lagom.metric: Metrics

lagom.metric.returns(gamma, rewards)[source]
lagom.metric.bootstrapped_returns(gamma, rewards, last_V, reach_terminal)[source]

Return (discounted) accumulated returns with bootstrapping for a batch of episodic transitions.

Formally, suppose we have all rewards \((r_1, \dots, r_T)\), it computes

\[Q_t = r_t + \gamma r_{t+1} + \dots + \gamma^{T - t} r_T + \gamma^{T - t + 1} V(s_{T+1})\]

Note

The state values for terminal states are masked out as zero !

lagom.metric.td0_target(gamma, rewards, Vs, last_V, reach_terminal)[source]

Calculate TD(0) targets of a batch of episodic transitions.

Let \(r_1, r_2, \dots, r_T\) be a list of rewards and let \(V(s_0), V(s_1), \dots, V(s_{T-1}), V(s_{T})\) be a list of state values including a last state value. Let \(\gamma\) be a discounted factor, the TD(0) targets are calculated as follows

\[r_t + \gamma V(s_t), \forall t = 1, 2, \dots, T\]

Note

The state values for terminal states are masked out as zero !

lagom.metric.td0_error(gamma, rewards, Vs, last_V, reach_terminal)[source]

Calculate TD(0) errors of a batch of episodic transitions.

Let \(r_1, r_2, \dots, r_T\) be a list of rewards and let \(V(s_0), V(s_1), \dots, V(s_{T-1}), V(s_{T})\) be a list of state values including a last state value. Let \(\gamma\) be a discounted factor, the TD(0) errors are calculated as follows

\[\delta_t = r_{t+1} + \gamma V(s_{t+1}) - V(s_t)\]

Note

The state values for terminal states are masked out as zero !

lagom.metric.gae(gamma, lam, rewards, Vs, last_V, reach_terminal)[source]

Calculate the Generalized Advantage Estimation (GAE) of a batch of episodic transitions.

Let \(\delta_t\) be the TD(0) error at time step \(t\), the GAE at time step \(t\) is calculated as follows

\[A_t^{\mathrm{GAE}(\gamma, \lambda)} = \sum_{k=0}^{\infty}(\gamma\lambda)^k \delta_{t + k}\]
lagom.metric.vtrace(behavior_logprobs, target_logprobs, gamma, Rs, Vs, last_V, reach_terminal, clip_rho=1.0, clip_pg_rho=1.0)[source]