lagom.metric: Metrics¶

lagom.metric.
bootstrapped_returns
(gamma, rewards, last_V, reach_terminal)[source]¶ Return (discounted) accumulated returns with bootstrapping for a batch of episodic transitions.
Formally, suppose we have all rewards \((r_1, \dots, r_T)\), it computes
\[Q_t = r_t + \gamma r_{t+1} + \dots + \gamma^{T  t} r_T + \gamma^{T  t + 1} V(s_{T+1})\]Note
The state values for terminal states are masked out as zero !

lagom.metric.
td0_target
(gamma, rewards, Vs, last_V, reach_terminal)[source]¶ Calculate TD(0) targets of a batch of episodic transitions.
Let \(r_1, r_2, \dots, r_T\) be a list of rewards and let \(V(s_0), V(s_1), \dots, V(s_{T1}), V(s_{T})\) be a list of state values including a last state value. Let \(\gamma\) be a discounted factor, the TD(0) targets are calculated as follows
\[r_t + \gamma V(s_t), \forall t = 1, 2, \dots, T\]Note
The state values for terminal states are masked out as zero !

lagom.metric.
td0_error
(gamma, rewards, Vs, last_V, reach_terminal)[source]¶ Calculate TD(0) errors of a batch of episodic transitions.
Let \(r_1, r_2, \dots, r_T\) be a list of rewards and let \(V(s_0), V(s_1), \dots, V(s_{T1}), V(s_{T})\) be a list of state values including a last state value. Let \(\gamma\) be a discounted factor, the TD(0) errors are calculated as follows
\[\delta_t = r_{t+1} + \gamma V(s_{t+1})  V(s_t)\]Note
The state values for terminal states are masked out as zero !

lagom.metric.
gae
(gamma, lam, rewards, Vs, last_V, reach_terminal)[source]¶ Calculate the Generalized Advantage Estimation (GAE) of a batch of episodic transitions.
Let \(\delta_t\) be the TD(0) error at time step \(t\), the GAE at time step \(t\) is calculated as follows
\[A_t^{\mathrm{GAE}(\gamma, \lambda)} = \sum_{k=0}^{\infty}(\gamma\lambda)^k \delta_{t + k}\]