| Method | Model-based | Policy-based | Value-based | Actor-Critic | On-policy | Off-policy | Bootstrapping | Continuous Action | Function Approx | Stable |
|---|
| Policy Iteration | ✅ | ❌ | ✅ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ✅ |
| Value Iteration | ✅ | ❌ | ✅ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ✅ |
| Monte Carlo | ❌ | ❌ | ✅ | ❌ | ✅ | ❌ | ❌ | ❌ | ❌ | ❌ |
| TD(0) | ❌ | ❌ | ✅ | ❌ | ✅ | ❌ | ✅ | ❌ | ❌ | ❌ |
| SARSA | ❌ | ❌ | ✅ | ❌ | ✅ | ❌ | ✅ | ❌ | ❌ | ❌ |
| Q-learning | ❌ | ❌ | ✅ | ❌ | ❌ | ✅ | ✅ | ❌ | ❌ | ❌ |
| DQN | ❌ | ❌ | ✅ | ❌ | ❌ | ✅ | ✅ | ❌ | ✅ | ❌ |
| REINFORCE | ❌ | ✅ | ❌ | ❌ | ✅ | ❌ | ❌ | ✅ | ✅ | ❌ |
| Actor-Critic | ❌ | ✅ | ✅ | ✅ | ✅ | ❌ | ✅ | ✅ | ✅ | ❌ |
| A2C | ❌ | ✅ | ✅ | ✅ | ✅ | ❌ | ✅ | ✅ | ✅ | ❌ |
| A3C | ❌ | ✅ | ✅ | ✅ | ✅ | ❌ | ✅ | ✅ | ✅ | ❌ |
| PPO | ❌ | ✅ | ✅ | ✅ | ✅ | ❌ | ✅ | ✅ | ✅ | ✅ |
| DDPG | ❌ | ✅ | ✅ | ✅ | ❌ | ✅ | ✅ | ✅ | ✅ | ❌ |
| TD3 | ❌ | ✅ | ✅ | ✅ | ❌ | ✅ | ✅ | ✅ | ✅ | ❌ |
| SAC | ❌ | ✅ | ✅ | ✅ | ❌ | ✅ | ✅ | ✅ | ✅ | ✅ |
v(s)←a∑π(a∣s)(r(s,a)+γs′∑P(s′∣s,a)v(s′))
v(s)←amax(r(s,a)+γs′∑P(s′∣s,a)v(s′))
Q(s,a)←Q(s,a)+α(Gt−Q(s,a))
V(st)←V(st)+α(rt+γV(st+1)−V(st))
V(st)←V(st)+α(Gt(n)−V(st))
Q(st,at)←Q(st,at)+α(rt+γQ(st+1,at+1)−Q(st,at))
Q(st,at)←Q(st,at)+α(rt+γamaxQ(st+1,a)−Q(st,at))
Q←Q+α(r+γEa′∼π[Q(s′,a′)]−Q)
θ←θ+α∇θlogπθ(at∣st)Gt
δt=rt+γV(st+1)−V(st)
θ←θ+α∇θlogπ(at∣st)δt
A(s,a)=Q(s,a)−V(s)
θ←θ+α∇θlogπ(a∣s)A(s,a)
L(θ)=(r+γa′maxQ(s′,a′;θ−)−Q(s,a;θ))2
rt(θ)=πθold(at∣st)πθ(at∣st)
L(θ)=E[min(rt(θ)At,clip(rt(θ),1−ϵ,1+ϵ)At)]
y=r+γ(V(s′)−αlogπ(a′∣s′))
JQ=E[(Q(s,a)−y)2]
Jπ=E[αlogπ(a∣s)−Q(s,a)]
y=r+γa′maxQ(s′,a′;θ−)
Where:
- θ− = slow-moving target parameters
- Break correlation
- Improve sample efficiency
Q(s,a)←Q(s,a)+α(y−Q(s,a))
Where:
(s,a,r,s′)∼D
- Reduce variance
- Stabilize updates
A^t=σA+ϵAt−μA
δt=rt+γV(st+1)−V(st)
At(λ)=l=0∑∞(γλ)lδt+l
J(θ)=JRL(θ)+βH(π(⋅∣s))
Where:
H(π)=−a∑π(a∣s)logπ(a∣s)
rt(θ)=πθold(at∣st)πθ(at∣st)
L(θ)=E[min(rtAt,clip(rt,1−ϵ,1+ϵ)At)]
θ−←τθ+(1−τ)θ−
r^t=σr+ϵrt−μr
Vclip=Vold+clip(V−Vold,−ϵ,ϵ)
y=r+γQθ−(s′,argamaxQθ(s′,a))