Model-Free RL Algorithms
Model-free RL algorithms learn optimal policies directly from interaction data without explicit environment models. Five popular model-free RL algorithms are highlighted with key ideas, mathematic objectives, and engineering applications.
1. Q-Learning
Overview: Q-Learning learns the optimal action-value function Q⋅(s,a) through experience. It is an off-policy, value-based method, capable of learning from exploratory actions.
Update Rule: Qnew(st,at)←Q(st,at)+α[rt+1+γmaxa′Q(st+1,a′)−Q(st,at)]
Applications: Equipment scheduling, discrete process control (e.g., valves, simple robotics).
2. Deep Q-Networks (DQN)
Overview: DQN extends Q-learning with neural networks for approximating Q-values, suitable for large-scale state spaces.
Key Innovations:
- Experience Replay: Stores experiences to reduce correlation and improve training stability.
- Target Network: Stabilizes training by periodically updating target network parameters.
Loss Function: L(θ)=(Qθ(s,a)−[r+γmaxa′Qθ−(s′,a′)])2
Applications: Autonomous driving, discrete robotic control, power systems management.
3. Policy Gradient Methods
Overview: Policy gradient methods directly optimize the policy parameters θ by maximizing expected rewards.
REINFORCE Update: θ←θ+α∇θlogπθ(at|st)Gt
Variance Reduction: Use baseline b(s) or advantage A(s,a) to improve stability.
Applications: Continuous control in robotics, process parameter tuning, network optimization.
4. Proximal Policy Optimization (PPO)
Overview: PPO improves policy gradient stability by preventing overly aggressive updates.
Clipped Objective: JCLIP(θ)=Et[min(rt(θ)ˆAt,clip{rt(θ),1−ϵ,1+ϵ}ˆAt)]
Applications: Robotic locomotion, drone control, continuous industrial parameter optimization.
5. Deep Deterministic Policy Gradient (DDPG)
Overview: DDPG combines policy gradients and Q-learning for continuous action spaces using actor-critic methods.
Updates:
- Critic Loss:
L(θQ)=(QθQ(s,a)−[r+γQθ−Q(s′,μθ−μ(s′))])2
- Actor Update:
∇θμJ≈Es[∇aQθQ(s,a)|a=μ(s)∇θμμθμ(s)]
Applications: Robotics (pendulum balancing), chemical process control, portfolio management, voltage regulation in power systems.
Definition of Symbols
- Q(s,a): Action-value function.
- α: Learning rate.
- γ: Discount factor.
- r: Reward.
- θ: Parameters of neural network or policy.
- πθ(a∣s): Policy distribution parameterized by θ.
- Gt: Return (cumulative discounted reward from time t onwards).
- ˆAt: Advantage estimate at time t.
- ε: PPO clip range hyperparameter.
- μ(s): Deterministic policy function.