Processing math: 100%

Model-Free RL Algorithms

Model-free RL algorithms learn optimal policies directly from interaction data without explicit environment models. Five popular model-free RL algorithms are highlighted with key ideas, mathematic objectives, and engineering applications.

1. Q-Learning

Overview: Q-Learning learns the optimal action-value function Q(s,a) through experience. It is an off-policy, value-based method, capable of learning from exploratory actions.

Update Rule: Qnew(st,at)Q(st,at)+α[rt+1+γmaxaQ(st+1,a)Q(st,at)]

Applications: Equipment scheduling, discrete process control (e.g., valves, simple robotics).

2. Deep Q-Networks (DQN)

Overview: DQN extends Q-learning with neural networks for approximating Q-values, suitable for large-scale state spaces.

Key Innovations:

  • Experience Replay: Stores experiences to reduce correlation and improve training stability.
  • Target Network: Stabilizes training by periodically updating target network parameters.

Loss Function: L(θ)=(Qθ(s,a)[r+γmaxaQθ(s,a)])2

Applications: Autonomous driving, discrete robotic control, power systems management.

3. Policy Gradient Methods

Overview: Policy gradient methods directly optimize the policy parameters θ by maximizing expected rewards.

REINFORCE Update: θθ+αθlogπθ(at|st)Gt

Variance Reduction: Use baseline b(s) or advantage A(s,a) to improve stability.

Applications: Continuous control in robotics, process parameter tuning, network optimization.

4. Proximal Policy Optimization (PPO)

Overview: PPO improves policy gradient stability by preventing overly aggressive updates.

Clipped Objective: JCLIP(θ)=Et[min(rt(θ)ˆAt,clip{rt(θ),1ϵ,1+ϵ}ˆAt)]

Applications: Robotic locomotion, drone control, continuous industrial parameter optimization.

5. Deep Deterministic Policy Gradient (DDPG)

Overview: DDPG combines policy gradients and Q-learning for continuous action spaces using actor-critic methods.

Updates:

  • Critic Loss:

L(θQ)=(QθQ(s,a)[r+γQθQ(s,μθμ(s))])2

  • Actor Update:

θμJEs[aQθQ(s,a)|a=μ(s)θμμθμ(s)]

Applications: Robotics (pendulum balancing), chemical process control, portfolio management, voltage regulation in power systems.

Definition of Symbols

  • Q(s,a): Action-value function.
  • α: Learning rate.
  • γ: Discount factor.
  • r: Reward.
  • θ: Parameters of neural network or policy.
  • πθ(as): Policy distribution parameterized by θ.
  • Gt: Return (cumulative discounted reward from time t onwards).
  • ˆAt: Advantage estimate at time t.
  • ε: PPO clip range hyperparameter.
  • μ(s): Deterministic policy function.
Streaming Chatbot
💬