Model-Free RL Algorithms

Model-free RL algorithms learn optimal policies directly from interaction data without explicit environment models. Five popular model-free RL algorithms are highlighted with key ideas, mathematic objectives, and engineering applications.

1. Q-Learning

Overview: Q-Learning learns the optimal action-value function $Q^*(s,a)$ through experience. It is an off-policy, value-based method, capable of learning from exploratory actions.

Update Rule: $Q_{new}(s_t,a_t) \leftarrow Q(s_t,a_t) + \alpha \left[r_{t+1} + \gamma \max_{a'} Q(s_{t+1},a') - Q(s_t,a_t)\right]$

Applications: Equipment scheduling, discrete process control (e.g., valves, simple robotics).

2. Deep Q-Networks (DQN)

Overview: DQN extends Q-learning with neural networks for approximating Q-values, suitable for large-scale state spaces.

Key Innovations:

Experience Replay: Stores experiences to reduce correlation and improve training stability.
Target Network: Stabilizes training by periodically updating target network parameters.

Loss Function: $L(\theta) = \left(Q_\theta(s,a) - [r + \gamma \max_{a'}Q_{\theta^-}(s',a')]\right)^2$

Applications: Autonomous driving, discrete robotic control, power systems management.

3. Policy Gradient Methods

Overview: Policy gradient methods directly optimize the policy parameters $\theta$ by maximizing expected rewards.

REINFORCE Update: $\theta \leftarrow \theta + \alpha \nabla_{\theta}\log \pi_\theta(a_t|s_t)G_t$

Variance Reduction: Use baseline $b(s)$ or advantage $A(s,a)$ to improve stability.

Applications: Continuous control in robotics, process parameter tuning, network optimization.

4. Proximal Policy Optimization (PPO)

Overview: PPO improves policy gradient stability by preventing overly aggressive updates.

Clipped Objective: $J^{CLIP}(\theta) = \mathbb{E}_t\left[\min\left(r_t(\theta)\hat{A}_t,\text{clip}\{r_t(\theta),1-\epsilon,1+\epsilon\}\hat{A}_t\right)\right]$

Applications: Robotic locomotion, drone control, continuous industrial parameter optimization.

5. Deep Deterministic Policy Gradient (DDPG)

Overview: DDPG combines policy gradients and Q-learning for continuous action spaces using actor-critic methods.

Updates:

Critic Loss:

$L(\theta_Q) = \left(Q_{\theta_Q}(s,a) - [r + \gamma Q_{\theta_Q^-}(s',\mu_{\theta_\mu^-}(s'))]\right)^2$

Actor Update:

$\nabla_{\theta_\mu} J \approx \mathbb{E}_{s}\left[\nabla_a Q_{\theta_Q}(s,a)|_{a=\mu(s)}\nabla_{\theta_\mu}\mu_{\theta_\mu}(s)\right]$

Applications: Robotics (pendulum balancing), chemical process control, portfolio management, voltage regulation in power systems.

Definition of Symbols

$Q(s,a)$ : Action-value function.
$\alpha$ : Learning rate.
$\gamma$ : Discount factor.
$r$ : Reward.
$\theta$ : Parameters of neural network or policy.
$\pi_{\theta}(a|s)$ : Policy distribution parameterized by $\theta$ .
$G_t$ : Return (cumulative discounted reward from time $t$ onwards).
$\hat{A}_t$ : Advantage estimate at time $t$ .
$\epsilon$ : PPO clip range hyperparameter.
$\mu(s)$ : Deterministic policy function.

Dynamic Optimization