Model-Free RL Algorithms
Main.RLAlgorithms History
Hide minor edits - Show changes to output
Changed lines 21-22 from:
to:
* '''Experience Replay:''' Stores experiences to reduce correlation and improve training stability.
* '''Target Network:''' Stabilizes training by periodically updating target network parameters.
* '''Target Network:''' Stabilizes training by periodically updating target network parameters.
Changed line 1 from:
(:title Popular Model-Free Reinforcement Learning Algorithms:)
to:
(:title Model-Free RL Algorithms:)
Changed line 1 from:
(:title Model-Free RL Algorithms:)
to:
(:title Popular Model-Free Reinforcement Learning Algorithms:)
Changed lines 5-6 from:
Model-free RL algorithms enable agents to learn optimal policies directly from interaction data without explicit environment models. Below is a summary of five popular '''model-free RL algorithms''' with key ideas, mathematics, and engineering applications.
to:
Model-free RL algorithms learn optimal policies directly from interaction data without explicit environment models. Five popular '''model-free RL algorithms''' are highlighted with key ideas, mathematic objectives, and engineering applications.
Deleted line 53:
Added lines 60-73:
!!! Definition of Symbols
* {`Q(s,a)`}: Action-value function.
* {`\alpha`}: Learning rate.
* {`\gamma`}: Discount factor.
* {`r`}: Reward.
* {`\theta`}: Parameters of neural network or policy.
* {`\pi_{\theta}(a|s)`}: Policy distribution parameterized by {`\theta`}.
* {`G_t`}: Return (cumulative discounted reward from time {`t`} onwards).
* {`\hat{A}_t`}: Advantage estimate at time {`t`}.
* {`\epsilon`}: PPO clip range hyperparameter.
* {`\mu(s)`}: Deterministic policy function.
Changed lines 54-55 from:
to:
* '''Critic Loss:'''
Changed line 57 from:
to:
* '''Actor Update:'''
Changed line 1 from:
(:title Model-Free Reinforcement Learning Algorithms:)
to:
(:title Model-Free RL Algorithms:)
Changed lines 5-7 from:
We discuss five popular '''model-free RL algorithms''', highlighting key ideas, mathematics, and engineering applications. These algorithms enable agents to learn optimal policies directly from interaction data without explicit environment models
to:
Model-free RL algorithms enable agents to learn optimal policies directly from interaction data without explicit environment models. Below is a summary of five popular '''model-free RL algorithms''' with key ideas, mathematics, and engineering applications.
Added lines 1-61:
(:title Model-Free Reinforcement Learning Algorithms:)
(:keywords reinforcement learning, RL algorithms, Q-learning, DQN, PPO, DDPG, model-free RL, engineering applications:)
(:description Summary of popular model-free reinforcement learning algorithms, their key concepts, equations, and engineering applications:)
!! Popular Model-Free RL Algorithms
We discuss five popular '''model-free RL algorithms''', highlighting key ideas, mathematics, and engineering applications. These algorithms enable agents to learn optimal policies directly from interaction data without explicit environment models.
!!! 1. Q-Learning
'''Overview:''' Q-Learning learns the optimal action-value function {`Q^*(s,a)`} through experience. It is an '''off-policy, value-based''' method, capable of learning from exploratory actions.
'''Update Rule:'''
{$ Q_{new}(s_t,a_t) \leftarrow Q(s_t,a_t) + \alpha \left[r_{t+1} + \gamma \max_{a'} Q(s_{t+1},a') - Q(s_t,a_t)\right] $}
'''Applications:''' Equipment scheduling, discrete process control (e.g., valves, simple robotics).
!!! 2. Deep Q-Networks (DQN)
'''Overview:''' DQN extends Q-learning with neural networks for approximating Q-values, suitable for large-scale state spaces.
'''Key Innovations:'''
- '''Experience Replay:''' Stores experiences to reduce correlation and improve training stability.
- '''Target Network:''' Stabilizes training by periodically updating target network parameters.
'''Loss Function:'''
{$ L(\theta) = \left(Q_\theta(s,a) - [r + \gamma \max_{a'}Q_{\theta^-}(s',a')]\right)^2 $}
'''Applications:''' Autonomous driving, discrete robotic control, power systems management.
!!! 3. Policy Gradient Methods
'''Overview:''' Policy gradient methods directly optimize the policy parameters {`\theta`} by maximizing expected rewards.
'''REINFORCE Update:'''
{$ \theta \leftarrow \theta + \alpha \nabla_{\theta}\log \pi_\theta(a_t|s_t)G_t $}
'''Variance Reduction:''' Use baseline {`b(s)`} or advantage {`A(s,a)`} to improve stability.
'''Applications:''' Continuous control in robotics, process parameter tuning, network optimization.
!!! 4. Proximal Policy Optimization (PPO)
'''Overview:''' PPO improves policy gradient stability by preventing overly aggressive updates.
'''Clipped Objective:'''
{$ J^{CLIP}(\theta) = \mathbb{E}_t\left[\min\left(r_t(\theta)\hat{A}_t,\text{clip}\{r_t(\theta),1-\epsilon,1+\epsilon\}\hat{A}_t\right)\right] $}
'''Applications:''' Robotic locomotion, drone control, continuous industrial parameter optimization.
!!! 5. Deep Deterministic Policy Gradient (DDPG)
'''Overview:''' DDPG combines policy gradients and Q-learning for continuous action spaces using actor-critic methods.
'''Updates:'''
- '''Critic Loss:'''
{$ L(\theta_Q) = \left(Q_{\theta_Q}(s,a) - [r + \gamma Q_{\theta_Q^-}(s',\mu_{\theta_\mu^-}(s'))]\right)^2 $}
- '''Actor Update:'''
{$ \nabla_{\theta_\mu} J \approx \mathbb{E}_{s}\left[\nabla_a Q_{\theta_Q}(s,a)|_{a=\mu(s)}\nabla_{\theta_\mu}\mu_{\theta_\mu}(s)\right] $}
'''Applications:''' Robotics (pendulum balancing), chemical process control, portfolio management, voltage regulation in power systems.
(:keywords reinforcement learning, RL algorithms, Q-learning, DQN, PPO, DDPG, model-free RL, engineering applications:)
(:description Summary of popular model-free reinforcement learning algorithms, their key concepts, equations, and engineering applications:)
!! Popular Model-Free RL Algorithms
We discuss five popular '''model-free RL algorithms''', highlighting key ideas, mathematics, and engineering applications. These algorithms enable agents to learn optimal policies directly from interaction data without explicit environment models.
!!! 1. Q-Learning
'''Overview:''' Q-Learning learns the optimal action-value function {`Q^*(s,a)`} through experience. It is an '''off-policy, value-based''' method, capable of learning from exploratory actions.
'''Update Rule:'''
{$ Q_{new}(s_t,a_t) \leftarrow Q(s_t,a_t) + \alpha \left[r_{t+1} + \gamma \max_{a'} Q(s_{t+1},a') - Q(s_t,a_t)\right] $}
'''Applications:''' Equipment scheduling, discrete process control (e.g., valves, simple robotics).
!!! 2. Deep Q-Networks (DQN)
'''Overview:''' DQN extends Q-learning with neural networks for approximating Q-values, suitable for large-scale state spaces.
'''Key Innovations:'''
- '''Experience Replay:''' Stores experiences to reduce correlation and improve training stability.
- '''Target Network:''' Stabilizes training by periodically updating target network parameters.
'''Loss Function:'''
{$ L(\theta) = \left(Q_\theta(s,a) - [r + \gamma \max_{a'}Q_{\theta^-}(s',a')]\right)^2 $}
'''Applications:''' Autonomous driving, discrete robotic control, power systems management.
!!! 3. Policy Gradient Methods
'''Overview:''' Policy gradient methods directly optimize the policy parameters {`\theta`} by maximizing expected rewards.
'''REINFORCE Update:'''
{$ \theta \leftarrow \theta + \alpha \nabla_{\theta}\log \pi_\theta(a_t|s_t)G_t $}
'''Variance Reduction:''' Use baseline {`b(s)`} or advantage {`A(s,a)`} to improve stability.
'''Applications:''' Continuous control in robotics, process parameter tuning, network optimization.
!!! 4. Proximal Policy Optimization (PPO)
'''Overview:''' PPO improves policy gradient stability by preventing overly aggressive updates.
'''Clipped Objective:'''
{$ J^{CLIP}(\theta) = \mathbb{E}_t\left[\min\left(r_t(\theta)\hat{A}_t,\text{clip}\{r_t(\theta),1-\epsilon,1+\epsilon\}\hat{A}_t\right)\right] $}
'''Applications:''' Robotic locomotion, drone control, continuous industrial parameter optimization.
!!! 5. Deep Deterministic Policy Gradient (DDPG)
'''Overview:''' DDPG combines policy gradients and Q-learning for continuous action spaces using actor-critic methods.
'''Updates:'''
- '''Critic Loss:'''
{$ L(\theta_Q) = \left(Q_{\theta_Q}(s,a) - [r + \gamma Q_{\theta_Q^-}(s',\mu_{\theta_\mu^-}(s'))]\right)^2 $}
- '''Actor Update:'''
{$ \nabla_{\theta_\mu} J \approx \mathbb{E}_{s}\left[\nabla_a Q_{\theta_Q}(s,a)|_{a=\mu(s)}\nabla_{\theta_\mu}\mu_{\theta_\mu}(s)\right] $}
'''Applications:''' Robotics (pendulum balancing), chemical process control, portfolio management, voltage regulation in power systems.