Reinforcement Learning for Engineers

Main.ReinforcementLearning History

Hide minor edits - Show changes to output

March 18, 2025, at 04:48 AM by 136.36.188.120 -

Changed lines 5-10 from:

!! High-Level Overview of RL

'''Definition and Key Concepts:'''

Reinforcement Learning (RL) is a paradigm of machine learning and optimal control where an '''~~agent~~''' ~~learns~~ to ~~make decisions by interacting with an~~ '''~~environment~~''' ~~to maximize a cumulative '''reward''' ([[https://en.wikipedia.org/wiki/Reinforcement_learning|Wikipedia]])~~. Unlike supervised learning, the agent isn’t given correct actions but instead '''experiments''' with actions, learning through feedback (rewards). The agent observes the current '''state''' of the environment, takes an '''action''', and receives a '''reward''' (scalar feedback signal), then the environment transitions to a new state. This loop continues over time (see '''Figure 1'''). The agent seeks a '''policy''' (mapping states to actions) that maximizes expected cumulative rewards. Key concepts:

to:

Reinforcement Learning (RL) is a paradigm of machine learning and optimal control where an '''agent''' learns to make decisions by interacting with an '''environment''' to maximize a cumulative '''reward'''. Unlike supervised learning, the agent isn’t given correct actions but instead '''experiments''' with actions, learning through feedback (rewards). The agent observes the current '''state''' of the environment, takes an '''action''', and receives a '''reward''' (scalar feedback signal), then the environment transitions to a new state. This loop continues over time (see '''Figure 1'''). The agent seeks a '''policy''' (mapping states to actions) that maximizes expected cumulative rewards. Key concepts:

Changed lines 12-16 from:

Agents face the '''exploration vs. exploitation dilemma''': exploring new actions to find higher rewards vs. exploiting known rewarding actions ([[https://en.wikipedia.org/wiki/Reinforcement_learning|Wikipedia]]).

%width=500px%Attach:rl_agent_environment.png
'''Figure 1:''' Agent-environment interaction in RL.

to:

Agents face the '''exploration vs. exploitation dilemma''': exploring new actions to find higher rewards vs. exploiting known rewarding actions.

Changed lines 18-19 from:

~~(:math:) \mathcal~~{~~M} = (~~\mathcal{S}, \mathcal{A}, P, R, \gamma) ~~(:mathend:)~~

to:

{$ \mathcal{M} = (\mathcal{S}, \mathcal{A}, P, R, \gamma) $}

Changed lines 22-31 from:

* ~~(:math:)\mathcal~~{S}~~(:mathend:)~~ = Set of '''states'''.
* ~~(:math:)\mathcal~~{A}~~(:mathend:)~~ = Set of '''actions'''.
* (~~:math:)P(~~s'|s,a)~~(:mathend:)~~ = '''Transition probability''' to state ~~(:math:)s~~'~~(:mathend:)~~ given state ~~(:math:)s(:mathend:)~~ and action (~~:math:)~~a~~(:mathend:~~).
* (:math:)R(s,a,s'~~)(:mathend:) =~~ '''Reward''' from transitioning state ~~(:math:)s(:mathend:)~~ to ~~(:math:)s~~'~~(:mathend:)~~ via action ~~(:math:)a(:mathend:)~~.
* ~~(:math:)~~\gamma \in [0,1)~~(:mathend:)~~ = '''Discount factor''', weighing future vs. immediate rewards.

MDPs satisfy the '''Markov property''': future states depend only on current state and action. The '''optimal policy''' (~~:math:~~)~~\pi^*(s)(:mathend:)~~ maximizes expected long-term rewards. MDP solutions involve computing '''value functions''' (~~:math:~~)~~V(s)(:mathend:)~~ or '''action-value functions''' (~~:math:)Q(~~s,a)~~(:mathend:)~~. The '''Bellman optimality equation''' for value functions:

(~~:math:~~)~~V^*(s)~~ = \max_{a \in \mathcal{A}} \sum_{s'} P(s'|s,a)\left[ R(s,a,s') + \gamma V^*(s') \right]~~(:mathend:)~~

to:

* {`\mathcal{S}`} = Set of '''states'''.
* {`\mathcal{A}`} = Set of '''actions'''.
* {`P(s'|s,a)`} = '''Transition probability''' to state {`s'`} given state {

$s$ } and action {

$a$ }.
* {`R(s,a,s')`} = '''Reward''' from transitioning state {

$s$ } to {`s'`} via action {

$a$ }.
* {`\gamma \in [0,1)`} = '''Discount factor''', weighing future vs. immediate rewards.

MDPs satisfy the '''Markov property''': future states depend only on current state and action. The '''optimal policy''' {`\pi^*(s)`} maximizes expected long-term rewards. MDP solutions involve computing '''value functions''' {`V(s)`} or '''action-value functions''' {`Q(s,a)`}. The '''Bellman optimality equation''' for value functions:

{$ V^*(s) = \max_{a \in \mathcal{A}} \sum_{s'} P(s'|s,a)\left[ R(s,a,s') + \gamma V^*(s') \right] $}

Changed lines 34-35 from:

(~~:math:~~)~~Q^*(s,a)~~ = \sum_{s'}P(s'|s,a)\left[R(s,a,s') + \gamma \max_{a'}Q^*(s',a')\right]~~(:mathend:)~~

to:

{$ Q^*(s,a) = \sum_{s'}P(s'|s,a)\left[R(s,a,s') + \gamma \max_{a'}Q^*(s',a')\right] $}

Changed line 40 from:

* '''Model-based RL''': Uses or learns a model (~~:math:)P(~~s'|s,a)~~(:mathend:)~~ and reward function, enabling planning and simulation (e.g., AlphaZero).

to:

* '''Model-based RL''': Uses or learns a model {`P(s'|s,a)`} and reward function, enabling planning and simulation (e.g., AlphaZero).

Restore

March 18, 2025, at 04:44 AM by 136.36.188.120 -

Added lines 1-61:

(:title Reinforcement Learning for Engineers:)
(:keywords reinforcement learning, RL, machine learning, optimal control, engineering optimization, Markov Decision Process, MDP, model-free, model-based:)
(:description High-level overview and applications of Reinforcement Learning for engineers, covering key concepts, Markov Decision Processes, and model-based vs. model-free methods:)

!! High-Level Overview of RL

'''Definition and Key Concepts:'''

Reinforcement Learning (RL) is a paradigm of machine learning and optimal control where an '''agent''' learns to make decisions by interacting with an '''environment''' to maximize a cumulative '''reward''' ([[https://en.wikipedia.org/wiki/Reinforcement_learning|Wikipedia]]). Unlike supervised learning, the agent isn’t given correct actions but instead '''experiments''' with actions, learning through feedback (rewards). The agent observes the current '''state''' of the environment, takes an '''action''', and receives a '''reward''' (scalar feedback signal), then the environment transitions to a new state. This loop continues over time (see '''Figure 1'''). The agent seeks a '''policy''' (mapping states to actions) that maximizes expected cumulative rewards. Key concepts:

* '''States''': Observations of the environment.
* '''Actions''': Decisions made by the agent.
* '''Rewards''': Immediate feedback signals.
* '''Policy''': Strategy to select actions.

Agents face the '''exploration vs. exploitation dilemma''': exploring new actions to find higher rewards vs. exploiting known rewarding actions ([[https://en.wikipedia.org/wiki/Reinforcement_learning|Wikipedia]]).

%width=500px%Attach:rl_agent_environment.png
'''Figure 1:''' Agent-environment interaction in RL.

'''Markov Decision Processes (MDPs):'''

RL problems often use Markov Decision Processes (MDPs), a mathematical framework for sequential decision-making under uncertainty. An MDP is defined by a tuple:

(:math:) \mathcal{M} = (\mathcal{S}, \mathcal{A}, P, R, \gamma) (:mathend:)

where:

* (:math:)\mathcal{S}(:mathend:) = Set of '''states'''.
* (:math:)\mathcal{A}(:mathend:) = Set of '''actions'''.
* (:math:)P(s'|s,a)(:mathend:) = '''Transition probability''' to state (:math:)s'(:mathend:) given state (:math:)s(:mathend:) and action (:math:)a(:mathend:).
* (:math:)R(s,a,s')(:mathend:) = '''Reward''' from transitioning state (:math:)s(:mathend:) to (:math:)s'(:mathend:) via action (:math:)a(:mathend:).
* (:math:)\gamma \in [0,1)(:mathend:) = '''Discount factor''', weighing future vs. immediate rewards.

MDPs satisfy the '''Markov property''': future states depend only on current state and action. The '''optimal policy''' (:math:)\pi^*(s)(:mathend:) maximizes expected long-term rewards. MDP solutions involve computing '''value functions''' (:math:)V(s)(:mathend:) or '''action-value functions''' (:math:)Q(s,a)(:mathend:). The '''Bellman optimality equation''' for value functions:

(:math:)V^*(s) = \max_{a \in \mathcal{A}} \sum_{s'} P(s'|s,a)\left[ R(s,a,s') + \gamma V^*(s') \right](:mathend:)

Similarly, the optimal Q-value:

(:math:)Q^*(s,a) = \sum_{s'}P(s'|s,a)\left[R(s,a,s') + \gamma \max_{a'}Q^*(s',a')\right](:mathend:)

'''Model-Free vs. Model-Based RL:'''

A key distinction:

* '''Model-based RL''': Uses or learns a model (:math:)P(s'|s,a)(:mathend:) and reward function, enabling planning and simulation (e.g., AlphaZero).
* '''Model-free RL''': Learns directly from trial-and-error interaction, without explicit models. Common and simpler, but typically requires more environment interactions.

Hybrid approaches like Dyna-Q use learned models to simulate additional experiences.

'''Applications of RL in Engineering Optimization:'''

RL applies broadly to engineering tasks involving sequential decision-making or control:

* '''Chemical Engineering''': Process control and reaction optimization (e.g., reactor settings, energy minimization, yield improvement).
* '''Mechanical Engineering''': Robotic control, autonomous systems (e.g., robotic arms, drones, inverted pendulum).
* '''Automotive''': Autonomous driving (lane-keeping, cruise control, collision avoidance).
* '''Industrial Energy Management''': HVAC optimization (e.g., DeepMind reduced Google data center cooling energy by ~40%).

These applications demonstrate RL’s effectiveness in engineering optimization, addressing complex, uncertain, and dynamic conditions.

Restore

Dynamic Optimization