Reinforcement Learning for Engineers
Main.ReinforcementLearning History
Hide minor edits - Show changes to output
Changed lines 5-10 from:
'''Definition and Key Concepts:'''
Reinforcement Learning (RL) is a paradigm of machine learning and optimal control where an
to:
Reinforcement Learning (RL) is a paradigm of machine learning and optimal control where an '''agent''' learns to make decisions by interacting with an '''environment''' to maximize a cumulative '''reward'''. Unlike supervised learning, the agent isnβt given correct actions but instead '''experiments''' with actions, learning through feedback (rewards). The agent observes the current '''state''' of the environment, takes an '''action''', and receives a '''reward''' (scalar feedback signal), then the environment transitions to a new state. This loop continues over time (see '''Figure 1'''). The agent seeks a '''policy''' (mapping states to actions) that maximizes expected cumulative rewards. Key concepts:
Changed lines 12-16 from:
Agents face the '''exploration vs. exploitation dilemma''': exploring new actions to find higher rewards vs. exploiting known rewarding actions ([[https://en.wikipedia.org/wiki/Reinforcement_learning|Wikipedia]]).
%width=500px%Attach:rl_agent_environment.png
'''Figure 1:''' Agent-environment interaction in RL.
%width=500px%Attach:rl_agent_environment.png
'''Figure 1:''' Agent-environment interaction in RL
to:
Agents face the '''exploration vs. exploitation dilemma''': exploring new actions to find higher rewards vs. exploiting known rewarding actions.
Changed lines 18-19 from:
to:
{$ \mathcal{M} = (\mathcal{S}, \mathcal{A}, P, R, \gamma) $}
Changed lines 22-31 from:
* (:math:)\mathcal{S}(:mathend:) = Set of '''states'''.
*(:math:)\mathcal{A}(:mathend:) = Set of '''actions'''.
* (:math:)P(s'|s,a)(:mathend:) = '''Transition probability''' to state (:math:)s'(:mathend:) given state (:math:)s(:mathend:) and action (:math:)a(:mathend:).
* (:math:)R(s,a,s')(:mathend:) = '''Reward''' from transitioning state (:math:)s(:mathend:) to (:math:)s'(:mathend:) via action (:math:)a(:mathend:).
*(:math:)\gamma \in [0,1)(:mathend:) = '''Discount factor''', weighing future vs. immediate rewards.
MDPs satisfy the '''Markov property''': future states depend only on current state and action. The '''optimal policy''' (:math:)\pi^*(s)(:mathend:) maximizes expected long-term rewards. MDP solutions involve computing '''value functions''' (:math:)V(s)(:mathend:) or '''action-value functions''' (:math:)Q(s,a)(:mathend:). The '''Bellman optimality equation''' for value functions:
(:math:)V^*(s) = \max_{a \in \mathcal{A}} \sum_{s'} P(s'|s,a)\left[ R(s,a,s') + \gamma V^*(s') \right](:mathend:)
*
* (
* (:math:)R(s,a,s
*
MDPs satisfy the '''Markov property''': future states depend only on current state and action. The '''optimal policy''' (
(
to:
* {`\mathcal{S}`} = Set of '''states'''.
* {`\mathcal{A}`} = Set of '''actions'''.
* {`P(s'|s,a)`} = '''Transition probability''' to state {`s'`} given state {s} and action {a}.
* {`R(s,a,s')`} = '''Reward''' from transitioning state {s} to {`s'`} via action {a}.
* {`\gamma \in [0,1)`} = '''Discount factor''', weighing future vs. immediate rewards.
MDPs satisfy the '''Markov property''': future states depend only on current state and action. The '''optimal policy''' {`\pi^*(s)`} maximizes expected long-term rewards. MDP solutions involve computing '''value functions''' {`V(s)`} or '''action-value functions''' {`Q(s,a)`}. The '''Bellman optimality equation''' for value functions:
{$ V^*(s) = \max_{a \in \mathcal{A}} \sum_{s'} P(s'|s,a)\left[ R(s,a,s') + \gamma V^*(s') \right] $}
* {`\mathcal{A}`} = Set of '''actions'''.
* {`P(s'|s,a)`} = '''Transition probability''' to state {`s'`} given state {s} and action {a}.
* {`R(s,a,s')`} = '''Reward''' from transitioning state {s} to {`s'`} via action {a}.
* {`\gamma \in [0,1)`} = '''Discount factor''', weighing future vs. immediate rewards.
MDPs satisfy the '''Markov property''': future states depend only on current state and action. The '''optimal policy''' {`\pi^*(s)`} maximizes expected long-term rewards. MDP solutions involve computing '''value functions''' {`V(s)`} or '''action-value functions''' {`Q(s,a)`}. The '''Bellman optimality equation''' for value functions:
{$ V^*(s) = \max_{a \in \mathcal{A}} \sum_{s'} P(s'|s,a)\left[ R(s,a,s') + \gamma V^*(s') \right] $}
Changed lines 34-35 from:
(:math:)Q^*(s,a) = \sum_{s'}P(s'|s,a)\left[R(s,a,s') + \gamma \max_{a'}Q^*(s',a')\right](:mathend:)
to:
{$ Q^*(s,a) = \sum_{s'}P(s'|s,a)\left[R(s,a,s') + \gamma \max_{a'}Q^*(s',a')\right] $}
Changed line 40 from:
* '''Model-based RL''': Uses or learns a model (:math:)P(s'|s,a)(:mathend:) and reward function, enabling planning and simulation (e.g., AlphaZero).
to:
* '''Model-based RL''': Uses or learns a model {`P(s'|s,a)`} and reward function, enabling planning and simulation (e.g., AlphaZero).
Added lines 1-61:
(:title Reinforcement Learning for Engineers:)
(:keywords reinforcement learning, RL, machine learning, optimal control, engineering optimization, Markov Decision Process, MDP, model-free, model-based:)
(:description High-level overview and applications of Reinforcement Learning for engineers, covering key concepts, Markov Decision Processes, and model-based vs. model-free methods:)
!! High-Level Overview of RL
'''Definition and Key Concepts:'''
Reinforcement Learning (RL) is a paradigm of machine learning and optimal control where an '''agent''' learns to make decisions by interacting with an '''environment''' to maximize a cumulative '''reward''' ([[https://en.wikipedia.org/wiki/Reinforcement_learning|Wikipedia]]). Unlike supervised learning, the agent isnβt given correct actions but instead '''experiments''' with actions, learning through feedback (rewards). The agent observes the current '''state''' of the environment, takes an '''action''', and receives a '''reward''' (scalar feedback signal), then the environment transitions to a new state. This loop continues over time (see '''Figure 1'''). The agent seeks a '''policy''' (mapping states to actions) that maximizes expected cumulative rewards. Key concepts:
* '''States''': Observations of the environment.
* '''Actions''': Decisions made by the agent.
* '''Rewards''': Immediate feedback signals.
* '''Policy''': Strategy to select actions.
Agents face the '''exploration vs. exploitation dilemma''': exploring new actions to find higher rewards vs. exploiting known rewarding actions ([[https://en.wikipedia.org/wiki/Reinforcement_learning|Wikipedia]]).
%width=500px%Attach:rl_agent_environment.png
'''Figure 1:''' Agent-environment interaction in RL.
'''Markov Decision Processes (MDPs):'''
RL problems often use Markov Decision Processes (MDPs), a mathematical framework for sequential decision-making under uncertainty. An MDP is defined by a tuple:
(:math:) \mathcal{M} = (\mathcal{S}, \mathcal{A}, P, R, \gamma) (:mathend:)
where:
* (:math:)\mathcal{S}(:mathend:) = Set of '''states'''.
* (:math:)\mathcal{A}(:mathend:) = Set of '''actions'''.
* (:math:)P(s'|s,a)(:mathend:) = '''Transition probability''' to state (:math:)s'(:mathend:) given state (:math:)s(:mathend:) and action (:math:)a(:mathend:).
* (:math:)R(s,a,s')(:mathend:) = '''Reward''' from transitioning state (:math:)s(:mathend:) to (:math:)s'(:mathend:) via action (:math:)a(:mathend:).
* (:math:)\gamma \in [0,1)(:mathend:) = '''Discount factor''', weighing future vs. immediate rewards.
MDPs satisfy the '''Markov property''': future states depend only on current state and action. The '''optimal policy''' (:math:)\pi^*(s)(:mathend:) maximizes expected long-term rewards. MDP solutions involve computing '''value functions''' (:math:)V(s)(:mathend:) or '''action-value functions''' (:math:)Q(s,a)(:mathend:). The '''Bellman optimality equation''' for value functions:
(:math:)V^*(s) = \max_{a \in \mathcal{A}} \sum_{s'} P(s'|s,a)\left[ R(s,a,s') + \gamma V^*(s') \right](:mathend:)
Similarly, the optimal Q-value:
(:math:)Q^*(s,a) = \sum_{s'}P(s'|s,a)\left[R(s,a,s') + \gamma \max_{a'}Q^*(s',a')\right](:mathend:)
'''Model-Free vs. Model-Based RL:'''
A key distinction:
* '''Model-based RL''': Uses or learns a model (:math:)P(s'|s,a)(:mathend:) and reward function, enabling planning and simulation (e.g., AlphaZero).
* '''Model-free RL''': Learns directly from trial-and-error interaction, without explicit models. Common and simpler, but typically requires more environment interactions.
Hybrid approaches like Dyna-Q use learned models to simulate additional experiences.
'''Applications of RL in Engineering Optimization:'''
RL applies broadly to engineering tasks involving sequential decision-making or control:
* '''Chemical Engineering''': Process control and reaction optimization (e.g., reactor settings, energy minimization, yield improvement).
* '''Mechanical Engineering''': Robotic control, autonomous systems (e.g., robotic arms, drones, inverted pendulum).
* '''Automotive''': Autonomous driving (lane-keeping, cruise control, collision avoidance).
* '''Industrial Energy Management''': HVAC optimization (e.g., DeepMind reduced Google data center cooling energy by ~40%).
These applications demonstrate RLβs effectiveness in engineering optimization, addressing complex, uncertain, and dynamic conditions.
(:keywords reinforcement learning, RL, machine learning, optimal control, engineering optimization, Markov Decision Process, MDP, model-free, model-based:)
(:description High-level overview and applications of Reinforcement Learning for engineers, covering key concepts, Markov Decision Processes, and model-based vs. model-free methods:)
!! High-Level Overview of RL
'''Definition and Key Concepts:'''
Reinforcement Learning (RL) is a paradigm of machine learning and optimal control where an '''agent''' learns to make decisions by interacting with an '''environment''' to maximize a cumulative '''reward''' ([[https://en.wikipedia.org/wiki/Reinforcement_learning|Wikipedia]]). Unlike supervised learning, the agent isnβt given correct actions but instead '''experiments''' with actions, learning through feedback (rewards). The agent observes the current '''state''' of the environment, takes an '''action''', and receives a '''reward''' (scalar feedback signal), then the environment transitions to a new state. This loop continues over time (see '''Figure 1'''). The agent seeks a '''policy''' (mapping states to actions) that maximizes expected cumulative rewards. Key concepts:
* '''States''': Observations of the environment.
* '''Actions''': Decisions made by the agent.
* '''Rewards''': Immediate feedback signals.
* '''Policy''': Strategy to select actions.
Agents face the '''exploration vs. exploitation dilemma''': exploring new actions to find higher rewards vs. exploiting known rewarding actions ([[https://en.wikipedia.org/wiki/Reinforcement_learning|Wikipedia]]).
%width=500px%Attach:rl_agent_environment.png
'''Figure 1:''' Agent-environment interaction in RL.
'''Markov Decision Processes (MDPs):'''
RL problems often use Markov Decision Processes (MDPs), a mathematical framework for sequential decision-making under uncertainty. An MDP is defined by a tuple:
(:math:) \mathcal{M} = (\mathcal{S}, \mathcal{A}, P, R, \gamma) (:mathend:)
where:
* (:math:)\mathcal{S}(:mathend:) = Set of '''states'''.
* (:math:)\mathcal{A}(:mathend:) = Set of '''actions'''.
* (:math:)P(s'|s,a)(:mathend:) = '''Transition probability''' to state (:math:)s'(:mathend:) given state (:math:)s(:mathend:) and action (:math:)a(:mathend:).
* (:math:)R(s,a,s')(:mathend:) = '''Reward''' from transitioning state (:math:)s(:mathend:) to (:math:)s'(:mathend:) via action (:math:)a(:mathend:).
* (:math:)\gamma \in [0,1)(:mathend:) = '''Discount factor''', weighing future vs. immediate rewards.
MDPs satisfy the '''Markov property''': future states depend only on current state and action. The '''optimal policy''' (:math:)\pi^*(s)(:mathend:) maximizes expected long-term rewards. MDP solutions involve computing '''value functions''' (:math:)V(s)(:mathend:) or '''action-value functions''' (:math:)Q(s,a)(:mathend:). The '''Bellman optimality equation''' for value functions:
(:math:)V^*(s) = \max_{a \in \mathcal{A}} \sum_{s'} P(s'|s,a)\left[ R(s,a,s') + \gamma V^*(s') \right](:mathend:)
Similarly, the optimal Q-value:
(:math:)Q^*(s,a) = \sum_{s'}P(s'|s,a)\left[R(s,a,s') + \gamma \max_{a'}Q^*(s',a')\right](:mathend:)
'''Model-Free vs. Model-Based RL:'''
A key distinction:
* '''Model-based RL''': Uses or learns a model (:math:)P(s'|s,a)(:mathend:) and reward function, enabling planning and simulation (e.g., AlphaZero).
* '''Model-free RL''': Learns directly from trial-and-error interaction, without explicit models. Common and simpler, but typically requires more environment interactions.
Hybrid approaches like Dyna-Q use learned models to simulate additional experiences.
'''Applications of RL in Engineering Optimization:'''
RL applies broadly to engineering tasks involving sequential decision-making or control:
* '''Chemical Engineering''': Process control and reaction optimization (e.g., reactor settings, energy minimization, yield improvement).
* '''Mechanical Engineering''': Robotic control, autonomous systems (e.g., robotic arms, drones, inverted pendulum).
* '''Automotive''': Autonomous driving (lane-keeping, cruise control, collision avoidance).
* '''Industrial Energy Management''': HVAC optimization (e.g., DeepMind reduced Google data center cooling energy by ~40%).
These applications demonstrate RLβs effectiveness in engineering optimization, addressing complex, uncertain, and dynamic conditions.