RL with Gymnasium

March 18, 2025, at 05:50 PM by 10.35.117.248 -

Added lines 4-5:

%width=15px%Attach:github.png [[https://github.com/APMonitor/dynopt/blob/master/RL_for_Engineers.ipynb|GitHub]] | %width=20px%Attach:colab.png [[https://colab.research.google.com/github/APMonitor/dynopt/blob/master/RL_for_Engineers.ipynb|Google Colab]]

March 18, 2025, at 12:39 PM by 136.36.188.120 -

Added lines 26-27:

The gymnasium environment can also be used for demonstration exercises such as using [[https://apmonitor.com/pds/index.php/Main/HandTracking|hand-tracking to adjust the torque]] on the cart.

March 18, 2025, at 12:36 PM by 136.36.188.120 -

Changed line 296 from:

This code ~~is '''executable''' and would train~~ a DDPG agent on the Pendulum problem. With 200 episodes, it should achieve a decent policy (though possibly not perfect; longer training yields better results). In practice, one might incorporate learning rate schedules, noise decay, or more episodes.

to:

This code trains a DDPG agent on the Pendulum problem. With 200 episodes, it should achieve a decent policy (though possibly not perfect; longer training yields better results). In practice, one might incorporate learning rate schedules, noise decay, or more episodes.

March 18, 2025, at 12:35 PM by 136.36.188.120 -

Changed lines 91-98 from:

'''Pendulum Problem with DDPG:''' The Pendulum-v1 environment is a classic control problem: the agent applies a torque to a pendulum (which starts hanging downward) and tries to swing it up and balance it upright. The state is 3-dimensional (angle represented as cos and sin, and angular velocity) and the action is 1-dimensional continuous (torque ''-2'' to ''+2'' N·m). The reward is highest (less negative) when the pendulum is upright and not moving (and using minimal torque). ~~This is~~ a ~~perfect testbed~~ for ~~DDPG~~.

*We will implement a simplified DDPG for Pendulum.
* This involves building the ~~actor and critic networks, a replay buffer, and the training loop. We’ll use '''PyTorch''' in~~ code ~~(though one could use TensorFlow similarly). The code will highlight how to connect to the gymnasium environment and perform learning~~.
* In practice, one might use an existing library like [[https://stable-baselines3.readthedocs.io/en/master/|Stable Baselines3]] for quick results, but here we write it out to understand the process.)

'''Setting up the Pendulum Environment and Networks:'''

to:

'''Pendulum Problem with DDPG:''' The Pendulum-v1 environment is a classic control problem: the agent applies a torque to a pendulum (which starts hanging downward) and tries to swing it up and balance it upright. The state is 3-dimensional (angle represented as cos and sin, and angular velocity) and the action is 1-dimensional continuous (torque ''-2'' to ''+2'' N·m). The reward is highest (less negative) when the pendulum is upright and not moving (and using minimal torque).

* Implement a simplified DDPG for Pendulum.
* Build the actor and critic networks, a replay buffer, and the training loop. We’ll use '''PyTorch''' in code (though one could use TensorFlow similarly). The code will highlight how to connect to the gymnasium environment and perform learning.
* In practice, use an existing library like [[https://stable-baselines3.readthedocs.io/en/master/|Stable Baselines3]].

'''Set up the Pendulum Environment and Networks:'''

Changed lines 113-114 from:

~~We’ve imported~~ the necessary libraries. ~~We retrieve~~ ''state_dim'', ''action_dim'', and ''max_action'' (to help scale outputs). Next, define the neural network models for actor and critic. A simple feedforward network ~~suffices~~ for this small problem:

to:

Import the necessary libraries. Retrieve ''state_dim'', ''action_dim'', and ''max_action'' (to help scale outputs). Next, define the neural network models for actor and critic. A simple feedforward network is sufficient for this small problem:

Changed lines 157-160 from:

~~Here, the~~ actor outputs a continuous action using a $\tanh$ activation to ensure it stays within $[-1,1]$, then ~~we multiply~~ by ''max_action'' to scale to [-2,2]. The critic concatenates state and action and outputs a scalar Q-value. ~~We use~~ two hidden layers of 128 units (a reasonable size for this small problem).

'''Replay Buffer:''' DDPG relies on a replay buffer to store transitions and sample them for training. ~~Let’s implement~~ a simple ring-buffer:

to:

The actor outputs a continuous action using a ''tanh'' activation to ensure it stays within ''[-1,1]'', then multiplies by ''max_action'' to scale to [-2,2]. The critic concatenates state and action and outputs a scalar Q-value. Use two hidden layers of 128 units as a reasonable size for this small problem.

'''Replay Buffer:''' DDPG relies on a replay buffer to store transitions and sample them for training. Implement a simple ring-buffer:

Changed lines 187-188 from:

The buffer stores transitions and ~~can sample~~ a random batch for training ~~(returns~~ tensors ~~for convenience)~~.

to:

The buffer stores transitions and samples a random batch for training as tensors.

Changed lines 208-211 from:

~~We create~~ the networks and set target networks initially equal to the main networks. ~~We use~~ Adam optimizers for both actor and critic. ~~We also create the replay~~ buffer. The learning rates (1e-3) are chosen as typical starting points.

'''Training Loop:''' ~~Now, we train the DDPG~~ agent. ~~We will~~ outline the loop and key steps:

to:

Create the networks and set target networks initially equal to the main networks. Use Adam optimizers for both actor and critic. Create the replay buffer. The learning rates (1e-3) are chosen as typical starting points.

'''Training Loop:''' Train the DDPG agent. Here is an outline of the loop and key steps:

Changed lines 282-289 from:

* Loop over episodes. For each episode, reset the environment and then loop for a certain number of steps (we use 500 as an upper bound to ensure ~~we don’t~~ loop forever if the environment doesn’t terminate early; Pendulum naturally truncates around 200 steps by default).

* '''Action selection:''' ~~We get~~ the action from the actor network ''actor(state_tensor)'' and then add Gaussian noise for exploration. The noise scale decays by ''exploration_noise * max_action'' (so here 0.1 * 2.0 = 0.2 std dev initial noise). This encourages exploration of different torques. ~~We then clip~~ the ~~action to the allowed~~ range.

* '''Storing transitions:''' ~~We store~~ ''state, action, reward, next_state, done'' in the ''ReplayBuffer''.

* '''Training updates:''' ~~We wait~~ until ~~we have at least one batch worth of samples~~ in the buffer, then each time step we do one gradient update for actor and critic:
** Compute '''target Q''': using the target networks, compute {`y_i = r_i + \gamma (1-d_i)\,Q_{\text{target}}(s_{i+1}, \mu_{\text{target}}(s_{i+1}))`}. ''dones'' is a 0/1 indicator for terminal states; we use ''(1 - done)'' to zero-out the future term if the episode ended.

to:

* Loop over episodes. For each episode, reset the environment and then loop for a certain number of steps (use 500 as an upper bound to ensure it doesn't loop forever if the environment doesn’t terminate early; Pendulum naturally truncates around 200 steps by default).

* '''Action selection:''' Get the action from the actor network ''actor(state_tensor)'' and then add Gaussian noise for exploration. The noise scale decays by ''exploration_noise * max_action'' (so here 0.1 * 2.0 = 0.2 std dev initial noise). This encourages exploration of different torques. Clip the action to the allowed range.

* '''Storing transitions:''' Store ''state, action, reward, next_state, done'' in the ''ReplayBuffer''.

* '''Training updates:''' Wait until at least one batch worth of samples is in the buffer, then each time step do one gradient update for actor and critic:
** Compute '''target Q''': using the target networks, compute {`y_i = r_i + \gamma (1-d_i)\,Q_{\text{target}}(s_{i+1}, \mu_{\text{target}}(s_{i+1}))`}. ''dones'' is a 0/1 indicator for terminal states; use ''(1 - done)'' to zero-out the future term if the episode ended.

Changed lines 291-292 from:

** '''Actor loss:''' ~~we want to maximize the critic’s Q~~, ~~so we minimize ''-critic(s, actor(s~~)~~)'' (negative sign for gradient descent). This implements the deterministic policy gradient~~ update.
** '''~~Target network update:''' We~~ perform a soft update with factor {`\tau`}. A small {`\tau`} of ''0.005'' means the target networks change very slowly, providing a stable reference for targets.

to:

** '''Actor loss:''' maximize the critic’s Q, so minimize ''-critic(s, actor(s))'' (negative sign for gradient descent). This implements the deterministic policy gradient update.
** '''Target network update:''' perform a soft update with factor {`\tau`}. A small {`\tau`} of ''0.005'' means the target networks change very slowly, providing a stable reference for targets.

March 18, 2025, at 12:28 PM by 136.36.188.120 -

Changed lines 65-67 from:

~~!!! Custom Engineering Environment~~

Environments are custom RL setups:

to:

Gymnasium Environments are custom RL setups:

March 18, 2025, at 12:27 PM by 136.36.188.120 -

Changed lines 21-23 from:

~~!!!~~ Pendulum Problem with DDPG

The Pendulum-v1 environment is a classic control task:

to:

The '''Pendulum-v1''' environment is a classic control task:

March 18, 2025, at 05:34 AM by 136.36.188.120 -

Changed lines 5-7 from:

Having covered the theory, we now discuss practical implementation using the '''gymnasium''' library (formerly OpenAI Gym). Gymnasium provides standard interfaces for reinforcement learning environments, facilitating ~~easy~~ experimentation with various RL algorithms.

~~!!! Using Gymnasium Environments~~

to:

Having covered the theory, we now discuss practical implementation using the '''gymnasium''' library (formerly OpenAI Gym). Gymnasium provides standard interfaces for reinforcement learning environments, facilitating experimentation with various RL algorithms.

March 18, 2025, at 05:33 AM by 136.36.188.120 -

Changed line 295 from:

** Compute '''target Q''': using the target networks, compute ~~$y~~_i = r_i + \gamma (1-d_i)\,Q_{\text{target}}(s_{i+1}, \mu_{\text{target}}(s_{i+1}))$. ~~`dones`~~ is a 0/1 indicator for terminal states; we use `(1 - done)` to zero-out the future term if the episode ended.

to:

** Compute '''target Q''': using the target networks, compute {`y_i = r_i + \gamma (1-d_i)\,Q_{\text{target}}(s_{i+1}, \mu_{\text{target}}(s_{i+1}))`}. ''dones'' is a 0/1 indicator for terminal states; we use ''(1 - done)'' to zero-out the future term if the episode ended.

March 18, 2025, at 05:30 AM by 136.36.188.120 -

Changed line 97 from:

'''Pendulum Problem with DDPG:''' The Pendulum-v1 environment is a classic control problem: the agent applies a torque to a pendulum (which starts hanging downward) and tries to swing it up and balance it upright. The state is 3-dimensional (angle represented as cos and sin, and angular velocity) and the action is 1-dimensional continuous (torque $-2$ to $+2$ ~~N·m).~~ The reward ~~is $-(\text{angle\_deviation}^2 + 0.1*\dot{\theta}^2 + 0.001*\text{torque}^2)$, which~~ is highest (less negative) when the pendulum is upright and not moving (and using minimal torque). This is a perfect testbed for DDPG.

to:

'''Pendulum Problem with DDPG:''' The Pendulum-v1 environment is a classic control problem: the agent applies a torque to a pendulum (which starts hanging downward) and tries to swing it up and balance it upright. The state is 3-dimensional (angle represented as cos and sin, and angular velocity) and the action is 1-dimensional continuous (torque ''-2'' to ''+2'' N·m). The reward is highest (less negative) when the pendulum is upright and not moving (and using minimal torque). This is a perfect testbed for DDPG.

March 18, 2025, at 05:29 AM by 136.36.188.120 -

Changed lines 290-293 from:

* '''Action selection:''' We get the action from the actor network (~~`actor(~~state_tensor)`) and then add Gaussian noise for exploration. The noise scale decays by ~~`exploration~~_noise * max_~~action`~~ (so here 0.1 * 2.0 = 0.2 std dev initial noise). This encourages exploration of different torques. We then clip the action to the allowed range.

* '''Storing transitions:''' We store `(state, action, reward, next_state, done)` in the ~~`ReplayBuffer`~~.

to:

* '''Action selection:''' We get the action from the actor network ''actor(state_tensor)'' and then add Gaussian noise for exploration. The noise scale decays by ''exploration_noise * max_action'' (so here 0.1 * 2.0 = 0.2 std dev initial noise). This encourages exploration of different torques. We then clip the action to the allowed range.

* '''Storing transitions:''' We store ''state, action, reward, next_state, done'' in the ''ReplayBuffer''.

Changed lines 296-298 from:

** '''Critic loss:''' mean squared error between ~~`current~~_Q = critic(s,a)` and ~~`target~~_Q`. This corresponds to minimizing $(Q(s,a) - y)^~~2$~~.
** '''Actor loss:''' we want to maximize the critic’s Q, so we minimize `-critic(s, actor(s))` (negative sign for gradient descent). This implements the deterministic policy gradient update.
** '''Target network update:''' We perform a soft update with factor ~~`tau`~~. A small ~~`tau` (~~0.005) means the target networks change very slowly, providing a stable reference for targets.

to:

** '''Critic loss:''' mean squared error between ''current_Q = critic(s,a)'' and ''target_Q''. This corresponds to minimizing {`(Q(s,a) - y)^2`}.
** '''Actor loss:''' we want to maximize the critic’s Q, so we minimize ''-critic(s, actor(s))'' (negative sign for gradient descent). This implements the deterministic policy gradient update.
** '''Target network update:''' We perform a soft update with factor {`\tau`}. A small {`\tau`} of ''0.005'' means the target networks change very slowly, providing a stable reference for targets.

March 18, 2025, at 05:25 AM by 136.36.188.120 -

Changed lines 99-101 from:

*We will implement a simplified DDPG for Pendulum.* This involves building the actor and critic networks, a replay buffer, and the training loop. We’ll use '''PyTorch''' in code (though one could use TensorFlow similarly). The code will highlight how to connect to the gymnasium environment and perform learning. (In practice, one might use an existing library like Stable Baselines3 for quick results, but here we write it out to understand the process.)

to:

*We will implement a simplified DDPG for Pendulum.
* This involves building the actor and critic networks, a replay buffer, and the training loop. We’ll use '''PyTorch''' in code (though one could use TensorFlow similarly). The code will highlight how to connect to the gymnasium environment and perform learning.
* In practice, one might use an existing library like [[https://stable-baselines3.readthedocs.io/en/master/|Stable Baselines3]] for quick results, but here we write it out to understand the process.)

March 18, 2025, at 05:23 AM by 136.36.188.120 -

Changed line 2 from:

(:keywords gymnasium, reinforcement learning, RL environments, Pendulum-v1, DDPG~~, CSTR~~, engineering applications:)

to:

(:keywords gymnasium, reinforcement learning, RL environments, Pendulum-v1, DDPG, engineering applications:)

Changed lines 69-72 from:

!!! Custom Engineering Environment: CSTR Optimization

Continuous Stirred Tank Reactor (CSTR) environments are custom RL setups ~~relevant in chemical engineering~~:

to:

!!! Custom Engineering Environment

Environments are custom RL setups:

Changed line 201 from:

!Copy weights from actor to target_actor, and critic to target_critic

to:

# Copy weights from actor to target_actor, and critic to target_critic

March 18, 2025, at 05:21 AM by 136.36.188.120 -

Added lines 96-300:

'''Pendulum Problem with DDPG:''' The Pendulum-v1 environment is a classic control problem: the agent applies a torque to a pendulum (which starts hanging downward) and tries to swing it up and balance it upright. The state is 3-dimensional (angle represented as cos and sin, and angular velocity) and the action is 1-dimensional continuous (torque $-2$ to $+2$ N·m). The reward is $-(\text{angle\_deviation}^2 + 0.1*\dot{\theta}^2 + 0.001*\text{torque}^2)$, which is highest (less negative) when the pendulum is upright and not moving (and using minimal torque). This is a perfect testbed for DDPG.

*We will implement a simplified DDPG for Pendulum.* This involves building the actor and critic networks, a replay buffer, and the training loop. We’ll use '''PyTorch''' in code (though one could use TensorFlow similarly). The code will highlight how to connect to the gymnasium environment and perform learning. (In practice, one might use an existing library like Stable Baselines3 for quick results, but here we write it out to understand the process.)

'''Setting up the Pendulum Environment and Networks:'''

(:source lang=python:)
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
import gymnasium as gym

# Make Pendulum environment
env = gym.make('Pendulum-v1', g=9.81) # g is gravitational acceleration, default 10
state_dim = env.observation_space.shape[0] # dimension of state (should be 3 for Pendulum)
action_dim = env.action_space.shape[0] # dimension of action (1 for Pendulum)
max_action = float(env.action_space.high[0]) # max torque (=2.0)
(:sourceend:)

We’ve imported the necessary libraries. We retrieve ''state_dim'', ''action_dim'', and ''max_action'' (to help scale outputs). Next, define the neural network models for actor and critic. A simple feedforward network suffices for this small problem:

(:source lang=python:)
# Actor Network: maps state -> action (within [-max_action, max_action])
class Actor(nn.Module):
def __init__(self, state_dim, action_dim, max_action):
super(Actor, self).__init__()
self.max_action = max_action
# Simple 2-layer MLP
self.net = nn.Sequential(
nn.Linear(state_dim, 128),
nn.ReLU(),
nn.Linear(128, 128),
nn.ReLU(),
nn.Linear(128, action_dim)
)
def forward(self, state):
# Output raw action, then scale to range [-max_action, max_action] using tanh
raw_action = self.net(state)
# bound output action between -1 and 1 via tanh, then scale
action = self.max_action * torch.tanh(raw_action)
return action

# Critic Network: maps (state, action) -> Q-value
class Critic(nn.Module):
def __init__(self, state_dim, action_dim):
super(Critic, self).__init__()
# Q-network takes state and action concatenated
self.net = nn.Sequential(
nn.Linear(state_dim + action_dim, 128),
nn.ReLU(),
nn.Linear(128, 128),
nn.ReLU(),
nn.Linear(128, 1)
)
def forward(self, state, action):
# Ensure state and action are concatenated as vectors
if action.dim() == 1:
action = action.unsqueeze(1)
x = torch.cat([state, action], dim=1)
Q = self.net(x)
return Q
(:sourceend:)

Here, the actor outputs a continuous action using a $\tanh$ activation to ensure it stays within $[-1,1]$, then we multiply by ''max_action'' to scale to [-2,2]. The critic concatenates state and action and outputs a scalar Q-value. We use two hidden layers of 128 units (a reasonable size for this small problem).

'''Replay Buffer:''' DDPG relies on a replay buffer to store transitions and sample them for training. Let’s implement a simple ring-buffer:

(:source lang=python:)
# Replay Buffer for experience replay
class ReplayBuffer:
def __init__(self, capacity=100000):
self.capacity = capacity
self.buffer = []
self.pos = 0 # position to insert next entry (for circular buffer)
def add(self, state, action, reward, next_state, done):
if len(self.buffer) < self.capacity:
self.buffer.append(None)
self.buffer[self.pos] = (state, action, reward, next_state, done)
# Move position pointer (overwrite oldest if full)
self.pos = (self.pos + 1) % self.capacity
def sample(self, batch_size):
batch = np.random.choice(len(self.buffer), batch_size, replace=False)
states, actions, rewards, next_states, dones = zip(*(self.buffer[i] for i in batch))
# Convert to torch tensors
return (torch.tensor(np.array(states), dtype=torch.float32),
torch.tensor(np.array(actions), dtype=torch.float32),
torch.tensor(np.array(rewards), dtype=torch.float32).unsqueeze(1),
torch.tensor(np.array(next_states), dtype=torch.float32),
torch.tensor(np.array(dones), dtype=torch.float32).unsqueeze(1))
def __len__(self):
return len(self.buffer)
(:sourceend:)

The buffer stores transitions and can sample a random batch for training (returns tensors for convenience).

'''Initialize DDPG components:'''

(:source lang=python:)
# Initialize actor, critic, target networks and optimizers
actor = Actor(state_dim, action_dim, max_action)
critic = Critic(state_dim, action_dim)
target_actor = Actor(state_dim, action_dim, max_action)
target_critic = Critic(state_dim, action_dim)
!Copy weights from actor to target_actor, and critic to target_critic
target_actor.load_state_dict(actor.state_dict())
target_critic.load_state_dict(critic.state_dict())
target_actor.eval()
target_critic.eval()

actor_optimizer = optim.Adam(actor.parameters(), lr=1e-3)
critic_optimizer = optim.Adam(critic.parameters(), lr=1e-3)
buffer = ReplayBuffer(capacity=100000)
(:sourceend:)

We create the networks and set target networks initially equal to the main networks. We use Adam optimizers for both actor and critic. We also create the replay buffer. The learning rates (1e-3) are chosen as typical starting points.

'''Training Loop:''' Now, we train the DDPG agent. We will outline the loop and key steps:

(:source lang=python:)
import math

num_episodes = 200 # number of episodes to train
batch_size = 64 # batch size for sampling from replay
gamma = 0.99 # discount factor
tau = 0.005 # target network update rate (tau)
exploration_noise = 0.1 # stddev for Gaussian exploration noise

for episode in range(num_episodes):
state, _ = env.reset()
state = state.astype(np.float32)
episode_reward = 0.0
for step in range(500): # max steps per episode (Pendulum typically truncated at 200)
# Select action according to current policy + exploration noise
state_tensor = torch.tensor(state, dtype=torch.float32).unsqueeze(0)
with torch.no_grad():
action = actor(state_tensor).cpu().numpy()[0]
# Add exploration noise (Gaussian)
action = action + np.random.normal(0, exploration_noise * max_action, size=action_dim)
action = np.clip(action, -max_action, max_action)
next_state, reward, terminated, truncated, info = env.step(action)
done = terminated or truncated
next_state = next_state.astype(np.float32)
# Store transition in replay buffer
buffer.add(state, action, reward, next_state, done)
state = next_state
episode_reward += reward

# Train the networks if we have enough samples in replay buffer
if len(buffer) >= batch_size:
# Sample a batch
states, actions, rewards, next_states, dones = buffer.sample(batch_size)
# Compute target Q values using target networks
with torch.no_grad():
# Target actor for next action
next_actions = target_actor(next_states)
target_Q = target_critic(next_states, next_actions)
# If done (terminal), no future reward; use (1-done) mask
target_Q = rewards + gamma * (1 - dones) * target_Q
# Critic loss = MSE between current Q and target Q
current_Q = critic(states, actions)
critic_loss = nn.MSELoss()(current_Q, target_Q)
# Update critic
critic_optimizer.zero_grad()
critic_loss.backward()
critic_optimizer.step()

# Actor loss = -mean(Q) (because we want to maximize Q, so minimize -Q)
actor_actions = actor(states)
actor_loss = -critic(states, actor_actions).mean()
# Update actor
actor_optimizer.zero_grad()
actor_loss.backward()
actor_optimizer.step()

# Soft update target networks
for param, target_param in zip(critic.parameters(), target_critic.parameters()):
target_param.data.copy_(tau * param.data + (1 - tau) * target_param.data)
for param, target_param in zip(actor.parameters(), target_actor.parameters()):
target_param.data.copy_(tau * param.data + (1 - tau) * target_param.data)

if done:
break # episode ends
# Logging (print) the cumulative reward of the episode
print(f"Episode {episode+1}: Reward = {episode_reward:.2f}")
(:sourceend:)

A few notes on the code above:

* Loop over episodes. For each episode, reset the environment and then loop for a certain number of steps (we use 500 as an upper bound to ensure we don’t loop forever if the environment doesn’t terminate early; Pendulum naturally truncates around 200 steps by default).

* '''Action selection:''' We get the action from the actor network (`actor(state_tensor)`) and then add Gaussian noise for exploration. The noise scale decays by `exploration_noise * max_action` (so here 0.1 * 2.0 = 0.2 std dev initial noise). This encourages exploration of different torques. We then clip the action to the allowed range.

* '''Storing transitions:''' We store `(state, action, reward, next_state, done)` in the `ReplayBuffer`.

* '''Training updates:''' We wait until we have at least one batch worth of samples in the buffer, then each time step we do one gradient update for actor and critic:
** Compute '''target Q''': using the target networks, compute $y_i = r_i + \gamma (1-d_i)\,Q_{\text{target}}(s_{i+1}, \mu_{\text{target}}(s_{i+1}))$. `dones` is a 0/1 indicator for terminal states; we use `(1 - done)` to zero-out the future term if the episode ended.
** '''Critic loss:''' mean squared error between `current_Q = critic(s,a)` and `target_Q`. This corresponds to minimizing $(Q(s,a) - y)^2$.
** '''Actor loss:''' we want to maximize the critic’s Q, so we minimize `-critic(s, actor(s))` (negative sign for gradient descent). This implements the deterministic policy gradient update.
** '''Target network update:''' We perform a soft update with factor `tau`. A small `tau` (0.005) means the target networks change very slowly, providing a stable reference for targets.

* Print episode rewards for monitoring. Over training, we expect the episode reward (which is negative in Pendulum, but closer to 0 is better since 0 would be perfect upright balance) to increase (become less negative), indicating the pendulum is kept more upright.

This code is '''executable''' and would train a DDPG agent on the Pendulum problem. With 200 episodes, it should achieve a decent policy (though possibly not perfect; longer training yields better results). In practice, one might incorporate learning rate schedules, noise decay, or more episodes.

March 18, 2025, at 05:11 AM by 136.36.188.120 -

Changed line 97 from:

!!! ~~Additional~~ Gymnasium Examples for Engineering

to:

!!! Gymnasium Examples for Engineering

March 18, 2025, at 05:11 AM by 136.36.188.120 -

Changed lines 63-64 from:

1. Initialize environment and replay buffer.
2. Collect experiences and perform network updates:

to:

# Initialize environment and replay buffer.
# Collect experiences and perform network updates:

March 18, 2025, at 05:10 AM by 136.36.188.120 -

Changed lines 65-67 from:

** Critic loss: minimize squared difference from target Q.
** Actor loss: maximize critic Q-values.
** Use soft updates for target networks.

to:

** Critic loss: minimize squared difference from target Q.
** Actor loss: maximize critic Q-values.
** Use soft updates for target networks.

March 18, 2025, at 05:09 AM by 136.36.188.120 -

Changed lines 29-30 from:

* Reward: {`-(~~angle~~\~~_deviation^2~~ + 0.1*\dot{\theta}^2 + 0.001*~~torque~~^2)`}

to:

* Reward: {`-(\Delta \theta^2 + 0.1*\dot{\theta}^2 + 0.001*\tau^2)`}

Changed lines 65-68 from:

~~- Critic~~ loss: minimize squared difference from target Q.
~~- Actor~~ loss: maximize critic Q-values.
- Use soft updates for target networks.

to:

** Critic loss: minimize squared difference from target Q.
** Actor loss: maximize critic Q-values.
** Use soft updates for target networks.

Changed lines 99-106 from:

- '''CartPole''': Balancing pole, discrete actions, robotics/control theory.
- '''MountainCar''': Illustrates energy management.
- '''Acrobot''': Robotic arm analogy.
- '''LunarLander''': Aerospace engineering control.
- '''BipedalWalker''': Biomechanical control using continuous actions.
- '''HVAC or Water Resource Management''': Custom environments applicable to industrial optimization.

Gymnasium provides ~~flexibility to~~ test ~~and~~ implement RL algorithms across ~~diverse~~ engineering domains.

to:

* '''CartPole''': Balancing pole, discrete actions, robotics/control theory.
* '''MountainCar''': Illustrates energy management.
* '''Acrobot''': Robotic arm analogy.
* '''LunarLander''': Aerospace engineering control.
* '''BipedalWalker''': Biomechanical control using continuous actions.
* '''HVAC or Water Resource Management''': Custom environments applicable to industrial optimization.

Gymnasium provides a test environment to implement RL algorithms across engineering domains.

March 18, 2025, at 05:07 AM by 136.36.188.120 -

Added lines 1-106:

(:title RL with Gymnasium:)
(:keywords gymnasium, reinforcement learning, RL environments, Pendulum-v1, DDPG, CSTR, engineering applications:)
(:description Practical guide for implementing RL algorithms using Gymnasium environments, with examples including Pendulum control using DDPG and custom engineering environments:)

Having covered the theory, we now discuss practical implementation using the '''gymnasium''' library (formerly OpenAI Gym). Gymnasium provides standard interfaces for reinforcement learning environments, facilitating easy experimentation with various RL algorithms.

!!! Using Gymnasium Environments

To use Gymnasium environments:

(:source lang=python:)
import gymnasium as gym
env = gym.make('<Environment-Name>')
state, info = env.reset()
(:sourceend:)

The environment can be stepped through using:

(:source lang=python:)
next_state, reward, terminated, truncated, info = env.step(action)
(:sourceend:)

!!! Pendulum Problem with DDPG

The Pendulum-v1 environment is a classic control task:

* State: {`[\cos\theta, \sin\theta, \dot{\theta}]`}
* Action: Continuous torque between {`[-2, +2]`} N·m
* Reward: {`-(angle\_deviation^2 + 0.1*\dot{\theta}^2 + 0.001*torque^2)`}

'''Actor and Critic Networks (PyTorch Example):'''

(:source lang=python:)
import torch
import torch.nn as nn

class Actor(nn.Module):
def __init__(self, state_dim, action_dim, max_action):
super().__init__()
self.max_action = max_action
self.net = nn.Sequential(
nn.Linear(state_dim, 128), nn.ReLU(),
nn.Linear(128, 128), nn.ReLU(),
nn.Linear(128, action_dim)
)
def forward(self, state):
return self.max_action * torch.tanh(self.net(state))

class Critic(nn.Module):
def __init__(self, state_dim, action_dim):
super().__init__()
self.net = nn.Sequential(
nn.Linear(state_dim + action_dim, 128), nn.ReLU(),
nn.Linear(128, 128), nn.ReLU(),
nn.Linear(128, 1)
)
def forward(self, state, action):
return self.net(torch.cat([state, action], dim=1))
(:sourceend:)

'''Training Steps Outline:'''

1. Initialize environment and replay buffer.
2. Collect experiences and perform network updates:
- Critic loss: minimize squared difference from target Q.
- Actor loss: maximize critic Q-values.
- Use soft updates for target networks.

!!! Custom Engineering Environment: CSTR Optimization

Continuous Stirred Tank Reactor (CSTR) environments are custom RL setups relevant in chemical engineering:

* State: Concentration, temperature.
* Action: Coolant flow rate or other continuous variables.
* Reward: Based on optimization goals (e.g., yield, safety constraints).

Implement by subclassing Gymnasium:

(:source lang=python:)
import gymnasium as gym
class CSTREnv(gym.Env):
def __init__(self):
self.observation_space = gym.spaces.Box(low=[0,0], high=[10,500])
self.action_space = gym.spaces.Box(low=[0], high=[1])
# Define reactor dynamics and initial conditions

def reset(self):
# Initialize state
return state, {}

def step(self, action):
# Update state using reactor dynamics equations
# Calculate reward and done status
return next_state, reward, terminated, truncated, {}
(:sourceend:)

!!! Additional Gymnasium Examples for Engineering

- '''CartPole''': Balancing pole, discrete actions, robotics/control theory.
- '''MountainCar''': Illustrates energy management.
- '''Acrobot''': Robotic arm analogy.
- '''LunarLander''': Aerospace engineering control.
- '''BipedalWalker''': Biomechanical control using continuous actions.
- '''HVAC or Water Resource Management''': Custom environments applicable to industrial optimization.

Gymnasium provides flexibility to test and implement RL algorithms across diverse engineering domains.

Dynamic Optimization

RL with Gymnasium

Main.RLGymnasium History

Search

Options:

Dynamic Optimization