TCLab with Reinforcement Learning

This page demonstrates a Reinforcement Learning (RL) approach for controlling the Temperature Control Lab (TCLab) using a Deep Deterministic Policy Gradient (DDPG) algorithm in PyTorch. The RL agent learns to adjust the heater power to maintain a desired temperature setpoint.

TCLab Environment

The TCLab is an Arduino-based temperature control system with:

Two heaters
Two temperature sensors
Python / MATLAB / Simulink interface

The RL agent learns to control heater power to maintain a temperature set point.

Gymnasium Custom Environment

Define a custom Gymnasium environment to interface with TCLab. This class defines the interface between the TCLab (Temperature Control Lab) hardware and Python through a Gymnasium custom environment. The environment allows the RL agent to interact with heaters and sensors, apply actions, and receive temperature readings.

import gymnasium as gym
import numpy as np
import torch
import tclab

class TCLabEnv(gym.Env):
def __init__(self, setpoint=50):
super(TCLabEnv, self).__init__()
self.lab = tclab.TCLabModel() # Connect to TCLab hardware with TCLab()
self.setpoint = setpoint
self.action_space = gym.spaces.Box(low=np.array([0]), high=np.array([100]), dtype=np.float32)
self.observation_space = gym.spaces.Box(low=np.array([0]), high=np.array([100]), dtype=np.float32)

def reset(self):
self.lab.Q1(0) # Turn off heater
self.lab.Q2(0)
return np.array([self.lab.T1]), {}

def step(self, action):
self.lab.Q1(action[0]) # Apply action
self.lab.Q2(action[0])
temperature = self.lab.T1 # Read temperature
reward = -abs(temperature - self.setpoint) # Reward: minimize error
done = False # No terminal state in continuous control
return np.array([temperature]), reward, done, False, {}

def close(self):
self.lab.Q1(0)
self.lab.Q2(0)
self.lab.close()

[$[Get Code]]

Actor-Critic Networks (PyTorch)

Define the Actor and Critic neural networks. This code creates two neural network classes using PyTorch:

Actor: Determines the control action (heater power level) based on the current temperature.
Critic: Evaluates the quality of actions by estimating the expected future rewards from a given state-action pair.

import torch.nn as nn
import torch.optim as optim

class Actor(nn.Module):
def __init__(self, state_dim, action_dim, max_action):
super(Actor, self).__init__()
self.net = nn.Sequential(
nn.Linear(state_dim, 128), nn.ReLU(),
nn.Linear(128, 128), nn.ReLU(),
nn.Linear(128, action_dim),
nn.Sigmoid() # Output between 0 and 1
)
self.max_action = max_action

def forward(self, state):
return self.max_action * self.net(state)

class Critic(nn.Module):
def __init__(self, state_dim, action_dim):
super(Critic, self).__init__()
self.net = nn.Sequential(
nn.Linear(state_dim + action_dim, 128), nn.ReLU(),
nn.Linear(128, 128), nn.ReLU(),
nn.Linear(128, 1)
)

def forward(self, state, action):
if action.dim() == 1: # If action is 1D, reshape it to (batch_size, action_dim)
action = action.unsqueeze(1)
return self.net(torch.cat([state, action], dim=1))

[$[Get Code]]

Replay Buffer

Implements a memory buffer to store experience tuples (state, action, reward, next state, done). The replay buffer randomly samples batches of experiences for stable training of the neural networks.

import random
from collections import deque

class ReplayBuffer:
def __init__(self, capacity=10000):
self.buffer = deque(maxlen=capacity)

def add(self, state, action, reward, next_state, done):
self.buffer.append((state, action, reward, next_state, done))

def sample(self, batch_size):
batch = random.sample(self.buffer, batch_size)
states, actions, rewards, next_states, dones = zip(*batch)
return (
torch.tensor(states, dtype=torch.float32),
torch.tensor(actions, dtype=torch.float32),
torch.tensor(rewards, dtype=torch.float32).unsqueeze(1),
torch.tensor(next_states, dtype=torch.float32),
torch.tensor(dones, dtype=torch.float32).unsqueeze(1)
)

def __len__(self): # Add this method
return len(self.buffer)

[$[Get Code]]

Training Loop

Executes the main training process where the RL agent interacts with the TCLab environment over multiple episodes. In each step:

The actor selects heater power actions.
Experience data is collected and stored in the replay buffer.
The actor and critic networks are trained using sampled experiences.
Target networks are softly updated to stabilize learning.

# Initialize Gymnasium environment
env = TCLabEnv(setpoint=50)

actor = Actor(state_dim=1, action_dim=1, max_action=100)
critic = Critic(state_dim=1, action_dim=1)
target_actor = Actor(state_dim=1, action_dim=1, max_action=100)
target_critic = Critic(state_dim=1, action_dim=1)

target_actor.load_state_dict(actor.state_dict())
target_critic.load_state_dict(critic.state_dict())

actor_optimizer = optim.Adam(actor.parameters(), lr=1e-3)
critic_optimizer = optim.Adam(critic.parameters(), lr=1e-3)
buffer = ReplayBuffer()

gamma = 0.99
tau = 0.005

for episode in range(100):
state, _ = env.reset()
episode_reward = 0
for step in range(200):
state_tensor = torch.tensor(state, dtype=torch.float32).unsqueeze(0)
action = actor(state_tensor).detach().cpu().numpy().flatten()[0] # Convert to scalar
next_state, reward, done, _, _ = env.step([action]) # Wrap action in list
buffer.add(state, action, reward, next_state, done)
state = next_state
episode_reward += reward

if len(buffer) > 64:
states, actions, rewards, next_states, dones = buffer.sample(64)
with torch.no_grad():
next_actions = target_actor(next_states)
target_q = target_critic(next_states, next_actions)
target_q = rewards + gamma * (1 - dones) * target_q
current_q = critic(states, actions)
critic_loss = nn.MSELoss()(current_q, target_q)

critic_optimizer.zero_grad()
critic_loss.backward()
critic_optimizer.step()

actor_loss = -critic(states, actor(states)).mean()
actor_optimizer.zero_grad()
actor_loss.backward()
actor_optimizer.step()

for param, target_param in zip(critic.parameters(), target_critic.parameters()):
target_param.data.copy_(tau * param.data + (1 - tau) * target_param.data)

for param, target_param in zip(actor.parameters(), target_actor.parameters()):
target_param.data.copy_(tau * param.data + (1 - tau) * target_param.data)

print(f"Episode {episode+1}: Reward = {episode_reward:.2f}")

env.close()

[$[Get Code]]

This RL implementation attempts to control the TCLab temperature using DDPG with PyTorch. The agent learns to adjust heater power to maintain the temperature setpoint with minimal error.

Next Steps

Increase from 100 to more training steps
Train on real TCLab hardware
Optimize hyperparameters
Compare RL vs. PID control
Convert from SISO to MIMO control

This guide provides a step-by-step RL implementation for process control applications using TCLab.

Dynamic Optimization

TCLab with Reinforcement Learning

Search

Options:

Dynamic Optimization