Transformers: LLMs and Time-Series

Transformers have revolutionized the field of Natural Language Processing (NLP) and are increasingly being used in time-series forecasting. Originally introduced in the paper Attention Is All You Need, transformers have become the backbone of many state-of-the-art language models like BERT, GPT, and others.

Transformers in Large Language Models (LLM)

Transformers in LLMs like GPT and BERT use self-attention mechanisms to process text. They are capable of capturing contextual information from the entire text input, making them effective for a variety of NLP tasks.

Basic Usage Example with the transformers package

from transformers import pipeline

# Using a pre-trained model
generator = pipeline('text-generation', model='gpt2')
generated_text = generator("Today is a beautiful day and", max_length=30)
print(generated_text)

Fine-tuning a transformer model allows for customization to specific tasks or datasets. For example, fine-tuning a model on Python Gekko data involves adjusting the model to better understand and generate Python code related to the Gekko library. For detailed steps and code examples for fine-tuning the phi2-microsoft model with training data for Python Gekko, refer to this GitHub repository:

Transformers in Time-Series Forecasting

In time-series forecasting, transformers are used to analyze sequential data, capturing temporal dependencies. They are particularly effective in scenarios where long-range dependencies are important.

Data Generation: This section of the script is responsible for creating synthetic time-series data. It uses a sine function to generate a sequence of data points, mimicking a real-world time-series dataset. The data is split into sequences of a specified length, with each sequence used to predict the next point in the series. This mimics a common scenario in time-series analysis where past data is used to predict future values.

import numpy as np
import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader

# Generating synthetic time-series data
def generate_data(size=1000, sequence_length=10):
    data = np.sin(np.linspace(0, 10 * np.pi, size))  # Sine wave data
    sequences = [data[i:i+sequence_length] for i in range(size-sequence_length)]
    next_points = data[sequence_length:]
    return np.array(sequences), next_points

Custom Dataset Class: This part defines a custom Dataset class for handling the time-series data, making it compatible with PyTorch's data handling utilities. The class, TimeSeriesDataset, takes sequences and their corresponding next points, and implements methods to allow easy access to these data points and their labels (the next points in the series). This class is essential for feeding the data into a DataLoader for batching and shuffling during training.

# Custom dataset class
class TimeSeriesDataset(Dataset):
    def __init__(self, sequences, next_points):
        self.sequences = sequences
        self.next_points = next_points

    def __len__(self):
        return len(self.sequences)

    def __getitem__(self, idx):
        return self.sequences[idx], self.next_points[idx]

Transformer Model Definition: In this section, a transformer model specifically tailored for numerical time-series data is defined. The TransformerModel class extends PyTorch's nn.Module. It's designed to process sequences of numerical data using a transformer encoder, which is capable of capturing temporal dependencies in the data. The model includes a fully connected output layer to make the final prediction of the next point in the series.

# Transformer Model (simplified for numerical data)
class TransformerModel(nn.Module):
    def __init__(self, input_size=1, sequence_length=10, num_layers=1, \
                 num_heads=2, dim_feedforward=512):
        super(TransformerModel, self).__init__()
        self.sequence_length = sequence_length
        self.encoder_layer = nn.TransformerEncoderLayer(d_model=input_size*sequence_length,
                           nhead=num_heads,
                           dim_feedforward=dim_feedforward)
        self.transformer_encoder = nn.TransformerEncoder(self.encoder_layer,
                      num_layers=num_layers)
        self.fc_out = nn.Linear(input_size * sequence_length, 1)

    def forward(self, src):
        # Reshape to match the input dimensions
        src = src.reshape(-1, self.sequence_length, 1)  
        src = src.flatten(start_dim=1)
        src = src.unsqueeze(0)  # Add batch dimension
        out = self.transformer_encoder(src)
        out = out.squeeze(0)  # Remove batch dimension
        return self.fc_out(out)

Data Preparation for Training: This part of the script involves instantiating the dataset and DataLoader. The DataLoader is used to batch and shuffle the dataset, making it ready for efficient training.

# Prepare data
sequences, next_points = generate_data()
dataset = TimeSeriesDataset(sequences, next_points)
dataloader = DataLoader(dataset, batch_size=32, shuffle=True)

Model Training: Here, the model is trained using the time-series data. The training loop involves passing batches of data through the model, calculating the loss (using Mean Squared Error as the criterion), and updating the model's weights with backpropagation. The optimizer used is Adam, a popular choice for training neural networks.

# Model
model = TransformerModel()
criterion = nn.MSELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

# Training loop
for epoch in range(9):  # Number of epochs
    for seq, next_point in dataloader:
        seq, next_point = seq.float(), next_point.float().unsqueeze(1)
        output = model(seq)
        loss = criterion(output, next_point)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
    print(f"Epoch {epoch+1}, Loss: {loss.item()}")
  Epoch 1, Loss: 0.057174958288669586
  Epoch 2, Loss: 0.030068831518292427
  Epoch 3, Loss: 0.011044860817492008
  Epoch 4, Loss: 0.034356195479631424
  Epoch 5, Loss: 0.029425013810396194
  Epoch 6, Loss: 0.10149335861206055
  Epoch 7, Loss: 0.007862072438001633
  Epoch 8, Loss: 0.0072705368511378765
  Epoch 9, Loss: 0.008393393829464912

Prediction: In the final part of the script, the trained model is used to make a prediction. It takes a sequence from the dataset and predicts the next data point. This demonstrates the model's ability to perform the intended time-series forecasting task.

# Predict the next point after a sequence
test_seq = torch.tensor(sequences[0]).float()
predicted_point = model(test_seq)
print("Predicted next point:", predicted_point.item())
  Predicted next point: 0.3461025655269623

The complete code is given below on generating synthetic data, training the transformer, and predicting the next point as a time-series forecast.

import numpy as np
import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader

# Generating synthetic time-series data
def generate_data(size=1000, sequence_length=10):
    data = np.sin(np.linspace(0, 10 * np.pi, size))  # Sine wave data
    sequences = [data[i:i+sequence_length] for i in range(size-sequence_length)]
    next_points = data[sequence_length:]
    return np.array(sequences), next_points

# Custom dataset class
class TimeSeriesDataset(Dataset):
    def __init__(self, sequences, next_points):
        self.sequences = sequences
        self.next_points = next_points

    def __len__(self):
        return len(self.sequences)

    def __getitem__(self, idx):
        return self.sequences[idx], self.next_points[idx]

# Transformer Model (simplified for numerical data)
class TransformerModel(nn.Module):
    def __init__(self, input_size=1, sequence_length=10, num_layers=1, \
                 num_heads=2, dim_feedforward=512):
        super(TransformerModel, self).__init__()
        self.sequence_length = sequence_length
        self.encoder_layer = nn.TransformerEncoderLayer(d_model=input_size \
                                  * sequence_length, nhead=num_heads, \
                                    dim_feedforward=dim_feedforward)
        self.transformer_encoder = nn.TransformerEncoder(self.encoder_layer, \
                                                         num_layers=num_layers)
        self.fc_out = nn.Linear(input_size * sequence_length, 1)

    def forward(self, src):
        # Reshape to match the input dimensions
        src = src.reshape(-1, self.sequence_length, 1)  
        src = src.flatten(start_dim=1)
        src = src.unsqueeze(0)  # Add batch dimension
        out = self.transformer_encoder(src)
        out = out.squeeze(0)  # Remove batch dimension
        return self.fc_out(out)

# Prepare data
sequences, next_points = generate_data()
dataset = TimeSeriesDataset(sequences, next_points)
dataloader = DataLoader(dataset, batch_size=32, shuffle=True)

# Model
model = TransformerModel()
criterion = nn.MSELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

# Training loop
for epoch in range(9):  # Number of epochs
    for seq, next_point in dataloader:
        seq, next_point = seq.float(), next_point.float().unsqueeze(1)
        output = model(seq)
        loss = criterion(output, next_point)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
    print(f"Epoch {epoch+1}, Loss: {loss.item()}")

# Predict the next point after a sequence
test_seq = torch.tensor(sequences[0]).float()
predicted_point = model(test_seq)
print("Predicted next point:", predicted_point.item())

Further Reading


✅ Knowledge Check

1. What is the primary mechanism that transformers use to process text in language models?

A. Recurrent Neural Networks (RNN)
Incorrect. Transformers use self-attention mechanisms to process text.
B. Self-Attention Mechanisms
Correct. Transformers use self-attention mechanisms to process text.
C. Long-Short Term Memory (LSTM) units
Incorrect. Transformers use self-attention mechanisms to process text.

2. Why are transformers effective in time-series forecasting?

A. Transformers are good at capturing long-range dependencies in data.
Correct. Transformers are effective because they can capture long-range dependencies in sequential data.
B. They reduce the need for large datasets.
Incorrect. The effectiveness of transformers is not primarily due to reduced data requirements.
C. Transformers have vanishing gradients.
Incorrect. This is a disadvantage of LSTM models that do not capture long-range dependencies because of vanishing gradients.
💬