Scale Data for Machine Learning

Scaling (inputs and outputs) can improve the training process for machine learning. Certain types of classifiers do not improve with data scaling. These include Decision Trees, RandomForest, and XGBoost. Most other types of classifiers are very sensitive to scaling. The example at the end of this page shows the negative impact of unscaled data for a Neural Network Classifier. The data is scaled before training and unscaled to evaluate the performance after prediction.

A common scaling technique is to divide by the standard deviation and shift the mean to 0. Another common scaling approach is to adjust all of the data to a range of 0 to 1 or -1 to 1. Each data column is scaled individually.

There are different methods for scaling that are important based on the presence of outliers or statistical properties of the data. Two primary methods for scaling are a standard scaler (scale by the standard deviation) and a min-max (e.g. 0-1) scaler. For classifiers and regressor such as neural networks, most of the data should be between 0 and 1 or -1 and 1.

import numpy as np
import matplotlib.pyplot as plt

# Generate a distribution
x = 0.5*np.random.randn(1000)+4

# Standard (mean=0, stdev=1) Scaler
y = (x-np.mean(x))/np.std(x)

# Min-Max (0-1) Scaler
z = (x-np.min(x))/(np.max(x)-np.min(x))

# Plot distributions
plt.figure(figsize=(8,4))
plt.hist(x, bins=30, label='original')
plt.hist(y, alpha=0.7, bins=30, label='standard scaler')
plt.hist(z, alpha=0.7, bins=30, label='minmax scaler')
plt.legend()
plt.show()

The scaled output is y and the unscaled input is x.

$$y = a \left(x-b\right)$$

The a and b adjust the original data to a transformed state.

Sample Data

Import sample data and split into training (80%) and testing (20%) sets. More information on Splitting Data is available in another module.

from sklearn.model_selection import train_test_split
import pandas as pd

data = pd.read_csv('http://apmonitor.com/pds/uploads/Main/tclab_data6.txt')
data.set_index('Time',inplace=True)

# Split into train and test subsets (20% for test)
train, test = train_test_split(data, test_size=0.2, shuffle=False)

Standard Scaler

The standard scaler transforms data train into a standard normal distribution from mean=`\bar x` and standard deviation `\sigma` to a new distribution with mean=0 and unit standard deviation (`\sigma=1`).

$$y = \frac{1}{\sigma} \left(x-\bar x\right)$$

$$a=\frac{1}{\sigma},\;b=\bar x$$

The scikit-learn (sklearn) package facilitates scaling with either the fit or the fit_transform functions. The fit_transform function combined the fit and transform functions into a single operation.

from sklearn.preprocessing import StandardScaler
s = StandardScaler()
s_train = s.fit_transform(train)

The scaling factors are available for each of the data columns.

print('a: ', s.scale_)
print('Scaler mean')
print('b: ', s.mean_)

For the example problem, this produces approximately s.scale_=0.5 and s.mean_=4.0 because the original data has a standard deviation of 0.5 and a mean of 4.0. If there is another data set that needs to be transformed, the transform function uses the same scaling factors as generated from the fit_transform function.

s_test = s.fit_transform(test)

If the original source is a Pandas dataframe, the data can be returned to a Pandas dataframe as s_train_df and s_test_df. This is needed because the fit_transform and transform functions return a Numpy array.

# convert scaled values back to dataframe
s_train_df = pd.DataFrame(s_train, columns=train.columns.values)
s_test_df = pd.DataFrame(s_test, columns=test.columns.values)

Min Max Scaler

An alternative to the standard scaler is the min-max scaler that adjusts all data between an upper and lower limit in the feature_range.

$$y = \frac{1}{x_{max}-x_{min}} \left(x-x_{min}\right)$$

$$a=\frac{1}{x_{max}-x_{min}},\;b=x_{min}$$

This type of scaler is useful for machine learning algorithms that require non-negative data or when the data set does not contain outliers. Outliers skew most of the data to a very narrow region within the 0-1 interval.

from sklearn.preprocessing import MinMaxScaler
s = MinMaxScaler(feature_range=(0,1))
s_train = s.fit_transform(train)
s_test  = s.transform(test)

The scaling factors are available for each of the data columns with scale_ and min_.

print('Scaler multipliers')
print('a: ', s.scale_)
print('Scaler minimum')
print('b: ', s.min_)

Inverse Transform

An inverse transform returns data to the original scaling. Scaled data is only for the machine learning methods that need well-conditioned data for processing. Once the training or prediction is completed, the data needs to be returned to the unscaled form for visualization or interpretation. The inverse_transform function is used to unscale the data.

$$x = \frac{y}{a}+b$$

x = s.inverse_transform(s_test)

Exercise

The Temperature Control Lab (TCLab) has two heaters (Q1 and Q2) and two temperature sensors (T1 and T2). Heater 1 (Q1) is cycled between 0% and 100% and Heater 2 Q2 is off as shown in the Equipment Monitoring exercise. A histogram plot shows the heaters and temperature distributions.

import pandas as pd
data = pd.read_csv('http://apmonitor.com/pds/uploads/Main/tclab_data6.txt')
data.set_index('Time',inplace=True)
data.plot(kind='hist',alpha=0.7,bins=30,figsize=(8,4))

Activity 1: Scale Data

Scale the TCLab data with a standard scaler and plot the distributions.

from sklearn.preprocessing import StandardScaler
s = StandardScaler()
sdata = s.fit_transform(data)
sdata = pd.DataFrame(sdata, columns=data.columns.values, index=data.index)
sdata.plot(kind='hist',alpha=0.7,bins=10,figsize=(8,4))

Activity 2: Neural Network with Scaled Data

Train a neural network to predict T2 (output) from Q1 and T1 (input features). Split the data into a train (80%) and test (20%) set. Create a parity plot of the predicted versus measured T2.

from sklearn.neural_network import MLPRegressor
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt

# split into training (80%) and testing (20%)
train, test = train_test_split(sdata, test_size=0.2, shuffle=True)
train=train.copy(); test=test.copy()

# train neural network
nn = MLPRegressor(hidden_layer_sizes=(3,3),activation='tanh',\
                  solver='lbfgs',max_iter=5000)
model = nn.fit(train[['Q1','T1']],train['T2'])

# test neural network
predict = test.copy()
predict['T2'] = nn.predict(test[['Q1','T1']])

# unscale data
d1 = s.inverse_transform(test)
d2 = s.inverse_transform(predict)
test_results = pd.DataFrame({'T2':d1[:,-1],'T2p':d2[:,-1]})

# plot results
test_results.plot(x='T2',y='T2p',kind='scatter')
plt.plot([15,28],[15,28],'r-')
plt.savefig('results.png',dpi=600)
plt.show()

Activity 3: Neural Network with Unscaled Data

Repeat Activity 2 but use unscaled data instead of scaled data to perform the regression. Evaluate the neural network performance with a parity plot of the predicted and measured T2 on the test set.

from sklearn.neural_network import MLPRegressor
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
import pandas as pd

data = pd.read_csv('http://apmonitor.com/pds/uploads/Main/tclab_data6.txt')
data.set_index('Time',inplace=True)

# split into training (80%) and testing (20%)
train, test = train_test_split(data, test_size=0.2, shuffle=True)
train=train.copy(); test=test.copy()

# train neural network
nn = MLPRegressor(hidden_layer_sizes=(3,3),activation='tanh',\
                  solver='lbfgs',max_iter=5000)
model = nn.fit(train[['Q1','T1']],train['T2'])

# test neural network
test['T2p'] = nn.predict(test[['Q1','T1']])

# plot results
test.plot(x='T2',y='T2p',kind='scatter')
plt.plot([15,28],[15,28],'r-')
plt.show()

✅ Knowledge Check

1. Which of the following classifiers is sensitive to data scaling?

A. Neural Network Classifier
Correct. Neural Network Classifier is very sensitive to scaling. As mentioned in the example, the negative impact of unscaled data on a Neural Network Classifier is evident.
B. Decision Trees
Incorrect. Decision Trees do not improve with data scaling and aren't sensitive to it.
C. RandomForest
Incorrect. RandomForest, like Decision Trees, do not improve with data scaling.

2. What is a common technique for data scaling?

A. Multiplying every value by 2
Incorrect. Multiplying every value by 2 is not a standard technique for data scaling.
B. Shifting the mean to 3
Incorrect. The common technique mentioned is shifting the mean to 0, not 3.
C. Adjusting data to a range of 0 to 1
Correct. One of the common techniques mentioned is adjusting all of the data to a range of 0 to 1 or -1 to 1 (Min/Max Scalar) or shifting to mean of 0 and standard deviation of 1 (Standard Scalar).
D. Taking the absolute value of each data point
Incorrect. Taking the absolute value is not a common technique for data scaling.