## Scale Data for Machine Learning

Scaling (inputs and outputs) can improve the training process for machine learning. Certain types of classifiers do not improve with data scaling. These include Decision Trees, RandomForest, and XGBoost. Most other types of classifiers are very sensitive to scaling. The example at the end of this page shows the negative impact of unscaled data for a Neural Network Classifier. The data is scaled before training and unscaled to evaluate the performance after prediction.

A common scaling technique is to divide by the standard deviation and shift the mean to 0. Another common scaling approach is to adjust all of the data to a range of 0 to 1 or -1 to 1. Each data column is scaled individually.

There are different methods for scaling that are important based on the presence of outliers or statistical properties of the data. Two primary methods for scaling are a standard scaler (scale by the standard deviation) and a min-max (e.g. 0-1) scaler. For classifiers and regressor such as neural networks, most of the data should be between 0 and 1 or -1 and 1.

import matplotlib.pyplot as plt

# Generate a distribution

x = 0.5*np.random.randn(1000)+4

# Standard (mean=0, stdev=1) Scaler

y = (x-np.mean(x))/np.std(x)

# Min-Max (0-1) Scaler

z = (x-np.min(x))/(np.max(x)-np.min(x))

# Plot distributions

plt.figure(figsize=(8,4))

plt.hist(x, bins=30, label='original')

plt.hist(y, alpha=0.7, bins=30, label='standard scaler')

plt.hist(z, alpha=0.7, bins=30, label='minmax scaler')

plt.legend()

plt.show()

The scaled output is *y* and the unscaled input is *x*.

$$y = a \left(x-b\right)$$

The *a* and *b* adjust the original data to a transformed state.

#### Sample Data

Import sample data and split into training (80%) and testing (20%) sets. More information on Splitting Data is available in another module.

import pandas as pd

data = pd.read_csv('http://apmonitor.com/pds/uploads/Main/tclab_data6.txt')

data.set_index('Time',inplace=True)

# Split into train and test subsets (20% for test)

train, test = train_test_split(data, test_size=0.2, shuffle=False)

#### Standard Scaler

The standard scaler transforms data *train* into a standard normal distribution from mean=`\bar x` and standard deviation `\sigma` to a new distribution with mean=0 and unit standard deviation (`\sigma=1`).

$$y = \frac{1}{\sigma} \left(x-\bar x\right)$$

$$a=\frac{1}{\sigma},\;b=\bar x$$

The scikit-learn (sklearn) package facilitates scaling with either the *fit* or the *fit_transform* functions. The *fit_transform* function combined the *fit* and *transform* functions into a single operation.

s = StandardScaler()

s_train = s.fit_transform(train)

The scaling factors are available for each of the data columns.

print('Scaler mean')

print('b: ', s.mean_)

For the example problem, this produces approximately *s.scale_=0.5* and *s.mean_=4.0* because the original data has a standard deviation of 0.5 and a mean of 4.0. If there is another data set that needs to be transformed, the *transform* function uses the same scaling factors as generated from the *fit_transform* function.

If the original source is a Pandas dataframe, the data can be returned to a Pandas dataframe as *s_train_df* and *s_test_df*. This is needed because the *fit_transform* and *transform* functions return a Numpy array.

s_train_df = pd.DataFrame(s_train, columns=train.columns.values)

s_test_df = pd.DataFrame(s_test, columns=test.columns.values)

#### Min Max Scaler

An alternative to the standard scaler is the min-max scaler that adjusts all data between an upper and lower limit in the *feature_range*.

$$y = \frac{1}{x_{max}-x_{min}} \left(x-x_{min}\right)$$

$$a=\frac{1}{x_{max}-x_{min}},\;b=x_{min}$$

This type of scaler is useful for machine learning algorithms that require non-negative data or when the data set does not contain outliers. Outliers skew most of the data to a very narrow region within the 0-1 interval.

s = MinMaxScaler(feature_range=(0,1))

s_train = s.fit_transform(train)

s_test = s.transform(test)

The scaling factors are available for each of the data columns with *scale_* and *min_*.

print('a: ', s.scale_)

print('Scaler minimum')

print('b: ', s.min_)

#### Inverse Transform

An inverse transform returns data to the original scaling. Scaled data is only for the machine learning methods that need well-conditioned data for processing. Once the training or prediction is completed, the data needs to be returned to the unscaled form for visualization or interpretation. The *inverse_transform* function is used to unscale the data.

$$x = \frac{y}{a}+b$$

#### Exercise

The Temperature Control Lab (TCLab) has two heaters (*Q1* and *Q2*) and two temperature sensors (*T1* and *T2*). Heater 1 (*Q1*) is cycled between 0% and 100% and Heater 2 *Q2* is off as shown in the Equipment Monitoring exercise. A histogram plot shows the heaters and temperature distributions.

data = pd.read_csv('http://apmonitor.com/pds/uploads/Main/tclab_data6.txt')

data.set_index('Time',inplace=True)

data.plot(kind='hist',alpha=0.7,bins=30,figsize=(8,4))

**Activity 1: Scale Data**

Scale the TCLab data with a standard scaler and plot the distributions.

s = StandardScaler()

sdata = s.fit_transform(data)

sdata = pd.DataFrame(sdata, columns=data.columns.values, index=data.index)

sdata.plot(kind='hist',alpha=0.7,bins=10,figsize=(8,4))

**Activity 2: Neural Network with Scaled Data**

Train a neural network to predict *T2* (output) from *Q1* and *T1* (input features). Split the data into a train (80%) and test (20%) set. Create a parity plot of the predicted versus measured *T2*.

from sklearn.model_selection import train_test_split

import matplotlib.pyplot as plt

# split into training (80%) and testing (20%)

train, test = train_test_split(sdata, test_size=0.2, shuffle=True)

train=train.copy(); test=test.copy()

# train neural network

nn = MLPRegressor(hidden_layer_sizes=(3,3),activation='tanh',\

solver='lbfgs',max_iter=5000)

model = nn.fit(train[['Q1','T1']],train['T2'])

# test neural network

predict = test.copy()

predict['T2'] = nn.predict(test[['Q1','T1']])

# unscale data

d1 = s.inverse_transform(test)

d2 = s.inverse_transform(predict)

test_results = pd.DataFrame({'T2':d1[:,-1],'T2p':d2[:,-1]})

# plot results

test_results.plot(x='T2',y='T2p',kind='scatter')

plt.plot([15,28],[15,28],'r-')

plt.savefig('results.png',dpi=600)

plt.show()

**Activity 3: Neural Network with Unscaled Data**

Repeat *Activity 2* but use unscaled data instead of scaled data to perform the regression. Evaluate the neural network performance with a parity plot of the predicted and measured *T2* on the test set.

from sklearn.model_selection import train_test_split

import matplotlib.pyplot as plt

import pandas as pd

data = pd.read_csv('http://apmonitor.com/pds/uploads/Main/tclab_data6.txt')

data.set_index('Time',inplace=True)

# split into training (80%) and testing (20%)

train, test = train_test_split(data, test_size=0.2, shuffle=True)

train=train.copy(); test=test.copy()

# train neural network

nn = MLPRegressor(hidden_layer_sizes=(3,3),activation='tanh',\

solver='lbfgs',max_iter=5000)

model = nn.fit(train[['Q1','T1']],train['T2'])

# test neural network

test['T2p'] = nn.predict(test[['Q1','T1']])

# plot results

test.plot(x='T2',y='T2p',kind='scatter')

plt.plot([15,28],[15,28],'r-')

plt.show()

#### ✅ Knowledge Check

**1.** Which of the following classifiers is sensitive to data scaling?

**A.**Neural Network Classifier

- Correct. Neural Network Classifier is very sensitive to scaling. As mentioned in the example, the negative impact of unscaled data on a Neural Network Classifier is evident.

**B.**Decision Trees

- Incorrect. Decision Trees do not improve with data scaling and aren't sensitive to it.

**C.**RandomForest

- Incorrect. RandomForest, like Decision Trees, do not improve with data scaling.

**2.** What is a common technique for data scaling?

**A.**Multiplying every value by 2

- Incorrect. Multiplying every value by 2 is not a standard technique for data scaling.

**B.**Shifting the mean to 3

- Incorrect. The common technique mentioned is shifting the mean to 0, not 3.

**C.**Adjusting data to a range of 0 to 1

- Correct. One of the common techniques mentioned is adjusting all of the data to a range of 0 to 1 or -1 to 1 (Min/Max Scalar) or shifting to mean of 0 and standard deviation of 1 (Standard Scalar).

**D.**Taking the absolute value of each data point

- Incorrect. Taking the absolute value is not a common technique for data scaling.