Feature Engineering

Feature engineering is the process of selecting and creating the input descriptors for machine learning. Categorical data is converted to numeric values such as True=1 and False=0. Encoding creates indicators from images, words, numbers, or discrete categories.

  • Ordinal Encoding: assign number to each option (e.g. 0=red, 1=blue)
  • One-Hot Encoding: new binary (0 or 1) feature for each option
  • Feature Hashing: compromise between ordinal and one-hot encoding

Feature generation may also create new data columns that are derived from the existing features. This may include a nonlinear transform of an individual feature or a product of two features.

Selection and creation of features is an important step in machine learning. Too many features may cause the classifier or regressor to increase the chances of predicting poorly. With many features, one of the inputs may be a bad value and cause a bad prediction. More features also take longer for data curation, training, and prediction. There are methods to rank the most important features for classification or regression.

There are methods to rank features in order of significance such as SHAP (SHapley Additive exPlanations). SHAP gives each feature a ranking to explain the output of a machine learning model or SelectKBest to rank the input features. Unimportant features are identified and removed to improve training time, reduce storage cost, and minimize deployment resources.


Data Transformation

Data transformation is the modification of a data column to create a modified feature.

Nonlinear Transform

A common example of data transformation is to perform a nonlinear transform such as log-scale of time.

import numpy as np
import pandas as pd
data = pd.read_csv('http://apmonitor.com/pds/uploads/Main/tclab_data6.txt')
data['log_time'] = np.log(data['Time'].values)
data.head()
     Time   Q1   Q2     T1     T2  log_time
  0   0.0  0.0  0.0  16.06  16.00      -inf
  1   1.0  0.0  0.0  16.06  15.97  0.000000
  2   2.0  0.0  0.0  16.06  16.03  0.693147
  3   3.0  0.0  0.0  16.03  16.00  1.098612
  4   4.0  0.0  0.0  16.03  15.94  1.386294
data.set_index('Time',inplace=True)
data['log_time'].plot()

Statistical Transform

Statistical measures are another example of data transformations. A rolling window of the mean, standard deviation, kurtosis, skew, or other statistical information may give additional features for classification or regression.

data['T1_mean']  = data['T1'].rolling(window=100,center=True).mean()
data['T1_stdev'] = data['T1'].rolling(window=100,center=True).std()
data['T1_skew']  = data['T1'].rolling(window=100,center=True).skew()
data[['T1','T1_mean']].plot(figsize=(8,4),ylabel='Temperature (degC)')

Categorical Data

Categorical data takes a limited number of discrete values and are typically represented as strings. An example of categorical data is the classification of cats and dogs based on weight and color.

import pandas as pd
url = 'http://apmonitor.com/pds/uploads/Main/animals.txt'
data = pd.read_csv(url)
data.head()
  Weight Color   Type
   0.35  Gray    Cat
   0.52  White   Dog
   0.73  White   Dog
   1.2   Orange  Cat
   1.3   Gray    Cat
   1.4   Orange  Cat
   2.2   Brown   Dog
   2.3   White   Cat
   3.0   Gray    Dog
   4.5   Brown   Dog
   8.3   White   Dog
  18.2   Orange  Dog
  24.4   Tan     Dog
  29.3   Orange  Dog
  42.6   Tan     Dog
 220.0   Orange  Cat

The output label is Cat or Dog but most machine learning algorithms need the data as a numerical value as 0 or 1. This can be accomplished for the output label with a map.

mapping = {'Cat': 0, 'Dog': 1}
data['Label'] = data['Type'].replace(mapping)
     Weight   Color Type  Label
  0    0.35    Gray  Cat      0
  1    0.52   White  Dog      1
  2    0.73   White  Dog      1
  3    1.20  Orange  Cat      0
  4    1.30    Gray  Cat      0

The color feature of the animal is transformed with one of 3 methods:

  1. Ordinal Encoding
  2. One-Hot Encoding
  3. Feature Hashing

1. Ordinal Encoding

The simplest method for converting categorical data into a numeric value is to use an ordinal transform to produce CNumber.

  0 = Gray
  1 = White
  2 = Orange
  3 = Brown
  4 = Tan

The disadvantage of this approach is that the numerical feature has an ordered set where there is no significance to the order.

# Ordinal number encoding
data['CNumber'] = pd.factorize(data['Color'])[0]
      Weight   Color Type  CNumber
  0     0.35    Gray  Cat        0
  1     0.52   White  Dog        1
  2     0.73   White  Dog        1
  3     1.20  Orange  Cat        2
  4     1.30    Gray  Cat        0
  5     1.40  Orange  Cat        2
  6     2.20   Brown  Dog        3
  7     2.30   White  Cat        1
  8     3.00    Gray  Dog        0
  9     4.50   Brown  Dog        3
  10    8.30   White  Dog        1
  11   18.20  Orange  Dog        2
  12   24.40     Tan  Dog        4
  13   29.30  Orange  Dog        2
  14   42.60     Tan  Dog        4
  15  220.00  Orange  Cat        2

2. One-Hot Encoding

When there are fewer than 15 categories, one-hot encoding is a viable option to transform features into numerical values and avoid problems with ordinal encoding. The pandas function get_dummies performs this transformation. In this case, each color is a new feature with a 0 or 1 for the color.

data = data.join(pd.get_dummies(data['Color']))
data.head()
     Weight   Color Type  Brown  Gray  Orange  Tan  White
  0    0.35    Gray  Cat      0     1       0    0      0
  1    0.52   White  Dog      0     0       0    0      1
  2    0.73   White  Dog      0     0       0    0      1
  3    1.20  Orange  Cat      0     0       1    0      0
  4    1.30    Gray  Cat      0     1       0    0      0

This also allows multiple colors to be represented as features, possibly with a fractional value for the surface area of the color.

3. Feature Hashing

Feature hashing is a compromise between ordinal encoding and one-hot encoding. It is encoding into a non-binary vector with a specified number of new features. In this example, 2 new features Color_0 and Color_1 are created.

from sklearn.feature_extraction import FeatureHasher
fh = FeatureHasher(n_features=2, input_type='string')
ht = fh.fit_transform(data['Color'].apply(lambda x: [x])).toarray()
nc = pd.DataFrame(ht)
nc.columns = ['Color_'+str(i) for i in range(fh.n_features)]
data = data.join(nc)
print(data.head())

# create plot to show new features
import matplotlib.pyplot as plt
plt.figure(figsize=(8,3))
plt.rcParams['axes.facecolor'] = 'black'
for i in range(len(data)):
    plt.plot([0,data['Color_0'][i]],\
             [0,data['Color_1'][i]],\
             color=data['Color'][i],\
             marker='o',linestyle='-',lw=3)
plt.show()
     Weight   Color Type  Label  Color_0  Color_1
  0    0.35    Gray  Cat      0      2.0      2.0
  1    0.52   White  Dog      1     -1.0     -2.0
  2    0.73   White  Dog      1     -1.0     -2.0
  3    1.20  Orange  Cat      0      0.0      2.0
  4    1.30    Gray  Cat      0      2.0      2.0

Feature Collection and Generation

Data aggregation imports new features from a combination of data sources. Python packages such as Feature Tools assist in automating the generation and curation of derived features. When a tool such as Feature Tools generates many new potential features, it is important to retain only the best set of features for training and testing.

Feature Selection

Features should be removed if it is a single constant value, there is a large imbalance of categories, or if there is little correlation to the output label. There are statistical tests to select features that strong relationships with the output label. A tool is SelectKBest with associated statistical tests. This method uses a `\chi^2` statistical test for non-negative features to rank the best features for predicting the output. If there are negative features then a MinMaxScalar can be used to scale all features to 0 to 1.

from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
import matplotlib.pyplot as plt
plt.rcParams['axes.facecolor'] = 'white'
bestfeatures = SelectKBest(score_func=chi2, k='all')
features = ['Brown','Gray','Orange','Tan','White','CNumber']
X = data[features]
y = data['Label']
fit = bestfeatures.fit(X,y)
dfscores = pd.DataFrame(fit.scores_)
dfcolumns = pd.DataFrame(X.columns)
scores = pd.concat([dfcolumns,dfscores],axis=1)
scores.columns = ['Specs','Score']
scores.index = features
scores.plot(kind='bar',figsize=(8,2))
plt.show()

An iterative method is to remove the lowest performing feature and test the goodness of fit after training without the feature. The iterative method continues to remove features until the test score drops.

Exercise 1

The first and second derivatives of the temperature are calculated as features to classify when the TCLab heater is on or off in the case study for equipment health monitoring.

Exercise 2

Physical and chemical properties of the Lithium-ion silicate cathodes predict the crystal structure of a Lithium-ion battery material as monoclinic, orthorhombic and triclinic. This case study demonstrates how feature engineering improves the classification results.

Further Reading


✅ Knowledge Check

1. Which of the following methods is described as a compromise between ordinal encoding and one-hot encoding?

A. Feature Tools
Incorrect. Feature Tools is a Python package for automating the generation and curation of derived features.
B. Feature Hashing
Correct. Feature hashing is encoding into a non-binary vector with a specified number of new features. It is a compromise between ordinal and one-hot encoding.
C. One-Hot Encoding
Incorrect. One-Hot Encoding is a method to transform features into numerical values, where each category is a new feature with a 0 or 1 value.
D. Ordinal Encoding
Incorrect. Ordinal Encoding is a simple method for converting categorical data into a numeric value, producing an ordered set.

2. What is the primary function of the SHAP (SHapley Additive exPlanations)?

A. It performs a nonlinear transform of data.
Incorrect. SHAP does not perform a nonlinear transform of data; it ranks the significance of features.
B. It generates many new potential features.
Incorrect. SHAP is not primarily for generating new features but rather ranking existing ones.
C. It gives each feature a ranking to explain the output of a machine learning model.
Correct. SHAP (SHapley Additive exPlanations) gives each feature a ranking to explain the output of a machine learning model. It is a game theoretic approach to explain the output of any machine learning model.
D. It is a method for one-hot encoding.
Incorrect. SHAP is not a method for one-hot encoding. One-hot encoding is a way to transform categorical data into numeric format.