Feature Engineering

Feature engineering is the process of selecting and creating the input descriptors for machine learning. Categorical data is converted to numeric values such as True=1 and False=0. Encoding creates indicators from images, words, numbers, or discrete categories.

  • Ordinal Encoding: assign number to each option (e.g. 0=red, 1=blue)
  • One-Hot Encoding: new binary (0 or 1) feature for each option
  • Feature Hashing: compromise between ordinal and one-hot encoding

Feature generation may also create new data columns that are derived from the existing features. This may include a nonlinear transform of an individual feature or a product of two features.

Selection and creation of features is an important step in machine learning. Too many features may cause the classifier or regressor to increase the chances of predicting poorly. With many features, one of the inputs may be a bad value and cause a bad prediction. More features also take longer for data curation, training, and prediction. There are methods to rank the most important features for classification or regression.

There are methods to rank features in order of significance such as SHAP (SHapley Additive exPlanations). SHAP gives each feature a ranking to explain the output of a machine learning model or SelectKBest to rank the input features. Unimportant features are identified and removed to improve training time, reduce storage cost, and minimize deployment resources.

Data Transformation

Data transformation is the modification of a data column to create a modified feature.

Nonlinear Transform

A common example of data transformation is to perform a nonlinear transform such as log-scale of time.

import numpy as np
import pandas as pd
data = pd.read_csv('http://apmonitor.com/pds/uploads/Main/tclab_data6.txt')
data['log_time'] = np.log(data['Time'].values)
     Time   Q1   Q2     T1     T2  log_time
  0   0.0  0.0  0.0  16.06  16.00      -inf
  1   1.0  0.0  0.0  16.06  15.97  0.000000
  2   2.0  0.0  0.0  16.06  16.03  0.693147
  3   3.0  0.0  0.0  16.03  16.00  1.098612
  4   4.0  0.0  0.0  16.03  15.94  1.386294

Statistical Transform

Statistical measures are another example of data transformations. A rolling window of the mean, standard deviation, kurtosis, skew, or other statistical information may give additional features for classification or regression.

data['T1_mean']  = data['T1'].rolling(window=100,center=True).mean()
data['T1_stdev'] = data['T1'].rolling(window=100,center=True).std()
data['T1_skew']  = data['T1'].rolling(window=100,center=True).skew()
data[['T1','T1_mean']].plot(figsize=(8,4),ylabel='Temperature (degC)')

Categorical Data

Categorical data takes a limited number of discrete values and are typically represented as strings. An example of categorical data is the classification of cats and dogs based on weight and color.

import pandas as pd
url = 'http://apmonitor.com/pds/uploads/Main/animals.txt'
data = pd.read_csv(url)
  Weight Color   Type
   0.35  Gray    Cat
   0.52  White   Dog
   0.73  White   Dog
   1.2   Orange  Cat
   1.3   Gray    Cat
   1.4   Orange  Cat
   2.2   Brown   Dog
   2.3   White   Cat
   3.0   Gray    Dog
   4.5   Brown   Dog
   8.3   White   Dog
  18.2   Orange  Dog
  24.4   Tan     Dog
  29.3   Orange  Dog
  42.6   Tan     Dog
 220.0   Orange  Cat

The output label is Cat or Dog but most machine learning algorithms need the data as a numerical value as 0 or 1. This can be accomplished for the output label with a map.

mapping = {'Cat': 0, 'Dog': 1}
data['Label'] = data['Type'].replace(mapping)
     Weight   Color Type  Label
  0    0.35    Gray  Cat      0
  1    0.52   White  Dog      1
  2    0.73   White  Dog      1
  3    1.20  Orange  Cat      0
  4    1.30    Gray  Cat      0

The color feature of the animal is transformed with one of 3 methods:

  1. Ordinal Encoding
  2. One-Hot Encoding
  3. Feature Hashing

1. Ordinal Encoding

The simplest method for converting categorical data into a numeric value is to use an ordinal transform to produce CNumber.

  0 = Gray
  1 = White
  2 = Orange
  3 = Brown
  4 = Tan

The disadvantage of this approach is that the numerical feature has an ordered set where there is no significance to the order.

# Ordinal number encoding
data['CNumber'] = pd.factorize(data['Color'])[0]
      Weight   Color Type  CNumber
  0     0.35    Gray  Cat        0
  1     0.52   White  Dog        1
  2     0.73   White  Dog        1
  3     1.20  Orange  Cat        2
  4     1.30    Gray  Cat        0
  5     1.40  Orange  Cat        2
  6     2.20   Brown  Dog        3
  7     2.30   White  Cat        1
  8     3.00    Gray  Dog        0
  9     4.50   Brown  Dog        3
  10    8.30   White  Dog        1
  11   18.20  Orange  Dog        2
  12   24.40     Tan  Dog        4
  13   29.30  Orange  Dog        2
  14   42.60     Tan  Dog        4
  15  220.00  Orange  Cat        2

2. One-Hot Encoding

When there are fewer than 15 categories, one-hot encoding is a viable option to transform features into numerical values and avoid problems with ordinal encoding. The pandas function get_dummies performs this transformation. In this case, each color is a new feature with a 0 or 1 for the color.

data = data.join(pd.get_dummies(data['Color']))
     Weight   Color Type  Brown  Gray  Orange  Tan  White
  0    0.35    Gray  Cat      0     1       0    0      0
  1    0.52   White  Dog      0     0       0    0      1
  2    0.73   White  Dog      0     0       0    0      1
  3    1.20  Orange  Cat      0     0       1    0      0
  4    1.30    Gray  Cat      0     1       0    0      0

This also allows multiple colors to be represented as features, possibly with a fractional value for the surface area of the color.

3. Feature Hashing

Feature hashing is a compromise between ordinal encoding and one-hot encoding. It is encoding into a non-binary vector with a specified number of new features. In this example, 2 new features Color_0 and Color_1 are created.

from sklearn.feature_extraction import FeatureHasher
fh = FeatureHasher(n_features=2, input_type='string')
ht = fh.fit_transform(data['Color']).toarray()
nc = pd.DataFrame(ht)
nc.columns = ['Color_'+str(i) for i in range(fh.n_features)]
data = data.join(nc)

# create plot to show new features
import matplotlib.pyplot as plt
plt.rcParams['axes.facecolor'] = 'black'
for i in range(len(data)):
     Weight   Color Type  Label  Color_0  Color_1
  0    0.35    Gray  Cat      0      2.0      2.0
  1    0.52   White  Dog      1     -1.0     -2.0
  2    0.73   White  Dog      1     -1.0     -2.0
  3    1.20  Orange  Cat      0      0.0      2.0
  4    1.30    Gray  Cat      0      2.0      2.0

Feature Collection and Generation

Data aggregation imports new features from a combination of data sources. Python packages such as Feature Tools assist in automating the generation and curation of derived features. When a tool such as Feature Tools generates many new potential features, it is important to retain only the best set of features for training and testing.

Feature Selection

Features should be removed if it is a single constant value, there is a large imbalance of categories, or if there is little correlation to the output label. There are statistical tests to select features that strong relationships with the output label. A tool is SelectKBest with associated statistical tests. This method uses a `\chi^2` statistical test for non-negative features to rank the best features for predicting the output. If there are negative features then a MinMaxScalar can be used to scale all features to 0 to 1.

from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
import matplotlib.pyplot as plt
plt.rcParams['axes.facecolor'] = 'white'
bestfeatures = SelectKBest(score_func=chi2, k='all')
features = ['Brown','Gray','Orange','Tan','White','CNumber']
X = data[features]
y = data['Label']
fit = bestfeatures.fit(X,y)
dfscores = pd.DataFrame(fit.scores_)
dfcolumns = pd.DataFrame(X.columns)
scores = pd.concat([dfcolumns,dfscores],axis=1)
scores.columns = ['Specs','Score']
scores.index = features

An iterative method is to remove the lowest performing feature and test the goodness of fit after training without the feature. The iterative method continues to remove features until the test score drops.

Exercise 1

The first and second derivatives of the temperature are calculated as features to classify when the TCLab heater is on or off in the case study for equipment health monitoring.

Exercise 2

Physical and chemical properties of the Lithium-ion silicate cathodes predict the crystal structure of a Lithium-ion battery material as monoclinic, orthorhombic and triclinic. This case study demonstrates how feature engineering improves the classification results.

Further Reading