Classification predicts discrete labels (outcomes) such as yes
/no
, True
/False
, or any number of discrete levels such as a letter from text recognition. An example of classification is to suggest a movie you will want to watch next (label) based on your prior viewing history (feature). Regression is different than classification with continuous outcomes such as any floating point number in a range. An example of regression is to build a correlation of the temperature of a pan of water (label) based on the time it has been heating (feature). The temperature values are continuous while the next movie is one of many discrete options.
Features are input values to regression or classification models. The features are inputs and labels are the measured outcomes. Below is a table of terms with terminology from machine learning, optimization, methods in GEKKO
, and a description.
Machine Learning | Optimization | Gekko Estimation | Description |
---|---|---|---|
Loss | Objective Function | m.Minimize() |
The mathematical quantity that represents the difference between the predicted and measured outcomes |
Weights | Adjustable Parameters | Fixed Values (m.FV() ) with STATUS=1 |
Adjustable values to minimize the loss function |
Label | Measured Outcome | Controlled Variable (m.CV() ) with FSTATUS=1 |
Measurements of the predicted system output |
Feature | Measured Input | Parameter (m.Param() ) |
Input measurements that predict the outcome label |
Train | Optimize | Solve (m.solve() ) |
Adjust the unknown parameters (weights) to minimize the objective (loss) function |
Test | Evaluate | Solve with STATUS=0 for m.FV() |
Predict the labels with the tuned model to evaluate the performance of the classifier or regressor |
Regressor or Classifier | Model | m = GEKKO() |
Mathematical equations and parameters that use feature inputs to predict an outcome label |
Selection and creation of features is an important step in machine learning. Too many features may cause the classifier or regressor to increase the chances of predicting poorly. With many features, one of the inputs may be a bad value and cause a bad prediction. More features also take longer for data curation, training, and prediction. This lesson shows how to derive and select features for regression and classification.
The first step in building a regressor or classifier is to determine what measurements (input features and output label) are available. You can select the data columns as features or generate derived features with a package such as Featuretools
.
You may want to use stock market data to give you an indicator on when to buy (1
) or sell (-1
). This indicator is a label. Import the daily stock data for Google for 23 days.
import pandas as pd
import numpy as np
url = 'http://apmonitor.com/che263/uploads/Main/goog.csv'
data = pd.read_csv(url)
data = data.drop(columns=['Adj Close'])
data.head()
The features may be any of the categories that may useful in predicting a future stock price change. The Open
, difference in High
and Low
(Volitility
), difference in Close
and Open
(Change
), and the Volume
of trades are features. The .diff()
calculates the difference and .fillna(0)
replaces any NaN
with zero. Add any other additional features that you would like to consider.
features = ['Open','Volatility','Change','Volume']
data['Volatility'] = (data['High']-data['Low']).diff()
data['Change'] = (data['Close']-data['Open']).diff()
# any other features?
data.head()
A label (outcome) for classification is the sign (+
or -
) of the close price from one day to the next. The np.roll( ,-1)
shifts all the values up by one to indicate the change for the next day on that same row. The np.sign()
returns the sign of the difference as an indicator of buy or sell and .dropna()
drops the last row that is NaN
.
data['Close_diff'] = np.roll(data['Close'].diff(),-1)
data=data.dropna()
label = ['Buy/Sell']
data['Buy/Sell'] = np.sign(data['Close_diff'])
data.head()
We have now generated a number of features but we want to evaluate which ones are the best predictors of the labeled outcome. There are a variety of methods to evaluate how many features are needed (correlation) and which are the best (selection). The first thing to do is to separate into input X
and output y
with data scaling (0
to 1
).
data[features+label].head()
from sklearn.preprocessing import MinMaxScaler
s = MinMaxScaler()
ds = s.fit_transform(data[features+label])
ds = pd.DataFrame(ds,columns=data[features+label].columns)
X = ds[features]
y = ds[label]
ds.head()
There are statistical tests to select features that strong relationships with the output label. A tool is the scikit-learn SelectKBest
method with associated statistical tests. This method uses a $\chi^2$ statistical test for non-negative features to select 10 of the best features for predicting the output.
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
import matplotlib.pyplot as plt
%matplotlib inline
bestfeatures = SelectKBest(score_func=chi2, k='all')
fit = bestfeatures.fit(X,y)
dfscores = pd.DataFrame(fit.scores_)
dfcolumns = pd.DataFrame(X.columns)
scores = pd.concat([dfcolumns,dfscores],axis=1)
scores.columns = ['Specs','Score']
scores.index = features
scores.plot(kind='bar')
plt.show()
Based on this information, drop any low scoring features with .remove()
.
# drop any low scoring features with features.remove('')
print(features)
There is a method that comes with a Tree Based Classifier to give a score for each feature of the data. Higher score correlates to more importance and relevance for predicting the output variable. The results change each analysis due to the stochastic nature of the calculation but Volitility
is again a factor that typically ranks highest.
from sklearn.ensemble import ExtraTreesClassifier
model = ExtraTreesClassifier(n_estimators=100)
model.fit(X,np.ravel(y))
feat_importances = pd.Series(model.feature_importances_, index=X.columns)
feat_importances.nlargest(4).plot(kind='bar')
plt.show()
Correlation shows how strongly features are related to each other. A large value, either positive or negative, incidates that the values are correlated. Correlated values mean that one of them may be eliminated because they are providing similar input information. A heatmap is a symmetric visual grid of the correlation matrix. The diagonal is always 1 because each value perfectly correlates to itself. From the heatmap, determine which features are most correlated and may be able to be removed.
import seaborn as sns
corrmat = ds.corr()
top_features = corrmat.index
plt.figure(figsize=(5,5))
sns.heatmap(ds[top_features].corr(),annot=True,cmap="RdYlGn")
b, t = plt.ylim(); plt.ylim(b+0.5, t-0.5) # addresses issue in matplotlib 3.1.1
plt.show()
Consider how you could determine if the TCLab heater is on or off without knowing the Q1
value. You would likely measure the T1
temperature and observe if the temperature is rising or falling. However, just observing the temperature and slope is not enough because the temperature continues to rise for 10-20 seconds after the heater is turned off. As a result, you need to calculate the 2nd derivative of the temperature to classify the heater state as on or off. A positive 2nd derivative is another clue that the heater is on.
In the case of whether the heater is on or off (measured outcome), the temperature and derivatives are the features. Run the code below to generate temperature data with the heater on
at 100% or off
at 0% in 20-30 second intervals for 3 minutes.
import tclab, time
import numpy as np
import pandas as pd
try:
with tclab.TCLab() as lab:
n = 180; t = np.linspace(0,n-1,n)
Q1 = np.zeros(n); T1 = np.zeros(n)
Q2 = np.zeros(n); T2 = np.zeros(n)
Q1[20:41] = 100.0; Q1[60:91] = 100.0
Q1[150:181] = 100.0; Q1[190:206] = 100.0
Q1[220:251] = 100.0; Q1[260:291] = 100.0
print('Time Q1 Q2 T1 T2')
for i in range(180):
T1[i] = lab.T1; T2[i] = lab.T2
lab.Q1(Q1[i])
if i%10==0:
print(int(t[i]),Q1[i],Q2[i],T1[i],T2[i])
time.sleep(1)
data = np.column_stack((t,Q1,Q2,T1,T2))
data7 = pd.DataFrame(data,columns=['Time','Q1','Q2','T1','T2'])
data7.to_csv('07-tclab.csv',index=False)
except:
print('Connect TCLab to generate new data')
print('Importing data from online source')
url = 'http://apmonitor.com/do/uploads/Main/tclab_data5.txt'
data7=pd.read_csv(url)
Create three features from the data including the temperature and the derivatives.
Add the derivatives as columns in data7
.
Scale data7
to be between 0
and 1
with d7 = s.fit_transform(data7)
. Don't forget to translate the scaled values back to a pandas
DataFrame.
Use SelectKBest
to determine the best features.
Generate a Heat Map to determine the correlation of the features.