Regression is the process of adjusting model parameters to fit a prediction y
to measured values z
. There are independent variables x
as inputs to the model to generate the predictions y
. For machine learning, the objective is to minimize a loss function by adjusting model parameters. A common loss function is the sum of squared errors between the predicted y
and measured z
values.
x = Independent Variable, Input, Feature
y = Dependent Variable, Output, Label
z = Output Measurement
The is objective is to minimize a loss function such as a sum of squared errors between the measured and predicted values:
$Loss = \sum_{i=1}^{n}\left(y_i-z_i\right)^2$
where n
is the number of observations. Regression requires labelled data (output values) for training. Classification, on the other hand, can either be supervised (with z
measurements, labels) or unsupervised (no labels, z
measurements). Run the following code to load 30 sample data points with input x
and measured output z
.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
n = 31
x = np.linspace(0,3,n)
z = np.array([1.89,-0.12,-0.32,2.11,-0.25,1.01,0.17,2.75,2.01,5.5,\
0.87,3.31,2.29,2.73,3.67,3.92,4.29,5.27,3.85,4.26,\
5.75,5.82,6.36,9.13,7.61,9.52,9.53,12.49,12.29,13.7,14.12])
data = pd.DataFrame(np.vstack((x,z)).T,columns=['x','z'])
plt.plot(x,z,'ro',label='Measured')
plt.xlabel('x'); plt.ylabel('z'); plt.legend()
plt.show()
There are many model forms such as linear, polynomial, and nonlinear. A familiar linear model is a line with slope a
and intercept b
.
y = a x + b
A simple method for linear regression is with numpy
to fit p=np.polyfit(x,y,1)
and evaluate np.polyval(p,x)
the model. Run the following code to determine the slope and intercept that minimize the sum of squared errors (least squares) between the predicted y
and measured z
output.
p1 = np.polyfit(x,z,1)
print('Slope, Intercept:' + str(p1))
plt.plot(x,z,'ro',label='Measured (z)')
plt.plot(x,np.polyval(p1,x),'b-',label='Predicted (y)')
plt.legend(); plt.show()
The $R^2$ value can be calculated with the measured (true) and model (predicted) values for any regression model, not just linear.
from sklearn.metrics import r2_score
meas = [3.0, 2.0, 1.9, 7.1]
model = [2.5, 1.8, 2.0, 8.0]
r2_score(meas, model)
Another package is statsmodels
that performs standard Ordinary Least Squares (OLS) analysis with a nice report summary. The input x
is augmented with a np.ones(n)
column so that it also predicts the intercept
xc = np.vstack((x,np.ones(n))).T
and this is also accomplished more conveniently with xc=sm.add_constant(x)
.
import statsmodels.api as sm
xc = sm.add_constant(x)
model = sm.OLS(z,xc).fit()
predictions = model.predict(xc)
model.summary()
Create a linear model with the data:
xr = [0.0,1.0,2.0,3.5,5.0]
yr = [0.7,0.55,0.34,0.3,0.2]
Calculate the $R^2$ value and show the data and linear fit on a plot.
A polynomial model may also be quadratic:
y = a x^2 + b x + c
A quadratic model is really just a linear model with two inputs x
and z=x^2
.
y = a z + b x + c
This is also called multiple linear regression when there is more than one input y=f(x,z)
where f
is a function of inputs x
and z
.
p2 = np.polyfit(x,z,2)
print(p2)
plt.plot(x,z,'ro',label='Measured (z)')
plt.plot(x,np.polyval(p2,x),'b-',label='Predicted (y)')
plt.legend(); plt.show()
There is more information on the regressed coefficients for the quadratic fit if you use statsmodels
.
import statsmodels.api as sm
xc = np.vstack((x**2,x,np.ones(n))).T
model = sm.OLS(z,xc).fit()
predictions = model.predict(xc)
model.summary()
Create a polynomial model with the data:
xr = [0.0,1.0,2.0,3.5,5.0]
yr = [1.7,1.45,1.05,0.4,0.2]
Show the polynomial model on a plot.
Nonlinear regression requires a different tool such as curve_fit
that requires a function f
that returns a prediction. It also requires the data x
and z
. The unknown parameters a
and b
are adjusted so that the predicted output matches the measured output z
.
from scipy.optimize import curve_fit
def f(x,a,b):
return a * np.exp(b*x)
p, pcov = curve_fit(f,x,z)
print('p = '+str(p))
plt.plot(x,z,'ro')
plt.plot(x,f(x,*p),'b-')
plt.show()
Create a nonlinear model with the data:
$y = a \ln\left( b \, x \right)$
xr = [0.1,1.0,2.0,3.5,5.0]
yr = [0.2,0.4,1.05,1.45,1.7]
Show the nonlinear model on a plot.
Machine learning is computer algorithms and statistical models that rely on patterns and inference. They perform a specific task without explicit instructions. Machine learned regression models can be as simple as linear regression or as complex as deep learning. This tutorial demonstrates several regression methods with scikit-learn
.
To make the plot interactive, add the command:
%matplotlib notebook
%matplotlib inline
from mpl_toolkits.mplot3d import Axes3D
import math
def f(x,y):
return 2*math.cos(x)*y + x*math.cos(y) - 3*x*y
n = 500
x = (np.random.rand(n)-0.5)*2.0
y = (np.random.rand(n)-0.5)*2.0
z = np.empty_like(x)
for i in range(n):
z[i] = f(x[i],y[i])
data = pd.DataFrame(np.vstack((x,y,z)).T,columns=['x','y','z'])
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
ax.scatter(x,y,z,c=z,cmap='plasma')
plt.show()
The data can be scaled with a Standard Scalar or just left unscaled because the values of x
, y
, and z
are already close to acceptable ranges. If scaling is desired, it could be done with a few lines of additional code and changes to the plots.
from sklearn.preprocessing import StandardScaler
s = StandardScaler()
ds = s.fit_transform(data)
ds = pd.DataFrame(ds,columns=data.columns)
The data is split into training and testing sets with train_test_split
.
# no data scaling
ds = data
# data splitting into train and test sets
from sklearn.model_selection import train_test_split
train,test = train_test_split(ds, test_size=0.2, shuffle=True)
Run this code so that each of the regressor models will train and display on a 3D scatter and surface plot.
def fit(method):
# create points for plotting surface
xp = np.arange(-1, 1, 0.1)
yp = np.arange(-1, 1, 0.1)
XP, YP = np.meshgrid(xp, yp)
model = method.fit(train[['x','y']],train['z'])
zp = method.predict(np.vstack((XP.flatten(),YP.flatten())).T)
ZP = zp.reshape(np.size(XP,0),np.size(XP,1))
r2 = method.score(test[['x','y']],test['z'])
print('R^2: ' + str(r2))
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
ax.scatter(ds['x'],ds['y'],ds['z'],c=z,cmap='plasma',label='data')
ax.plot_surface(XP, YP, ZP, cmap='coolwarm',alpha=0.7,
linewidth=0, antialiased=False)
plt.show()
return
sklearn
¶The simplest regressor is a linear model. As expected, this model doesn't perform very well with the nonlinear data but it does predict the slope of the data.
from sklearn import linear_model
lm = linear_model.LinearRegression()
fit(lm)
from sklearn.neighbors import KNeighborsRegressor
knn = KNeighborsRegressor(n_neighbors=2)
fit(knn)
from sklearn import svm
s = svm.SVR(gamma='scale')
fit(s)
from sklearn.neural_network import MLPRegressor
nn = MLPRegressor(hidden_layer_sizes=(3),
activation='tanh', solver='lbfgs')
fit(nn)
Find another regressor in scikit-learn such as Decision Tree Regressor or Passive Agressive Regressor. Train and test the regressor with the fit()
function in this notebook by passing in the regressor object such as:
Decision Tree Regressor
from sklearn import tree
dt = tree.DecisionTreeRegressor()
fit(dt)
Passive Aggressive Regressor
from sklearn.linear_model import PassiveAggressiveRegressor
par = PassiveAggressiveRegressor(max_iter=2000,tol=1e-3)
fit(par)
Change the options of the regressor if you can improve the $R^2$ value such as the PassiveAggressiveRegressor
with max_iter=2000
.
Deep learning is regression with a neural network with multiple layers. Regression with deep learning has specialized packages such as Tensorflow that are built for large-scale data or Gekko that are built for configurable model structures. Below is one of the examples from a deep learning tutorial with Gekko. This same example with Keras (Tensorflow) is also shown in the link.
from gekko import brain
x = np.linspace(0.0,2*np.pi)
y = np.sin(x)
b = brain.Brain(remote=False)
b.input_layer(1)
b.layer(linear=2)
b.layer(tanh=2)
b.layer(linear=2)
b.output_layer(1)
b.learn(x,y,disp=False)
xp = np.linspace(-2*np.pi,4*np.pi,100)
yp = b.think(xp)
plt.figure()
plt.plot(x,y,'bo',label='meas (label)')
plt.plot(xp,yp[0],'r-',label='model')
plt.legend(); plt.show()
Set heater 1 to 80% with lab.Q1(80)
and heater 2 to 60% with lab.Q1(60)
. Record the temperatures (lab.T1
and lab.T2
) every 0.5 seconds (use time.sleep(0.5)
) for 30 seconds. Store the values for time, temperature 1, and temperature 2 in numpy
arrays. Use time.time() to get the current time in seconds.
Create a linear model for T2
with regression. Report the $R^2$ value. Add the regression line as a black line to a plot with the measured T2
as blue circles. Add appropriate labels to the plot.
Create a nonlinear regression for T1
. Fit the $T_1$ data with:
a
, b
, and c
are adjustable parameters.scikit-learn
, keras (tensorflow)
, or gekko
Report the $R^2$ value for each.