Linear Regression

Linear regression is a statistical model that is used to predict the value of a continuous dependent variable (the response variable) based on the value of one or more independent variables (the predictor variables). It is based on the assumption that there is a linear relationship between the predictor variables and the response variable.

Regression is the method of adjusting parameters in a model to minimize the difference between the predicted output and the measured output. The predicted output is calculated from a measured input (univariate), multiple inputs and a single output (multiple linear regression), or multiple inputs and outputs (multivariate linear regression).

linear regression: x and y are scalars
multiple linear regression: x is a vector, y is a scalar response
multivariate linear regression: x is a vector, y is a vector response

In machine learning terminology, the data inputs (x) are features and the measured outputs (y) are labels. For a single input and single output, m is the slope and c is the intercept.

$$y=m x + c$$

An alternate way to write this is in matrix form and changing the slope to `\beta_1` and the intercept to `\beta_2`.

$$y = \begin{bmatrix}x&1\end{bmatrix} \begin{bmatrix}m\\c\end{bmatrix} = \begin{bmatrix}x&1\end{bmatrix} \begin{bmatrix}\beta_1\\\beta_2\end{bmatrix}$$

Capital letters are often used to indicate when there are multiple inputs (X) or multiple outputs (Y). The difference between the predicted `X \beta` and measured `Y` output is the error `\epsilon`.

$$Y=X \beta + \epsilon$$

Linear regression analysis determines if the error `\epsilon` has certain statistical properties. For regression, the objective is to minimize the sum of squared errors.

$$J = \min\left(\sum_{i=1}^n \epsilon_i^2\right)$$

The minimum is where the gradient of the objective `\epsilon` is set equal to zero and solved for `\beta`.

$$Y=X \beta$$

Multiply each side by `X^T`

$$X^T Y= X^T X \beta$$

Multiply each side by the inverse of `X^T X` to solve for `\beta`.

$$\beta = \left(X^T X \right)^{-1} X^T Y$$

Although it is possible to solve for the linear regression parameters this way, there are more efficient numerical methods for solving for the regression parameters.

A common requirement is that the errors (residuals) are normally distributed (`N(\mu,\Sigma)`) with zero mean `\mu=0` and covariance `\Sigma`=I (the identity matrix). This implies that the residuals are i.i.d. (independent and identically distributed) random variables. Statistical tests determine if the data fits a linear regression model or if there are unmodeled features of the data that may require a different type of regression model.

Two examples demonstrate multiple Python methods for (1) univariate linear regression and (2) multiple linear regression.

Example 1: Linear Regression

Objective: Perform univariate (single input factor) linear regression on sample data with and without a parameter constraint.

For linear regression, find unknown parameters a₀ (slope) and a₁ (intercept) to minimize the difference between measured y and predicted y_fit.

Data

$$x = [4,5,2,3,-1,1,6,7]$$

$$y = [0.3,0.8,-0.05,0.1,-0.8,-0.5,0.5,0.65]$$

Linear Equation

$$y_{fit} = a_0 x + a_1$$

Minimize Objective

$$\min_{a_0,a_1} \sum_{i=1}^n \left(y_i-y_{fit,i}\right)^2$$

where n is the length of y and a₀ and a₁ are adjusted to minimize the sum of the squared errors.

Report the parameter values, the R² value of fit, and display a plot of the results. Enforce a constraint with the intercept>-0.5 and show the effect of that constraint on the regression fit compared to the unconstrained least squares solution.

Solution

There are many methods for regression in Python with 5 different packages to generate the solution. All give the same solution but the methods are different. The methods are from the packages:

scipy.stats.linregress
numpy.polyfit
numpy.linalg
statsmodels ordinary least squares
gekko optimization (allows constraints)

OLS Regression Results
============================================================================
Dep. Variable: y R-squared: 0.897
Model: OLS Adj. R-squared: 0.880
Method: Least Squares F-statistic: 52.19
Date: Wed, 26 Aug 2020 Prob (F-statistic): 0.000357
Time: 22:05:45 Log-Likelihood: 2.9364
No. Observations: 8 AIC: -1.873
Df Residuals: 6 BIC: -1.714
Df Model: 1
Covariance Type: nonrobust
============================================================================
coef std err t P>|t| [0.025 0.975]
--------------------------------------------------------------------------
x1 0.1980 0.027 7.224 0.000 0.131 0.265
const -0.5432 0.115 -4.721 0.003 -0.825 -0.262
============================================================================
Omnibus: 2.653 Durbin-Watson: 0.811
Prob(Omnibus): 0.265 Jarque-Bera (JB): 0.918
Skew: 0.827 Prob(JB): 0.632
Kurtosis: 2.862 Cond. No. 7.32
============================================================================

[$[Get Code]]

import numpy as np
from scipy.stats import linregress
import statsmodels.api as sm
import matplotlib.pyplot as plt
from gekko import GEKKO

# Data
x = np.array([4,5,2,3,-1,1,6,7])
y = np.array([0.3,0.8,-0.05,0.1,-0.8,-0.5,0.5,0.65])

# calculate R^2
def rsq(y1,y2):
yresid= y1 - y2
SSresid = np.sum(yresid**2)
SStotal = len(y1) * np.var(y1)
r2 = 1 - SSresid/SStotal
return r2

# Method 1: scipy linregress
slope,intercept,r,p_value,std_err = linregress(x,y)
a = [slope,intercept]
print('R^2 linregress = '+str(r**2))

# Method 2: numpy polyfit (1=linear)
a = np.polyfit(x,y,1); print(a)
yfit = np.polyval(a,x)
print('R^2 polyfit = '+str(rsq(y,yfit)))

# Method 3: numpy linalg solution
# y = X a
# X^T y = X^T X a
X = np.vstack((x,np.ones(len(x)))).T
# matrix operations
XX = np.dot(X.T,X)
XTy = np.dot(X.T,y)
a = np.linalg.solve(XX,XTy)
# same solution with lstsq
a = np.linalg.lstsq(X,y,rcond=None)[0]
yfit = a[0]*x+a[1]; print(a)
print('R^2 matrix = '+str(rsq(y,yfit)))

# Method 4: statsmodels ordinary least squares
X = sm.add_constant(x,prepend=False)
model = sm.OLS(y,X).fit()
yfit = model.predict(X)
a = model.params
print(model.summary())

# Method 5: Gekko for constrained regression
m = GEKKO(remote=False); m.options.IMODE=2
c = m.Array(m.FV,2); c[0].STATUS=1; c[1].STATUS=1
c[1].lower=-0.5
xd = m.Param(x); yd = m.Param(y); yp = m.Var()
m.Equation(yp==c[0]*xd+c[1])
m.Minimize((yd-yp)**2)
m.solve(disp=False)
c = [c[0].value[0],c[1].value[1]]
print(c)

# plot data and regressed line
plt.plot(x,y,'ko',label='data')
xp = np.linspace(-2,8,100)
slope = str(np.round(a[0],2))
intercept = str(np.round(a[1],2))
eqn = 'LstSQ: y='+slope+'x'+intercept
plt.plot(xp,a[0]*xp+a[1],'r-',label=eqn)
slope = str(np.round(c[0],2))
intercept = str(np.round(c[1],2))
eqn = 'Constraint: y='+slope+'x'+intercept
plt.plot(xp,c[0]*xp+c[1],'b--',label=eqn)
plt.grid()
plt.legend()
plt.show()

[$[Get Code]]

Example 2: Multiple Linear Regression

Objective: Perform multiple linear regression on sample data with two inputs.

For linear regression, find unknown parameters a₀-a₂ to minimize the difference between measured y and predicted y_fit.

Data

$$x_0 = [4,5,2,3,-1,1,6,7]$$

$$x_1 = [3,2,3,4, 3,5,2,6]$$

$$y = [0.3,0.8,-0.05,0.1,-0.8,-0.5,0.5,0.65]$$

Linear Equation

$$y_{fit} = a_0 x_0 + a_1 x_1 + a_2$$

Minimize Objective

$$\min_{a_0,a_1} \sum_{i=1}^n \left(y_i-y_{fit,i}\right)^2$$

where n is the length of y and a₀-a₂ are adjusted to minimize the sum of the squared errors.

Report the parameter values, the R² value of fit, and display a plot of the results.

Solution

As with univariate linear regression, there are several methods for multiple regression in Python with 3 different packages to generate the solution. Fewer packages in Python can perform multiple or multivariate linear regression. The methods are from the packages:

numpy.linalg
statsmodels ordinary least squares
gekko optimization (allows constraints)

OLS Regression Results
==========================================================================
Dep. Variable: y R-squared: 0.933
Model: OLS Adj. R-squared: 0.906
Method: Least Squares F-statistic: 34.77
Date: Wed, 26 Aug 2020 Prob (F-statistic): 0.00117
Time: 23:16:24 Log-Likelihood: 4.6561
No. Observations: 8 AIC: -3.312
Df Residuals: 5 BIC: -3.074
Df Model: 2
Covariance Type: nonrobust
==========================================================================
coef std err t P>|t| [0.025 0.975]
--------------------------------------------------------------------------
x1 0.2003 0.024 8.256 0.000 0.138 0.263
x2 -0.0750 0.046 -1.639 0.162 -0.193 0.043
const -0.2883 0.186 -1.551 0.182 -0.766 0.190
==========================================================================
Omnibus: 1.262 Durbin-Watson: 1.558
Prob(Omnibus): 0.532 Jarque-Bera (JB): 0.075
Skew: -0.237 Prob(JB): 0.963
Kurtosis: 3.026 Cond. No. 16.9
==========================================================================

[$[Get Code]]

import numpy as np
import statsmodels.api as sm
from gekko import GEKKO

# Data
x0 = np.array([4,5,2,3,-1,1,6,7])
x1 = np.array([3,2,3,4, 3,5,2,6])
y = np.array([0.3,0.8,-0.05,0.1,-0.8,-0.5,0.5,0.65])

# calculate R^2
def rsq(y1,y2):
yresid= y1 - y2
SSresid = np.sum(yresid**2)
SStotal = len(y1) * np.var(y1)
r2 = 1 - SSresid/SStotal
return r2

# Method 1: numpy linalg solution
# Y = X a
# X^T Y = X^T X a
X = np.vstack((x0,x1,np.ones(len(x0)))).T
a = np.linalg.lstsq(X,y)[0]; print(a)
yfit = a[0]*x0+a[1]*x1+a[2]
print('R^2 = '+str(rsq(y,yfit)))

# Method 2: statsmodels ordinary least squares
model = sm.OLS(y,X).fit()
predictions = model.predict(X)
print(model.summary())

# Method 3: gekko
m = GEKKO(remote=False); m.options.IMODE=2
c = m.Array(m.FV,3)
for ci in c:
ci.STATUS=1
xd = m.Array(m.Param,2); xd[0].value=x0; xd[1].value=x1
yd = m.Param(y); yp = m.Var()
s = m.sum([c[i]*xd[i] for i in range(2)])
m.Equation(yp==s+c[-1])
m.Minimize((yd-yp)**2)
m.solve(disp=False)
a = [c[i].value[0] for i in range(3)]
print(a)

# plot data
from mpl_toolkits import mplot3d
from matplotlib import cm
import matplotlib.pyplot as plt
fig = plt.figure()
ax = plt.axes(projection='3d')
ax.plot3D(x0,x1,y,'ko')
x0t = np.arange(-1,7,0.25)
x1t = np.arange(2,6,0.25)
X0,X1 = np.meshgrid(x0t,x1t)
Yt = a[0]*X0+a[1]*X1+a[2]
ax.plot_surface(X0,X1,Yt,cmap=cm.coolwarm,alpha=0.5)
plt.show()

[$[Get Code]]

Dep. Variable: Model output
Model: Regression model (OLS=Ordinary Least Squares)
Method: Regression method
Date/Time: Time stamp
No. Observations: Number of data points
DF Residuals: Residual degrees of freedom. Number of data points – number of parameters
DF Model: Number of parameters but not including the constant term (intercept)
R-squared: Coefficient of determination (0-1) is a statistical measure the regression line closeness to the data points (1=perfect alignment)
Adj. R-squared: Adjusted R-squared based on the number of data points and DF Residuals
F-statistic: Significance of the fit
Prob (F-statistic): Probability of the F-statistic
Log-likelihood: log of the likelihood function
AIC: Akaike Information Criterion
BIC: Bayesian Information Criterion
coef: the regression coefficient
std err: standard error of the estimated coefficient
t: t-statistic value that is a measure of the cofficient signficance
P > |t|: P-value, if less than the confidence level (typically 0.05) the coefficient is a statistically significant in predicting the output
[0.025 0.975]: 95% confidence interval coefficient bounds
Skewness: measure of data symmetry. With |skewness|>1 data is highly skewed. If |skewness|<0.5 the data is approximately symmetric.
Kurtosis: shape of the distribution that compares data at the center with the tails. Data sets with high kurtosis have heavy tails or more outliers. Data sets with low kurtosis have fewer outliers. Kurtosis is 3 for a normal distribution.
Omnibus: D’Angostino’s test, statistical test for the presence of skewness and kurtosis
Prob(Omnibus): Omnibus probability
Jarque-Bera: Test of skewness and kurtosis
Prob (JB): Jarque-Bera probability
Durbin-Watson: Test for autocorrelation if the errors have a time-series component
Cond. No: Test for multicollinearity coefficients are related. A high condition number indicates that some of the inputs and coefficents are not needed.

Example 3: Scale-up

For large problems, it is important to know how a linear regression package performs with larger data sets on CPU and GPU hardware. The IPython Notebook below evaluates the clock time for problems up to 10⁸ samples.

Linear Regression Benchmark on GitHub and Google Colab

Linear Regression with CPU (i7-6600U Intel Processor)

Linear Regression with GPU (Google Colab with GPU Kernel)

Changed test_time = 5.0 to avoid exceeding RAM limit for upgrading to Colab Pro subscription.

The results also include a bottom subplot that shows a neural network trained on the same data.

MATLAB Live Script

Linear Regression MATLAB Live Script

Given the content, here are two questions related to Linear Regression that expose common misconceptions:

✅ Knowledge Check

1. Which of the following statements best describes Linear Regression?

A. Linear regression can only predict discrete values.

Incorrect. Linear regression is used to predict continuous dependent variables.

B. Linear regression assumes there is a nonlinear relationship between the predictor and response variables.

Incorrect. Linear regression assumes a linear relationship between the predictor and response variables.

C. In linear regression, the errors (residuals) should be i.i.d. (independent and identically distributed) random variables.

Correct. This is a common assumption in linear regression to ensure that the model is accurate and meaningful.

D. The formula for linear regression is `y = m \times x`, where m is the intercept and x is the slope.

Incorrect. The correct formula is `y = m \times x + c` where m is the slope and c is the intercept.

2. In multiple linear regression, how do the predictor variables relate to the response variable?

A. Each predictor variable contributes to the prediction of the response variable in a nonlinear manner.

Incorrect. In multiple linear regression, predictor variables contribute linearly to predict the response variable.

B. Only one predictor variable is used to predict the response variable.

Incorrect. Multiple linear regression uses more than one predictor variable.

C. The relationship between the predictor variables and the response variable is represented using multiple equations.

Incorrect. The relationship is represented using a single equation, but with multiple terms for each predictor.

D. Each predictor variable contributes linearly to the prediction of the response variable.

Correct. This is the essence of multiple linear regression.

Machine Learning for Engineers