Processing math: 100%

Linear Regression

Linear regression is a statistical model that is used to predict the value of a continuous dependent variable (the response variable) based on the value of one or more independent variables (the predictor variables). It is based on the assumption that there is a linear relationship between the predictor variables and the response variable.

Regression is the method of adjusting parameters in a model to minimize the difference between the predicted output and the measured output. The predicted output is calculated from a measured input (univariate), multiple inputs and a single output (multiple linear regression), or multiple inputs and outputs (multivariate linear regression).

  • linear regression: x and y are scalars
  • multiple linear regression: x is a vector, y is a scalar response
  • multivariate linear regression: x is a vector, y is a vector response

In machine learning terminology, the data inputs (x) are features and the measured outputs (y) are labels. For a single input and single output, m is the slope and c is the intercept.

y=mx+c

An alternate way to write this is in matrix form and changing the slope to β1 and the intercept to β2.

y=[x1][mc]=[x1][β1β2]

Capital letters are often used to indicate when there are multiple inputs (X) or multiple outputs (Y). The difference between the predicted Xβ and measured Y output is the error ε.

Y=Xβ+ϵ

Linear regression analysis determines if the error ε has certain statistical properties. For regression, the objective is to minimize the sum of squared errors.

J=min(ni=1ϵ2i)

The minimum is where the gradient of the objective ε is set equal to zero and solved for β.

Y=Xβ

Multiply each side by XT

XTY=XTXβ

Multiply each side by the inverse of XTX to solve for β.

β=(XTX)1XTY

Although it is possible to solve for the linear regression parameters this way, there are more efficient numerical methods for solving for the regression parameters.

A common requirement is that the errors (residuals) are normally distributed (N(μ,Σ)) with zero mean μ=0 and covariance Σ=I (the identity matrix). This implies that the residuals are i.i.d. (independent and identically distributed) random variables. Statistical tests determine if the data fits a linear regression model or if there are unmodeled features of the data that may require a different type of regression model.

Two examples demonstrate multiple Python methods for (1) univariate linear regression and (2) multiple linear regression.

Example 1: Linear Regression

Objective: Perform univariate (single input factor) linear regression on sample data with and without a parameter constraint.

For linear regression, find unknown parameters a0 (slope) and a1 (intercept) to minimize the difference between measured y and predicted yfit.

Data

x=[4,5,2,3,1,1,6,7]

y=[0.3,0.8,0.05,0.1,0.8,0.5,0.5,0.65]

Linear Equation

yfit=a0x+a1

Minimize Objective

mina0,a1ni=1(yiyfit,i)2

where n is the length of y and a0 and a1 are adjusted to minimize the sum of the squared errors.

Report the parameter values, the R2 value of fit, and display a plot of the results. Enforce a constraint with the intercept>-0.5 and show the effect of that constraint on the regression fit compared to the unconstrained least squares solution.

Solution

There are many methods for regression in Python with 5 different packages to generate the solution. All give the same solution but the methods are different. The methods are from the packages:

  1. scipy.stats.linregress
  2. numpy.polyfit
  3. numpy.linalg
  4. statsmodels ordinary least squares
  5. gekko optimization (allows constraints)
                            OLS Regression Results                            
============================================================================
Dep. Variable:                      y  R-squared:                     0.897
Model:                            OLS  Adj. R-squared:                0.880
Method:                 Least Squares  F-statistic:                   52.19
Date:                Wed, 26 Aug 2020  Prob (F-statistic):         0.000357
Time:                        22:05:45  Log-Likelihood:               2.9364
No. Observations:                   8  AIC:                          -1.873
Df Residuals:                       6  BIC:                          -1.714
Df Model:                           1                                        
Covariance Type:          nonrobust                                        
============================================================================
             coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------
x1         0.1980      0.027      7.224      0.000       0.131       0.265
const     -0.5432      0.115     -4.721      0.003      -0.825      -0.262
============================================================================
Omnibus:                    2.653   Durbin-Watson:                 0.811
Prob(Omnibus):              0.265   Jarque-Bera (JB):              0.918
Skew:                       0.827   Prob(JB):                      0.632
Kurtosis:                   2.862   Cond. No.                       7.32
============================================================================

Example 2: Multiple Linear Regression

Objective: Perform multiple linear regression on sample data with two inputs.

For linear regression, find unknown parameters a0-a2 to minimize the difference between measured y and predicted yfit.

Data

x0=[4,5,2,3,1,1,6,7]

x1=[3,2,3,4,3,5,2,6]

y=[0.3,0.8,0.05,0.1,0.8,0.5,0.5,0.65]

Linear Equation

yfit=a0x0+a1x1+a2

Minimize Objective

mina0,a1ni=1(yiyfit,i)2

where n is the length of y and a0-a2 are adjusted to minimize the sum of the squared errors.

Report the parameter values, the R2 value of fit, and display a plot of the results.

Solution

As with univariate linear regression, there are several methods for multiple regression in Python with 3 different packages to generate the solution. Fewer packages in Python can perform multiple or multivariate linear regression. The methods are from the packages:

  1. numpy.linalg
  2. statsmodels ordinary least squares
  3. gekko optimization (allows constraints)
                            OLS Regression Results                            
==========================================================================
Dep. Variable:                      y   R-squared:                   0.933
Model:                            OLS   Adj. R-squared:              0.906
Method:                 Least Squares   F-statistic:                 34.77
Date:                Wed, 26 Aug 2020   Prob (F-statistic):        0.00117
Time:                        23:16:24   Log-Likelihood:             4.6561
No. Observations:                   8   AIC:                        -3.312
Df Residuals:                       5   BIC:                        -3.074
Df Model:                           2                                        
Covariance Type:        nonrobust                                        
==========================================================================
             coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------
x1         0.2003      0.024      8.256      0.000       0.138       0.263
x2        -0.0750      0.046     -1.639      0.162      -0.193       0.043
const     -0.2883      0.186     -1.551      0.182      -0.766       0.190
==========================================================================
Omnibus:                    1.262   Durbin-Watson:                   1.558
Prob(Omnibus):              0.532   Jarque-Bera (JB):                0.075
Skew:                      -0.237   Prob(JB):                        0.963
Kurtosis:                   3.026   Cond. No.                         16.9
==========================================================================

  • Dep. Variable: Model output
  • Model: Regression model (OLS=Ordinary Least Squares)
  • Method: Regression method
  • Date/Time: Time stamp
  • No. Observations: Number of data points
  • DF Residuals: Residual degrees of freedom. Number of data points – number of parameters
  • DF Model: Number of parameters but not including the constant term (intercept)
  • R-squared: Coefficient of determination (0-1) is a statistical measure the regression line closeness to the data points (1=perfect alignment)
  • Adj. R-squared: Adjusted R-squared based on the number of data points and DF Residuals
  • F-statistic: Significance of the fit
  • Prob (F-statistic): Probability of the F-statistic
  • Log-likelihood: log of the likelihood function
  • AIC: Akaike Information Criterion
  • BIC: Bayesian Information Criterion
  • coef: the regression coefficient
  • std err: standard error of the estimated coefficient
  • t: t-statistic value that is a measure of the cofficient signficance
  • P > |t|: P-value, if less than the confidence level (typically 0.05) the coefficient is a statistically significant in predicting the output
  • [0.025 0.975]: 95% confidence interval coefficient bounds
  • Skewness: measure of data symmetry. With |skewness|>1 data is highly skewed. If |skewness|<0.5 the data is approximately symmetric.
  • Kurtosis: shape of the distribution that compares data at the center with the tails. Data sets with high kurtosis have heavy tails or more outliers. Data sets with low kurtosis have fewer outliers. Kurtosis is 3 for a normal distribution.
  • Omnibus: D’Angostino’s test, statistical test for the presence of skewness and kurtosis
  • Prob(Omnibus): Omnibus probability
  • Jarque-Bera: Test of skewness and kurtosis
  • Prob (JB): Jarque-Bera probability
  • Durbin-Watson: Test for autocorrelation if the errors have a time-series component
  • Cond. No: Test for multicollinearity coefficients are related. A high condition number indicates that some of the inputs and coefficents are not needed.

Example 3: Scale-up

For large problems, it is important to know how a linear regression package performs with larger data sets on CPU and GPU hardware. The IPython Notebook below evaluates the clock time for problems up to 108 samples.

Linear Regression with CPU (i7-6600U Intel Processor)

Linear Regression with GPU (Google Colab with GPU Kernel)

Changed test_time = 5.0 to avoid exceeding RAM limit for upgrading to Colab Pro subscription.

The results also include a bottom subplot that shows a neural network trained on the same data.

MATLAB Live Script

Given the content, here are two questions related to Linear Regression that expose common misconceptions:


✅ Knowledge Check

1. Which of the following statements best describes Linear Regression?

A. Linear regression can only predict discrete values.
B. Linear regression assumes there is a nonlinear relationship between the predictor and response variables.
C. In linear regression, the errors (residuals) should be i.i.d. (independent and identically distributed) random variables.
D. The formula for linear regression is y=m×x, where m is the intercept and x is the slope.

2. In multiple linear regression, how do the predictor variables relate to the response variable?

A. Each predictor variable contributes to the prediction of the response variable in a nonlinear manner.
B. Only one predictor variable is used to predict the response variable.
C. The relationship between the predictor variables and the response variable is represented using multiple equations.
D. Each predictor variable contributes linearly to the prediction of the response variable.
💬