Regression is the method of adjusting parameters in a model to minimize the difference between the predicted output and the measured output. The predicted output is calculated from a measured input (univariate), multiple inputs and a single output (multiple linear regression), or multiple inputs and outputs (multivariate linear regression).
linear regression: x and y are scalars
multiple linear regression: x is a vector, y is a scalar response
multivariate linear regression: x is a vector, y is a vector response
In machine learning terminology, the data inputs (x) are features and the measured outputs (y) are labels. For a single input and single output, m is the slope and c is the intercept.
$$y=m x + c$$
An alternate way to write this is in matrix form and changing the slope to `\beta_1` and the intercept to `\beta_2`.
Capital letters are often used to indicate when there are multiple inputs (X) or multiple outputs (Y). The difference between the predicted `X \beta` and measured `Y` output is the error `\epsilon`.
$$Y=X \beta + \epsilon$$
Linear regression analysis determines if the error `\epsilon` has certain statistical properties. A common requirement is that the errors (residuals) are normally distributed (`N(\mu,\Sigma)`) with zero mean `\mu=0` and covariance `\Sigma`=I (the identity matrix). This implies that the residuals are i.i.d. (independent and identically distributed) random variables. Statistical tests determine if the data fits a linear regression model or if there are unmodeled features of the data that may require a different type of regression model.
Two examples demonstrate multiple Python methods for (1) univariate linear regression and (2) multiple linear regression.
Example 1: Linear Regression
Objective: Perform univariate (single input factor) linear regression on sample data with and without a parameter constraint.
For linear regression, find unknown parameters a0 (slope) and a1 (intercept) to minimize the difference between measured y and predicted yfit.
where n is the length of y and a0 and a1 are adjusted to minimize the sum of the squared errors.
Report the parameter values, the R2 value of fit, and display a plot of the results. Enforce a constraint with the intercept>-0.5 and show the effect of that constraint on the regression fit compared to the unconstrained least squares solution.
Solution
There are many methods for regression in Python with 5 different packages to generate the solution. All give the same solution but the methods are different. The methods are from the packages:
# Method 3: numpy linalg solution # y = X a # X^T y = X^T X a
X = np.vstack((x,np.ones(len(x)))).T # matrix operations
XX = np.dot(X.T,X)
XTy = np.dot(X.T,y)
a = np.linalg.solve(XX,XTy) # same solution with lstsq
a = np.linalg.lstsq(X,y,rcond=None)[0]
yfit = a[0]*x+a[1];print(a) print('R^2 matrix = '+str(rsq(y,yfit)))
# Method 4: statsmodels ordinary least squares
X = sm.add_constant(x,prepend=False)
model = sm.OLS(y,X).fit()
yfit = model.predict(X)
a = model.params print(model.summary())
# Method 5: Gekko for constrained regression
m = GEKKO(remote=False); m.options.IMODE=2
c = m.Array(m.FV,2); c[0].STATUS=1; c[1].STATUS=1
c[1].lower=-0.5
xd = m.Param(x); yd = m.Param(y); yp = m.Var()
m.Equation(yp==c[0]*xd+c[1])
m.Minimize((yd-yp)**2)
m.solve(disp=False)
c =[c[0].value[0],c[1].value[1]] print(c)
# plot data and regressed line
plt.plot(x,y,'ko',label='data')
xp = np.linspace(-2,8,100)
slope =str(np.round(a[0],2))
intercept =str(np.round(a[1],2))
eqn ='LstSQ: y='+slope+'x'+intercept
plt.plot(xp,a[0]*xp+a[1],'r-',label=eqn)
slope =str(np.round(c[0],2))
intercept =str(np.round(c[1],2))
eqn ='Constraint: y='+slope+'x'+intercept
plt.plot(xp,c[0]*xp+c[1],'b--',label=eqn)
plt.grid()
plt.legend()
plt.show()
where n is the length of y and a0-a2 are adjusted to minimize the sum of the squared errors.
Report the parameter values, the R2 value of fit, and display a plot of the results.
Solution
As with univariate linear regression, there are several methods for multiple regression in Python with 3 different packages to generate the solution. Fewer packages in Python can perform multiple or multivariate linear regression. The methods are from the packages:
# Method 1: numpy linalg solution # Y = X a # X^T Y = X^T X a
X = np.vstack((x0,x1,np.ones(len(x0)))).T
a = np.linalg.lstsq(X,y)[0];print(a)
yfit = a[0]*x0+a[1]*x1+a[2] print('R^2 = '+str(rsq(y,yfit)))
# Method 2: statsmodels ordinary least squares
model = sm.OLS(y,X).fit()
predictions = model.predict(X) print(model.summary())
# Method 3: gekko
m = GEKKO(remote=False); m.options.IMODE=2
c = m.Array(m.FV,3) for ci in c:
ci.STATUS=1
xd = m.Array(m.Param,2); xd[0].value=x0; xd[1].value=x1
yd = m.Param(y); yp = m.Var()
s = m.sum([c[i]*xd[i]for i inrange(2)])
m.Equation(yp==s+c[-1])
m.Minimize((yd-yp)**2)
m.solve(disp=False)
a =[c[i].value[0]for i inrange(3)] print(a)
# plot data from mpl_toolkits import mplot3d from matplotlib import cm import matplotlib.pyplotas plt
fig = plt.figure()
ax = plt.axes(projection='3d')
ax.plot3D(x0,x1,y,'ko')
x0t = np.arange(-1,7,0.25)
x1t = np.arange(2,6,0.25)
X0,X1 = np.meshgrid(x0t,x1t)
Yt = a[0]*X0+a[1]*X1+a[2]
ax.plot_surface(X0,X1,Yt,cmap=cm.coolwarm,alpha=0.5)
plt.show()
Skewness: measure of data symmetry. With |skewness|>1 data is highly skewed. If |skewness|<0.5 the data is approximately symmetric.
Kurtosis: shape of the distribution that compares data at the center with the tails. Data sets with high kurtosis have heavy tails or more outliers. Data sets with low kurtosis have fewer outliers. Kurtosis is 3 for a normal distribution.
Omnibus: D’Angostino’s test, statistical test for the presence of skewness and kurtosis
Prob(Omnibus): Omnibus probability
Jarque-Bera: Test of skewness and kurtosis
Prob (JB): Jarque-Bera probability
Durbin-Watson: Test for autocorrelation if the errors have a time-series component
Cond. No: Test for multicollinearity coefficients are related. A high condition number indicates that some of the inputs and coefficents are not needed.