Data Regression with Python

Python Data Regression

Correlations from data are obtained by adjusting parameters of a model to best fit the measured outcomes. The analysis may include statistics, data visualization, or other calculations to synthesize the information into relevant and actionable information. This tutorial demonstrates how to create a linear, polynomial, or nonlinear functions that best approximate the data and analyze the result. Script files of the Python source code with sample data are available below.

Linear and Polynomial Regression

import numpy as np
x = np.array([0,1,2,3,4,5])
y = np.array([0,0.8,0.9,0.1,-0.8,-1])
print(x)
print(y)

p1 = np.polyfit(x,y,1)
p2 = np.polyfit(x,y,2)
p3 = np.polyfit(x,y,3)
print(p1)
print(p2)
print(p3)

import matplotlib.pyplot as plt
plt.plot(x,y,'o')
xp = np.linspace(-2,6,100)
plt.plot(xp,np.polyval(p1,xp),'r-')
plt.plot(xp,np.polyval(p2,xp),'b--')
plt.plot(xp,np.polyval(p3,xp),'m:')
yfit = p1[0] * x + p1[1]
yresid= y - yfit
SSresid = np.sum(yresid**2)
SStotal = len(y) * np.var(y)
rsq = 1 - SSresid/SStotal
print(yfit)
print(y)
print(rsq)

from scipy.stats import linregress
slope,intercept,r_value,p_value,std_err = linregress(x,y)
print(r_value**2)
print(p_value)
plt.show()

Regression with Python (GEKKO or Scipy)

import numpy as np
from gekko import GEKKO

# data
xm = np.array([18.3447,79.86538,85.09788,10.5211,44.4556, \
               69.567,8.960,86.197,66.857,16.875, \
               52.2697,93.917,24.35,5.118,25.126, \
               34.037,61.4445,42.704,39.531,29.988])
ym = np.array([5.072,7.1588,7.263,4.255,6.282, \
               6.9118,4.044,7.2595,6.898,4.8744, \
               6.5179,7.3434,5.4316,3.38,5.464, \
               5.90,6.80,6.193,6.070,5.737])
# regression
m = GEKKO() # remote=False for local mode
# parameters and variables
a = m.FV(value=0); a.STATUS=1
b = m.FV(value=0); b.STATUS=1
c = m.FV(value=0,lb=-100,ub=100); c.STATUS=1
# load data
x = m.Param(value=xm)
ymeas = m.Param(value=ym)
ypred = m.Var()
# define model
m.Equation(ypred == a + b/x + c*m.log(x))
m.Minimize(((ypred-ymeas)/ymeas)**2)
m.options.IMODE = 2
m.solve()

# show final objective
print('Final SSE Objective: ' + str(m.options.objfcnval))

# print solution
print('Solution')
print('a = ' + str(a.value[0]))
print('b = ' + str(b.value[0]))
print('c = ' + str(c.value[0]))

# plot solution
import matplotlib.pyplot as plt
plt.figure(1)
plt.plot(x,ymeas,'ro')
plt.plot(x,ypred,'bx');
plt.xlabel('x')
plt.ylabel('y')
plt.legend(['Measured','Predicted'],loc='best')
plt.savefig('results.png')
plt.show()

import numpy as np
from scipy.optimize import minimize

# load data
xm = np.array([18.3447,79.86538,85.09788,10.5211,44.4556, \
               69.567,8.960,86.197,66.857,16.875, \
               52.2697,93.917,24.35,5.118,25.126, \
               34.037,61.4445,42.704,39.531,29.988])

ym = np.array([5.072,7.1588,7.263,4.255,6.282, \
               6.9118,4.044,7.2595,6.898,4.8744, \
               6.5179,7.3434,5.4316,3.38,5.464, \
               5.90,6.80,6.193,6.070,5.737])

# calculate y
def calc_y(x):
    a,b,c = x
    y = a + b/xm + c*np.log(xm)
    return y

# define objective
def objective(x):
    return np.sum(((calc_y(x)-ym)/ym)**2)

# initial guesses
x0 = np.zeros(3)

# show initial objective
print('Initial SSE Objective: ' + str(objective(x0)))

# optimize
# bounds on variables
bnds100 = (-100.0, 100.0)
no_bnds = (-1.0e10, 1.0e10)
bnds = (no_bnds, no_bnds, bnds100)
solution = minimize(objective,x0,method='SLSQP',bounds=bnds)
x = solution.x
y = calc_y(x)

# show final objective
print('Final SSE Objective: ' + str(objective(x)))

# print solution
print('Solution')
print('a = ' + str(x[0]))
print('b = ' + str(x[1]))
print('c = ' + str(x[2]))

# plot solution
import matplotlib.pyplot as plt
plt.figure(1)
plt.plot(xm,ym,'ro')
plt.plot(xm,y,'bx');
plt.xlabel('x')
plt.ylabel('y')
plt.legend(['Measured','Predicted'],loc='best')
plt.savefig('results.png')
plt.show()

While this exercise demonstrates only one independent parameter and one dependent variable, any number of independent or dependent terms can be included. See Energy Price regression with three independent variables as an example.

Regression with APM Python

Excel and MATLAB

This regression tutorial can also be completed with Excel and Matlab. A multivariate nonlinear regression case with multiple factors is available with example data for energy prices in Python. Click on the appropriate link for additional information.

There is additional information on regression in the Data Science online course.

💬