Logistic Regression

Logistic regression is a machine learning algorithm for classification. In this algorithm, the probabilities describing the possible outcomes of a single trial are modeled using a logistic function.

Logistic regression makes a binary classification prediction based on the sigmoid function with n input features, `x_1 \ldots x_n`.

$$p = \frac{1}{1+e^{-(\beta_0+\beta_1x_1+...+\beta_nx_n)}}$$

The `\beta_i` are coefficients that can be determined from a stochastic gradient descent or other optimization on the training dataset. The desired classification is either a 0 or 1, so the probability calculated with the sigmoid function is rounded to the nearest value.

Advantages: Logistic regression is designed for this purpose (classification), and is most useful for understanding the influence of several independent variables on a single outcome variable.

Disadvantages: Assumes all predictors are independent of each other, and assumes data set is free of missing values.

from sklearn.linear_model import LogisticRegression
lr = LogisticRegression(solver='lbfgs')
lr.fit(XA,yA)
yP = lr.predict(XB)

Optical Character Recognition with Logistic Regression

Optical character recognition (OCR) is a method used to convert images of text into machine-readable text. It is typically used to process scanned documents and recognize the text in the images. Logistic regression is a type of machine learning model that can be used for classification tasks. In the context of OCR, logistic regression could be trained to recognize characters in images of text.

Here is an example of using logistic regression for OCR in Python:

# Import necessary libraries
from sklearn.datasets import load_digits
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

# Load the dataset of images of handwritten digits
digits = load_digits()

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(digits.data,
                                                    digits.target,
                                                    random_state=0)

# Create a logistic regression classifier
clf = LogisticRegression(max_iter=5000, random_state=0)

# Train the model using the training set
clf.fit(X_train, y_train)

# Evaluate the model's performance on the test set
accuracy = clf.score(X_test, y_test)
print("Accuracy: %0.2f" % accuracy)

In this example, we use the scikit-learn library to load the dataset of images of handwritten digits, split the dataset into training and testing sets, and train a logistic regression classifier. We then evaluate the model's performance on the test set by computing the accuracy, which is the proportion of test images that the model correctly identifies. Below is an additional example that shows a sample image from the test set.

from sklearn import datasets
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
import numpy as np

from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression(max_iter=5000,\
                                solver='lbfgs',\
                                multi_class='auto')

# The digits dataset
digits = datasets.load_digits()
n_samples = len(digits.images)
data = digits.images.reshape((n_samples, -1))

# Split into train and test subsets (50% each)
X_train, X_test, y_train, y_test = train_test_split(
    data, digits.target, test_size=0.5, shuffle=False)

# Learn the digits on the first half of the digits
classifier.fit(X_train, y_train)

# Test on second half of data
n = np.random.randint(int(n_samples/2),n_samples)
print('Predicted: ' + str(classifier.predict(digits.data[n:n+1])[0]))

# Show number
plt.imshow(digits.images[n], cmap=plt.cm.gray_r, interpolation='nearest')
plt.show()

Logistic Regression Programming

Using only Python code (not a machine learning package), create a Logistic Regression classifier. Use the blobs test code to evaluate the performance of the code with any number of features (n_features). First develop the code to test 2 features. Use the source code blocks as needed.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
import seaborn as sns

# Generate blobs dataset
features, label = make_blobs(n_samples=800, centers=2,\
                             n_features=2, random_state=12)
data = pd.DataFrame()
data['x1'] = features[:,0]
data['x2'] = features[:,1]
data['y']  = label

sns.pairplot(data,hue='y')

Split data into train (80%) and test (20%) sets.

from sklearn.model_selection import train_test_split
train, test = train_test_split(data.values,test_size=0.2)

Create a function to make a prediction with coefficients.

$$p = \frac{1}{1+e^{-(\beta_0+\beta_1x_1+...+\beta_nx_n)}}$$

The predicted `\hat y` is `p` rounded to the nearest 0 or 1 to produce a final predicted value.

from math import exp
def predict(row, beta):
    x = row[0:2]
    t = beta[0] + beta[1]*x[0] + beta[2]*x[1]
    return 1.0 / (1.0 + exp(-t))

Update coefficients for the intercept `\beta_0` with

$$\beta_0 = \beta_0 + l_{rate} (err) p (1-p)$$

and for the feature coefficients `beta_{i+1}` with

$$\beta_{i+1} = \beta_{i+1} + l_{rate} (err) p (1-p) x_i$$

where `l_{rate}` is the learning rate (use 0.3), `err` is the error (difference) between the measured label `y` and the predicted label `p`, and `x_i` is the input feature. Continue for 100 epochs (iteration updates of `\beta`) and plot the loss function.

l_rate = 0.3
n_epoch = 100

loss = np.zeros(n_epoch)
beta = [0.0,0.0,0.0]
for epoch in range(n_epoch):
    sum_error = 0
    for row in train:
        x = row[0:-1] # input features
        y = row[-1]   # output label
        p = predict(row, beta)
        error = y - p
        sum_error += error**2
        beta[0] += l_rate * error * p * (1.0 - p)
        beta[1] += l_rate * error * p * (1.0 - p) * x[0]
        beta[2] += l_rate * error * p * (1.0 - p) * x[1]
    loss[epoch] = sum_error

print('Coefficients:',beta)
plt.plot(loss)
plt.xlabel('Epoch')
plt.ylabel('Loss')

Test the logistic regression model on the remaining 20% the data not used for training.

yhat = []
for row in test:
    yhat.append(round(predict(row, beta)))

from sklearn.metrics import confusion_matrix
cmat = confusion_matrix(test[:,-1],yhat)
sns.heatmap(cmat,annot=True)

As a final step, make the code general to accept any number of input features. Test with a random integer for n_features between 3 and 10.

Solutions


Resources


✅ Knowledge Check

1. Which statement best describes the Logistic Regression?

A. It's an algorithm used primarily for regression problems.
Incorrect. Logistic Regression, despite its name, is used primarily for classification problems.
B. It predicts the output based on a linear combination of input features.
Incorrect. Though it uses a linear combination of features, the prediction is transformed using the sigmoid function.
C. Logistic Regression assumes all predictors are independent of each other.
Correct. One of the assumptions of Logistic Regression is that all predictors are independent.
D. It is most effective for multi-class classification problems.
Incorrect. While logistic regression can be adapted for multi-class problems (e.g., using the One-Versus-Rest method), it is inherently binary.

2. In the context of Optical Character Recognition (OCR), what is the primary role of Logistic Regression?

A. To convert images of text into a digital format.
Incorrect. While OCR does convert images of text to digital format, the role of logistic regression is to classify or recognize characters in those images.
B. To train on images to recognize and classify text characters.
Correct. Logistic Regression can be used to recognize characters in images of text by being trained on a dataset of images.
C. To improve the resolution of scanned documents.
Incorrect. The resolution of scanned documents is not affected by logistic regression. Its role is in classification.
D. To detect and remove noise from scanned documents.
Incorrect. Noise removal is a preprocessing step and not the primary role of logistic regression in the context of OCR.

Return to Classification Overview