Stochastic gradient descent (SGD) is a popular optimization algorithm commonly used in machine learning. It is an iterative algorithm that seeks to find the minimum of a function by taking small steps in the direction of the negative gradient of the function. In other words, at each iteration of the algorithm, the parameters of the function are updated in the direction that reduces the function's value.

Stochastic gradient descent is a simple and very efficient approach to fit linear models. It is particularly useful when the number of samples is very large. It supports different loss functions and penalties for classification. Here is a tutorial that shows how to develop SGD from scratch.

Advantages: Efficiency and ease of implementation.

Disadvantages: Requires a number of hyper-parameters and it is sensitive to feature scaling.

Here is a simple example of gradient descent in Python:

# initialize parameters
p = [1, 2, 3]

# define the function to be minimized
def my_func(p):
return (p[0]-4)**2 + (p[1]-5)**2 + (p[2]-6)**2

# define the gradient of the function
return [2*(p[0]-4), 2*(p[1]-5), 2*(p[2]-6)]

# set the learning rate
lr = 0.01

# perform gradient descent for a number of iterations
for i in range(1000):

# update the parameters
p[0] = p[0] - lr * g[0]
p[1] = p[1] - lr * g[1]
p[2] = p[2] - lr * g[2]

# print the final parameters
print(p)

In this example, we define a simple function to be minimized and the gradient, and then use gradient descent to find the minimum of the function by updating the parameters in the direction of the negative gradient. After 1000 iterations, the final values of the parameters is close to the minimum of the function. There is additional information on other algorithms such as conjugate gradient, Newton's method, and steepest descent that use gradient (derivative) information to find an optimal solution.

In stochastic gradient descent observations are chosen randomly from the training set. This trains on a random subset of the training set between iterations. Here is a tutorial with a complete example of SGD from scratch. Python also has an SGD classification algorithm in the scikit-learn package.

from sklearn.linear_model import SGDClassifier
sgd = SGDClassifier(loss='modified_huber',
shuffle=True,
random_state=101)
sgd.fit(XA,yA)
yP = sgd.predict(XB)

Optical Character Recognition with SGD in Python

Optical character recognition (OCR) is a method used to convert images of text into machine-readable text. It is typically used to process scanned documents and recognize the text in the images. Stochastic gradient descent (SGD) is a method for training a machine learning model, such as a neural network, by iteratively updating the model's parameters to minimize a loss function. In the context of OCR, SGD can used to train a neural network to recognize characters in images of text.

Here is an example of using SGD as a solver to train an OCR neural network model in Python:

# Import necessary libraries
import numpy as np
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import train_test_split

# Load the dataset of images of handwritten digits

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(digits.data, digits.target, random_state=0)

# Create a neural network classifier
clf = MLPClassifier(max_iter=1000, tol=1e-3,
solver='sgd', random_state=0)

# Train the model using the training set
clf.fit(X_train, y_train)

# Evaluate the model performance on the test set
accuracy = clf.score(X_test, y_test)
print("Accuracy: %0.2f" % accuracy)

In this example, we use the scikit-learn library to load the dataset of images of handwritten digits, split the dataset into training and testing sets, and train a neural network classifier using SGD. We then evaluate the model's performance on the test set by computing the accuracy, which is the proportion of test images that the model correctly identifies.

SGD scikit-learn Classifier

SGD is also a stand-alone classifier in scikit-learn, not just as a solver for other methods. Below is an example of OCR with the SGDClassifier.

from sklearn import datasets
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
import numpy as np

from sklearn.linear_model import SGDClassifier
classifier = SGDClassifier(loss='modified_huber',
shuffle=True,
random_state=101)

# The digits dataset
n_samples = len(digits.images)
data = digits.images.reshape((n_samples, -1))

# Split into train and test subsets (50% each)
X_train, X_test, y_train, y_test = train_test_split(
data, digits.target, test_size=0.5, shuffle=False)

# Learn the digits on the first half of the digits
classifier.fit(X_train, y_train)

# Test on second half of data
n = np.random.randint(int(n_samples/2),n_samples)
print('Predicted: ' + str(classifier.predict(digits.data[n:n+1])[0]))

# Show number
plt.imshow(digits.images[n], cmap=plt.cm.gray_r, interpolation='nearest')
plt.show()

MATLAB Live Script

• Singh, J. Implementing SGD From Scratch, Towards Data Science, Article Link.

#### ✅ Knowledge Check

1. What is the primary purpose of Stochastic Gradient Descent (SGD)?

A. To make the function's value increase with each iteration.
Incorrect. The primary purpose of SGD is to minimize a function. It works by updating the function's parameters in the direction that reduces its value.
B. To find the maximum of a function by taking small steps in random directions.
Incorrect. SGD seeks to find the minimum, not the maximum, of a function. The steps it takes are based on the negative gradient of the function, not random directions.
C. To find the minimum of a function by updating its parameters in the direction of its negative gradient.
Correct. SGD is an iterative algorithm that seeks to find the minimum of a function by taking small steps in the direction of the negative gradient.
D. To efficiently train only non-linear models.
Incorrect. SGD is an optimization algorithm that can be used to train both linear and non-linear models. Its primary advantage is efficiency, especially with large datasets.

2. Which of the following statements about SGD is NOT true?

A. SGD is particularly useful when the number of samples is very large.
Incorrect. This statement is true. SGD is efficient and especially useful when dealing with large datasets.
B. SGD requires no hyper-parameters.
Correct. This statement is false. One of the disadvantages of SGD is that it requires a number of hyper-parameters, such as the learning rate.
C. In SGD, observations are chosen randomly from the training set between iterations.
Incorrect. This statement is true. SGD trains on a random subset of the training set between iterations, which differentiates it from standard gradient descent.
D. SGD is sensitive to feature scaling.
Incorrect. This statement is true. One of the characteristics of SGD is its sensitivity to feature scaling, which is why preprocessing and normalization of features can be crucial when using this algorithm.