## Stochastic Gradient Descent

Stochastic gradient descent (SGD) is a popular optimization algorithm commonly used in machine learning. It is an iterative algorithm that seeks to find the minimum of a function by taking small steps in the direction of the negative gradient of the function. In other words, at each iteration of the algorithm, the parameters of the function are updated in the direction that reduces the function's value.

Stochastic gradient descent is a simple and very efficient approach to fit linear models. It is particularly useful when the number of samples is very large. It supports different loss functions and penalties for classification. Here is a tutorial that shows how to develop SGD from scratch.

**Advantages:** Efficiency and ease of implementation.

**Disadvantages:** Requires a number of hyper-parameters and it is sensitive to feature scaling.

**Gradient Descent in Python**

Here is a simple example of gradient descent in Python:

p = [1, 2, 3]

# define the function to be minimized

def my_func(p):

return (p[0]-4)**2 + (p[1]-5)**2 + (p[2]-6)**2

# define the gradient of the function

def grad(p):

return [2*(p[0]-4), 2*(p[1]-5), 2*(p[2]-6)]

# set the learning rate

lr = 0.01

# perform gradient descent for a number of iterations

for i in range(1000):

# calculate the gradient

g = grad(p)

# update the parameters

p[0] = p[0] - lr * g[0]

p[1] = p[1] - lr * g[1]

p[2] = p[2] - lr * g[2]

# print the final parameters

print(p)

In this example, we define a simple function to be minimized and the gradient, and then use gradient descent to find the minimum of the function by updating the parameters in the direction of the negative gradient. After 1000 iterations, the final values of the parameters is close to the minimum of the function. There is additional information on other algorithms such as conjugate gradient, Newton's method, and steepest descent that use gradient (derivative) information to find an optimal solution.

**Stochastic Gradient Descent in Python**

In *stochastic* gradient descent observations are chosen randomly from the training set. This trains on a random subset of the training set between iterations. Here is a tutorial with a complete example of SGD from scratch. Python also has an SGD classification algorithm in the *scikit-learn* package.

sgd = SGDClassifier(loss='modified_huber',

shuffle=True,

random_state=101)

sgd.fit(XA,yA)

yP = sgd.predict(XB)

**Optical Character Recognition with SGD in Python**

Optical character recognition (OCR) is a method used to convert images of text into machine-readable text. It is typically used to process scanned documents and recognize the text in the images. Stochastic gradient descent (SGD) is a method for training a machine learning model, such as a neural network, by iteratively updating the model's parameters to minimize a loss function. In the context of OCR, SGD can used to train a neural network to recognize characters in images of text.

Here is an example of using SGD as a solver to train an OCR neural network model in Python:

import numpy as np

from sklearn.datasets import load_digits

from sklearn.neural_network import MLPClassifier

from sklearn.model_selection import train_test_split

# Load the dataset of images of handwritten digits

digits = load_digits()

# Split the dataset into training and testing sets

X_train, X_test, y_train, y_test = train_test_split(digits.data, digits.target, random_state=0)

# Create a neural network classifier

clf = MLPClassifier(max_iter=1000, tol=1e-3,

solver='sgd', random_state=0)

# Train the model using the training set

clf.fit(X_train, y_train)

# Evaluate the model performance on the test set

accuracy = clf.score(X_test, y_test)

print("Accuracy: %0.2f" % accuracy)

In this example, we use the *scikit-learn* library to load the dataset of images of handwritten digits, split the dataset into training and testing sets, and train a neural network classifier using SGD. We then evaluate the model's performance on the test set by computing the accuracy, which is the proportion of test images that the model correctly identifies.

**SGD scikit-learn Classifier**

SGD is also a stand-alone classifier in *scikit-learn*, not just as a solver for other methods. Below is an example of OCR with the *SGDClassifier*.

from sklearn.model_selection import train_test_split

import matplotlib.pyplot as plt

import numpy as np

from sklearn.linear_model import SGDClassifier

classifier = SGDClassifier(loss='modified_huber',

shuffle=True,

random_state=101)

# The digits dataset

digits = datasets.load_digits()

n_samples = len(digits.images)

data = digits.images.reshape((n_samples, -1))

# Split into train and test subsets (50% each)

X_train, X_test, y_train, y_test = train_test_split(

data, digits.target, test_size=0.5, shuffle=False)

# Learn the digits on the first half of the digits

classifier.fit(X_train, y_train)

# Test on second half of data

n = np.random.randint(int(n_samples/2),n_samples)

print('Predicted: ' + str(classifier.predict(digits.data[n:n+1])[0]))

# Show number

plt.imshow(digits.images[n], cmap=plt.cm.gray_r, interpolation='nearest')

plt.show()

**MATLAB Live Script**

#### Additional Information

- Singh, J. Implementing SGD From Scratch, Towards Data Science, Article Link.

#### ✅ Knowledge Check

**1.** What is the primary purpose of Stochastic Gradient Descent (SGD)?

**A.**To make the function's value increase with each iteration.

- Incorrect. The primary purpose of SGD is to minimize a function. It works by updating the function's parameters in the direction that reduces its value.

**B.**To find the maximum of a function by taking small steps in random directions.

- Incorrect. SGD seeks to find the minimum, not the maximum, of a function. The steps it takes are based on the negative gradient of the function, not random directions.

**C.**To find the minimum of a function by updating its parameters in the direction of its negative gradient.

- Correct. SGD is an iterative algorithm that seeks to find the minimum of a function by taking small steps in the direction of the negative gradient.

**D.**To efficiently train only non-linear models.

- Incorrect. SGD is an optimization algorithm that can be used to train both linear and non-linear models. Its primary advantage is efficiency, especially with large datasets.

**2.** Which of the following statements about SGD is NOT true?

**A.**SGD is particularly useful when the number of samples is very large.

- Incorrect. This statement is true. SGD is efficient and especially useful when dealing with large datasets.

**B.**SGD requires no hyper-parameters.

- Correct. This statement is false. One of the disadvantages of SGD is that it requires a number of hyper-parameters, such as the learning rate.

**C.**In SGD, observations are chosen randomly from the training set between iterations.

- Incorrect. This statement is true. SGD trains on a random subset of the training set between iterations, which differentiates it from standard gradient descent.

**D.**SGD is sensitive to feature scaling.

- Incorrect. This statement is true. One of the characteristics of SGD is its sensitivity to feature scaling, which is why preprocessing and normalization of features can be crucial when using this algorithm.

Return to Classification Overview