A comprehensive exploration of the mathematical foundations of Stochastic Gradient Descent, crucial for understanding its application in optimization algorithms in deep learning.
Delving into Stochastic Gradient Descent.
This article delves into the fundamental concept of Stochastic Gradient Descent, elucidating the basic principles from which this optimization algorithm is derived. Among the topics covered in this article are:
At its core, a basic linear regression function is employed in various machine learning and statistical modeling applications. Conceptually, it represents a single straight line, defined mathematically by two real-valued parameters:
Linear regression optimizes this function, particularly when dealing with
In scenarios where the noise in the dependent variable
from sklearn.linear_model import LinearRegression
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from torch import tensor
%matplotlib inline
df = pd.read_csv("houses.csv").drop(columns="Unnamed: 0")
df.sample(n=10)
Price | Size | Lot | |
---|---|---|---|
9 | 123500 | 1161 | 9626 |
4 | 160000 | 2536 | 9234 |
7 | 145000 | 1572 | 12588 |
5 | 85000 | 2368 | 13329 |
12 | 156000 | 2240 | 21780 |
6 | 85000 | 1264 | 8407 |
16 | 182000 | 1320 | 15768 |
13 | 146500 | 1269 | 11250 |
0 | 212000 | 4148 | 25264 |
18 | 125000 | 1274 | 13634 |
Given price, size of the house and lot, we eliminate lot, so that there is only one independent variable (size).
df_indep = df["Size"].to_numpy()
df_dep = df["Price"].to_numpy()
We create a linear regression model that uses ordinary least squares for
optimization by minimizing sklearn
machine learning library. reg.score
gives the global minimum for
the reg.coef_
and reg.intercept_
. pred
for the model’s prediction and dep
for the value
of the dependent variable that the model tries to predict using the independent
variable and the parameters:
After the model is fitted, the .score
attribute gives the minimum value of the
loss function using the
reg = LinearRegression(fit_intercept=True, n_jobs=-1).fit(
df_indep.reshape(-1, 1), df_dep
)
reg.score(df_indep.reshape(-1, 1), df_dep)
0.4689809992584135
reg.coef_
gives the optimal value for the slope parameter,
while reg.intercept_
gives the optimal value for the intercept parameter.
w1 = reg.coef_
w0 = reg.intercept_
print(
f"The optimal value for the slope parameter is: {w1},\nwhile {w0} is the optimal value for the intercept."
)
The optimal value for the slope parameter is: [48.19930529],
while 64553.68328966276 is the optimal value for the intercept.
The function hw
is a generic univariate linear regression function that makes
the code more reproducible.
def hw(x, w1, w0):
return w1 * x + w0
The plot below, shows training examples for independent and dependent variable and the linear regression line, using the optimal parameters.
fig, ax = plt.subplots(1, 1, figsize=(6, 6), tight_layout=True)
x_vals = np.linspace(min(df_indep) + 10, max(df_indep) + 5, 1000)
ax.scatter(df_indep, df_dep, c="r", label="actual x,y pairs")
ax.plot(x_vals, hw(x_vals, w1, w0), c="y", label="regression line")
ax.set_xlabel("Size of the house in square feet")
ax.set_ylabel("Price in USD")
plt.title("Ordinary Least Squares Univariate Linear Regression using $$L_{2}$$")
plt.legend(loc="best")
plt.show()
Next, we plot the loss surface for
def plotls(x, w1, w0):
slope = np.linspace(w1 - 0.5 * w1, w1 + 0.5 * w1, 20)
bias = np.linspace(w0 - 0.5 * w0, w0 + 0.5 * w0, 20)
w1, w0 = np.meshgrid(slope, bias)
loss = np.power((hw(df_indep, w1, w0) - df_dep), 2)
fig = plt.figure(figsize=(10, 10))
ax = fig.gca(projection="3d")
surface = ax.plot_surface(
w0,
w1,
loss,
label="$$\mathit{L}_{2}$$ loss surface",
cmap="viridis",
edgecolor="none",
)
surface._facecolors2d = surface._facecolor3d
surface._edgecolors2d = surface._edgecolor3d
ax.set_xlabel("w1 - slope")
ax.set_ylabel("w0 - bias")
plt.legend(loc="best")
The loss function is convex and thus, always has a global minimum.
plotls(df_indep, w1, w0)
The loss function has a gradient of 0, where the minimum is found. The equation,
that has to be solved, fulfills
Solving for
We plug in the values for the independent and dependent variables and solve for
def w1_solve(indep, dep):
N = len(indep)
nom = N * np.array(df_indep * df_dep).sum() - (
np.array(indep).sum() * np.array(dep).sum()
)
denom = N * np.array(np.power(indep, 2).sum()) - np.array(
np.power(np.array(indep).sum(), 2)
)
opt = nom / denom
return opt
def w0_solve(indep, dep):
N = len(indep)
nom = dep.sum() - (w1_solve(indep, dep) * (np.array(indep).sum()))
denom = N
opt = nom / denom
return opt
Solving for the slope parameter using w1_solve
.
w1_solve(df_indep, df_dep)
48.19930529470915
In the same way, we use w0_solve
to solve for
w0_solve(df_indep, df_dep)
64553.683289662775
The optimal values for LinearRegression
calculated for the two
parameters. This is confirmed using np.allclose
.
rig = [w0, w0_solve(df_indep, df_dep), w1, w1_solve(df_indep, df_dep)]
def allclose(rig):
print(np.allclose(rig[0], rig[1]))
print(np.allclose(rig[2], rig[3]))
allclose(rig)
True
True
The univariate linear model always has an optimal solution, where the partial derivatives are zero. However, this is not always the case, and the following algorithm for minimizing loss that does not depend on solving for zero values of the derivatives. It can be applied to any loss function. The Gradient Descent optimization algorithm is that optimizer. Its variation, the Stochastic Gradient Descent is widely used in deep learning to drive the training process.
The starting point for the Gradient Descent is any point in the weight space.
Here that is a point in the
For each weight
Parameter
We calculate the partial derivatives, using the chain rule, and a single pair
of independent and dependent variable
The partial derivative was not specified for the ‘inner function’, since it
depends on which of the two
The partial derivatives to
Thus, the partial derivative of the loss function for
With these equations calculated, it is possible to plug in the values in the pseudocode under ‘Gradient Descent: Steps’. The -2 is added to the learning rate.
In the case of
The aim is to minimize the sum of the individual losses.
The equations for updating the weights are applied after each batch. A batch
consists of a specified number of training examples that are loaded into memory
at once. The batch gradient descent for univariate linear regression updates
the weights after each batch. It is computationally expensive, since it sums
over all
The stochastic gradient descent or SGD is a faster variant. It randomly
picks a small subset of training examples at each step, and updates the weights
using the equation under heading ‘Single Training Example’. It is common that
the SGD selects a minibatch of
The standard error of the mean (
The standard error can be calculated with the following formula:
Where
Therefore, the standard error grows by the root of the sample size. That means,
that given a minibatch of
That means that the SGD trades being 100 times less computationally expensive with a 10 times larger standard error for this example.
In this article, we started by introducing the function of a univariate linear
regression model and explored its
The batch gradient descent algorithm was explained and in particular how it
updates the weights during each step. We calculated the partial derivatives for
the parameters and used them to show how the weights are updated during each
step. The stochastic gradient descent was introduced and compared to the batch
gradient descent.