Gradient Descent
Gradient descent is another application where derivatives plays a role. The main objective of gradient descent is to design a scheme that will converge to the (or better a local) minimum of a function.
Given a smooth function \(f:\mathbb{R}^d \to \mathbb{R}\), the goal is to find a local minimum \(\mathbf{x}^\ast\), that is
for any \(\mathbf{y}\) in a neighborhood of \(\mathbf{x}^\ast\). If this relation holds for any \(\mathbf{y}\), then \(\mathbf{x}^\ast\) is called a global minimum.
Since \(f\) is assumed to be smooth, it follows that a necessary (but not sufficient, think about a saddle point) condition for it is that
If an explicit solution or a direct root finding method is not available, we want to construct a sequence \(\mathbf{x}_0, \mathbf{x}_1, \ldots\) such that \(\mathbf{x}_n \to \mathbf{x}^\ast\).
The classical intuition behind the gradient descent, is the situation where you are in the mountains, it is foggy and you want to find your way down to the valley. Your vision is just local but you can estimate in a close radius in which direction the path is going downwards. You move further and update along the way the direction in which you are going.
This principle rely on the following result:
Descent Direction
A vector \(\mathbf{d}\) is called a descent direction at \(\mathbf{x}\) if \(f(\mathbf{x} + t\mathbf{d})<f(\mathbf{x})\) for all \(t>0\) sufficiently small.
Proposition
If \(f\) is smooth, then any \(\mathbf{d}\) such that \(\mathbf{d}^\top \nabla f(\mathbf{x})<0\) is a descent direction.
Note that \(\inf_{\|\mathbf{d}\|=1} \mathbf{d}^\top \nabla f(\mathbf{x}) = -\nabla f(\mathbf{x})\), therefore \(-\nabla f(\mathbf{x})\) is called the steepest descent direction.
Proof
Let \(\mathbf{d}\) be such that \(\mathbf{d}^\top \nabla f(\mathbf{x}) <0\), by continuity of \(\nabla f\) there exist \(t_0\) such that \(\mathbf{d}^\top \nabla f(\mathbf{x}+t\mathbf{d})<0\) for all \(0<t<t_0\). Applying Taylor for any \(0<t<t_0\) it holds
for some \(0<\gamma <1\) from which follows the claim.
A First Try
Following this result and the intuition, we therefore construct the sequence as follows. We start from \(\mathbf{x}_0\) and update recursively with
and stop whenever we have something like \(\nabla f(\mathbf{x}_n) \approx 0\). However from the proof, even if the steeest direction is \(-\nabla f(\mathbf{x})\), the amount in which you walk down along this direction is part of the Taylor expansion and it is not clear how large \(\gamma\) shall be optimally.
- \(\gamma\) large: Even if \(\nabla f(\mathbf{x})\) will decrease to \(0\) as you close the minimum, your iterative step \(\gamma\) might be too large so that you are going to miss it by overshooting and will have to come back, eventually ending up into an oscillatory or unstable behavior.
- \(\gamma\) small: You run into less problems of overshooting, but you will walk slowly making the overall scheme very slow.
A lot of theory has been done on this topic for choosing the right step, with updating as you walk down. It depends very much on the problem, the properties of your function (convex, lipschitz, second differentiable, etc.)
Though we do not enter in this problem here, we consider another strategy to improve the convergence called momentum closely related to exponential moving average, in the sense that we keep in our update a decaying memory of the previous gradients.
Doing a short expansion it follows that
The second term in the sum is an exponential decaying function of the previous step size difference, therefore the term momentum (in time series it is often referred to as exponential moving average).
Implementation using Numpy
import numpy as np
# define the gradient descent function with inputs
# * max_iterations
# * threshold
# * x0 initialization
# * obj_func
# * grad_func
# * alpha (learning rate)
# * momentum
# return the trajectory of the gradient descent as well as the value
def gradient_descent(
max_iterations,
threshold,
x0,
obj_func,
grad_func,
gamma=0.01,
alpha=0.9,
):
# initialize the w and the history
x = x0
x_history = x
f_history = obj_func(x)
deltax = np.zeros(x.shape)
# initialize the iteration counter
i = 0
# initialize the previous loss
diff = 1e10
# loop until the max iterations
while i < max_iterations and diff > threshold:
deltax = alpha * deltax + gamma * grad_func(x)
x = x - deltax
# store the history
x_history = np.vstack([x_history, x])
f_history = np.vstack([f_history, obj_func(x)])
# update i and diff
i += 1
diff = np.abs(f_history[-1] - f_history[-2])
return x_history, f_history
With our gradient descent function at hand let us try on a simple function. We import some plotting functionalities too to illustrate.
import plotly.express as px
import plotly.graph_objs as go
from plotly.subplots import make_subplots
# define the objective and gradient functions
def func(x):
return np.sum(x**2)
def grad(x):
return 2 * x
# plot the function
x = np.linspace(-10, 10, 100)
y = np.linspace(-10, 10, 100)
X, Y = np.meshgrid(x, y)
cart_prod = np.dstack((X, Y)).reshape(-1, 2)
Z = np.apply_along_axis(func, 1, cart_prod)
fig = go.Figure()
fig.add_trace(
go.Surface(
x=x,
y=y,
z=Z.reshape(100, 100),
colorscale="Inferno",
showscale=False,
)
)
fig.show()
We now run the gradient descent for different learning rate and moment values
learning_rates = np.array([0.05, 0.2, 0.5, 0.8])
momentums = np.array([0.0, 0.5, 0.9])
max_iterations = 5
threshold = 1e-6
fig = make_subplots(cols=len(learning_rates), rows=len(momentums))
for idx, gamma in enumerate(learning_rates):
for idy, alpha in enumerate(momentums):
rng = np.random.default_rng(seed=10)
x0 = 20 * rng.random(2) - 10
x_history, f_history = gradient_descent(
max_iterations,
threshold,
x0,
func,
grad,
gamma=gamma,
alpha=alpha,
)
# add the plot of the function
fig.add_trace(
px.imshow(
Z.reshape(100, 100),
x=x,
y=y,
color_continuous_scale="Oranges",
zmin=100,
zmax=200,
).data[0],
row=idy + 1,
col=idx + 1,
)
# add the curve of the path with arrows
fig.add_trace(
go.Scatter(
x=x_history[:, 0],
y=x_history[:, 1],
mode="lines+markers",
showlegend=False,
marker=dict(
symbol="arrow",
size=15,
angleref="previous",
color="blue",
),
line=dict(color="blue"),
),
row=idy + 1,
col=idx + 1,
)
fig.update_layout(coloraxis_showscale=False)
fig.show()
Basic Image Classification on Real Data
sklearn
provides several datasets, among which are blurred images of \(0\) and \(1\).
The images are 8x8
pixels, that is arrays of length 64
where the value corresponds to the shade of grey (from white to black).
# library to get the dataset as well as the train_test function to split the dataset
import sklearn.datasets as dt
from sklearn.model_selection import train_test_split
# load the images and their feature (if it is 0 or 1)
digits, target = dt.load_digits(n_class=2, return_X_y=True)
# the shape of the dataset
print(digits.shape)
# We plot a short sample
px.imshow(digits.reshape(360, 8, 8)[:10, :, :], facet_col=0, binary_string=True)
The goal it to perform a simple linear regression but using gradient descent by minimizing among all \(\mathbf{w} = [w_0, \ldots, w_{64}]^\top\) the objective function
where \(y_n\) is either \(0\) or \(1\) for the label of the \(n\)-th image and \(\bar{\mathbf{x}}_n = [1, \mathbf{x}_n]\) is the array of the image augmented with a \(1\).
Compute the gradient descent and compare with the linear regression.