Gradient-Based Optimisation
Introduction
Optimisation is a ubiquitous process, people do it everyday. A classic example would be: How do I get somewhere as (fast, cheap, environmentally friendly, …) as possible? Nature optimises too, a big part of what I did in my PhD was finding out how systems of particles optimise their potential energy. Needless to say, optimisation is an important process, and quite a bit of effort has been dedicated to developing algorithms that do it well. As we will see, optimisation algorithms form a key component of neural networks. The first section of these notes introduce some of the theory underpinning these optimisation algorithms, and how to implement them in Python.
Some Background Maths
Before getting stuck into optimisation algorithms, we should first introduce some of the maths needed to understand the algorithms and the notation we’ll use. First, and maybe most important, is the objective function. This function should provide some quantitative description of our process of interest (think of it as a measure of success).
For instance, in our above example of getting to some destination the objective function could be some mathematical expression which involves the time or cost of the journey.
We denote the objective function as $f(\textbf{x})$, where $\textbf{x}$ is an $n$-dimensional vector of parameters.
So in our journey example, $\textbf{x}$ would be a 2d vector where the first parameter ($x_{0}$) would be time and the second parameter ($x_{1}$) would be cost.
We could therefore write $\textbf{x}=(x_{0},x_{1})^{T}$ (where the $^{T}$ means we are dealing with a row vector).
Our objective function will then involve some combination of these parameters, for instance this could be:
$$
f(\textbf{x}) = x_{0}^2 + x_{1}^2
$$
I have plotted this function for $x_{0},x_{1}\in[-10,10]$ below, and we can see (from the 3D representation on the left) it has a bowl-like shape (actually we would call this a paraboloid).
We can also see that when $\textbf{x}=(0,0)^{T}$, $f(x_{0},x_{1})=0$. We would say that this point corresponds to the minimum of the function $f(\textbf{x})$.
On the right I’ve shown a 2D contour plot of the same function, as well as the gradient. The gradient for this function is given by: \begin{align} \textbf{g}(\textbf{x}) \equiv \nabla f(x_{0},x_{1}) &= \Big(\dfrac{\partial f}{\partial x_{0}}, \dfrac{\partial f}{\partial x_{1}}\Big)^{T} = \Big(2x_{0}, 2x_{1}\Big)^{T} \end{align} The gradient defines a vector field, which points in the direction of maximum rate of increase for its corresponding function. This is an important property of the gradient which we are going to exploit. If none of this is familiar to you I would suggest going through a course like this one.
Optimisation Problems
We are interested in solving the general optimisation problem: $$ \min_{\textbf{x}\in\mathbb{R}^{n}}f(\textbf{x}) $$ In other words, we want to find a set of coordinates which correspond to a minimum of the function $f(\textbf{x})$. We are going to do this using an optimisation algorithm. There are different algorithms to choose from, but they all follow the same general process: iteratively update an initial guess for $\textbf{x}$ until some termination condition is satisified.
The difference between different algorithms comes from how we move from $\textbf{x}_{i}$, at iteration $i$ of our algorithm, to $\textbf{x}_{i+1}$.
We want to write some Python code that can optimise arbitrary functions using different optimisation algorithms.
The first step is to define a Class
:
1class Optimise:
2 def __init__(self, X, function, gradient, err, method):
3 # Initialise input parameters for the optimisation algorithms
4 self.X = X # Coordinates of the function.
5 self.f = function # Function to be optimised.
6 self.g = gradient # Gradient of the function.
7 self.err = err # Threshold convergence value
8 self.method = method # ID of the line search method
This Class
has three key attributes:
X
: The coordinates which we are to optimise, $\textbf{x}$.f
: The objective function to be optimised, $f(\textbf{x})$.g
: The gradient of the objective function, $\textbf{g}(\textbf{x})$.
There are two more parameters, err
and method
.
We use the err
parameter to define our termination condition, and the method
variable to decide which method to use. For now we’ll just include the steepest descent algorithm.
9# Initialise parameters describing the convergence of the optimisation algorithms
10 self.nsteps = 0
11 self.path = []
12 self.steps = []
13# Perform local optimisation.
14 if(method==1):
15 self.steepest_descent()
16# Extract the coordinates of the local minimum and the path taken to it.
17 self.minimum = self.path[-1]
18 self.path = numpy.array(self.path).T
We have also created some new variables which we are going to use to monitor the path the algorithms take to the minimum, and their convergence properties. Also, on line 18 we use a function from NumPy which we’ll be using a lot more of later on when we actually start to code our optimisation algorithms.
Gradient Descent
In this section we are going to focus on line search strategies – algorithms where we select a direction $\textbf{d}_{i}$ from $\textbf{x}_{i}$, and search along that direction to find a new set of coordinates $\textbf{x}_{i+1}$: $$ \textbf{x}_{i+1} = \textbf{x}_{i} + \alpha_{i}\textbf{d}_{i}, \quad i=0, 1, 2, … $$ where $f(\textbf{x}_{i+1}) < f(\textbf{x}_{i})$ and so we refer to $\textbf{d}_{i}$ as a descent direction. Therefore, we have two tasks at each iteration of our optimisation algorithm:
- Determine $\textbf{d}_{i}$ (the step direction)
- Determine $\alpha_{i}$ (the step length)
If we look back at the 2D contour plot of $f(\textbf{x})$ above (specifically focusing on the gradient field) an obvious choice for $\textbf{d}_{i}$ is staring us in the face. Maybe the most obvious choice for a descent direction of any (differentiable) function is given by $\textbf{d}_{i}=-\textbf{g}_{i}$, and forms the basis of our entry level optimisation algorithm:
19def gradient_descent(self):
20# Define the initial coordinates for iteration i=0
21 xi = self.X
22# Add the initial coordinates the path to the local minimum.
23 self.path.append(xi)
24# Calculate the square of the convergence threshold.
25 errsq = self.err**
26 gd = 1. # Initialise the squared gradient to 1
27# Iteratively update the coordinates using the Steepest Descent algorithm
28# until the convergence criterion is met.
29 while gd > errsq:
30# Calculate the gradient and the square of its magnitude at the new coordinates
31 gi = self.g(*xi); gd = np.dot(gi,gi)
32# Set the step direction to be the negative of the gradient
33 di = -gi
34# Determine the step size for this iteration using the backtracking algorithm.
35 a = self.backtrack(xi=xi,gi=gi,di=di,a0=1)
36# Update the coordinates
37 xi = xi + a*di
38# Update parameters describing the convergence of the optimisation algorithm.
39 self.nsteps += 1; self.path.append(xi); self.steps.append(a)
You’ll see in the code snippet we call a function backtrack()
on line 35, this function chooses an appropriate step length $\alpha_{i}$ which we’ll go over in the next section. But in just a few lines of code we’ve written a function which should optimise any function we give it!
Deciding the Step Length
Now that we know how to extract a search direction which points towards the minimum, we need to figure out how far we want to travel in that direction.
You may have noticed in the code snippets above for the steepest descent and conjugate gradient methods a third method is called: backtrack(xi=xi,gi=gi,di=di,a0=1)
.
This is the method we use to determine our step size $\alpha_{i}$.
Backtracking
When computing $\alpha_{i}$ we face a tradeoff: We want $\alpha_{i}$ to make a substantial decrease in $f(\textbf{x}_{i})$, but we also don’t want to waste a lot of time choosing its value. In practise we try out a number of candidate $\alpha_{i}$ values, and choose the first one which satisfies some conditions. We are going to use the sufficient decrease condition: $$ f(\textbf{x}_{i}+\alpha_{i}\textbf{d}_{i}) \leq f(\textbf{x}_{i}) + c_{1}\alpha_{i}\textbf{g}(\textbf{x}_{i})^{T}\textbf{d}_{i}, \quad c_{1}\in(0,1) $$ One issue with this inequality is that it’s satisfied for all sufficiently small values $\alpha_{i}$. This issue can be avoided though by choosing the candidate $\alpha_{i}$ values appropriately. We do this using the backtracking approach:
40def backtrack(self, xi, gi, di, a0, c1=0.5, tau=0.5):
41# Calculate the value of the function at the coordinates for the
42# current iteration of the optimisation algorithm.
43 fi = self.f(*xi)
44# Calculate the dot product of the gradient and the search direction,
45# to be used to evaluate the Armijo condition.
46 gi = np.dot(gi, di)
47 ai = a0
48# While the step size does not provide a sufficient decrease in the function f(X),
49# adjust the step size using the contraction factor tau.
50 while( self.f( *(xi+ai*di) ) > (fi + c1*ai*gi) ):
51 ai *= tau
52
53 return ai
We select an initial guess for $\alpha^{0}_{i}$ (which we just set to 1 by default) and iteratively decrease its size by the factor $\tau\in(0,1)$ until the sufficient decrease condition is satisfied. We’ll introduce more sophisticated algorithms for choosing the step size when we get to quasi-Newton methods, but for now this does the job.
Testing out our Code
Now that we’ve built our optimiser we are ready to test it out. There are a set of canonical test functions for optimisation algorithms, which we’ll use to be absolutely sure our code works.
We’ll start of with the 2D Beale function: $$ f(x,y) = (1.5-x+xy)^{2} + (2.25-x+xy^{2})^{2} + (2.625-x+xy^{3})^{2} $$ which has the minimum $f(3,0.5) = 0$.
Fist we need to define our function in Python, which we can do really easily using a Lambda expression:
f = lambda x,y: (1.5-x+x*y)**2 + (2.25-x+x*y**2)**2 + (2.625-x+x*y**3)**2
Similarly, we need to define our gradient vector: $$ \textbf{g}(x,y) = \Bigg(\dfrac{\partial f(x,y)}{\partial x}, \dfrac{\partial f(x,y)}{\partial y}\Bigg)^{T} $$ where we have: \begin{align} \dfrac{\partial f(x,y)}{\partial x} = 2\big[(1.5-x&+xy)(y-1) + (2.25-x+xy^{2})(y^{2}-1) \\ &+ (2.625-x+xy^{3})(y^{3}-1)\big] \end{align}
\begin{align} \dfrac{\partial f(x,y)}{\partial y} = 2\big[(1.5&-x+xy)x + (2.25-x+xy^{2})(2xy) \\ &+ (2.625-x+xy^{3})(3xy^{2})\big] \end{align} Again, we can do this using a Lambda expression that returns a list:
g = lambda x,y: [2*( (1.5-x+x*y)*(y-1) + (2.25-x+x*y**2)*(y**2-1) + (2.625-x+x*y**3)*(y**3-1) ),
2*( (1.5-x+x*y)*x + (2.25-x+x*y**2)*(2*x*y) + (2.625-x+x*y**3)*(3*x*y**2) )]
Now we need to import our Optimise Class
(which I have saved in a file called optimiser.py) and call it using the different method
IDs:
import optimiser
# Optimise using the Steepest Descent method
sd = optimiser.Optimise(X=[3.,4.],function=f,gradient=g,err=1.e-9,method=1)
print(sd.minimum, sd.nsteps)
Now if we run this we should get something like this:
aneo@computer:~$ python test_function.py
[3. 0.5] 205
It works! We can even watch an animation of our optimisation algorithm at work
Issues with Gradient Descent
Now (if you didn’t get too excited over the success of our code) you might’ve noticed that it took 205 iterations of our algorithm to reach the minimum. In the video we can see that almost all of those steps are taken when we get really close to the minimum (the video isn’t longer than it needs to be it’s just the steps being taken are so small our video doesn’t have the resolution to see them!). There are two main reasons why this happens which we’ll go through below.
Crawling Behaviour of Gradient Descent
Now while we call $\alpha_{i}$ our step length, technically the size of each step (which let’s represent as $\Delta\textbf{x}_{i}$) in our optimisation algorithm is only equal to this when $||\textbf{d}_{i}||=1$ (in other words, when our step direction is a unit vector). In practice the size of our step at each iteration is given by: $$ \Delta\textbf{x}_{i} = ||\textbf{x}_{i+1}-\textbf{x}_{i}|| = ||(\textbf{x}_{i}+\alpha_{i}\textbf{d}_{i})-\textbf{x}_{i}|| = \alpha_{i}||\textbf{d}_{i}|| $$ which means that the size of the steps we take at each iteration of the gradient descent algorithm is actually equal to $\Delta\textbf{x}_{i}=\alpha_{i}||\textbf{g}_{i}||$, and we know that as we approach the minimum $||\textbf{g}_{i}||\to 0$. So the closer we get to the minimum, the size of $\Delta\textbf{x}_{i}$ gets increasingly smaller and smaller.
This is a pretty easy fix though, we can just make our step direction a unit vector instead: $\textbf{d}_{i}=-\textbf{g}_{i}/||\textbf{g}_{i}||$. In our code this would look like
def gradient_descent(self):
.
.
.
# Set the step direction to be the negative of the normalised gradient
di = -gi/np.sqrt(gd)
and then if we rerun our code we get the following output:
aneo@computer:~$ python test_function.py
[3. 0.5] 174
We get a modest a improvement in performance, but still not great.
Zig-Zag Behaviour of Gradient Descent
The second issue with gradient descent is that it can zig-zag to the minimum instead of going straight down. This was somewhat apparent for our test run above, but to really highlight this problem we’re going to minimise the following quadratic function instead: $$ f(\textbf{x}) = \textbf{x}^{T}\textbf{A}\textbf{x} + \textbf{x}^{T}\textbf{b} + c $$ where we have set $$ \textbf{A} = \begin{pmatrix} 1 & 0 \\ 0 & 10 \end{pmatrix}, \quad \textbf{b} =\begin{pmatrix} 1 \\ 1 \end{pmatrix}, \quad c=0 $$ The gradient for this function is given by $$ \textbf{g}(\textbf{x}) = 2\textbf{x}^{T}\textbf{A} + \textbf{b} $$ If we choose our initial point to $\textbf{x}_{0}=(8, -0.75)^{T}$, then our algorithm really shows this zig-zag behaviour. Below is an animation of our algorithm at work, clearly zigging and zagging to the minimum:
Conjugate Gradient
The zig-zag behaviour of the gradient descent algorithm stems from the fact that consecutive search directions in the gradient descent algorithm are orthogonal to one another (i.e., $\textbf{g}^{T}_{i+1}\textbf{g}_{i}=0$). One way of getting around this issue is by using the conjugate gradient method, so called because search directions are conjugate to one another. Explaining this deserves its own dedicated resource (which is why I made a blog post on it here).
For now we’ll just say that this method avoids the inefficiencies of gradient descent by introducing some history to our search directions. The principle of the method is exactly the same as the gradient descent algorithm, but now we say our search direction is $\textbf{d}_{i}=-\textbf{g}_{i} + \beta_{i}\textbf{d}_{i-1}$, where $$ \beta_{i} = \begin{cases} 0 & \text{if } i=0 \\ \dfrac{\textbf{g}_{i}^{T}(\textbf{g}_{i}-\textbf{g}_{i-1})}{\textbf{g}_{i-1}^{T}\textbf{g}_{i-1}} & \text{otherwise} \end{cases} $$ There are in fact many valid choices of $\beta_{i}$ to ensure consecutive search directions conjugate, but this is known as the Polak–Ribière formula (named after its developers). In practice there can be a few issues with this choice of $\beta$ meaning that it doesn’t always result in a descent direction, but we can avoid this by using $\beta^{+}=\max${ $0,\beta$ } instead.
40def conjugate_gradient(self, line_method=1):
41 # Define the initial coordinates for iteration i=0
42 xi = self.X
43 # Add the initial coordinates the path to the local minimum.
44 self.path.append(xi)
45 # Calculate the square of the convergence threshold.
46 errsq = self.err**2
47 # Initialise variables needed for the main loop of the algorithm
48 gd = 1.; gj = 1.; dj = 0.
49 # Iteratively update the coordinates using the Conjugate Gradient algorithm
50 # until the convergence criterion is met.
51 while gd > errsq:
52 # Compute the gradient and the square of its magnitude at i=0
53 gi = np.array(self.g(*xi))
54 gd = np.dot(gi,gi)
55 # Calculate the search direction for the next iteration.
56 b = np.max([0.,np.dot(gi,(gi-gj)) / np.dot(gj,gj)])
57 di = b*dj - gi
58 # Determine the step size for this iteration using the backtracking algorithm.
59 a = self.backtrack(xi=xi,gi=gi,di=di, a0=1., tau=0.9)
60 # Update the coordinates
61 xi = xi + a*di
62 # Save the old gradient and search direction, which will be used to calculate
63 # the search direction for the next iteration.
64 gj = np.copy(gi); dj = np.copy(di)
65 # Update parameters describing the convergence of the optimisation algorithm.
66 self.nsteps += 1; self.path.append(xi); self.steps.append(a)
The layout of the two optimisation algorithms are identical, the only difference with the conjugate gradient method is that we need to save the gradient and search direction at each iteration to use in the next one. Now if we optimise our function again using the conjugate gradient method we see that our zig-zag behaviour is greatly reduced!
This is a good starting point, but in the next section we’ll implement quasi-Newton methods that can have even better performance.