Gradient-based optimization#

If information about the gradient is available, it can accelarate convergence substantially.

Newton’s optimization method#

Newton’s optimization routine aims to find the root of the gradient, which is the extremal. Since we are now focussed on scalar \(f(\vec{x})\) the gradient is a vector and we will need the Hessian matrix $\(H = \frac{\partial^2 f}{{\partial \vec{x}}^2}\)$

The increment is now solved as:

\[ H \Delta \vec{x} = - \nabla f \]

Noting that \(H\) must be symmetric.

Netwon’s optimization method has excellent convergence criteria but requires calculation of the Hessian which can be computationally expensive.

import numpy as np
from scipy.linalg import solve
import plotly.graph_objects as go

def newton_method(f, grad_f, hessian_f, x0, tol=1e-6, max_iter=100):
  x = x0
  print(x)
  guesses = [x]
  for _ in range(max_iter):
    grad = grad_f(x)
    hess = hessian_f(x)

    # ~~ What goes here?

    ###
    delta_x = solve(hess, -grad, assume_a = 'sym')
    ###
    x = x + delta_x
    guesses.append(x)
    print(x)
    if np.linalg.norm(grad) < tol:
      break

  # Create a surface plot of the function
  x = np.linspace(-2, 2, 100)
  y = np.linspace(-2, 2, 100)
  X, Y = np.meshgrid(x, y)
  Z = f([X, Y])

  fig = go.Figure(data=[go.Surface(x=X, y=Y, z=Z)])

  # Add markers for each guess
  for guess in guesses:
    fig.add_trace(go.Scatter3d(
        x=[guess[0]],
        y=[guess[1]],
        z=[f(guess)],
        mode='markers',
        marker=dict(
            size=5,
            color='red'
        )
    ))

  fig.update_layout(
      title='Newton\'s Method Optimization',
      scene=dict(
          xaxis_title='x',
          yaxis_title='y',
          zaxis_title='f(x,y)'
      )
  )
  fig.show()
---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
Cell In[1], line 2
      1 import numpy as np
----> 2 from scipy.linalg import solve
      3 import plotly.graph_objects as go
      5 def newton_method(f, grad_f, hessian_f, x0, tol=1e-6, max_iter=100):

ModuleNotFoundError: No module named 'scipy'

Example: Minimize \(x^4 + y^4\)#

def f(x):
  return x[0]**4 + 2*x[1]**4

def grad_f(x):
  return np.array([4*x[0]**3, 8*x[1]**3])

def hessian_f(x):
  return np.array([[12*x[0]**2, 0], [0, 24*x[1]**2]])

# Initial guess
x0 = np.array([1.5, 2])

# Perform Newton's method
newton_method(f, grad_f, hessian_f, x0)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[2], line 14
     11 x0 = np.array([1.5, 2])
     13 # Perform Newton's method
---> 14 newton_method(f, grad_f, hessian_f, x0)

NameError: name 'newton_method' is not defined

Example Minimize \(x^2+y^2\)#

import numpy as np
from scipy.linalg import solve
import plotly.graph_objects as go

def f(x):
  return x[0]**2 + x[1]**2

def grad_f(x):
  return np.array([2*x[0], 2*x[1]])

def hessian_f(x):
  return np.array([[2, 0], [0, 2]])

# Initial guess
x0 = np.array([1.5, 1.5])

# Perform Newton's method
newton_method(f, grad_f, hessian_f, x0)
---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
Cell In[3], line 2
      1 import numpy as np
----> 2 from scipy.linalg import solve
      3 import plotly.graph_objects as go
      5 def f(x):

ModuleNotFoundError: No module named 'scipy'

Why does this converge so fast?

Example: \(x^2 - 6 x y +y^2\)#

def f(x):
  return x[0]**2 + x[1]**2 - 6*x[0]*x[1]

def grad_f(x):
  return np.array([2*x[0] - 6*x[1], 2*x[1] - 6*x[0]])

def hessian_f(x):
  return np.array([[2, -6], [-6, 2]])

# Initial guess
x0 = np.array([1.5, 2.])

# Perform Newton's method
newton_method(f, grad_f, hessian_f, x0)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[4], line 14
     11 x0 = np.array([1.5, 2.])
     13 # Perform Newton's method
---> 14 newton_method(f, grad_f, hessian_f, x0)

NameError: name 'newton_method' is not defined

Yeehaw Giddyup!

Gradient decent methods#

If you have information about the gradient (exactly or approximately), moving down the gradient is an intuitive approach to reach the minimum. While fairly fool-proof, as we saw that the steepest decent can lead to zig-zagging which motivated constructing orthogonal / conjugate directions which limit their interference with each other.

Recal each step is incremented: $\(\vec{x}^{i+1} = \vec{x}^i+a \vec{p}\)\( where \)a\( is the step length and \)\vec{p}$ is the step direction.

The steepest decent, \(p=-\nabla f\) maximizes the change in \(f\) in the immediate neighbourhood, but a different direciton may permit longer step lengths. In general:

\[f(\vec{x}^{1+1})-f(\vec{x}^i) \le -a \|\nabla{f}\| \|p\| \bigg[cos(\theta) -\max_{t ∈ [0,1]} \frac{\|\nabla f(\vec{x}^i-t a \vec{p}) - \nabla f(\vec{x}) \|}{\|\nabla f(\vec{x})\|}\bigg]\]

The second term assesses the rate of change of \(\nabla f\) and involves new quantities, so an exact calculation is typically avoided. Approximations to this term (or alternative algorithmic tools) give rise to different methods.

Stoichastic gradient decent#

Machine learning training involves optimizing a model by finding an appropriate set of parameters. The parameter space can be very high dimensional, and the resulting function can be highly complex. In this case the gradient may be approximated by randomly sampling the change in subsets of parameters. The step length \(a\) is renamed the learning rate.