Optimization for ML: Beyond Gradient Descent

Summary

Optimization is the engine that transforms a loss function into a trained model. While gradient descent is the workhorse, modern machine learning relies on a rich toolbox of optimizers—each with unique strengths, weaknesses, and use cases. This article explores the optimization landscape: from convex to non‑convex problems, from first‑order to second‑order methods, and from simple SGD to adaptive algorithms like Adam and RMSprop. Through intuitive explanations and executable Python code, you’ll learn how momentum helps escape plateaus, why adaptive methods handle sparse features, and when second‑order information might be worth the cost.

Why Optimization Matters

Every ML training loop is an optimization problem: find the parameters θ that minimize a loss function L(θ, data). But the path from random initialization to a good minimum is rarely straightforward.

Ill‑conditioned landscapes can make gradient descent painfully slow.
Saddle points and local minima can trap optimization.
Constraints (e.g., non‑negative weights, unit norms) often reflect real‑world requirements.

Understanding optimization algorithms helps you:

Choose the right solver for your problem.
Tune hyperparameters (learning rate, momentum) effectively.
Diagnose why training might be failing.

In robotics, optimization appears in inverse kinematics (finding joint angles that reach a target pose), trajectory smoothing, and model‑predictive control. The same principles—gradient‑based search, constraints, and local minima—apply.

Core Optimization Concepts with ML Applications

1. The Optimization Problem in ML

We aim to solve:

$\theta^* = \arg\min_\theta \frac{1}{N}\sum_{i=1}^N \ell(f_\theta(x_i), y_i) + \lambda R(\theta)$

Where $ℓ$ is a loss (e.g., cross‑entropy) and $R$ is a regularizer.

Linear regression with MSE loss has a closed‑form solution, but most ML models require iterative methods.

import numpy as np
import matplotlib.pyplot as plt

# Simple quadratic loss: L(θ) = (θ - 5)^2
def loss(theta):
    return (theta - 5)**2

def gradient(theta):
    return 2 * (theta - 5)

# Gradient descent
theta = 0.0
lr = 0.1
trajectory = [theta]

for i in range(20):
    theta = theta - lr * gradient(theta)
    trajectory.append(theta)

# Plot
thetas = np.linspace(-2, 8, 100)
plt.plot(thetas, loss(thetas), label='Loss')
plt.plot(trajectory, [loss(t) for t in trajectory], 'ro-', label='GD steps')
plt.xlabel('θ')
plt.ylabel('L(θ)')
plt.title('Gradient Descent on a Convex Function')
plt.legend()
plt.grid(True)
plt.show()

2. Gradient Descent Variants

Batch GD: uses entire dataset per step – accurate but slow.
Stochastic GD (SGD): uses one random sample – fast but noisy.
Mini‑batch GD: compromise – uses a small batch.

# Simulate mini‑batch gradient descent
np.random.seed(42)
X = np.random.randn(1000, 10)
y = X @ np.array([1.5, -2.0, 0.5, 0, 0, 0, 0, 0, 0, 3.0]) + 0.1 * np.random.randn(1000)

def mse_gradient(theta, X_batch, y_batch):
    pred = X_batch @ theta
    error = pred - y_batch
    return (2 / len(X_batch)) * X_batch.T @ error

theta = np.zeros(10)
lr = 0.01
batch_size = 32
losses = []

for epoch in range(100):
    idx = np.random.permutation(len(X))
    X_shuffled = X[idx]
    y_shuffled = y[idx]
    for i in range(0, len(X), batch_size):
        X_batch = X_shuffled[i:i+batch_size]
        y_batch = y_shuffled[i:i+batch_size]
        grad = mse_gradient(theta, X_batch, y_batch)
        theta -= lr * grad
    losses.append(np.mean((X @ theta - y)**2))

plt.plot(losses)
plt.xlabel('Epoch')
plt.ylabel('MSE')
plt.title('Mini‑batch GD Convergence')
plt.show()

3. Challenges in Optimization

Ill‑conditioning: different directions have very different curvatures → zig‑zagging.
Saddle points: gradient ≈ 0 but not a minimum – common in high dimensions.
Local minima: in non‑convex problems, many minima exist.

Visualising a saddle point:

from mpl_toolkits.mplot3d import Axes3D

x = np.linspace(-2, 2, 50)
y = np.linspace(-2, 2, 50)
X, Y = np.meshgrid(x, y)
Z = X**2 - Y**2  # saddle at (0,0)

fig = plt.figure(figsize=(8,6))
ax = fig.add_subplot(111, projection='3d')
ax.plot_surface(X, Y, Z, cmap='viridis', alpha=0.8)
ax.set_title('Saddle Point: $f(x,y) = x^2 - y^2$')
ax.set_xlabel('x')
ax.set_ylabel('y')
ax.set_zlabel('f')
plt.show()

4. Momentum: Accelerating Convergence

Momentum accumulates past gradients to dampen oscillations and speed up progress.

$v_t = \beta v_{t-1} + (1-\beta) \nabla L(\theta_t)$ $\theta_{t+1} = \theta_t - \eta v_t$

def gradient_descent_with_momentum(gradient, init, lr=0.01, beta=0.9, steps=50):
    theta = init
    v = 0
    path = [theta]
    for _ in range(steps):
        g = gradient(theta)
        v = beta * v + (1 - beta) * g
        theta = theta - lr * v
        path.append(theta)
    return path

# Example: 2D quadratic with different curvatures
def grad_quad(theta):
    return np.array([2*theta[0], 20*theta[1]])  # ill‑conditioned

path_mom = gradient_descent_with_momentum(grad_quad, np.array([5.0, 1.0]), lr=0.1, beta=0.9)
path_gd = gradient_descent_with_momentum(grad_quad, np.array([5.0, 1.0]), lr=0.1, beta=0.0)  # no momentum

# Plot trajectories
path_mom = np.array(path_mom)
path_gd = np.array(path_gd)

plt.plot(path_gd[:,0], path_gd[:,1], 'o-', label='GD')
plt.plot(path_mom[:,0], path_mom[:,1], 's-', label='Momentum')
plt.xlim(-1,6)
plt.ylim(-1,2)
plt.xlabel('x')
plt.ylabel('y')
plt.title('Effect of Momentum in Ill‑conditioned Problem')
plt.legend()
plt.grid(True)
plt.show()

5. Adaptive Methods: AdaGrad, RMSprop, Adam

AdaGrad adapts learning rates per parameter based on past gradients. Good for sparse features.

RMSprop modifies AdaGrad to use a moving average of squared gradients, preventing aggressive decay.

Adam combines momentum and RMSprop – currently the most popular optimizer.

import torch
import torch.nn as nn
import torch.optim as optim

# Simple neural net on synthetic data
X = torch.randn(1000, 10)
y = X @ torch.tensor([1.5, -2.0, 0.5, 0, 0, 0, 0, 0, 0, 3.0], dtype=torch.float32) + 0.1*torch.randn(1000)

model = nn.Linear(10, 1)
criterion = nn.MSELoss()

# Compare optimizers
optimizers = {
    'SGD': optim.SGD(model.parameters(), lr=0.01),
    'Momentum': optim.SGD(model.parameters(), lr=0.01, momentum=0.9),
    'Adam': optim.Adam(model.parameters(), lr=0.01)
}

losses = {name: [] for name in optimizers}

for name, opt in optimizers.items():
    # Reinitialize model for each optimizer
    model = nn.Linear(10, 1)
    opt = optimizers[name]
    for epoch in range(100):
        opt.zero_grad()
        output = model(X).squeeze()
        loss = criterion(output, y)
        loss.backward()
        opt.step()
        losses[name].append(loss.item())

# Plot
for name, loss_list in losses.items():
    plt.plot(loss_list, label=name)
plt.xlabel('Epoch')
plt.ylabel('MSE Loss')
plt.title('Optimizer Comparison')
plt.legend()
plt.show()

6. Second‑Order Methods: Newton and Quasi‑Newton

First‑order methods use gradient; second‑order also use curvature (Hessian). They can converge faster but are expensive.

Newton’s method: $\theta_{t+1} = \theta_t - H^{-1} \nabla L$ – exact, but Hessian inversion is O(n³).
L‑BFGS: approximates Hessian using past gradients – popular for smaller problems.

from scipy.optimize import minimize

# Use L-BFGS via scipy on a simple logistic regression
from sklearn.datasets import make_classification
X, y = make_classification(n_samples=200, n_features=5, random_state=42)

def loss(theta, X, y):
    z = X @ theta
    pred = 1 / (1 + np.exp(-z))
    return -np.mean(y * np.log(pred) + (1-y) * np.log(1-pred))

def grad(theta, X, y):
    z = X @ theta
    pred = 1 / (1 + np.exp(-z))
    return (X.T @ (pred - y)) / len(y)

theta0 = np.zeros(5)
result = minimize(loss, theta0, args=(X, y), method='L-BFGS-B', jac=grad)
print(result)

7. Constrained Optimization and Lagrange Multipliers

Many ML problems have constraints: e.g., SVM dual, non‑negative matrix factorization, or norm constraints. Lagrange multipliers turn constrained problems into unconstrained ones.

Example: Minimize $f(x) = x^2$ subject to $x \ge 1$ .

Lagrangian: $\mathcal{L}(x, \lambda) = x^2 + \lambda(1 - x)$ , with $\lambda \ge 0$ .

Solution: $x=1$ , $\lambda=2$ (from KKT conditions).

In practice, constrained problems are often solved via projected gradient or specialized algorithms.

# Projected gradient: after each step, project onto feasible set
def project(x):
    return max(x, 1)  # projection onto x >= 1

x = 0.0
lr = 0.1
for i in range(50):
    grad = 2*x
    x = x - lr * grad
    x = project(x)
    print(f"Step {i}: x = {x:.4f}")

8. Convex vs. Non‑convex Optimization

Convex: any local minimum is global. Many classic models (linear regression, SVM) are convex.
Non‑convex: deep neural networks are highly non‑convex. Yet SGD finds good solutions due to overparameterization and implicit regularization.

Implication: In convex problems, we can guarantee convergence; in non‑convex, we rely on heuristics and empirical success.

9. Practical Tips for Training

Learning rate schedules: step decay, exponential decay, cosine annealing.
Warm‑up: start with small LR, then increase – helps early training stability.
Gradient clipping: prevent exploding gradients.
Normalization: batch norm, layer norm improve conditioning.

Conclusions

No single optimizer rules them all: SGD with momentum is simple and general; Adam works well across many tasks; L‑BFGS is excellent for small‑scale problems.
Adaptive methods reduce sensitivity to hyperparameters, making them a good default.
Second‑order information is powerful but costly – often reserved for low‑dimensional or carefully structured problems.
Constraints appear frequently (e.g., fairness, safety) and require specialised algorithms.
Understanding optimisation helps you debug training: if loss plateaus, try momentum or adaptive LR; if it diverges, clip gradients or reduce LR.

Real‑World Applications

Training LLMs: AdamW is the standard.
Reinforcement Learning: often uses Adam or RMSprop.
Robotics: Trajectory optimisation via sequential quadratic programming (SQP).
Finance: Portfolio optimisation with constraints.