Gradient Descent Explained Simply

⏱️ 28 sec read 🤖 AI & ML

Gradient descent is the core optimization algorithm in machine learning. It finds the best model parameters by iteratively moving downhill on the loss function surface toward the minimum error.

The Concept: Walking Downhill

# Imagine you're on a mountain in fog
# Goal: Reach the lowest point (valley)
# Strategy: Feel the slope, take step downhill, repeat

1. Start at random position
2. Calculate slope (gradient)
3. Move opposite to slope direction
4. Repeat until you reach the bottom

Same principle applies to minimizing model error!

The Math (Simplified)

# Update rule:
θ = θ - α * ∇J(θ)

Where:
θ = model parameters (weights)
α = learning rate (step size)
∇J(θ) = gradient (slope) of loss function

"Move parameters opposite to gradient direction"

Simple Implementation

import numpy as np

# Simple linear regression with gradient descent
def gradient_descent(X, y, learning_rate=0.01, iterations=1000):
    m = len(y)
    theta = np.zeros(X.shape[1])  # Initialize parameters

    for i in range(iterations):
        # Make predictions
        predictions = X.dot(theta)

        # Calculate error
        errors = predictions - y

        # Calculate gradient
        gradient = (1/m) * X.T.dot(errors)

        # Update parameters
        theta = theta - learning_rate * gradient

        # Calculate loss
        if i % 100 == 0:
            loss = np.mean(errors**2)
            print(f"Iteration {i}: Loss = {loss:.4f}")

    return theta

# Example usage
X = np.array([[1, 1], [1, 2], [1, 3], [1, 4]])  # Features with bias term
y = np.array([2, 4, 6, 8])  # Target values

theta = gradient_descent(X, y)
print(f"Final parameters: {theta}")

Learning Rate: The Step Size

Too Large (α too big)

# Takes huge steps
# May overshoot minimum
# Loss bounces around, doesn't converge
# Can even diverge (get worse)

learning_rate = 1.0  # Often too large

Too Small (α too small)

# Takes tiny steps
# Very slow convergence
# May need millions of iterations
# Gets stuck in local minima

learning_rate = 0.00001  # Often too small

Just Right

# Steady progress toward minimum
# Converges in reasonable time
# Typical values: 0.001 to 0.1

learning_rate = 0.01  # Good starting point

Types of Gradient Descent

Batch Gradient Descent

# Uses ALL data for each update
# Accurate but slow for large datasets

for iteration in range(iterations):
    gradient = calculate_gradient(all_data)  # All samples
    theta = theta - learning_rate * gradient

Stochastic Gradient Descent (SGD)

# Uses ONE sample for each update
# Fast but noisy updates

for iteration in range(iterations):
    for sample in shuffle(data):
        gradient = calculate_gradient(sample)  # Single sample
        theta = theta - learning_rate * gradient

Mini-Batch Gradient Descent

# Uses small batch (e.g., 32 samples)
# Balance between speed and stability
# Most commonly used in practice

batch_size = 32
for iteration in range(iterations):
    for batch in get_batches(data, batch_size):
        gradient = calculate_gradient(batch)  # Small batch
        theta = theta - learning_rate * gradient

Monitoring Convergence

import matplotlib.pyplot as plt

# Track loss over iterations
losses = []

for i in range(iterations):
    predictions = X.dot(theta)
    loss = np.mean((predictions - y)**2)
    losses.append(loss)

    # Update parameters
    gradient = (1/m) * X.T.dot(predictions - y)
    theta = theta - learning_rate * gradient

# Plot learning curve
plt.plot(losses)
plt.xlabel('Iteration')
plt.ylabel('Loss')
plt.title('Gradient Descent Convergence')
plt.show()

# Loss should decrease and flatten out

Common Issues

Slow convergence: Reduce learning rate, use momentum
Divergence: Lower learning rate significantly
Local minima: Try multiple random initializations
Plateaus: Use adaptive learning rates (Adam, RMSprop)

Advanced Variants

# Momentum (faster convergence)
velocity = 0.9 * velocity + learning_rate * gradient
theta = theta - velocity

# Adam (adaptive learning rates)
from torch.optim import Adam
optimizer = Adam(model.parameters(), lr=0.001)

# Most popular in deep learning

Pro Tip: Start with mini-batch gradient descent and Adam optimizer—they work well in most cases. Monitor your loss curve: it should steadily decrease. If it bounces around, lower the learning rate!

← Back to AI & ML Tips