Overfitting vs Underfitting

🤖 AI & ML ⏱️ 30 sec read

The Goldilocks Problem

Underfitting: Model too simple, can't capture patterns (high bias)

Overfitting: Model too complex, memorizes training data (high variance)

Just right: Model generalizes well to new data

Simple Analogy

Underfitting: Studying only chapter 1, then taking exam on entire book

Overfitting: Memorizing practice exam answers, failing on new questions

Good fit: Understanding concepts, can apply to new problems

Visual Example

# Data: Points roughly following a curve
# True relationship: y = x^2 + noise

Underfitting:  y = constant        ───────────  (straight line, misses pattern)
Good fit:      y = a*x^2 + b*x + c  ∪∪∪∪∪∪∪  (captures pattern)
Overfitting:   y = polynomial_50   ∿∿∿∿∿∿∿  (fits noise, not pattern)

How to Detect

from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Train model
model = DecisionTreeRegressor(max_depth=10)
model.fit(X_train, y_train)

# Evaluate
train_score = model.score(X_train, y_train)
test_score = model.score(X_test, y_test)

print(f"Training score: {train_score:.3f}")
print(f"Test score: {test_score:.3f}")

# Diagnosis:
# train=0.99, test=0.98 → Good fit ✅
# train=0.99, test=0.60 → Overfitting ❌ (huge gap)
# train=0.65, test=0.63 → Underfitting ❌ (both low)

Underfitting in Detail

Symptoms

Low training accuracy
Low test accuracy
Model is too simple
High bias

Causes

Model too simple (linear for non-linear data)
Too few features
Too much regularization
Not trained long enough

Solutions

# 1. Increase model complexity
# Before: Linear model
model = LinearRegression()

# After: Polynomial features
from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(degree=2)
X_poly = poly.fit_transform(X)
model.fit(X_poly, y)

# 2. Add more features
X_new = X.copy()
X_new['feature_interaction'] = X['A'] * X['B']
X_new['feature_squared'] = X['A'] ** 2

# 3. Reduce regularization
# Before: Strong regularization
model = Ridge(alpha=100)

# After: Weaker regularization
model = Ridge(alpha=0.1)

# 4. Train longer (neural networks)
model.fit(X_train, y_train, epochs=100)  # Instead of 10

Overfitting in Detail

Symptoms

Very high training accuracy (>99%)
Much lower test accuracy
Large gap between train and test scores
High variance

Causes

Model too complex
Too many features
Too little training data
Trained too long
No regularization

Solutions

1. Get More Data

# More data → harder to memorize → better generalization
# Best solution if possible!

2. Regularization (L1/L2)

# L2 Regularization (Ridge)
from sklearn.linear_model import Ridge
model = Ridge(alpha=1.0)  # alpha controls strength

# L1 Regularization (Lasso) - also does feature selection
from sklearn.linear_model import Lasso
model = Lasso(alpha=1.0)

# ElasticNet - combines L1 and L2
from sklearn.linear_model import ElasticNet
model = ElasticNet(alpha=1.0, l1_ratio=0.5)

3. Reduce Model Complexity

# Decision Trees: Limit depth
model = DecisionTreeClassifier(
    max_depth=5,           # Limit tree depth
    min_samples_split=20,  # Require more samples to split
    min_samples_leaf=10    # Require more samples in leaf
)

# Neural Networks: Fewer layers/neurons
model = Sequential([
    Dense(32, activation='relu'),  # Instead of 512
    Dense(16, activation='relu'),  # Simpler architecture
    Dense(1, activation='sigmoid')
])

4. Dropout (Neural Networks)

from tensorflow.keras.layers import Dropout

model = Sequential([
    Dense(128, activation='relu'),
    Dropout(0.5),  # Randomly drop 50% of neurons during training
    Dense(64, activation='relu'),
    Dropout(0.3),
    Dense(1)
])

5. Early Stopping

from tensorflow.keras.callbacks import EarlyStopping

# Stop training when validation loss stops improving
early_stop = EarlyStopping(
    monitor='val_loss',
    patience=10,       # Wait 10 epochs for improvement
    restore_best_weights=True
)

model.fit(X_train, y_train,
          validation_split=0.2,
          epochs=1000,
          callbacks=[early_stop])  # Will stop early if overfitting

6. Cross-Validation

from sklearn.model_selection import cross_val_score

# Evaluate on multiple train/test splits
scores = cross_val_score(model, X, y, cv=5)
print(f"Cross-validation scores: {scores}")
print(f"Mean: {scores.mean():.3f} (+/- {scores.std():.3f})")

7. Feature Selection

# Remove irrelevant/redundant features
from sklearn.feature_selection import SelectKBest, f_classif

selector = SelectKBest(f_classif, k=10)  # Keep top 10 features
X_selected = selector.fit_transform(X, y)

# Check feature importance
feature_scores = selector.scores_
print("Feature scores:", feature_scores)

8. Data Augmentation

# For images: rotate, flip, zoom
from tensorflow.keras.preprocessing.image import ImageDataGenerator

datagen = ImageDataGenerator(
    rotation_range=20,
    width_shift_range=0.2,
    horizontal_flip=True
)

# Increases effective training data size

The Bias-Variance Tradeoff

Total Error = Bias² + Variance + Irreducible Error

Bias: Error from wrong assumptions (underfitting)
Variance: Error from sensitivity to training data (overfitting)

High Bias, Low Variance  → Underfitting
Low Bias, High Variance  → Overfitting
Low Bias, Low Variance   → Just right!

Learning Curves

from sklearn.model_selection import learning_curve
import matplotlib.pyplot as plt

train_sizes, train_scores, test_scores = learning_curve(
    model, X, y, cv=5, train_sizes=np.linspace(0.1, 1.0, 10)
)

plt.plot(train_sizes, train_scores.mean(axis=1), label='Training score')
plt.plot(train_sizes, test_scores.mean(axis=1), label='Test score')
plt.xlabel('Training Set Size')
plt.ylabel('Score')
plt.legend()
plt.show()

# Overfitting: Large gap between curves
# Underfitting: Both curves plateau at low value
# Good fit: Curves converge at high value

Validation Curves

from sklearn.model_selection import validation_curve

param_range = [1, 2, 3, 5, 10, 20, 50]
train_scores, test_scores = validation_curve(
    DecisionTreeClassifier(), X, y,
    param_name='max_depth',
    param_range=param_range,
    cv=5
)

# Plot to find optimal max_depth
# Sweet spot where test score is highest

Real-World Example

from sklearn.datasets import make_classification
from sklearn.ensemble import RandomForestClassifier

# Generate data
X, y = make_classification(n_samples=1000, n_features=20, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Test different complexities
for max_depth in [2, 5, 10, 20, None]:
    model = RandomForestClassifier(max_depth=max_depth, random_state=42)
    model.fit(X_train, y_train)

    train_acc = model.score(X_train, y_train)
    test_acc = model.score(X_test, y_test)

    print(f"max_depth={max_depth}:")
    print(f"  Train: {train_acc:.3f}")
    print(f"  Test:  {test_acc:.3f}")
    print(f"  Gap:   {train_acc - test_acc:.3f}\n")

# Output:
# max_depth=2:   Train: 0.850, Test: 0.845 (underfitting, low scores)
# max_depth=5:   Train: 0.920, Test: 0.905 (good fit!)
# max_depth=10:  Train: 0.965, Test: 0.900 (slight overfit)
# max_depth=None: Train: 1.000, Test: 0.885 (overfitting, big gap)

Quick Diagnostic Guide

Train Score	Test Score	Problem	Solution
High (>95%)	Much lower	Overfitting	Regularization, more data, simpler model
Low (<80%)	Low (<80%)	Underfitting	More complex model, more features
High (90%)	Similar (88%)	Good fit	You're done!

Best Practices

Always use train/validation/test split
Start simple, increase complexity gradually
Monitor both train and test performance
Use cross-validation for robust evaluation
Regularize by default (especially with many features)
More data > better algorithm (usually)
Plot learning curves to diagnose issues

Key Takeaways:

Overfitting: Model memorizes, doesn't generalize (high variance)
Underfitting: Model too simple, misses patterns (high bias)
Detect: Large gap between train/test scores = overfitting
Fix overfitting: More data, regularization, simpler model, dropout
Fix underfitting: More complex model, more features, train longer
Goal: Find sweet spot with good generalization