Overfitting vs Underfitting

⏱️ 35 sec read πŸ“ˆ Data Analysis

An overfit model memorizes the training data and fails on new data; an underfit model is too simple to capture the pattern even on the training data. The diagnostic is the gap between training and validation error: large gap = overfitting, both errors high = underfitting. The fixes are opposite directions of the same dial.

Definitions in One Line Each

The Polynomial Example Everyone Uses

Imagine you fit a polynomial to 30 noisy points sampled from a sine curve.

Degree 1 (linear):    train MSE = 0.45, test MSE = 0.48   β†’ UNDERFIT
Degree 3 (cubic):     train MSE = 0.04, test MSE = 0.05   β†’ GOOD FIT
Degree 15:            train MSE = 0.001, test MSE = 0.62  β†’ OVERFIT

The degree-15 model wiggles through every training point β€” including the noise β€” and is wildly wrong between them. The cubic captures the curve and ignores the wiggles. The line is too rigid to bend at all.

How to Tell Which One You Have

train_score    test_score    diagnosis
-----------    ----------    -----------------------------------------
low            low           Underfitting (model too simple)
high           high          Good fit
high           low           Overfitting (model memorized training set)
low            high          Bug β€” leakage, wrong split, or tiny test set

"Low" means the metric is bad: high error or low accuracy. The fourth row is a red flag, not a model problem β€” investigate before tuning anything.

Read the Learning Curve

Plot training and validation error as the training set grows:

How to Fix Underfitting

How to Fix Overfitting

The Bias-Variance Tradeoff

Total prediction error decomposes into three parts:

Total error = BiasΒ² + Variance + Irreducible noise

- High bias  β†’ underfitting (assumptions are wrong)
- High variance β†’ overfitting (sensitive to training sample)
- Irreducible noise β†’ can't be fixed by any model

Tuning the model is mostly a tradeoff between bias and variance. Regularization moves the dial toward bias; flexibility moves it toward variance. The minimum total error sits between the extremes.

Code Sketch in scikit-learn

from sklearn.model_selection import learning_curve
from sklearn.ensemble import GradientBoostingRegressor
import numpy as np

sizes, train_scores, val_scores = learning_curve(
    GradientBoostingRegressor(max_depth=5, n_estimators=200),
    X, y, cv=5, scoring="neg_mean_squared_error",
    train_sizes=np.linspace(0.1, 1.0, 8),
)

# Plot mean of -train_scores and -val_scores against sizes.
# Big persistent gap = overfitting; both flat-and-high = underfitting.

Common Pitfalls

Pro Tip: Always look at both training and validation metrics, not just validation. A model with 95% validation accuracy and 95% training accuracy is healthy. A model with 95% validation accuracy and 99.9% training accuracy is overfitting and will degrade as the data drifts.

← Back to Data Analysis Tips