Overfitting vs Underfitting

Q: Definitions in One Line Each

Underfitting: the model is too simple — it misses real structure. High bias, low variance. Good fit: the model captures the signal but ignores the noise. Overfitting: the model is too flexible — it memorizes noise as if it were signal. Low bias, high variance.

Q: How to Tell Which One You Have

train_score test_score diagnosis ----------- ---------- ----------------------------------------- low low Underfitting (model too simple) high high Good fit high low Overfitting (model memorized training set) low high Bug — leakage, wrong split, or tiny test set "Low" means the metric is bad: high error or low accuracy. The fourth row is a red flag, not a model problem — investigate before tuning anything.

⏱️ 35 sec read 📈 Data Analysis

An overfit model memorizes the training data and fails on new data; an underfit model is too simple to capture the pattern even on the training data. The diagnostic is the gap between training and validation error: large gap = overfitting, both errors high = underfitting. The fixes are opposite directions of the same dial.

Definitions in One Line Each

Underfitting: the model is too simple — it misses real structure. High bias, low variance.
Good fit: the model captures the signal but ignores the noise.
Overfitting: the model is too flexible — it memorizes noise as if it were signal. Low bias, high variance.

The Polynomial Example Everyone Uses

Imagine you fit a polynomial to 30 noisy points sampled from a sine curve.

Degree 1 (linear):    train MSE = 0.45, test MSE = 0.48   → UNDERFIT
Degree 3 (cubic):     train MSE = 0.04, test MSE = 0.05   → GOOD FIT
Degree 15:            train MSE = 0.001, test MSE = 0.62  → OVERFIT

The degree-15 model wiggles through every training point — including the noise — and is wildly wrong between them. The cubic captures the curve and ignores the wiggles. The line is too rigid to bend at all.

How to Tell Which One You Have

train_score    test_score    diagnosis
-----------    ----------    -----------------------------------------
low            low           Underfitting (model too simple)
high           high          Good fit
high           low           Overfitting (model memorized training set)
low            high          Bug — leakage, wrong split, or tiny test set

"Low" means the metric is bad: high error or low accuracy. The fourth row is a red flag, not a model problem — investigate before tuning anything.

Read the Learning Curve

Plot training and validation error as the training set grows:

Both curves high and close together → underfitting. More data won't help; you need a more flexible model.
Big gap that doesn't close → overfitting. More data will help, and so will regularization.
Both curves converge low → good fit; you're done.

How to Fix Underfitting

Use a more flexible model (linear → tree, shallow tree → deep tree, linear → polynomial features, small NN → bigger NN).
Add features — interactions, transformations, domain-specific signals.
Reduce regularization (lower L1/L2 penalty, higher tree depth, fewer dropout layers).
Train longer if loss is still going down.

How to Fix Overfitting

Get more training data — the single most effective fix when it's available.
Regularize: add L1/L2 penalty, dropout, max-depth on trees, early stopping.
Simplify the model: fewer features, fewer parameters, shallower trees.
Cross-validate hyperparameters — k-fold CV on a holdout, not on the test set.
Use ensembles: bagging (random forest) reduces variance; boosting + early stopping is the standard production answer.

The Bias-Variance Tradeoff

Total prediction error decomposes into three parts:

Total error = Bias² + Variance + Irreducible noise

- High bias  → underfitting (assumptions are wrong)
- High variance → overfitting (sensitive to training sample)
- Irreducible noise → can't be fixed by any model

Tuning the model is mostly a tradeoff between bias and variance. Regularization moves the dial toward bias; flexibility moves it toward variance. The minimum total error sits between the extremes.

Code Sketch in scikit-learn

from sklearn.model_selection import learning_curve
from sklearn.ensemble import GradientBoostingRegressor
import numpy as np

sizes, train_scores, val_scores = learning_curve(
    GradientBoostingRegressor(max_depth=5, n_estimators=200),
    X, y, cv=5, scoring="neg_mean_squared_error",
    train_sizes=np.linspace(0.1, 1.0, 8),
)

# Plot mean of -train_scores and -val_scores against sizes.
# Big persistent gap = overfitting; both flat-and-high = underfitting.

Common Pitfalls

Tuning on the test set: if you pick the model with the best test score across many trials, you've overfit to the test set. Use a separate validation split or cross-validation.
Data leakage: features computed from the future, target encoding done before splitting, or scaling fit on the whole dataset all produce "great" training scores that collapse in production.
Tiny test sets: a 50-row test set has high variance — a single lucky split looks like a great model.
Calling deep neural nets "always overfitting": with enough data, regularization, and early stopping, large models often underfit until trained for longer.

Pro Tip: Always look at both training and validation metrics, not just validation. A model with 95% validation accuracy and 95% training accuracy is healthy. A model with 95% validation accuracy and 99.9% training accuracy is overfitting and will degrade as the data drifts.

← Back to Data Analysis Tips