Overfitting vs Underfitting
The Goldilocks Problem
Underfitting: Model too simple, can't capture patterns (high bias)
Overfitting: Model too complex, memorizes training data (high variance)
Just right: Model generalizes well to new data
Simple Analogy
Underfitting: Studying only chapter 1, then taking exam on entire book
Overfitting: Memorizing practice exam answers, failing on new questions
Good fit: Understanding concepts, can apply to new problems
Visual Example
# Data: Points roughly following a curve
# True relationship: y = x^2 + noise
Underfitting: y = constant βββββββββββ (straight line, misses pattern)
Good fit: y = a*x^2 + b*x + c βͺβͺβͺβͺβͺβͺβͺ (captures pattern)
Overfitting: y = polynomial_50 βΏβΏβΏβΏβΏβΏβΏ (fits noise, not pattern)
How to Detect
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
# Train model
model = DecisionTreeRegressor(max_depth=10)
model.fit(X_train, y_train)
# Evaluate
train_score = model.score(X_train, y_train)
test_score = model.score(X_test, y_test)
print(f"Training score: {train_score:.3f}")
print(f"Test score: {test_score:.3f}")
# Diagnosis:
# train=0.99, test=0.98 β Good fit β
# train=0.99, test=0.60 β Overfitting β (huge gap)
# train=0.65, test=0.63 β Underfitting β (both low)
Underfitting in Detail
Symptoms
- Low training accuracy
- Low test accuracy
- Model is too simple
- High bias
Causes
- Model too simple (linear for non-linear data)
- Too few features
- Too much regularization
- Not trained long enough
Solutions
# 1. Increase model complexity
# Before: Linear model
model = LinearRegression()
# After: Polynomial features
from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(degree=2)
X_poly = poly.fit_transform(X)
model.fit(X_poly, y)
# 2. Add more features
X_new = X.copy()
X_new['feature_interaction'] = X['A'] * X['B']
X_new['feature_squared'] = X['A'] ** 2
# 3. Reduce regularization
# Before: Strong regularization
model = Ridge(alpha=100)
# After: Weaker regularization
model = Ridge(alpha=0.1)
# 4. Train longer (neural networks)
model.fit(X_train, y_train, epochs=100) # Instead of 10
Overfitting in Detail
Symptoms
- Very high training accuracy (>99%)
- Much lower test accuracy
- Large gap between train and test scores
- High variance
Causes
- Model too complex
- Too many features
- Too little training data
- Trained too long
- No regularization
Solutions
1. Get More Data
# More data β harder to memorize β better generalization
# Best solution if possible!
2. Regularization (L1/L2)
# L2 Regularization (Ridge)
from sklearn.linear_model import Ridge
model = Ridge(alpha=1.0) # alpha controls strength
# L1 Regularization (Lasso) - also does feature selection
from sklearn.linear_model import Lasso
model = Lasso(alpha=1.0)
# ElasticNet - combines L1 and L2
from sklearn.linear_model import ElasticNet
model = ElasticNet(alpha=1.0, l1_ratio=0.5)
3. Reduce Model Complexity
# Decision Trees: Limit depth
model = DecisionTreeClassifier(
max_depth=5, # Limit tree depth
min_samples_split=20, # Require more samples to split
min_samples_leaf=10 # Require more samples in leaf
)
# Neural Networks: Fewer layers/neurons
model = Sequential([
Dense(32, activation='relu'), # Instead of 512
Dense(16, activation='relu'), # Simpler architecture
Dense(1, activation='sigmoid')
])
4. Dropout (Neural Networks)
from tensorflow.keras.layers import Dropout
model = Sequential([
Dense(128, activation='relu'),
Dropout(0.5), # Randomly drop 50% of neurons during training
Dense(64, activation='relu'),
Dropout(0.3),
Dense(1)
])
5. Early Stopping
from tensorflow.keras.callbacks import EarlyStopping
# Stop training when validation loss stops improving
early_stop = EarlyStopping(
monitor='val_loss',
patience=10, # Wait 10 epochs for improvement
restore_best_weights=True
)
model.fit(X_train, y_train,
validation_split=0.2,
epochs=1000,
callbacks=[early_stop]) # Will stop early if overfitting
6. Cross-Validation
from sklearn.model_selection import cross_val_score
# Evaluate on multiple train/test splits
scores = cross_val_score(model, X, y, cv=5)
print(f"Cross-validation scores: {scores}")
print(f"Mean: {scores.mean():.3f} (+/- {scores.std():.3f})")
7. Feature Selection
# Remove irrelevant/redundant features
from sklearn.feature_selection import SelectKBest, f_classif
selector = SelectKBest(f_classif, k=10) # Keep top 10 features
X_selected = selector.fit_transform(X, y)
# Check feature importance
feature_scores = selector.scores_
print("Feature scores:", feature_scores)
8. Data Augmentation
# For images: rotate, flip, zoom
from tensorflow.keras.preprocessing.image import ImageDataGenerator
datagen = ImageDataGenerator(
rotation_range=20,
width_shift_range=0.2,
horizontal_flip=True
)
# Increases effective training data size
The Bias-Variance Tradeoff
Total Error = BiasΒ² + Variance + Irreducible Error
Bias: Error from wrong assumptions (underfitting)
Variance: Error from sensitivity to training data (overfitting)
High Bias, Low Variance β Underfitting
Low Bias, High Variance β Overfitting
Low Bias, Low Variance β Just right!
Learning Curves
from sklearn.model_selection import learning_curve
import matplotlib.pyplot as plt
train_sizes, train_scores, test_scores = learning_curve(
model, X, y, cv=5, train_sizes=np.linspace(0.1, 1.0, 10)
)
plt.plot(train_sizes, train_scores.mean(axis=1), label='Training score')
plt.plot(train_sizes, test_scores.mean(axis=1), label='Test score')
plt.xlabel('Training Set Size')
plt.ylabel('Score')
plt.legend()
plt.show()
# Overfitting: Large gap between curves
# Underfitting: Both curves plateau at low value
# Good fit: Curves converge at high value
Validation Curves
from sklearn.model_selection import validation_curve
param_range = [1, 2, 3, 5, 10, 20, 50]
train_scores, test_scores = validation_curve(
DecisionTreeClassifier(), X, y,
param_name='max_depth',
param_range=param_range,
cv=5
)
# Plot to find optimal max_depth
# Sweet spot where test score is highest
Real-World Example
from sklearn.datasets import make_classification
from sklearn.ensemble import RandomForestClassifier
# Generate data
X, y = make_classification(n_samples=1000, n_features=20, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
# Test different complexities
for max_depth in [2, 5, 10, 20, None]:
model = RandomForestClassifier(max_depth=max_depth, random_state=42)
model.fit(X_train, y_train)
train_acc = model.score(X_train, y_train)
test_acc = model.score(X_test, y_test)
print(f"max_depth={max_depth}:")
print(f" Train: {train_acc:.3f}")
print(f" Test: {test_acc:.3f}")
print(f" Gap: {train_acc - test_acc:.3f}\n")
# Output:
# max_depth=2: Train: 0.850, Test: 0.845 (underfitting, low scores)
# max_depth=5: Train: 0.920, Test: 0.905 (good fit!)
# max_depth=10: Train: 0.965, Test: 0.900 (slight overfit)
# max_depth=None: Train: 1.000, Test: 0.885 (overfitting, big gap)
Quick Diagnostic Guide
| Train Score | Test Score | Problem | Solution |
|---|---|---|---|
| High (>95%) | Much lower | Overfitting | Regularization, more data, simpler model |
| Low (<80%) | Low (<80%) | Underfitting | More complex model, more features |
| High (90%) | Similar (88%) | Good fit | You're done! |
Best Practices
- Always use train/validation/test split
- Start simple, increase complexity gradually
- Monitor both train and test performance
- Use cross-validation for robust evaluation
- Regularize by default (especially with many features)
- More data > better algorithm (usually)
- Plot learning curves to diagnose issues
Key Takeaways:
- Overfitting: Model memorizes, doesn't generalize (high variance)
- Underfitting: Model too simple, misses patterns (high bias)
- Detect: Large gap between train/test scores = overfitting
- Fix overfitting: More data, regularization, simpler model, dropout
- Fix underfitting: More complex model, more features, train longer
- Goal: Find sweet spot with good generalization