Useful Data Tips

Overfitting vs Underfitting

πŸ€– AI & ML ⏱️ 30 sec read

The Goldilocks Problem

Underfitting: Model too simple, can't capture patterns (high bias)

Overfitting: Model too complex, memorizes training data (high variance)

Just right: Model generalizes well to new data

Simple Analogy

Underfitting: Studying only chapter 1, then taking exam on entire book

Overfitting: Memorizing practice exam answers, failing on new questions

Good fit: Understanding concepts, can apply to new problems

Visual Example

# Data: Points roughly following a curve
# True relationship: y = x^2 + noise

Underfitting:  y = constant        ───────────  (straight line, misses pattern)
Good fit:      y = a*x^2 + b*x + c  βˆͺβˆͺβˆͺβˆͺβˆͺβˆͺβˆͺ  (captures pattern)
Overfitting:   y = polynomial_50   ∿∿∿∿∿∿∿  (fits noise, not pattern)

How to Detect

from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Train model
model = DecisionTreeRegressor(max_depth=10)
model.fit(X_train, y_train)

# Evaluate
train_score = model.score(X_train, y_train)
test_score = model.score(X_test, y_test)

print(f"Training score: {train_score:.3f}")
print(f"Test score: {test_score:.3f}")

# Diagnosis:
# train=0.99, test=0.98 β†’ Good fit βœ…
# train=0.99, test=0.60 β†’ Overfitting ❌ (huge gap)
# train=0.65, test=0.63 β†’ Underfitting ❌ (both low)

Underfitting in Detail

Symptoms

Causes

Solutions

# 1. Increase model complexity
# Before: Linear model
model = LinearRegression()

# After: Polynomial features
from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(degree=2)
X_poly = poly.fit_transform(X)
model.fit(X_poly, y)

# 2. Add more features
X_new = X.copy()
X_new['feature_interaction'] = X['A'] * X['B']
X_new['feature_squared'] = X['A'] ** 2

# 3. Reduce regularization
# Before: Strong regularization
model = Ridge(alpha=100)

# After: Weaker regularization
model = Ridge(alpha=0.1)

# 4. Train longer (neural networks)
model.fit(X_train, y_train, epochs=100)  # Instead of 10

Overfitting in Detail

Symptoms

Causes

Solutions

1. Get More Data

# More data β†’ harder to memorize β†’ better generalization
# Best solution if possible!

2. Regularization (L1/L2)

# L2 Regularization (Ridge)
from sklearn.linear_model import Ridge
model = Ridge(alpha=1.0)  # alpha controls strength

# L1 Regularization (Lasso) - also does feature selection
from sklearn.linear_model import Lasso
model = Lasso(alpha=1.0)

# ElasticNet - combines L1 and L2
from sklearn.linear_model import ElasticNet
model = ElasticNet(alpha=1.0, l1_ratio=0.5)

3. Reduce Model Complexity

# Decision Trees: Limit depth
model = DecisionTreeClassifier(
    max_depth=5,           # Limit tree depth
    min_samples_split=20,  # Require more samples to split
    min_samples_leaf=10    # Require more samples in leaf
)

# Neural Networks: Fewer layers/neurons
model = Sequential([
    Dense(32, activation='relu'),  # Instead of 512
    Dense(16, activation='relu'),  # Simpler architecture
    Dense(1, activation='sigmoid')
])

4. Dropout (Neural Networks)

from tensorflow.keras.layers import Dropout

model = Sequential([
    Dense(128, activation='relu'),
    Dropout(0.5),  # Randomly drop 50% of neurons during training
    Dense(64, activation='relu'),
    Dropout(0.3),
    Dense(1)
])

5. Early Stopping

from tensorflow.keras.callbacks import EarlyStopping

# Stop training when validation loss stops improving
early_stop = EarlyStopping(
    monitor='val_loss',
    patience=10,       # Wait 10 epochs for improvement
    restore_best_weights=True
)

model.fit(X_train, y_train,
          validation_split=0.2,
          epochs=1000,
          callbacks=[early_stop])  # Will stop early if overfitting

6. Cross-Validation

from sklearn.model_selection import cross_val_score

# Evaluate on multiple train/test splits
scores = cross_val_score(model, X, y, cv=5)
print(f"Cross-validation scores: {scores}")
print(f"Mean: {scores.mean():.3f} (+/- {scores.std():.3f})")

7. Feature Selection

# Remove irrelevant/redundant features
from sklearn.feature_selection import SelectKBest, f_classif

selector = SelectKBest(f_classif, k=10)  # Keep top 10 features
X_selected = selector.fit_transform(X, y)

# Check feature importance
feature_scores = selector.scores_
print("Feature scores:", feature_scores)

8. Data Augmentation

# For images: rotate, flip, zoom
from tensorflow.keras.preprocessing.image import ImageDataGenerator

datagen = ImageDataGenerator(
    rotation_range=20,
    width_shift_range=0.2,
    horizontal_flip=True
)

# Increases effective training data size

The Bias-Variance Tradeoff

Total Error = BiasΒ² + Variance + Irreducible Error

Bias: Error from wrong assumptions (underfitting)
Variance: Error from sensitivity to training data (overfitting)

High Bias, Low Variance  β†’ Underfitting
Low Bias, High Variance  β†’ Overfitting
Low Bias, Low Variance   β†’ Just right!

Learning Curves

from sklearn.model_selection import learning_curve
import matplotlib.pyplot as plt

train_sizes, train_scores, test_scores = learning_curve(
    model, X, y, cv=5, train_sizes=np.linspace(0.1, 1.0, 10)
)

plt.plot(train_sizes, train_scores.mean(axis=1), label='Training score')
plt.plot(train_sizes, test_scores.mean(axis=1), label='Test score')
plt.xlabel('Training Set Size')
plt.ylabel('Score')
plt.legend()
plt.show()

# Overfitting: Large gap between curves
# Underfitting: Both curves plateau at low value
# Good fit: Curves converge at high value

Validation Curves

from sklearn.model_selection import validation_curve

param_range = [1, 2, 3, 5, 10, 20, 50]
train_scores, test_scores = validation_curve(
    DecisionTreeClassifier(), X, y,
    param_name='max_depth',
    param_range=param_range,
    cv=5
)

# Plot to find optimal max_depth
# Sweet spot where test score is highest

Real-World Example

from sklearn.datasets import make_classification
from sklearn.ensemble import RandomForestClassifier

# Generate data
X, y = make_classification(n_samples=1000, n_features=20, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Test different complexities
for max_depth in [2, 5, 10, 20, None]:
    model = RandomForestClassifier(max_depth=max_depth, random_state=42)
    model.fit(X_train, y_train)

    train_acc = model.score(X_train, y_train)
    test_acc = model.score(X_test, y_test)

    print(f"max_depth={max_depth}:")
    print(f"  Train: {train_acc:.3f}")
    print(f"  Test:  {test_acc:.3f}")
    print(f"  Gap:   {train_acc - test_acc:.3f}\n")

# Output:
# max_depth=2:   Train: 0.850, Test: 0.845 (underfitting, low scores)
# max_depth=5:   Train: 0.920, Test: 0.905 (good fit!)
# max_depth=10:  Train: 0.965, Test: 0.900 (slight overfit)
# max_depth=None: Train: 1.000, Test: 0.885 (overfitting, big gap)

Quick Diagnostic Guide

Train Score Test Score Problem Solution
High (>95%) Much lower Overfitting Regularization, more data, simpler model
Low (<80%) Low (<80%) Underfitting More complex model, more features
High (90%) Similar (88%) Good fit You're done!

Best Practices

Key Takeaways:

  • Overfitting: Model memorizes, doesn't generalize (high variance)
  • Underfitting: Model too simple, misses patterns (high bias)
  • Detect: Large gap between train/test scores = overfitting
  • Fix overfitting: More data, regularization, simpler model, dropout
  • Fix underfitting: More complex model, more features, train longer
  • Goal: Find sweet spot with good generalization