Handling Missing Data
What is Missing Data?
Missing data (NaN, NULL, blank) occurs when no value is stored for a variable. Common in real-world datasets. How you handle it affects your analysis accuracy.
Step 1: Detect Missing Data
import pandas as pd
import numpy as np
# Load data
df = pd.read_csv('data.csv')
# Check for missing values
print(df.isnull().sum())
# Output:
# age 5
# income 12
# gender 0
# Percentage missing
print(df.isnull().sum() / len(df) * 100)
# age 2.5%
# income 6.0%
# Visualize missing data
import missingno as msno
msno.matrix(df) # Shows pattern of missingness
msno.heatmap(df) # Shows correlation between missing values
Types of Missing Data
1. MCAR (Missing Completely At Random)
Missingness is totally random, unrelated to any variables.
# Example: Survey respondents randomly skip questions
# No pattern, no bias
# Safe to: Delete rows or simple imputation
2. MAR (Missing At Random)
Missingness depends on observed data, not the missing value itself.
# Example: Older people more likely to skip income question
# Missing income relates to age (which we have), not to income value
# Solution: Use other variables to predict missing values
3. MNAR (Missing Not At Random)
Missingness depends on the unobserved value.
# Example: People with high income more likely to skip income question
# Missing relates to the value itself (which we don't have)
# Problem: Hard to handle, may need domain knowledge
Method 1: Delete Missing Data
Delete Rows (Listwise Deletion)
# Remove any row with at least one missing value
df_clean = df.dropna()
# Remove rows where specific column is missing
df_clean = df.dropna(subset=['age'])
# Remove rows where ALL values are missing
df_clean = df.dropna(how='all')
# Remove rows with too many missing values
df_clean = df.dropna(thresh=3) # Keep rows with at least 3 non-null values
# When to use:
# ✅ MCAR data
# ✅ Small % missing (<5%)
# ✅ Large dataset (can afford to lose rows)
# ❌ Large % missing (lose too much data)
# ❌ MAR or MNAR (biased results)
Delete Columns
# Remove columns with >50% missing
threshold = 0.5
df_clean = df.dropna(axis=1, thresh=len(df) * threshold)
# When to use:
# ✅ Column has too much missing data (>50%)
# ✅ Column not critical for analysis
Method 2: Simple Imputation
Mean/Median/Mode
# Mean imputation (for numerical data)
df['age'] = df['age'].fillna(df['age'].mean())
# Median imputation (better with outliers)
df['income'] = df['income'].fillna(df['income'].median())
# Mode imputation (for categorical data)
df['gender'] = df['gender'].fillna(df['gender'].mode()[0])
# Forward fill (use previous value)
df['price'] = df['price'].fillna(method='ffill')
# Backward fill (use next value)
df['price'] = df['price'].fillna(method='bfill')
# Constant value
df['discount'] = df['discount'].fillna(0)
# When to use:
# ✅ MCAR data
# ✅ Quick solution
# ❌ Can reduce variance
# ❌ Can bias correlations
Method 3: Advanced Imputation
K-Nearest Neighbors (KNN)
from sklearn.impute import KNNImputer
# Use similar rows to predict missing values
imputer = KNNImputer(n_neighbors=5)
df_imputed = pd.DataFrame(
imputer.fit_transform(df),
columns=df.columns
)
# How it works:
# 1. Find 5 most similar rows (by other features)
# 2. Average their values for missing field
# 3. Use that as the imputed value
Iterative Imputer (MICE)
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
# Model each feature as function of others
imputer = IterativeImputer(random_state=0)
df_imputed = pd.DataFrame(
imputer.fit_transform(df),
columns=df.columns
)
# How it works:
# 1. Model each column based on all other columns
# 2. Iteratively improve predictions
# 3. Best for MAR data
Predictive Modeling
from sklearn.ensemble import RandomForestRegressor
# Train model to predict missing values
# Use rows without missing data as training set
train = df[df['income'].notnull()]
test = df[df['income'].isnull()]
# Features to use for prediction
features = ['age', 'education', 'experience']
# Train model
model = RandomForestRegressor()
model.fit(train[features], train['income'])
# Predict missing values
predicted_income = model.predict(test[features])
df.loc[df['income'].isnull(), 'income'] = predicted_income
Method 4: Flag Missing Data
# Create indicator column
df['income_missing'] = df['income'].isnull().astype(int)
# Then impute the actual column
df['income'] = df['income'].fillna(df['income'].median())
# Why: Preserves information that value was missing
# Useful if missingness is informative
Choosing the Right Method
| Scenario | Method | Why |
|---|---|---|
| < 5% missing, MCAR | Delete rows | Simple, no bias |
| Numerical, MCAR | Mean/Median | Fast, reasonable |
| Categorical | Mode | Most common value |
| Time series | Forward/Backward fill | Use adjacent values |
| MAR, relationships exist | KNN or MICE | Use other features |
| MNAR | Flag + Impute | Preserve missingness info |
Pandas Complete Example
import pandas as pd
import numpy as np
# Sample data with missing values
df = pd.DataFrame({
'age': [25, 30, np.nan, 35, 40],
'income': [50000, np.nan, 60000, np.nan, 80000],
'city': ['NYC', 'LA', np.nan, 'Chicago', 'Boston']
})
print("Missing values:")
print(df.isnull().sum())
# Strategy 1: Drop rows with any missing
df1 = df.dropna()
# Strategy 2: Fill numerical with median
df2 = df.copy()
df2['age'] = df2['age'].fillna(df2['age'].median())
df2['income'] = df2['income'].fillna(df2['income'].median())
# Strategy 3: Fill categorical with mode
df2['city'] = df2['city'].fillna(df2['city'].mode()[0])
# Strategy 4: Multiple methods per column
df3 = df.copy()
df3['age'] = df3['age'].fillna(df3['age'].mean())
df3['income'] = df3['income'].interpolate() # Linear interpolation
df3['city'] = df3['city'].fillna('Unknown')
print("\nAfter imputation:")
print(df3)
SQL Handling Missing Data
-- Find rows with missing values
SELECT * FROM users WHERE age IS NULL;
-- Count missing values
SELECT COUNT(*) - COUNT(age) as missing_age
FROM users;
-- Impute with average
UPDATE users
SET age = (SELECT AVG(age) FROM users WHERE age IS NOT NULL)
WHERE age IS NULL;
-- Coalesce (use first non-null value)
SELECT COALESCE(income, 0) as income_clean
FROM users;
Common Mistakes
Mistake 1: Imputing Before Train/Test Split
# WRONG: Impute all data
df['age'] = df['age'].fillna(df['age'].mean()) # Uses test data!
train, test = train_test_split(df)
# RIGHT: Impute separately
train, test = train_test_split(df)
train_mean = train['age'].mean()
train['age'] = train['age'].fillna(train_mean)
test['age'] = test['age'].fillna(train_mean) # Use TRAIN mean
Mistake 2: Imputing with Mean in Skewed Data
# Data: [10, 12, 11, 13, 1000, NaN]
# Mean: 209 (pulled up by outlier 1000)
# WRONG:
df['value'] = df['value'].fillna(df['value'].mean()) # 209 is wrong!
# RIGHT:
df['value'] = df['value'].fillna(df['value'].median()) # 12 makes sense
Mistake 3: Not Documenting Imputation
# Always document what you did:
# - Which columns had missing data
# - What % was missing
# - Which method you used
# - Why you chose that method
Best Practices
- Understand WHY data is missing before choosing method
- Document your approach - reproducibility matters
- Visualize missingness patterns (missingno library)
- Try multiple approaches and compare results
- Consider domain knowledge - sometimes 0 != missing
- Impute after split (train/test) to avoid data leakage
Key Takeaways:
- Always detect and understand missing data first
- Delete if <5% missing and MCAR
- Mean/median for simple numerical imputation
- KNN/MICE for sophisticated imputation (MAR)
- Flag + impute when missingness is informative
- Never impute before train/test split
- Document your approach for reproducibility