Handling Missing Data

📈 Data Analysis ⏱️ 35 sec read

What is Missing Data?

Missing data (NaN, NULL, blank) occurs when no value is stored for a variable. Common in real-world datasets. How you handle it affects your analysis accuracy.

Step 1: Detect Missing Data

import pandas as pd
import numpy as np

# Load data
df = pd.read_csv('data.csv')

# Check for missing values
print(df.isnull().sum())
# Output:
# age          5
# income      12
# gender       0

# Percentage missing
print(df.isnull().sum() / len(df) * 100)
# age       2.5%
# income    6.0%

# Visualize missing data
import missingno as msno
msno.matrix(df)  # Shows pattern of missingness
msno.heatmap(df) # Shows correlation between missing values

Types of Missing Data

1. MCAR (Missing Completely At Random)

Missingness is totally random, unrelated to any variables.

# Example: Survey respondents randomly skip questions
# No pattern, no bias

# Safe to: Delete rows or simple imputation

2. MAR (Missing At Random)

Missingness depends on observed data, not the missing value itself.

# Example: Older people more likely to skip income question
# Missing income relates to age (which we have), not to income value

# Solution: Use other variables to predict missing values

3. MNAR (Missing Not At Random)

Missingness depends on the unobserved value.

# Example: People with high income more likely to skip income question
# Missing relates to the value itself (which we don't have)

# Problem: Hard to handle, may need domain knowledge

Method 1: Delete Missing Data

Delete Rows (Listwise Deletion)

# Remove any row with at least one missing value
df_clean = df.dropna()

# Remove rows where specific column is missing
df_clean = df.dropna(subset=['age'])

# Remove rows where ALL values are missing
df_clean = df.dropna(how='all')

# Remove rows with too many missing values
df_clean = df.dropna(thresh=3)  # Keep rows with at least 3 non-null values

# When to use:
# ✅ MCAR data
# ✅ Small % missing (<5%)
# ✅ Large dataset (can afford to lose rows)
# ❌ Large % missing (lose too much data)
# ❌ MAR or MNAR (biased results)

Delete Columns

# Remove columns with >50% missing
threshold = 0.5
df_clean = df.dropna(axis=1, thresh=len(df) * threshold)

# When to use:
# ✅ Column has too much missing data (>50%)
# ✅ Column not critical for analysis

Method 2: Simple Imputation

Mean/Median/Mode

# Mean imputation (for numerical data)
df['age'] = df['age'].fillna(df['age'].mean())

# Median imputation (better with outliers)
df['income'] = df['income'].fillna(df['income'].median())

# Mode imputation (for categorical data)
df['gender'] = df['gender'].fillna(df['gender'].mode()[0])

# Forward fill (use previous value)
df['price'] = df['price'].fillna(method='ffill')

# Backward fill (use next value)
df['price'] = df['price'].fillna(method='bfill')

# Constant value
df['discount'] = df['discount'].fillna(0)

# When to use:
# ✅ MCAR data
# ✅ Quick solution
# ❌ Can reduce variance
# ❌ Can bias correlations

Method 3: Advanced Imputation

K-Nearest Neighbors (KNN)

from sklearn.impute import KNNImputer

# Use similar rows to predict missing values
imputer = KNNImputer(n_neighbors=5)
df_imputed = pd.DataFrame(
    imputer.fit_transform(df),
    columns=df.columns
)

# How it works:
# 1. Find 5 most similar rows (by other features)
# 2. Average their values for missing field
# 3. Use that as the imputed value

Iterative Imputer (MICE)

from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

# Model each feature as function of others
imputer = IterativeImputer(random_state=0)
df_imputed = pd.DataFrame(
    imputer.fit_transform(df),
    columns=df.columns
)

# How it works:
# 1. Model each column based on all other columns
# 2. Iteratively improve predictions
# 3. Best for MAR data

Predictive Modeling

from sklearn.ensemble import RandomForestRegressor

# Train model to predict missing values
# Use rows without missing data as training set
train = df[df['income'].notnull()]
test = df[df['income'].isnull()]

# Features to use for prediction
features = ['age', 'education', 'experience']

# Train model
model = RandomForestRegressor()
model.fit(train[features], train['income'])

# Predict missing values
predicted_income = model.predict(test[features])
df.loc[df['income'].isnull(), 'income'] = predicted_income

Method 4: Flag Missing Data

# Create indicator column
df['income_missing'] = df['income'].isnull().astype(int)

# Then impute the actual column
df['income'] = df['income'].fillna(df['income'].median())

# Why: Preserves information that value was missing
# Useful if missingness is informative

Choosing the Right Method

Scenario	Method	Why
< 5% missing, MCAR	Delete rows	Simple, no bias
Numerical, MCAR	Mean/Median	Fast, reasonable
Categorical	Mode	Most common value
Time series	Forward/Backward fill	Use adjacent values
MAR, relationships exist	KNN or MICE	Use other features
MNAR	Flag + Impute	Preserve missingness info

Pandas Complete Example

import pandas as pd
import numpy as np

# Sample data with missing values
df = pd.DataFrame({
    'age': [25, 30, np.nan, 35, 40],
    'income': [50000, np.nan, 60000, np.nan, 80000],
    'city': ['NYC', 'LA', np.nan, 'Chicago', 'Boston']
})

print("Missing values:")
print(df.isnull().sum())

# Strategy 1: Drop rows with any missing
df1 = df.dropna()

# Strategy 2: Fill numerical with median
df2 = df.copy()
df2['age'] = df2['age'].fillna(df2['age'].median())
df2['income'] = df2['income'].fillna(df2['income'].median())

# Strategy 3: Fill categorical with mode
df2['city'] = df2['city'].fillna(df2['city'].mode()[0])

# Strategy 4: Multiple methods per column
df3 = df.copy()
df3['age'] = df3['age'].fillna(df3['age'].mean())
df3['income'] = df3['income'].interpolate()  # Linear interpolation
df3['city'] = df3['city'].fillna('Unknown')

print("\nAfter imputation:")
print(df3)

SQL Handling Missing Data

-- Find rows with missing values
SELECT * FROM users WHERE age IS NULL;

-- Count missing values
SELECT COUNT(*) - COUNT(age) as missing_age
FROM users;

-- Impute with average
UPDATE users
SET age = (SELECT AVG(age) FROM users WHERE age IS NOT NULL)
WHERE age IS NULL;

-- Coalesce (use first non-null value)
SELECT COALESCE(income, 0) as income_clean
FROM users;

Common Mistakes

Mistake 1: Imputing Before Train/Test Split

# WRONG: Impute all data
df['age'] = df['age'].fillna(df['age'].mean())  # Uses test data!
train, test = train_test_split(df)

# RIGHT: Impute separately
train, test = train_test_split(df)
train_mean = train['age'].mean()
train['age'] = train['age'].fillna(train_mean)
test['age'] = test['age'].fillna(train_mean)  # Use TRAIN mean

Mistake 2: Imputing with Mean in Skewed Data

# Data: [10, 12, 11, 13, 1000, NaN]
# Mean: 209 (pulled up by outlier 1000)

# WRONG:
df['value'] = df['value'].fillna(df['value'].mean())  # 209 is wrong!

# RIGHT:
df['value'] = df['value'].fillna(df['value'].median())  # 12 makes sense

Mistake 3: Not Documenting Imputation

# Always document what you did:
# - Which columns had missing data
# - What % was missing
# - Which method you used
# - Why you chose that method

Best Practices

Understand WHY data is missing before choosing method
Document your approach - reproducibility matters
Visualize missingness patterns (missingno library)
Try multiple approaches and compare results
Consider domain knowledge - sometimes 0 != missing
Impute after split (train/test) to avoid data leakage

Key Takeaways:

Always detect and understand missing data first
Delete if <5% missing and MCAR
Mean/median for simple numerical imputation
KNN/MICE for sophisticated imputation (MAR)
Flag + impute when missingness is informative
Never impute before train/test split
Document your approach for reproducibility