Missing Data Handling Strategies
First: Understand WHY Data Is Missing
MCAR (Missing Completely at Random): No pattern to missingness
- Example: Random sensor failures
- Safe to delete rows or impute
MAR (Missing at Random): Missingness related to other variables
- Example: Younger people skip income questions more
- Can use advanced imputation methods
MNAR (Missing Not at Random): Missingness related to the missing value itself
- Example: High earners refuse to report income
- Hardest case - deletion causes bias!
Handling Strategies
1. Deletion
Listwise deletion: Remove entire row if any value missing
When to use: Less than 5% missing, MCAR data, large dataset
Risk: Reduces sample size, introduces bias if not MCAR
2. Simple Imputation
Mean/median/mode: Replace missing values with average
- ✅ Quick and simple
- ❌ Reduces variance
- ❌ Ignores relationships between variables
3. Advanced Imputation
Multiple imputation: Generate several plausible values
Regression imputation: Predict missing values from other variables
When to use: More than 5% missing, important analysis
4. Flag as Missing
Create indicator variable: "income_missing" (0/1)
When to use: Missingness itself is informative
Common Mistakes
❌ Replacing missing with 0 - Zero is a value, not "unknown"
❌ Imputing without investigating why - Pattern might be important
❌ Using mean for skewed data - Use median instead
Best practice: Check missingness patterns FIRST. If more than 40% of a variable is missing, question whether you should use that variable at all.
← Back to Data Analysis Tips