Missing Data Handling Strategies

⏱️ 30 sec read 🔧 Data Cleaning

First: Understand WHY Data Is Missing

MCAR (Missing Completely at Random): No pattern to missingness

MAR (Missing at Random): Missingness related to other variables

MNAR (Missing Not at Random): Missingness related to the missing value itself

Listwise deletion: Remove entire row if any value missing

When to use: Less than 5% missing, MCAR data, large dataset

Risk: Reduces sample size, introduces bias if not MCAR

Mean/median/mode: Replace missing values with average

Multiple imputation: Generate several plausible values

Regression imputation: Predict missing values from other variables

When to use: More than 5% missing, important analysis

Create indicator variable: "income_missing" (0/1)

When to use: Missingness itself is informative

❌ Replacing missing with 0 - Zero is a value, not "unknown"

❌ Imputing without investigating why - Pattern might be important

❌ Using mean for skewed data - Use median instead

Best practice: Check missingness patterns FIRST. If more than 40% of a variable is missing, question whether you should use that variable at all.