5 Descriptive Statistics You Should Always Check First
Before diving into complex analysis, always start with these 5 statistics. They reveal data quality issues, outliers, and distribution shape.
1. Mean vs Median - Detect Skewness
Mean: Average of all values (sensitive to outliers)
Median: Middle value (robust to outliers)
# Python
df['salary'].mean() # 75,000
df['salary'].median() # 58,000
What this tells you:
- Mean ≈ Median → Symmetrical distribution
- Mean > Median → Right-skewed (few high outliers pulling mean up)
- Mean < Median → Left-skewed (few low outliers pulling mean down)
2. Standard Deviation - Measure Spread
How much do values vary from the mean?
# Python
df['age'].std() # 12.5
# SQL
SELECT STDDEV(age) FROM users;
Rules of thumb:
- ~68% of data within 1 std dev of mean
- ~95% within 2 std devs
- ~99.7% within 3 std devs
High std dev = lots of variability. Low std dev = values clustered near mean.
3. Min and Max - Spot Data Issues
df['age'].min() # 0 ← Suspicious! Baby customers?
df['age'].max() # 150 ← Data entry error!
Check for:
- Impossible values (negative ages, future dates)
- Placeholder values (-1, 999, 0)
- Data entry errors (150-year-old person)
4. Percentiles (25th, 75th) - Understand Distribution
Quartiles split data into 4 equal parts:
# Python
df['price'].quantile([0.25, 0.5, 0.75])
# Result:
0.25 19.99 ← 25th percentile (Q1)
0.50 39.99 ← 50th percentile (median)
0.75 79.99 ← 75th percentile (Q3)
Interquartile Range (IQR) = Q3 - Q1
IQR tells you the range of the middle 50% of data (less affected by outliers than range).
5. Count and Missing Values
# Python
df['email'].count() # 8,500
df['email'].isna().sum() # 1,500 missing
# Percentage missing
df['email'].isna().mean() * 100 # 15% missing
Questions to ask:
- Why are values missing? (Random or pattern?)
- Can I fill them or should I drop them?
- Is missing data informative? (e.g., "no response" is meaningful)
Quick Analysis Pattern
# Python: Get all at once
df.describe()
# Output:
age salary experience
count 10000.0 10000.0 10000.0
mean 35.5 75000.0 8.2
std 12.3 45000.0 5.1
min 18.0 25000.0 0.0
25% 27.0 45000.0 4.0
50% 34.0 58000.0 7.0
75% 43.0 95000.0 12.0
max 65.0 250000.0 30.0
Red Flags to Look For
- 🚩 Mean >> Median → Check for outliers on high end
- 🚩 Min/Max unrealistic → Data quality issues
- 🚩 Std dev > Mean → Huge variability or outliers
- 🚩 High % missing → Investigate data collection
- 🚩 Count differs by column → Inconsistent missing data
Golden Rule: Never skip descriptive statistics. Five minutes of checking these stats can save hours of debugging bad analysis later.
← Back to Data Analysis Tips