5 Descriptive Statistics You Should Always Check First

⏱️ 35 sec read 📊 Data Analysis

Before diving into complex analysis, always start with these 5 statistics. They reveal data quality issues, outliers, and distribution shape.

1. Mean vs Median - Detect Skewness

Mean: Average of all values (sensitive to outliers)

Median: Middle value (robust to outliers)

# Python
df['salary'].mean()   # 75,000
df['salary'].median() # 58,000

What this tells you:

Mean ≈ Median → Symmetrical distribution
Mean > Median → Right-skewed (few high outliers pulling mean up)
Mean < Median → Left-skewed (few low outliers pulling mean down)

2. Standard Deviation - Measure Spread

How much do values vary from the mean?

# Python
df['age'].std()  # 12.5

# SQL
SELECT STDDEV(age) FROM users;

Rules of thumb:

~68% of data within 1 std dev of mean
~95% within 2 std devs
~99.7% within 3 std devs

High std dev = lots of variability. Low std dev = values clustered near mean.

3. Min and Max - Spot Data Issues

df['age'].min()  # 0  ← Suspicious! Baby customers?
df['age'].max()  # 150 ← Data entry error!

Check for:

Impossible values (negative ages, future dates)
Placeholder values (-1, 999, 0)
Data entry errors (150-year-old person)

4. Percentiles (25th, 75th) - Understand Distribution

Quartiles split data into 4 equal parts:

# Python
df['price'].quantile([0.25, 0.5, 0.75])

# Result:
0.25    19.99  ← 25th percentile (Q1)
0.50    39.99  ← 50th percentile (median)
0.75    79.99  ← 75th percentile (Q3)

Interquartile Range (IQR) = Q3 - Q1

IQR tells you the range of the middle 50% of data (less affected by outliers than range).

5. Count and Missing Values

# Python
df['email'].count()        # 8,500
df['email'].isna().sum()   # 1,500 missing

# Percentage missing
df['email'].isna().mean() * 100  # 15% missing

Questions to ask:

Why are values missing? (Random or pattern?)
Can I fill them or should I drop them?
Is missing data informative? (e.g., "no response" is meaningful)

Quick Analysis Pattern

# Python: Get all at once
df.describe()

# Output:
       age         salary       experience
count  10000.0    10000.0    10000.0
mean   35.5       75000.0    8.2
std    12.3       45000.0    5.1
min    18.0       25000.0    0.0
25%    27.0       45000.0    4.0
50%    34.0       58000.0    7.0
75%    43.0       95000.0    12.0
max    65.0       250000.0   30.0

Red Flags to Look For

🚩 Mean >> Median → Check for outliers on high end
🚩 Min/Max unrealistic → Data quality issues
🚩 Std dev > Mean → Huge variability or outliers
🚩 High % missing → Investigate data collection
🚩 Count differs by column → Inconsistent missing data

Golden Rule: Never skip descriptive statistics. Five minutes of checking these stats can save hours of debugging bad analysis later.

← Back to Data Analysis Tips