Useful Data Tips

Standard Deviation Explained

📈 Data Analysis ⏱️ 30 sec read

What is Standard Deviation?

Standard deviation measures how spread out data is from the mean. Low standard deviation means data is clustered near the mean; high standard deviation means data is spread out.

Simple Explanation

Imagine two classes with the same average test score of 75:

Same mean, very different distributions!

The Formula

Standard Deviation (σ) = √(Σ(x - μ)² / N)

Where:
- x = each value
- μ = mean of all values
- N = number of values
- Σ = sum of

Step-by-Step Calculation

Data: [4, 8, 6, 5, 3]

Step 1: Calculate mean
mean = (4 + 8 + 6 + 5 + 3) / 5 = 5.2

Step 2: Find differences from mean
4 - 5.2 = -1.2
8 - 5.2 = 2.8
6 - 5.2 = 0.8
5 - 5.2 = -0.2
3 - 5.2 = -2.2

Step 3: Square the differences
(-1.2)² = 1.44
(2.8)²  = 7.84
(0.8)²  = 0.64
(-0.2)² = 0.04
(-2.2)² = 4.84

Step 4: Calculate variance (average of squared differences)
variance = (1.44 + 7.84 + 0.64 + 0.04 + 4.84) / 5 = 2.96

Step 5: Take square root
standard deviation = √2.96 = 1.72

Python Implementation

import numpy as np

data = [4, 8, 6, 5, 3]

# Method 1: NumPy (easiest)
std_dev = np.std(data)  # Population std dev
print(f"Standard deviation: {std_dev:.2f}")  # 1.72

# Method 2: Sample standard deviation (use ddof=1)
std_dev_sample = np.std(data, ddof=1)  # 1.92

# Method 3: Manual calculation
mean = np.mean(data)
squared_diffs = [(x - mean)**2 for x in data]
variance = sum(squared_diffs) / len(data)
std_dev_manual = variance ** 0.5

# Also get variance
variance = np.var(data)  # 2.96

Population vs Sample Standard Deviation

# Python
np.std(data)           # Population (N)
np.std(data, ddof=1)   # Sample (N-1)

# Pandas defaults to sample (N-1)
df['column'].std()     # Sample standard deviation

Interpreting Standard Deviation

Real-World Examples

# Example 1: Quality control
widget_weights = [100.1, 99.9, 100.0, 100.2, 99.8]
std = np.std(widget_weights)  # 0.14
# Low std dev = consistent manufacturing

# Example 2: Stock volatility
stock_returns = [0.05, -0.03, 0.08, -0.12, 0.15, -0.08]
volatility = np.std(stock_returns)  # 0.094
# High std dev = risky stock

# Example 3: Outlier detection
data = [10, 12, 11, 13, 12, 85]  # 85 is outlier
mean = np.mean(data)
std = np.std(data)
# Values beyond mean ± 2*std are often outliers
outliers = [x for x in data if abs(x - mean) > 2*std]
print(outliers)  # [85]

SQL Implementation

-- Calculate standard deviation and variance
SELECT
    STDDEV_POP(salary) as pop_std_dev,    -- Population
    STDDEV_SAMP(salary) as sample_std_dev, -- Sample
    VAR_POP(salary) as variance
FROM employees;

Excel Formulas

=STDEV.P(A1:A100)  -- Population standard deviation
=STDEV.S(A1:A100)  -- Sample standard deviation (most common)
=VAR.P(A1:A100)    -- Population variance
=VAR.S(A1:A100)    -- Sample variance

Relationship: Variance vs Standard Deviation

# Both measure spread, but std dev is more interpretable
variance = np.var(data)       # 2.96 (squared units)
std_dev = np.std(data)        # 1.72 (original units)
std_dev = np.sqrt(variance)   # Same thing

Using Standard Deviation

Key Takeaways:

  • Standard deviation measures spread around the mean
  • Low std dev: Data clustered near mean (predictable)
  • High std dev: Data spread out (variable)
  • Use sample std dev (N-1) for sample data (most common)
  • Always report std dev alongside the mean for complete picture