Standard Deviation Explained
What is Standard Deviation?
Standard deviation measures how spread out data is from the mean. Low standard deviation means data is clustered near the mean; high standard deviation means data is spread out.
Simple Explanation
Imagine two classes with the same average test score of 75:
- Class A scores: [73, 74, 75, 76, 77] → Standard deviation: 1.58 (tight cluster)
- Class B scores: [50, 60, 75, 90, 100] → Standard deviation: 20.62 (wide spread)
Same mean, very different distributions!
The Formula
Standard Deviation (σ) = √(Σ(x - μ)² / N)
Where:
- x = each value
- μ = mean of all values
- N = number of values
- Σ = sum of
Step-by-Step Calculation
Data: [4, 8, 6, 5, 3]
Step 1: Calculate mean
mean = (4 + 8 + 6 + 5 + 3) / 5 = 5.2
Step 2: Find differences from mean
4 - 5.2 = -1.2
8 - 5.2 = 2.8
6 - 5.2 = 0.8
5 - 5.2 = -0.2
3 - 5.2 = -2.2
Step 3: Square the differences
(-1.2)² = 1.44
(2.8)² = 7.84
(0.8)² = 0.64
(-0.2)² = 0.04
(-2.2)² = 4.84
Step 4: Calculate variance (average of squared differences)
variance = (1.44 + 7.84 + 0.64 + 0.04 + 4.84) / 5 = 2.96
Step 5: Take square root
standard deviation = √2.96 = 1.72
Python Implementation
import numpy as np
data = [4, 8, 6, 5, 3]
# Method 1: NumPy (easiest)
std_dev = np.std(data) # Population std dev
print(f"Standard deviation: {std_dev:.2f}") # 1.72
# Method 2: Sample standard deviation (use ddof=1)
std_dev_sample = np.std(data, ddof=1) # 1.92
# Method 3: Manual calculation
mean = np.mean(data)
squared_diffs = [(x - mean)**2 for x in data]
variance = sum(squared_diffs) / len(data)
std_dev_manual = variance ** 0.5
# Also get variance
variance = np.var(data) # 2.96
Population vs Sample Standard Deviation
- Population (σ): Divide by N (when you have all data)
- Sample (s): Divide by N-1 (when you have a sample)
# Python
np.std(data) # Population (N)
np.std(data, ddof=1) # Sample (N-1)
# Pandas defaults to sample (N-1)
df['column'].std() # Sample standard deviation
Interpreting Standard Deviation
- Low std dev (e.g., 0.5): Data is consistent and predictable
- Medium std dev: Some variability, but not extreme
- High std dev (e.g., 50): Data is scattered and unpredictable
Real-World Examples
# Example 1: Quality control
widget_weights = [100.1, 99.9, 100.0, 100.2, 99.8]
std = np.std(widget_weights) # 0.14
# Low std dev = consistent manufacturing
# Example 2: Stock volatility
stock_returns = [0.05, -0.03, 0.08, -0.12, 0.15, -0.08]
volatility = np.std(stock_returns) # 0.094
# High std dev = risky stock
# Example 3: Outlier detection
data = [10, 12, 11, 13, 12, 85] # 85 is outlier
mean = np.mean(data)
std = np.std(data)
# Values beyond mean ± 2*std are often outliers
outliers = [x for x in data if abs(x - mean) > 2*std]
print(outliers) # [85]
SQL Implementation
-- Calculate standard deviation and variance
SELECT
STDDEV_POP(salary) as pop_std_dev, -- Population
STDDEV_SAMP(salary) as sample_std_dev, -- Sample
VAR_POP(salary) as variance
FROM employees;
Excel Formulas
=STDEV.P(A1:A100) -- Population standard deviation
=STDEV.S(A1:A100) -- Sample standard deviation (most common)
=VAR.P(A1:A100) -- Population variance
=VAR.S(A1:A100) -- Sample variance
Relationship: Variance vs Standard Deviation
- Variance: Average of squared differences (σ²)
- Standard Deviation: Square root of variance (σ)
- Standard deviation is in the same units as your data (easier to interpret)
# Both measure spread, but std dev is more interpretable
variance = np.var(data) # 2.96 (squared units)
std_dev = np.std(data) # 1.72 (original units)
std_dev = np.sqrt(variance) # Same thing
Using Standard Deviation
- Comparing variability: Which dataset is more consistent?
- Outlier detection: Flag values > 2 or 3 standard deviations from mean
- Confidence intervals: Mean ± 1.96*std (95% CI for normal data)
- Quality control: Monitor if process stays within acceptable range
- Risk assessment: Higher std dev = higher uncertainty
Key Takeaways:
- Standard deviation measures spread around the mean
- Low std dev: Data clustered near mean (predictable)
- High std dev: Data spread out (variable)
- Use sample std dev (N-1) for sample data (most common)
- Always report std dev alongside the mean for complete picture