Standard Deviation Explained

📈 Data Analysis ⏱️ 30 sec read

What is Standard Deviation?

Standard deviation measures how spread out data is from the mean. Low standard deviation means data is clustered near the mean; high standard deviation means data is spread out.

Simple Explanation

Imagine two classes with the same average test score of 75:

Class A scores: [73, 74, 75, 76, 77] → Standard deviation: 1.58 (tight cluster)
Class B scores: [50, 60, 75, 90, 100] → Standard deviation: 20.62 (wide spread)

Same mean, very different distributions!

The Formula

Standard Deviation (σ) = √(Σ(x - μ)² / N)

Where:
- x = each value
- μ = mean of all values
- N = number of values
- Σ = sum of

Step-by-Step Calculation

Data: [4, 8, 6, 5, 3]

Step 1: Calculate mean
mean = (4 + 8 + 6 + 5 + 3) / 5 = 5.2

Step 2: Find differences from mean
4 - 5.2 = -1.2
8 - 5.2 = 2.8
6 - 5.2 = 0.8
5 - 5.2 = -0.2
3 - 5.2 = -2.2

Step 3: Square the differences
(-1.2)² = 1.44
(2.8)²  = 7.84
(0.8)²  = 0.64
(-0.2)² = 0.04
(-2.2)² = 4.84

Step 4: Calculate variance (average of squared differences)
variance = (1.44 + 7.84 + 0.64 + 0.04 + 4.84) / 5 = 2.96

Step 5: Take square root
standard deviation = √2.96 = 1.72

Python Implementation

import numpy as np

data = [4, 8, 6, 5, 3]

# Method 1: NumPy (easiest)
std_dev = np.std(data)  # Population std dev
print(f"Standard deviation: {std_dev:.2f}")  # 1.72

# Method 2: Sample standard deviation (use ddof=1)
std_dev_sample = np.std(data, ddof=1)  # 1.92

# Method 3: Manual calculation
mean = np.mean(data)
squared_diffs = [(x - mean)**2 for x in data]
variance = sum(squared_diffs) / len(data)
std_dev_manual = variance ** 0.5

# Also get variance
variance = np.var(data)  # 2.96

Population vs Sample Standard Deviation

Population (σ): Divide by N (when you have all data)
Sample (s): Divide by N-1 (when you have a sample)

# Python
np.std(data)           # Population (N)
np.std(data, ddof=1)   # Sample (N-1)

# Pandas defaults to sample (N-1)
df['column'].std()     # Sample standard deviation

Interpreting Standard Deviation

Low std dev (e.g., 0.5): Data is consistent and predictable
Medium std dev: Some variability, but not extreme
High std dev (e.g., 50): Data is scattered and unpredictable

Real-World Examples

# Example 1: Quality control
widget_weights = [100.1, 99.9, 100.0, 100.2, 99.8]
std = np.std(widget_weights)  # 0.14
# Low std dev = consistent manufacturing

# Example 2: Stock volatility
stock_returns = [0.05, -0.03, 0.08, -0.12, 0.15, -0.08]
volatility = np.std(stock_returns)  # 0.094
# High std dev = risky stock

# Example 3: Outlier detection
data = [10, 12, 11, 13, 12, 85]  # 85 is outlier
mean = np.mean(data)
std = np.std(data)
# Values beyond mean ± 2*std are often outliers
outliers = [x for x in data if abs(x - mean) > 2*std]
print(outliers)  # [85]

SQL Implementation

-- Calculate standard deviation and variance
SELECT
    STDDEV_POP(salary) as pop_std_dev,    -- Population
    STDDEV_SAMP(salary) as sample_std_dev, -- Sample
    VAR_POP(salary) as variance
FROM employees;

Excel Formulas

=STDEV.P(A1:A100)  -- Population standard deviation
=STDEV.S(A1:A100)  -- Sample standard deviation (most common)
=VAR.P(A1:A100)    -- Population variance
=VAR.S(A1:A100)    -- Sample variance

Relationship: Variance vs Standard Deviation

Variance: Average of squared differences (σ²)
Standard Deviation: Square root of variance (σ)
Standard deviation is in the same units as your data (easier to interpret)

# Both measure spread, but std dev is more interpretable
variance = np.var(data)       # 2.96 (squared units)
std_dev = np.std(data)        # 1.72 (original units)
std_dev = np.sqrt(variance)   # Same thing

Using Standard Deviation

Comparing variability: Which dataset is more consistent?
Outlier detection: Flag values > 2 or 3 standard deviations from mean
Confidence intervals: Mean ± 1.96*std (95% CI for normal data)
Quality control: Monitor if process stays within acceptable range
Risk assessment: Higher std dev = higher uncertainty

Key Takeaways:

Standard deviation measures spread around the mean
Low std dev: Data clustered near mean (predictable)
High std dev: Data spread out (variable)
Use sample std dev (N-1) for sample data (most common)
Always report std dev alongside the mean for complete picture