Box Plots: Understanding Distributions
Box plots (box-and-whisker plots) summarize data distribution showing median, quartiles, and outliers. They're perfect for comparing distributions across groups and identifying anomalies.
Understanding Box Plot Components
| β Maximum (within 1.5ΓIQR)
|
βββ
β β β Q3 (75th percentile)
βββ€ β Median (50th percentile)
β β β Q1 (25th percentile)
βββ
|
| β Minimum (within 1.5ΓIQR)
β’ β Outliers
Box height = IQR (Interquartile Range) = Q3 - Q1
Creating Box Plots in Python
import seaborn as sns
import matplotlib.pyplot as plt
# Compare salaries across departments
data = sns.load_dataset('tips')
plt.figure(figsize=(10, 6))
sns.boxplot(x='day', y='total_bill', data=data)
plt.title('Total Bill Distribution by Day')
plt.ylabel('Total Bill ($)')
plt.show()
Interpreting Box Plots
# What to look for:
1. Center (Median line):
Where is the middle of the data?
2. Spread (Box height):
How variable is the data?
3. Skewness (Median position in box):
- Median near top = negatively skewed
- Median in middle = symmetric
- Median near bottom = positively skewed
4. Outliers (Individual points):
- Values far from the rest
- May indicate errors or special cases
Comparing Multiple Groups
import pandas as pd
import seaborn as sns
# Sales performance across regions
sales_data = pd.DataFrame({
'region': ['North']*50 + ['South']*50 + ['East']*50 + ['West']*50,
'sales': list(range(100, 150)) + list(range(80, 130)) +
list(range(120, 170)) + list(range(90, 140))
})
sns.boxplot(x='region', y='sales', data=sales_data)
plt.title('Sales Distribution by Region')
plt.show()
# Quickly see which region has:
# - Highest median sales
# - Most consistent performance
# - Most outliers
When to Use Box Plots
- Comparing distributions across categories
- Identifying outliers
- Understanding data spread
- Spotting skewness
- Quality control analysis
Box Plot vs Violin Plot
# Box plot: Shows summary statistics
# Violin plot: Shows full distribution shape
sns.violinplot(x='day', y='total_bill', data=data)
# Violin plots show density - good for multimodal distributions
Pro Tip: Box plots are excellent for comparing distributions side-by-side. Pay attention to outliersβthey often represent data quality issues or interesting edge cases worth investigating!
β Back to Visualization Tips