Useful Data Tips

Box Plots: Understanding Distributions

⏱️ 25 sec read πŸ“Š Data Visualization

Box plots (box-and-whisker plots) summarize data distribution showing median, quartiles, and outliers. They're perfect for comparing distributions across groups and identifying anomalies.

Understanding Box Plot Components

    |  ← Maximum (within 1.5Γ—IQR)
    |
   β”Œβ”€β”
   β”‚ β”‚ ← Q3 (75th percentile)
   β”œβ”€β”€ ← Median (50th percentile)
   β”‚ β”‚ ← Q1 (25th percentile)
   β””β”€β”˜
    |
    |  ← Minimum (within 1.5Γ—IQR)
    β€’  ← Outliers

Box height = IQR (Interquartile Range) = Q3 - Q1

Creating Box Plots in Python

import seaborn as sns
import matplotlib.pyplot as plt

# Compare salaries across departments
data = sns.load_dataset('tips')

plt.figure(figsize=(10, 6))
sns.boxplot(x='day', y='total_bill', data=data)
plt.title('Total Bill Distribution by Day')
plt.ylabel('Total Bill ($)')
plt.show()

Interpreting Box Plots

# What to look for:

1. Center (Median line):
   Where is the middle of the data?

2. Spread (Box height):
   How variable is the data?

3. Skewness (Median position in box):
   - Median near top = negatively skewed
   - Median in middle = symmetric
   - Median near bottom = positively skewed

4. Outliers (Individual points):
   - Values far from the rest
   - May indicate errors or special cases

Comparing Multiple Groups

import pandas as pd
import seaborn as sns

# Sales performance across regions
sales_data = pd.DataFrame({
    'region': ['North']*50 + ['South']*50 + ['East']*50 + ['West']*50,
    'sales': list(range(100, 150)) + list(range(80, 130)) +
             list(range(120, 170)) + list(range(90, 140))
})

sns.boxplot(x='region', y='sales', data=sales_data)
plt.title('Sales Distribution by Region')
plt.show()

# Quickly see which region has:
# - Highest median sales
# - Most consistent performance
# - Most outliers

When to Use Box Plots

Box Plot vs Violin Plot

# Box plot: Shows summary statistics
# Violin plot: Shows full distribution shape

sns.violinplot(x='day', y='total_bill', data=data)
# Violin plots show density - good for multimodal distributions

Pro Tip: Box plots are excellent for comparing distributions side-by-side. Pay attention to outliersβ€”they often represent data quality issues or interesting edge cases worth investigating!

← Back to Visualization Tips