Useful Data Tips

How to Choose the Right Chart

📊 Data Visualization ⏱️ 35 sec read

The Key Question

Ask first: "What story am I trying to tell?"

Chart Selection Decision Tree

1. What are you showing?
   ├─ Comparison between categories → Bar/Column Chart
   ├─ Change over time → Line Chart
   ├─ Relationship between variables → Scatter Plot
   ├─ Distribution → Histogram/Box Plot
   ├─ Part of whole → Pie Chart/Treemap
   └─ Many variables → Heatmap

2. How many variables?
   ├─ One → Histogram, bar chart
   ├─ Two → Scatter, line, bar
   └─ Three+ → Bubble chart, faceted plots

3. How much data?
   ├─ Few (<10) → Bar, pie
   ├─ Medium (10-100) → Line, scatter
   └─ Many (>100) → Heatmap, density plot

Bar Chart / Column Chart

When to use: Compare values across categories

import matplotlib.pyplot as plt

# Horizontal bar chart (many categories)
categories = ['Product A', 'Product B', 'Product C']
values = [450, 380, 290]
plt.barh(categories, values)
plt.xlabel('Sales ($K)')
plt.title('Sales by Product')

# Vertical column chart (few categories, shows height well)
months = ['Jan', 'Feb', 'Mar']
revenue = [100, 150, 130]
plt.bar(months, revenue)
plt.ylabel('Revenue ($K)')

# When to use:
# ✅ Comparing discrete categories
# ✅ Clear rankings (which is biggest?)
# ✅ Precise value comparisons
# ❌ Don't use for time series (use line chart)

Line Chart

When to use: Show trends over time

# Time series
dates = pd.date_range('2024-01-01', periods=12, freq='M')
revenue = [100, 105, 110, 108, 115, 120, 125, 130, 128, 135, 140, 145]

plt.plot(dates, revenue, marker='o')
plt.xlabel('Date')
plt.ylabel('Revenue ($K)')
plt.title('Revenue Trend')
plt.xticks(rotation=45)

# Multiple lines for comparison
plt.plot(dates, revenue_2023, label='2023')
plt.plot(dates, revenue_2024, label='2024')
plt.legend()

# When to use:
# ✅ Time series data
# ✅ Showing trends, patterns
# ✅ Multiple series comparison
# ✅ Continuous data
# ❌ Don't use for categorical comparisons

Scatter Plot

When to use: Show relationship between two continuous variables

# Correlation between variables
plt.scatter(df['age'], df['income'], alpha=0.5)
plt.xlabel('Age')
plt.ylabel('Income ($)')
plt.title('Age vs Income')

# With color encoding third variable
plt.scatter(df['age'], df['income'], c=df['education_years'],
            cmap='viridis', alpha=0.6)
plt.colorbar(label='Years of Education')

# When to use:
# ✅ Finding correlations
# ✅ Identifying clusters
# ✅ Spotting outliers
# ✅ Two continuous variables
# ❌ Don't use with categorical data

Histogram

When to use: Show distribution of single continuous variable

# Distribution
plt.hist(df['age'], bins=20, edgecolor='black')
plt.xlabel('Age')
plt.ylabel('Frequency')
plt.title('Age Distribution')

# With density curve
from scipy.stats import norm
plt.hist(df['age'], bins=20, density=True, alpha=0.7)
plt.plot(x, norm.pdf(x, mean, std), 'r-', linewidth=2)

# When to use:
# ✅ Understanding data distribution
# ✅ Finding skewness, outliers
# ✅ Comparing distributions
# ❌ Don't use for categorical data (use bar chart)

Box Plot

When to use: Compare distributions across categories

# Compare salary distributions by department
df.boxplot(column='salary', by='department')
plt.title('Salary Distribution by Department')
plt.suptitle('')  # Remove auto title

# Shows:
# - Median (line in box)
# - Quartiles (box edges)
# - Outliers (points)
# - Range (whiskers)

# When to use:
# ✅ Comparing distributions
# ✅ Identifying outliers
# ✅ Seeing median, quartiles
# ✅ Multiple groups
# ❌ Don't use with small data (<20 points)

Pie Chart

When to use: Show parts of a whole (use sparingly!)

# Market share
labels = ['Company A', 'Company B', 'Company C', 'Others']
sizes = [35, 25, 20, 20]
plt.pie(sizes, labels=labels, autopct='%1.1f%%', startangle=90)
plt.title('Market Share')

# When to use:
# ✅ Simple proportions (2-5 slices max)
# ✅ One variable, parts sum to 100%
# ✅ Emphasizing one large slice
# ❌ Don't use for:
#     - Many categories (>5)
#     - Precise comparisons (use bar chart)
#     - Multiple pies (very hard to compare)

Heatmap

When to use: Show patterns in matrix data

import seaborn as sns

# Correlation matrix
corr = df.corr()
sns.heatmap(corr, annot=True, cmap='coolwarm', center=0)
plt.title('Correlation Matrix')

# Time-based patterns
pivot = df.pivot_table(values='sales', index='day', columns='hour')
sns.heatmap(pivot, cmap='YlOrRd')
plt.title('Sales by Day and Hour')

# When to use:
# ✅ Matrix data (rows x columns)
# ✅ Finding patterns, clusters
# ✅ Correlation matrices
# ✅ Time-based patterns
# ❌ Don't use for simple comparisons

Area Chart

When to use: Show cumulative totals over time

# Stacked area chart
plt.stackplot(dates, revenue_A, revenue_B, revenue_C,
              labels=['Product A', 'B', 'C'])
plt.legend(loc='upper left')
plt.title('Revenue by Product Over Time')

# When to use:
# ✅ Show total and components
# ✅ Multiple categories over time
# ✅ Emphasize magnitude
# ❌ Don't use when:
#     - Categories don't stack logically
#     - Showing trends more important than totals

Bad Chart Choices

❌ 3D Charts

# Almost never use 3D charts
# - Hard to read accurately
# - Distorts values
# - Looks dated

# Exception: True 3D scatter plots for scientific data

❌ Dual-Axis Charts (Usually)

# Be very careful with dual Y-axes
# Can be misleading if scales are manipulated
# Better: Use small multiples or normalize

❌ Too Many Pie Charts

# Don't compare multiple pies
# Very hard to compare slices across pies
# Better: Use grouped bar chart

Chart Selection by Goal

Your Goal Best Chart Type
Compare categories Bar chart
Show trend over time Line chart
Find correlation Scatter plot
Show distribution Histogram, box plot
Show composition Stacked bar, pie (if simple)
Show rankings Ordered bar chart
Show deviation Bar chart with reference line
Show relationship + magnitude Bubble chart
Show geographic data Choropleth map
Show hierarchical data Treemap, sunburst

Data Type Guide

Data Type Chart Options
1 categorical Bar chart, pie chart
1 continuous Histogram, box plot, density plot
1 categorical + 1 continuous Bar chart, box plot
2 continuous Scatter plot, line chart
Time + continuous Line chart, area chart
3 continuous Bubble chart, 3D scatter
Many variables Heatmap, parallel coordinates

Common Mistakes

Mistake 1: Wrong Chart for Data Type

# WRONG: Line chart for categorical data
categories = ['Red', 'Blue', 'Green']
values = [10, 15, 8]
plt.plot(categories, values)  # Implies order/trend that doesn't exist

# RIGHT: Bar chart
plt.bar(categories, values)

Mistake 2: Too Much Information

# WRONG: 20 lines on one chart
# Can't distinguish colors, too busy

# RIGHT: Use small multiples or facets
import seaborn as sns
sns.relplot(data=df, x='year', y='value', col='category', col_wrap=3)

Mistake 3: Not Starting Y-Axis at Zero

# For bar charts, always start at zero
# Otherwise, differences look exaggerated

plt.ylim(0, max_value * 1.1)  # Start at 0

# Exception: Line charts can have non-zero baseline if trends matter more

Best Practices

Quick Decision Matrix

Question 1: Time series? → Line chart
Question 2: Comparison? → Bar chart
Question 3: Relationship? → Scatter plot
Question 4: Distribution? → Histogram
Question 5: Part-to-whole? → Stacked bar > Pie
Question 6: Many variables? → Heatmap

Still unsure? → Start with bar chart (most versatile)

Key Takeaways:

  • Start with your story: What do you want to show?
  • Match chart to data type: Categorical vs continuous
  • Bar charts for comparisons, line charts for trends
  • Scatter plots for relationships, histograms for distributions
  • Use pie charts sparingly (2-5 slices max)
  • When in doubt, start with a bar chart
  • Simplicity > complexity