Pandas Merge and Join: How to Combine DataFrames

⏱️ 33 sec read 🐍 Python

Combining DataFrames is essential for data analysis. Here's everything you need to know about pandas merge and join operations:

1. Basic Merge Operations

import pandas as pd

# Sample data
users = pd.DataFrame({
    'user_id': [1, 2, 3, 4],
    'name': ['Alice', 'Bob', 'Charlie', 'David']
})

orders = pd.DataFrame({
    'order_id': [101, 102, 103],
    'user_id': [1, 2, 1],
    'amount': [50, 75, 100]
})

# Inner merge (default) - only matching rows ✅
result = pd.merge(users, orders, on='user_id')
# Returns rows where user_id exists in both DataFrames

2. Four Types of Joins

# Inner join - only matching rows
inner = pd.merge(users, orders, on='user_id', how='inner')

# Left join - all rows from left, matching from right
left = pd.merge(users, orders, on='user_id', how='left')
# All users, with NaN for users without orders

# Right join - all rows from right, matching from left
right = pd.merge(users, orders, on='user_id', how='right')

# Outer join - all rows from both DataFrames
outer = pd.merge(users, orders, on='user_id', how='outer')
# All users and all orders, with NaN where no match

3. Merge on Different Column Names

# DataFrames with different column names
df1 = pd.DataFrame({
    'employee_id': [1, 2, 3],
    'name': ['Alice', 'Bob', 'Charlie']
})

df2 = pd.DataFrame({
    'emp_id': [1, 2, 4],
    'salary': [50000, 60000, 55000]
})

# Merge using left_on and right_on
result = pd.merge(df1, df2,
                  left_on='employee_id',
                  right_on='emp_id',
                  how='left')

# Drop duplicate column if needed
result = result.drop('emp_id', axis=1)

4. Merge on Multiple Columns

# Merge on multiple columns
df1 = pd.DataFrame({
    'year': [2023, 2023, 2024],
    'month': [1, 2, 1],
    'sales': [100, 150, 120]
})

df2 = pd.DataFrame({
    'year': [2023, 2023, 2024],
    'month': [1, 2, 2],
    'costs': [80, 90, 85]
})

# Merge on both year and month
result = pd.merge(df1, df2, on=['year', 'month'], how='outer')

5. Using Index for Joins

# Join on index
df1 = pd.DataFrame({
    'name': ['Alice', 'Bob', 'Charlie']
}, index=[1, 2, 3])

df2 = pd.DataFrame({
    'salary': [50000, 60000, 55000]
}, index=[1, 2, 4])

# Join using index (shorthand method)
result = df1.join(df2, how='left')

# Or use merge with left_index and right_index
result = pd.merge(df1, df2,
                  left_index=True,
                  right_index=True,
                  how='left')

6. Handle Duplicate Column Names

# Both DataFrames have 'status' column
df1 = pd.DataFrame({
    'id': [1, 2],
    'status': ['active', 'inactive']
})

df2 = pd.DataFrame({
    'id': [1, 2],
    'status': ['verified', 'pending']
})

# Use suffixes to distinguish columns
result = pd.merge(df1, df2, on='id', suffixes=('_user', '_order'))
# Creates: status_user and status_order columns

7. Indicator Column

# Add indicator to show merge source
result = pd.merge(users, orders,
                  on='user_id',
                  how='outer',
                  indicator=True)

# indicator column shows: 'left_only', 'right_only', or 'both'
print(result['_merge'].value_counts())

# Filter based on merge result
only_in_left = result[result['_merge'] == 'left_only']
only_in_right = result[result['_merge'] == 'right_only']

8. Concatenate DataFrames

# Stack DataFrames vertically (append rows)
df1 = pd.DataFrame({'A': [1, 2], 'B': [3, 4]})
df2 = pd.DataFrame({'A': [5, 6], 'B': [7, 8]})

# Concatenate vertically
result = pd.concat([df1, df2], ignore_index=True)

# Concatenate horizontally (add columns)
result = pd.concat([df1, df2], axis=1)

# Concatenate with keys
result = pd.concat([df1, df2], keys=['first', 'second'])

9. Practical Examples

# Example: Customer orders with product details
customers = pd.DataFrame({
    'customer_id': [1, 2, 3],
    'name': ['Alice', 'Bob', 'Charlie'],
    'city': ['NYC', 'LA', 'Chicago']
})

orders = pd.DataFrame({
    'order_id': [101, 102, 103, 104],
    'customer_id': [1, 1, 2, 3],
    'product': ['Laptop', 'Mouse', 'Keyboard', 'Monitor'],
    'price': [1200, 25, 80, 300]
})

# Get all customer orders
result = pd.merge(customers, orders, on='customer_id', how='left')

# Calculate total spent per customer
summary = result.groupby(['customer_id', 'name'])['price'].sum().reset_index()
summary.columns = ['customer_id', 'name', 'total_spent']

Common Pitfalls

✅ Use merge() for column-based joins
✅ Use join() for index-based joins (faster)
✅ Use concat() for stacking DataFrames
✅ Always check result size after merge (data might explode!)
✅ Use indicator=True to debug merge issues
❌ Don't merge on columns with duplicates unless you expect multiple matches
❌ Don't forget to reset index if needed after merge
⚠️ Be careful with one-to-many and many-to-many joins (can create many rows)

Pro Tip: When merging large DataFrames, check for duplicates in the merge key columns first using df['key'].duplicated().sum(). Unexpected duplicates can cause your result to have many more rows than expected!

← Back to Python Tips