Pandas Merge and Join: How to Combine DataFrames
Combining DataFrames is essential for data analysis. Here's everything you need to know about pandas merge and join operations:
1. Basic Merge Operations
import pandas as pd
# Sample data
users = pd.DataFrame({
'user_id': [1, 2, 3, 4],
'name': ['Alice', 'Bob', 'Charlie', 'David']
})
orders = pd.DataFrame({
'order_id': [101, 102, 103],
'user_id': [1, 2, 1],
'amount': [50, 75, 100]
})
# Inner merge (default) - only matching rows ✅
result = pd.merge(users, orders, on='user_id')
# Returns rows where user_id exists in both DataFrames
2. Four Types of Joins
# Inner join - only matching rows
inner = pd.merge(users, orders, on='user_id', how='inner')
# Left join - all rows from left, matching from right
left = pd.merge(users, orders, on='user_id', how='left')
# All users, with NaN for users without orders
# Right join - all rows from right, matching from left
right = pd.merge(users, orders, on='user_id', how='right')
# Outer join - all rows from both DataFrames
outer = pd.merge(users, orders, on='user_id', how='outer')
# All users and all orders, with NaN where no match
3. Merge on Different Column Names
# DataFrames with different column names
df1 = pd.DataFrame({
'employee_id': [1, 2, 3],
'name': ['Alice', 'Bob', 'Charlie']
})
df2 = pd.DataFrame({
'emp_id': [1, 2, 4],
'salary': [50000, 60000, 55000]
})
# Merge using left_on and right_on
result = pd.merge(df1, df2,
left_on='employee_id',
right_on='emp_id',
how='left')
# Drop duplicate column if needed
result = result.drop('emp_id', axis=1)
4. Merge on Multiple Columns
# Merge on multiple columns
df1 = pd.DataFrame({
'year': [2023, 2023, 2024],
'month': [1, 2, 1],
'sales': [100, 150, 120]
})
df2 = pd.DataFrame({
'year': [2023, 2023, 2024],
'month': [1, 2, 2],
'costs': [80, 90, 85]
})
# Merge on both year and month
result = pd.merge(df1, df2, on=['year', 'month'], how='outer')
5. Using Index for Joins
# Join on index
df1 = pd.DataFrame({
'name': ['Alice', 'Bob', 'Charlie']
}, index=[1, 2, 3])
df2 = pd.DataFrame({
'salary': [50000, 60000, 55000]
}, index=[1, 2, 4])
# Join using index (shorthand method)
result = df1.join(df2, how='left')
# Or use merge with left_index and right_index
result = pd.merge(df1, df2,
left_index=True,
right_index=True,
how='left')
6. Handle Duplicate Column Names
# Both DataFrames have 'status' column
df1 = pd.DataFrame({
'id': [1, 2],
'status': ['active', 'inactive']
})
df2 = pd.DataFrame({
'id': [1, 2],
'status': ['verified', 'pending']
})
# Use suffixes to distinguish columns
result = pd.merge(df1, df2, on='id', suffixes=('_user', '_order'))
# Creates: status_user and status_order columns
7. Indicator Column
# Add indicator to show merge source
result = pd.merge(users, orders,
on='user_id',
how='outer',
indicator=True)
# indicator column shows: 'left_only', 'right_only', or 'both'
print(result['_merge'].value_counts())
# Filter based on merge result
only_in_left = result[result['_merge'] == 'left_only']
only_in_right = result[result['_merge'] == 'right_only']
8. Concatenate DataFrames
# Stack DataFrames vertically (append rows)
df1 = pd.DataFrame({'A': [1, 2], 'B': [3, 4]})
df2 = pd.DataFrame({'A': [5, 6], 'B': [7, 8]})
# Concatenate vertically
result = pd.concat([df1, df2], ignore_index=True)
# Concatenate horizontally (add columns)
result = pd.concat([df1, df2], axis=1)
# Concatenate with keys
result = pd.concat([df1, df2], keys=['first', 'second'])
9. Practical Examples
# Example: Customer orders with product details
customers = pd.DataFrame({
'customer_id': [1, 2, 3],
'name': ['Alice', 'Bob', 'Charlie'],
'city': ['NYC', 'LA', 'Chicago']
})
orders = pd.DataFrame({
'order_id': [101, 102, 103, 104],
'customer_id': [1, 1, 2, 3],
'product': ['Laptop', 'Mouse', 'Keyboard', 'Monitor'],
'price': [1200, 25, 80, 300]
})
# Get all customer orders
result = pd.merge(customers, orders, on='customer_id', how='left')
# Calculate total spent per customer
summary = result.groupby(['customer_id', 'name'])['price'].sum().reset_index()
summary.columns = ['customer_id', 'name', 'total_spent']
Common Pitfalls
- ✅ Use
merge()for column-based joins - ✅ Use
join()for index-based joins (faster) - ✅ Use
concat()for stacking DataFrames - ✅ Always check result size after merge (data might explode!)
- ✅ Use
indicator=Trueto debug merge issues - ❌ Don't merge on columns with duplicates unless you expect multiple matches
- ❌ Don't forget to reset index if needed after merge
- ⚠️ Be careful with one-to-many and many-to-many joins (can create many rows)
Pro Tip: When merging large DataFrames, check for duplicates in the merge key columns first using df['key'].duplicated().sum(). Unexpected duplicates can cause your result to have many more rows than expected!