K-Means Clustering Algorithm
K-means clustering groups similar data points together. It's an unsupervised learning algorithm perfect for customer segmentation, pattern discovery, and finding natural groupings in data.
How K-Means Works
1. Choose number of clusters (k)
2. Randomly initialize k cluster centers
3. Assign each point to nearest center
4. Recalculate centers as mean of assigned points
5. Repeat steps 3-4 until centers stop moving
Simple but effective!
Basic Implementation
from sklearn.cluster import KMeans
import numpy as np
import matplotlib.pyplot as plt
# Create sample data
from sklearn.datasets import make_blobs
X, _ = make_blobs(n_samples=300, centers=4, random_state=42)
# Fit k-means
kmeans = KMeans(n_clusters=4, random_state=42)
labels = kmeans.fit_predict(X)
# Plot results
plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='viridis')
plt.scatter(kmeans.cluster_centers_[:, 0],
kmeans.cluster_centers_[:, 1],
s=300, c='red', marker='X', label='Centroids')
plt.title('K-Means Clustering')
plt.legend()
plt.show()
Finding Optimal Number of Clusters
Elbow Method
# Try different k values
inertias = []
K_range = range(1, 11)
for k in K_range:
kmeans = KMeans(n_clusters=k, random_state=42)
kmeans.fit(X)
inertias.append(kmeans.inertia_)
# Plot elbow curve
plt.plot(K_range, inertias, 'bo-')
plt.xlabel('Number of Clusters (k)')
plt.ylabel('Inertia')
plt.title('Elbow Method')
plt.show()
# Look for "elbow" - point where improvement slows
Silhouette Score
from sklearn.metrics import silhouette_score
# Higher score = better clustering
scores = []
for k in range(2, 11):
kmeans = KMeans(n_clusters=k, random_state=42)
labels = kmeans.fit_predict(X)
score = silhouette_score(X, labels)
scores.append(score)
print(f"k={k}: Silhouette Score = {score:.3f}")
# Choose k with highest silhouette score
Customer Segmentation Example
import pandas as pd
# Customer data: age, income, spending
customers = pd.DataFrame({
'age': [25, 35, 45, 23, 47, 55, 30, 40],
'income': [50000, 60000, 80000, 45000, 90000, 120000, 55000, 75000],
'spending': [1000, 1500, 3000, 800, 3500, 5000, 1200, 2500]
})
# Standardize features
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(customers)
# Cluster into 3 segments
kmeans = KMeans(n_clusters=3, random_state=42)
customers['segment'] = kmeans.fit_predict(X_scaled)
# Analyze segments
print(customers.groupby('segment').mean())
# Segment 0: Low spenders
# Segment 1: Medium spenders
# Segment 2: High spenders
Common Use Cases
- Customer segmentation: Group by behavior, demographics
- Image compression: Reduce colors
- Anomaly detection: Points far from centers
- Document clustering: Group similar articles
- Recommendation systems: Find similar users
Limitations and Considerations
- Must specify k in advance
- Sensitive to initial centroids (use multiple runs)
- Assumes spherical clusters
- Struggles with different-sized clusters
- Affected by outliers
Improving K-Means
# Run multiple times with different initializations
kmeans = KMeans(
n_clusters=4,
n_init=10, # Run 10 times, keep best
max_iter=300, # Max iterations per run
random_state=42
)
# K-means++ initialization (default, better than random)
kmeans = KMeans(n_clusters=4, init='k-means++')
# MiniBatch K-Means for large datasets (faster)
from sklearn.cluster import MiniBatchKMeans
kmeans = MiniBatchKMeans(n_clusters=4, batch_size=100)
Pro Tip: Always scale your features before k-means! Use the elbow method or silhouette score to find optimal k. Remember: k-means finds spherical clusters—for complex shapes, try DBSCAN or hierarchical clustering instead.
← Back to AI & ML Tips