K-Means Clustering Algorithm

⏱️ 25 sec read 🤖 AI & ML

K-means clustering groups similar data points together. It's an unsupervised learning algorithm perfect for customer segmentation, pattern discovery, and finding natural groupings in data.

How K-Means Works

1. Choose number of clusters (k)
2. Randomly initialize k cluster centers
3. Assign each point to nearest center
4. Recalculate centers as mean of assigned points
5. Repeat steps 3-4 until centers stop moving

Simple but effective!

Basic Implementation

from sklearn.cluster import KMeans
import numpy as np
import matplotlib.pyplot as plt

# Create sample data
from sklearn.datasets import make_blobs
X, _ = make_blobs(n_samples=300, centers=4, random_state=42)

# Fit k-means
kmeans = KMeans(n_clusters=4, random_state=42)
labels = kmeans.fit_predict(X)

# Plot results
plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='viridis')
plt.scatter(kmeans.cluster_centers_[:, 0],
           kmeans.cluster_centers_[:, 1],
           s=300, c='red', marker='X', label='Centroids')
plt.title('K-Means Clustering')
plt.legend()
plt.show()

Finding Optimal Number of Clusters

Elbow Method

# Try different k values
inertias = []
K_range = range(1, 11)

for k in K_range:
    kmeans = KMeans(n_clusters=k, random_state=42)
    kmeans.fit(X)
    inertias.append(kmeans.inertia_)

# Plot elbow curve
plt.plot(K_range, inertias, 'bo-')
plt.xlabel('Number of Clusters (k)')
plt.ylabel('Inertia')
plt.title('Elbow Method')
plt.show()

# Look for "elbow" - point where improvement slows

Silhouette Score

from sklearn.metrics import silhouette_score

# Higher score = better clustering
scores = []
for k in range(2, 11):
    kmeans = KMeans(n_clusters=k, random_state=42)
    labels = kmeans.fit_predict(X)
    score = silhouette_score(X, labels)
    scores.append(score)
    print(f"k={k}: Silhouette Score = {score:.3f}")

# Choose k with highest silhouette score

Customer Segmentation Example

import pandas as pd

# Customer data: age, income, spending
customers = pd.DataFrame({
    'age': [25, 35, 45, 23, 47, 55, 30, 40],
    'income': [50000, 60000, 80000, 45000, 90000, 120000, 55000, 75000],
    'spending': [1000, 1500, 3000, 800, 3500, 5000, 1200, 2500]
})

# Standardize features
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(customers)

# Cluster into 3 segments
kmeans = KMeans(n_clusters=3, random_state=42)
customers['segment'] = kmeans.fit_predict(X_scaled)

# Analyze segments
print(customers.groupby('segment').mean())

# Segment 0: Low spenders
# Segment 1: Medium spenders
# Segment 2: High spenders

Common Use Cases

Customer segmentation: Group by behavior, demographics
Image compression: Reduce colors
Anomaly detection: Points far from centers
Document clustering: Group similar articles
Recommendation systems: Find similar users

Limitations and Considerations

Must specify k in advance
Sensitive to initial centroids (use multiple runs)
Assumes spherical clusters
Struggles with different-sized clusters
Affected by outliers

Improving K-Means

# Run multiple times with different initializations
kmeans = KMeans(
    n_clusters=4,
    n_init=10,      # Run 10 times, keep best
    max_iter=300,   # Max iterations per run
    random_state=42
)

# K-means++ initialization (default, better than random)
kmeans = KMeans(n_clusters=4, init='k-means++')

# MiniBatch K-Means for large datasets (faster)
from sklearn.cluster import MiniBatchKMeans
kmeans = MiniBatchKMeans(n_clusters=4, batch_size=100)

Pro Tip: Always scale your features before k-means! Use the elbow method or silhouette score to find optimal k. Remember: k-means finds spherical clusters—for complex shapes, try DBSCAN or hierarchical clustering instead.

← Back to AI & ML Tips