Essential SQL String Functions for Data Cleaning
String functions are essential for cleaning, formatting, and transforming text data. Here are the most useful ones you'll use daily.
Combining Strings
CONCAT / || (Concatenation)
-- Standard SQL
SELECT CONCAT(first_name, ' ', last_name) as full_name
FROM users;
-- PostgreSQL, SQLite
SELECT first_name || ' ' || last_name as full_name
FROM users;
Extracting Parts of Strings
SUBSTRING
-- Extract area code from phone
SELECT SUBSTRING(phone, 1, 3) as area_code
FROM customers;
-- Get first initial
SELECT SUBSTRING(first_name, 1, 1) as initial
FROM users;
LEFT / RIGHT
SELECT
LEFT(product_code, 3) as category,
RIGHT(product_code, 4) as item_number
FROM products;
Cleaning Strings
TRIM, LTRIM, RTRIM
-- Remove leading/trailing spaces
SELECT TRIM(email) as clean_email
FROM users;
-- Remove specific characters
SELECT TRIM('$' FROM price_text) as clean_price
FROM products;
REPLACE
-- Standardize phone formats
SELECT REPLACE(REPLACE(phone, '-', ''), ' ', '') as phone_clean
FROM customers;
-- Fix common typos
UPDATE products
SET description = REPLACE(description, 'teh', 'the')
WHERE description LIKE '%teh%';
Changing Case
SELECT
UPPER(email) as email_upper,
LOWER(email) as email_lower,
INITCAP(name) as name_titlecase -- PostgreSQL
FROM users;
Finding and Searching
LENGTH
-- Find invalid phone numbers
SELECT phone
FROM customers
WHERE LENGTH(REPLACE(REPLACE(phone, '-', ''), ' ', '')) != 10;
POSITION / CHARINDEX / INSTR
-- PostgreSQL/MySQL
SELECT email, POSITION('@' IN email) as at_position
FROM users;
-- Find domain
SELECT SUBSTRING(email, POSITION('@' IN email) + 1) as domain
FROM users;
Practical Data Cleaning Example
-- Clean and standardize email addresses
SELECT
TRIM(LOWER(email)) as cleaned_email,
CASE
WHEN email LIKE '%@%' AND email LIKE '%.%' THEN 'Valid'
ELSE 'Invalid'
END as email_status
FROM users;
Performance Note: String functions can be slow on large datasets. Consider adding computed columns or materialized views for frequently used transformations. Always use LIKE or pattern matching with leading wildcards sparingly.
← Back to SQL Tips