Useful Data Tips

Apache Hive

⏱️ 8 sec read 🗄️ Data Management

What it is: SQL-on-Hadoop data warehouse infrastructure. Query massive datasets in HDFS using SQL-like HiveQL. Pioneered big data SQL.

What It Does Best

Batch processing. ETL and data transformation on petabytes. MapReduce/Tez/Spark execution engines.

Schema-on-read. Query unstructured files as tables. CSV, JSON, Parquet, ORC support.

Mature ecosystem. Decades of enterprise use. Extensive documentation and tooling.

Pricing

Free: Open source, Apache 2.0. Cloud managed: AWS EMR, Azure HDInsight, Cloudera (compute-based pricing).

When to Use It

✅ Existing Hadoop infrastructure

✅ Large-scale batch ETL jobs

✅ Historical data processing

✅ Team already knows HiveQL

When NOT to Use It

❌ Interactive queries (too slow—use Trino/Presto)

❌ Real-time analytics (batch-oriented)

❌ New projects (consider Spark, Trino, cloud warehouses)

Bottom line: Legacy technology, but still widely used in enterprises with Hadoop. Batch ETL workhorse. For new projects, choose Spark for processing or Trino for querying. Hive's era has passed.

Visit Apache Hive →

← Back to Data Management Tools