Apache Hive
What it is: SQL-on-Hadoop data warehouse infrastructure. Query massive datasets in HDFS using SQL-like HiveQL. Pioneered big data SQL.
What It Does Best
Batch processing. ETL and data transformation on petabytes. MapReduce/Tez/Spark execution engines.
Schema-on-read. Query unstructured files as tables. CSV, JSON, Parquet, ORC support.
Mature ecosystem. Decades of enterprise use. Extensive documentation and tooling.
Pricing
Free: Open source, Apache 2.0. Cloud managed: AWS EMR, Azure HDInsight, Cloudera (compute-based pricing).
When to Use It
✅ Existing Hadoop infrastructure
✅ Large-scale batch ETL jobs
✅ Historical data processing
✅ Team already knows HiveQL
When NOT to Use It
❌ Interactive queries (too slow—use Trino/Presto)
❌ Real-time analytics (batch-oriented)
❌ New projects (consider Spark, Trino, cloud warehouses)
Bottom line: Legacy technology, but still widely used in enterprises with Hadoop. Batch ETL workhorse. For new projects, choose Spark for processing or Trino for querying. Hive's era has passed.