Skip to content

sidequery/awesome-iceberg

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 

Repository files navigation

Awesome Apache Iceberg Awesome

A curated list of awesome Apache Iceberg resources, libraries, tools, and learning materials. Apache Iceberg is a high-performance open table format for huge analytic datasets, bringing reliability and simplicity to data lakes.

Contents

Core Project

  • Apache Iceberg - Official website for Apache Iceberg, a high-performance open table format for analytic datasets. Open source (Apache 2.0).
  • Apache Iceberg GitHub - Reference Java implementation with core table format specifications, catalog implementations, and processing engine integrations. Open source (Apache 2.0).
  • Documentation - Comprehensive official documentation covering all features, APIs, and integrations. Open source.

Query Engines

Open Source Engines

  • Apache Spark - First-class native Iceberg support with full read/write capabilities, streaming, time travel, and all DDL/DML operations. Supports Spark 3.3, 3.4, 3.5. Open source (Apache 2.0).
  • Apache Flink - Native streaming and batch support with CDC pipeline connectors, incremental reads, and equality delete writes for efficient streaming use cases. Open source (Apache 2.0).
  • Trino - Production-ready native connector with full read/write support, time travel, multiple catalog types, and metadata table access. Open source (Apache 2.0).
  • PrestoDB - Native connector with V1/V2 table support, time travel queries, and REST catalog integration. Open source (Apache 2.0).
  • Apache Impala - Native C++ implementation with read/write support, partition transforms, time travel, and OPTIMIZE statement for maintenance. Open source (Apache 2.0).
  • Apache Hive - StorageHandler-based integration with full DDL/DML support in Hive 4.0+, table migration capabilities from Avro/Parquet/ORC. Open source (Apache 2.0).
  • StarRocks - MPP SQL engine with native Iceberg catalog support, materialized views, and local caching for performance. Open source (Apache 2.0).
  • Apache Doris - Real-time MPP data warehouse with native read/write support for V1/V2 tables, position/equality deletes, and CTAS operations. Open source (Apache 2.0).
  • DuckDB - In-process analytical database with Iceberg support via extension, excellent for local analytics and single-machine deployments. Open source (MIT).
  • ClickHouse - Columnar database with read-only Iceberg support via table function and engine, optimized for real-time analytics. Open source (Apache 2.0).
  • Polars - Dataframe query engine with read support for Iceberg & experimental support for writes. Open source.

Cloud and Commercial Engines

  • Amazon Athena - Serverless query engine with native Iceberg support for DDL/DML operations, time travel, and automatic metadata generation. Commercial (AWS).
  • Google BigQuery - BigLake Iceberg tables with fully managed storage, automatic optimization, high-throughput streaming, and cross-engine compatibility. Commercial (Google Cloud).
  • Snowflake - Native Iceberg support (GA April 2025) with managed and external tables, catalog integrations, and data sharing capabilities. Commercial.
  • Databricks - Unity Catalog-managed Iceberg tables (Public Preview) with Predictive Optimization, REST Catalog API, and Delta UniForm for interoperability. Commercial.
  • Dremio - SQL lakehouse platform with native Iceberg integration, Apache Polaris catalog, automatic optimization, and sub-second BI performance. Commercial with open source components.
  • Starburst/Trino Enterprise - Enterprise Trino distribution with Iceberg v3 support, deletion vectors, materialized view auto-refresh, and AI-powered features. Commercial.

Catalogs

Open Source Catalogs

  • Apache Polaris (Incubating) - Fully-featured REST catalog donated by Snowflake with multi-engine interoperability, RBAC, credential vending, and support for internal/external modes. Open source (Apache 2.0).
  • Project Nessie - Git-like transactional catalog with branches, tags, commits, and multi-table transaction support for data-as-code workflows. Open source (Apache 2.0).
  • Hive Metastore Catalog - Traditional HMS integration storing Iceberg metadata in relational databases, best for existing Hive ecosystems. Open source (Apache 2.0).
  • REST Catalog - RESTful specification enabling standardized catalog operations across any language and platform. Open source specification (Apache 2.0).
  • Hadoop/File-Based Catalog - File-based catalog using version-hint.text files, suitable for testing and single-writer scenarios but not recommended for production with object storage. Open source (Apache 2.0).
  • JDBC Catalog - Uses relational database table for Iceberg metadata management with serializable isolation for ACID transactions. Open source (Apache 2.0).
  • OpenMetadata - Unified metadata platform with native Iceberg connector for automated metadata extraction, lineage tracking, and data quality integration across 90+ sources. Open source (Apache 2.0).
  • DataHub - Metadata management platform with native Iceberg source connector and REST Catalog API implementation for discovery and governance. Open source (Apache 2.0).

Cloud and Commercial Catalogs

  • AWS Glue Data Catalog - Native AWS catalog with optimistic locking, automatic table optimization, REST API endpoint, and integration with Athena, EMR, and Redshift. Commercial (AWS).
  • DynamoDB Catalog - AWS DynamoDB-based catalog with optimistic locking and high write throughput for streaming workloads. Commercial (AWS).
  • Google Dataproc Metastore - Fully managed Hive metastore service with BigLake REST Catalog support and native Iceberg integration. Commercial (Google Cloud).
  • Unity Catalog - Databricks unified governance solution with full Iceberg REST API implementation, multi-format support, and lakehouse federation. Commercial.
  • Snowflake Catalog - Native catalog for Snowflake-managed Iceberg tables with external volume integrations and Polaris catalog support. Commercial.
  • Dremio Arctic - Managed Nessie-based catalog with Git-like versioning, automated optimization, and branch/tag management. Commercial.

Libraries and SDKs

Official Implementations

  • PyIceberg - Official Python library with full read/write capabilities, multiple catalog support, PyArrow integration, and DuckDB queries on Iceberg tables. Python 3.9+. Open source (Apache 2.0).
  • Apache Iceberg Java API - Reference implementation containing core specifications, all catalog types, transaction support, and processing engine integrations. Java 11+. Open source (Apache 2.0).
  • iceberg-rust - Official Rust implementation with DataFusion integration, multiple catalog support, and Arrow-based data processing. Stable Rust. Open source (Apache 2.0).
  • iceberg-go - Official Golang implementation with read/write capabilities, REST/Hive/Glue catalog support, and CLI tool similar to PyIceberg. Go 1.23+. Open source (Apache 2.0).
  • iceberg-cpp - Official C++ implementation with Apache Arrow integration and CMake build system. C++23 compliant. Open source (Apache 2.0).

Community Implementations

  • Icebird - JavaScript/TypeScript library for reading Iceberg tables with Parquet/Avro support, browser and Node.js compatible. Read-only. Open source.

Data Ingestion

CDC Tools

  • Debezium Server Iceberg - Direct CDC to Iceberg without Kafka or Spark, supporting PostgreSQL, MySQL, MongoDB, SQL Server, Oracle with real-time change capture. Open source (Apache 2.0).
  • AWS DMS with Amazon Data Firehose - Captures database changes and streams updates to Iceberg tables with automatic table creation and schema management. Commercial (AWS).
  • Upsolver - Cloud-based CDC platform with native Iceberg support, equality deletes as first-class citizen, and aggressive compaction for 1000s of changes per minute. Commercial (acquired by Qlik 2025).
  • Estuary Flow - Real-time data integration with sub-100ms latency, automatic schema evolution, and CDC capabilities from Kafka to Iceberg. Commercial.

Streaming Connectors

  • Apache Iceberg Kafka Connect - Official sink connector with exactly-once semantics, multi-table fan-out, automatic schema evolution, and DebeziumTransform SMT for CDC. Open source (Apache 2.0).
  • Apache Flink Iceberg Connector - Native streaming and batch integration with equality delete support, CDC processing, and both DataStream API and Table/SQL API. Open source (Apache 2.0).
  • Apache Flink CDC Pipeline Connector - Dedicated CDC connector with automatic table creation, schema synchronization, and direct pipeline from databases to Iceberg. Open source (Apache 2.0).
  • Apache Spark Structured Streaming - Native streaming reads and writes with micro-batch processing, checkpoint support, and fanout writer for low-latency ingestion. Open source (Apache 2.0).
  • Confluent Tableflow - Automatically exposes Kafka topics as Iceberg tables with Flink SQL transformations and seamless integration with Confluent Cloud. Commercial.

Batch Ingestion

  • Airbyte - Open-source ELT platform with 600+ connectors, S3 Data Lake destination supporting Iceberg format, and automated schema mapping with AWS Glue/Nessie catalogs. Open source core with Cloud offering.
  • Fivetran - Managed data movement with 700+ connectors, Managed Data Lake Service with native Iceberg support, ACID transactions, and Iceberg REST catalog. Commercial.
  • dbt - Data transformation framework with native Iceberg materializations for Snowflake, BigQuery, Spark, and Databricks, supporting incremental strategies and catalog integrations. Open source core with Cloud offering.
  • AWS Glue - Managed ETL service with native Iceberg connector, optimistic locking, and Lake Formation integration. Commercial (AWS).
  • DLT - Lightweight Python code to move data. Open Source with 3rd party connector. Open source.

Data Quality and Governance

Data Quality Tools

  • Great Expectations - Expectations-based testing framework working with Iceberg through Spark/Trino, supporting Write-Audit-Publish pattern with Iceberg branching. Open source (Apache 2.0).
  • Soda Core - Python-based data quality framework with native Iceberg branch-level checks, 25+ built-in metrics, and YAML-based SodaCL. Open source (Apache 2.0).
  • Soda Cloud - Commercial platform with AI-powered metrics observability (SodaGPT), collaborative data contracts, and metrics storage in Iceberg tables. Commercial.
  • Monte Carlo - Data observability platform with end-to-end pipeline monitoring, anomaly detection, and incident management for Iceberg tables. Commercial.
  • Microsoft Purview - Data quality assessment on Iceberg assets with profiling, schema import, and time travel for historical quality views. Commercial (Microsoft, Public Preview).

Governance Tools

  • OpenMetadata - Unified metadata platform with native Iceberg connector (v0.9.0+), automated lineage tracking, and support for REST, Hive, Glue, and DynamoDB catalogs. Open source (Apache 2.0).
  • DataHub - Metadata management with native Iceberg connector, REST Catalog API implementation, and real-time metadata updates using PyIceberg. Open source (Apache 2.0).
  • Apache Atlas - Metadata management for Hadoop ecosystems with native Iceberg integration in Cloudera CDP, lineage tracking, and schema evolution support. Open source (Apache 2.0).
  • Project Nessie - Git-like catalog with reference-based and path-based access control, commit-level governance, and branch-specific permissions. Open source (Apache 2.0).
  • AWS Lake Formation - Cell-level access permissions, fine-grained access control, and row/column-level security for Iceberg tables on AWS. Commercial (AWS).
  • Unity Catalog - Unified governance for Delta, Iceberg, Hudi, and Parquet with centralized access control, lineage, and business metrics governance. Commercial.

Monitoring and Observability

  • Apache Iceberg Metrics API - Built-in MetricsReporter API (v1.1.0+) with ScanReport and CommitReport metrics for scan planning, file operations, and commit tracking. Open source (Apache 2.0).
  • AWS CloudWatch for Iceberg - Time-series metrics collection from Iceberg metadata with EventBridge triggers, Lambda functions, and dashboard templates. Open source sample + AWS service.
  • Monte Carlo - Full observability platform with automatic anomaly detection, data freshness/volume/schema monitoring, and integration with Slack/PagerDuty/Jira. Commercial.
  • Dremio Arctic - Automatic table optimization monitoring, branch/tag health tracking, and query performance metrics with Git-like version control observability. Commercial.

Table Maintenance

  • Apache Spark Actions - Built-in maintenance operations including rewriteDataFiles (binpack, sort, z-order), expireSnapshots, rewriteManifests, and removeOrphanFiles with parallel execution. Open source (Apache 2.0).
  • Spark SQL Procedures - SQL-based maintenance commands like CALL system.rewrite_data_files() for no-code command-line execution of compaction and cleanup. Open source (Apache 2.0).
  • AWS Glue Optimization - Automatic compaction configuration with binpack, sort, z-order strategies and integration with AWS Glue jobs and EMR Serverless. Commercial (AWS).
  • Amazon Athena Maintenance - OPTIMIZE command for compaction and VACUUM for orphan file cleanup with table property-based configuration. Commercial (AWS).
  • Snowflake Automatic Maintenance - Automatic compaction for Snowflake-managed tables with configurable target file sizes and position delete handling. Commercial.
  • Upsolver - Continuous table optimization with small file compaction, sorting, compression, repartitioning, and health analysis of Iceberg tables. Commercial.
  • Dremio Arctic Optimizer - Automatic background optimization with compaction strategies, snapshot management, orphan file cleanup, and branch-based optimization. Commercial.
  • Amazon S3 Tables - Managed Iceberg with automatic maintenance scheduling including binpack, sort, and z-order compaction strategies. Commercial (AWS).

Migration Tools

  • Iceberg Table Migration - Built-in migration capabilities supporting in-place metadata migration (Snapshot Table, Migrate Table, Add Files) and full data migration via CTAS/INSERT. Open source (Apache 2.0).
  • Hive Migration Module - Official migration from ORC, Parquet, and Avro Hive tables via Spark Procedures. Open source (Apache 2.0).
  • Delta Lake Migration Module - iceberg-delta-lake module with snapshotDeltaLakeTable action supporting Delta protocol minReaderVersion 1, minWriterVersion 2. Open source (Apache 2.0).
  • Iceberg Catalog Migrator - CLI tool for bulk catalog migrations without data copy, supporting migrate and register commands. Requires Java 21+. Open source (Apache 2.0).
  • Delta Lake UniForm - Enables Iceberg reads on Delta Lake tables by generating Iceberg metadata asynchronously without data rewriting. Requires Unity Catalog. Commercial (Databricks).
  • Apache XTable (Incubating) - Seamless interoperability between Hudi, Delta, and Iceberg co-launched by Microsoft, Google, and Onehouse. Open source (Apache incubating).

Cloud Integrations

AWS

  • Amazon Athena - Serverless query engine with native Iceberg support, MERGE/UPDATE/DELETE operations, time travel, and automatic metadata generation. Commercial.
  • Amazon EMR - Managed big data platform with Spark integration, Iceberg v3 deletion vectors (EMR 7.10.0+), and AWS Glue catalog integration. Commercial.
  • AWS Glue - Managed ETL with native Iceberg connector, optimistic locking by default (Glue 4.0+), and REST catalog support (Glue 5.0+). Commercial.
  • Amazon Redshift - Query RMS tables as Iceberg tables via SageMaker Lakehouse with REST catalog backed by AWS Glue. Commercial.
  • AWS Lake Formation - Row and column-level security, fine-grained access control, and centralized permissions management for Iceberg tables. Commercial.
  • Amazon SageMaker Lakehouse - Unified lakehouse experience with Iceberg REST catalog backed by AWS Glue, accessing Redshift and S3 data via Iceberg format. Commercial.
  • Amazon S3 Tables - Fully managed Iceberg tables with automatic maintenance (compaction, snapshot expiration, orphan removal) and sort/z-order strategies. Commercial.
  • Amazon Data Firehose - Streaming CDC to Iceberg tables with automatic scaling and schema management for PostgreSQL, MySQL, Oracle, SQL Server, MongoDB. Commercial.

Google Cloud Platform

  • BigQuery - BigLake Iceberg tables (GA) with fully managed storage, automatic optimization, high-throughput streaming, and cross-engine compatibility via Spark/Flink connectors. Commercial.
  • Google Dataproc - Managed Spark/Hadoop with full Iceberg capabilities through Apache Spark, BigLake metastore integration, and serverless Spark option. Commercial.
  • BigLake Metastore - REST Catalog implementation with unified catalog access and BigQuery Storage API integration. Commercial.
  • Google Cloud Storage - Native GCS integration with GCSFileIO, Application Default Credentials, and customer-managed encryption keys. Commercial.

Microsoft Azure

  • Azure Synapse Analytics - Spark pools with Iceberg support (manual JAR configuration), ACID transactions, schema evolution, and ADLS Gen2 integration. Commercial.
  • Microsoft Fabric - OneLake bidirectional table format virtualization (Public Preview Nov 2024) with Delta-Iceberg interoperability and shortcut integration to external Iceberg tables. Commercial.
  • Azure Data Factory - Data pipeline support for Iceberg format (Nov 2024) with copy activity and transformation capabilities. Commercial.
  • Microsoft Purview - Data governance with Iceberg support for metadata management, lineage tracking, and data quality assessment. Commercial.

Commercial Products

  • Snowflake - Cloud data platform with native Iceberg support (GA April 2025), managed and external tables, multiple catalog integrations, data sharing, and Polaris catalog. Commercial.
  • Databricks - Lakehouse platform with Unity Catalog-managed Iceberg (Public Preview), acquired Tabular (June 2024), Predictive Optimization, and Delta UniForm for interoperability. Commercial.
  • Dremio - SQL lakehouse with native Iceberg integration, Apache Polaris catalog, automatic optimization, Git-like versioning, and sub-second BI performance. Commercial with open source components.
  • Starburst Galaxy - Managed Trino with Iceberg v3 support, deletion vectors, materialized view auto-refresh, Apache Polaris connector, and AI-powered features. Commercial.
  • Tabular - Managed Iceberg service created by Iceberg founders with centralized RBAC, auto-optimization, and multi-engine support (acquired by Databricks June 2024). Commercial.
  • Cloudera Data Platform - Hybrid cloud platform with first-class Iceberg support via Hive, Impala, Spark, Flink, REST catalog, and SDX integration for security and governance. Commercial.
  • Upsolver - Real-time streaming ingestion platform with native Iceberg support, Adaptive Iceberg Optimizer, and CDC processing (acquired by Qlik January 2025). Commercial.
  • Confluent - Enterprise Kafka platform with Tableflow for automatic Iceberg table creation, managed Kafka Connect with Iceberg sink, and Flink integration. Commercial.

Development Tools

Command Line Tools

  • PyIceberg CLI - Official CLI for describing, listing, managing Iceberg tables with commands for schema, properties, snapshots, and table operations. Python. Open source (Apache 2.0).
  • Iceberg Go CLI - Go implementation similar to PyIceberg CLI for table and namespace operations. Go 1.23+. Open source (Apache 2.0).
  • Upsolver Iceberg Diagnostic CLI - Evaluates Iceberg tables for optimization opportunities with side-by-side comparison of current vs optimized metrics. Install via Brew/PIP. Commercial.
  • Iceberg Catalog Migrator CLI - Bulk migrate Iceberg tables between catalogs with migrate and register commands. Java 21+. Open source (Apache 2.0).

Development Environments

  • Spark Iceberg Docker Image - Official tabulario/spark-iceberg Docker image with pre-configured Spark cluster, Iceberg catalog, and Jupyter notebook environment for local development. Open source (Apache 2.0).
  • Iceberg REST API Test - Example project testing Iceberg REST API with PyIceberg for namespace and table operations. Open source.
  • AWS Iceberg Streaming Examples - High-throughput IoT and CDC examples with best practices for Spark + Iceberg streaming, deployable to EMR Serverless and AWS Glue. Open source samples.

Testing

  • TestHiveMetastore - JUnit testing with local Thrift service and Derby databases for isolated Iceberg testing from iceberg-hive-metastore module. Open source (Apache 2.0).
  • TestContainers - Docker-based integration testing framework used in official Iceberg repository for testing catalog implementations and engine integrations. Open source (Apache 2.0).

Community

Events and Conferences

  • Iceberg Summit 2025 - Second annual ASF-sanctioned event, hybrid format (San Francisco April 8, virtual April 9), organized by AWS, Dremio, Microsoft, Snowflake with 500+ expected attendees. Official.
  • Apache Iceberg Talks - Curated list of talks and videos related to Iceberg, updated regularly with conference presentations and technical deep-dives. Official.

Discussion Channels

  • Apache Iceberg Slack - Active community workspace with channels for meetups, vendors, jobs, and technical discussions. Join here. Official.
  • Mailing Lists - dev@, commits@, and private@ lists for development discussions, commit notifications, and PMC discussions. Subscribe at [email protected]. Official (ASF).
  • GitHub Discussions - Issue tracking, pull requests, and community discussions on specific features and bugs. Official (ASF).

Meetups

  • Iceberg Community Meetups - Platform for starting and joining local Apache Iceberg meetups with Slack channels like #meetup-seattle, #meetup-atlanta for regional communities. Community-organized.

Social Media

  • @IcebergDevs Twitter - Developer-focused Twitter account for latest announcements, blogs, and project updates. Community advocacy.

Learning Resources

Official Documentation

  • Apache Iceberg Documentation - Comprehensive official documentation covering schema evolution, partitioning, time travel, performance tuning, and multi-engine support. Official (ASF).
  • API Documentation - Java API reference for core Iceberg interfaces and implementations. Official (ASF).

Books

Online Courses

Tutorials and Guides

Best Practices

Blogs and Articles

  • Dremio Blog - Extensive collection of Iceberg articles, tutorials, comparisons, and feature announcements with regular updates. Vendor.
  • Tabular Blog - Monthly Iceberg community news (2023-2024), technical deep-dives, Apache Iceberg Cookbook with 30+ recipes. Vendor.
  • 2025 Guide to Architecting an Iceberg Lakehouse - Comprehensive architecture guide with self-audit questions, tool selection, and best practices (December 2024). Community.
  • What's New in Apache Iceberg V3? - Binary deletion vectors with Roaring bitmaps, default column values, and V3 specification ratification (2025). Google official.
  • 10 Future Iceberg Developments for 2025 - Scan Planning endpoint, interoperable views, streaming support, and Hybrid Catalog GA (January 2025). Community.

Video Resources

  • Apache Iceberg YouTube Channel - Official talks, tutorials, conference recordings, and Iceberg Summit 2024 playlist with 32 talks. Official (ASF).
  • Gnarly Data Waves Podcast - Apache Iceberg Office Hours recordings and dedicated episodes on Iceberg topics, available on YouTube and Spotify. Community.

Example Projects

  • Iceberg Spark S3 Examples - Spark SQL examples with Iceberg on S3 and Java API usage demonstrations. Community.
  • Iceberg Streaming Examples (AWS) - High-throughput IoT and CDC scenarios with Spark + Iceberg streaming best practices and local development environment. AWS samples.
  • Iceberg Demo (Flink + Trino) - Stream writes to Iceberg on GCS using Flink and reading with Trino via Iceberg connector. Community.
  • Iceberg in Production - Curated blogs and videos showing real-world Iceberg production usage from various organizations. Community.

Contributing

Contributions welcome! Please read the contribution guidelines first. Submit a pull request with your suggestions following the awesome-x format.

License

CC0

To the extent possible under law, the contributors have waived all copyright and related rights to this work.

About

A curated list of Iceberg resources

Topics

Resources

Contributing

Stars

Watchers

Forks