A curated list of awesome Apache Iceberg resources, libraries, tools, and learning materials. Apache Iceberg is a high-performance open table format for huge analytic datasets, bringing reliability and simplicity to data lakes.
- Core Project
- Query Engines
- Catalogs
- Libraries and SDKs
- Data Ingestion
- Data Quality and Governance
- Monitoring and Observability
- Table Maintenance
- Migration Tools
- Cloud Integrations
- Commercial Products
- Development Tools
- Testing
- Community
- Learning Resources
- Apache Iceberg - Official website for Apache Iceberg, a high-performance open table format for analytic datasets. Open source (Apache 2.0).
- Apache Iceberg GitHub - Reference Java implementation with core table format specifications, catalog implementations, and processing engine integrations. Open source (Apache 2.0).
- Documentation - Comprehensive official documentation covering all features, APIs, and integrations. Open source.
- Apache Spark - First-class native Iceberg support with full read/write capabilities, streaming, time travel, and all DDL/DML operations. Supports Spark 3.3, 3.4, 3.5. Open source (Apache 2.0).
- Apache Flink - Native streaming and batch support with CDC pipeline connectors, incremental reads, and equality delete writes for efficient streaming use cases. Open source (Apache 2.0).
- Trino - Production-ready native connector with full read/write support, time travel, multiple catalog types, and metadata table access. Open source (Apache 2.0).
- PrestoDB - Native connector with V1/V2 table support, time travel queries, and REST catalog integration. Open source (Apache 2.0).
- Apache Impala - Native C++ implementation with read/write support, partition transforms, time travel, and OPTIMIZE statement for maintenance. Open source (Apache 2.0).
- Apache Hive - StorageHandler-based integration with full DDL/DML support in Hive 4.0+, table migration capabilities from Avro/Parquet/ORC. Open source (Apache 2.0).
- StarRocks - MPP SQL engine with native Iceberg catalog support, materialized views, and local caching for performance. Open source (Apache 2.0).
- Apache Doris - Real-time MPP data warehouse with native read/write support for V1/V2 tables, position/equality deletes, and CTAS operations. Open source (Apache 2.0).
- DuckDB - In-process analytical database with Iceberg support via extension, excellent for local analytics and single-machine deployments. Open source (MIT).
- ClickHouse - Columnar database with read-only Iceberg support via table function and engine, optimized for real-time analytics. Open source (Apache 2.0).
- Polars - Dataframe query engine with read support for Iceberg & experimental support for writes. Open source.
- Amazon Athena - Serverless query engine with native Iceberg support for DDL/DML operations, time travel, and automatic metadata generation. Commercial (AWS).
- Google BigQuery - BigLake Iceberg tables with fully managed storage, automatic optimization, high-throughput streaming, and cross-engine compatibility. Commercial (Google Cloud).
- Snowflake - Native Iceberg support (GA April 2025) with managed and external tables, catalog integrations, and data sharing capabilities. Commercial.
- Databricks - Unity Catalog-managed Iceberg tables (Public Preview) with Predictive Optimization, REST Catalog API, and Delta UniForm for interoperability. Commercial.
- Dremio - SQL lakehouse platform with native Iceberg integration, Apache Polaris catalog, automatic optimization, and sub-second BI performance. Commercial with open source components.
- Starburst/Trino Enterprise - Enterprise Trino distribution with Iceberg v3 support, deletion vectors, materialized view auto-refresh, and AI-powered features. Commercial.
- Apache Polaris (Incubating) - Fully-featured REST catalog donated by Snowflake with multi-engine interoperability, RBAC, credential vending, and support for internal/external modes. Open source (Apache 2.0).
- Project Nessie - Git-like transactional catalog with branches, tags, commits, and multi-table transaction support for data-as-code workflows. Open source (Apache 2.0).
- Hive Metastore Catalog - Traditional HMS integration storing Iceberg metadata in relational databases, best for existing Hive ecosystems. Open source (Apache 2.0).
- REST Catalog - RESTful specification enabling standardized catalog operations across any language and platform. Open source specification (Apache 2.0).
- Hadoop/File-Based Catalog - File-based catalog using version-hint.text files, suitable for testing and single-writer scenarios but not recommended for production with object storage. Open source (Apache 2.0).
- JDBC Catalog - Uses relational database table for Iceberg metadata management with serializable isolation for ACID transactions. Open source (Apache 2.0).
- OpenMetadata - Unified metadata platform with native Iceberg connector for automated metadata extraction, lineage tracking, and data quality integration across 90+ sources. Open source (Apache 2.0).
- DataHub - Metadata management platform with native Iceberg source connector and REST Catalog API implementation for discovery and governance. Open source (Apache 2.0).
- AWS Glue Data Catalog - Native AWS catalog with optimistic locking, automatic table optimization, REST API endpoint, and integration with Athena, EMR, and Redshift. Commercial (AWS).
- DynamoDB Catalog - AWS DynamoDB-based catalog with optimistic locking and high write throughput for streaming workloads. Commercial (AWS).
- Google Dataproc Metastore - Fully managed Hive metastore service with BigLake REST Catalog support and native Iceberg integration. Commercial (Google Cloud).
- Unity Catalog - Databricks unified governance solution with full Iceberg REST API implementation, multi-format support, and lakehouse federation. Commercial.
- Snowflake Catalog - Native catalog for Snowflake-managed Iceberg tables with external volume integrations and Polaris catalog support. Commercial.
- Dremio Arctic - Managed Nessie-based catalog with Git-like versioning, automated optimization, and branch/tag management. Commercial.
- PyIceberg - Official Python library with full read/write capabilities, multiple catalog support, PyArrow integration, and DuckDB queries on Iceberg tables. Python 3.9+. Open source (Apache 2.0).
- Apache Iceberg Java API - Reference implementation containing core specifications, all catalog types, transaction support, and processing engine integrations. Java 11+. Open source (Apache 2.0).
- iceberg-rust - Official Rust implementation with DataFusion integration, multiple catalog support, and Arrow-based data processing. Stable Rust. Open source (Apache 2.0).
- iceberg-go - Official Golang implementation with read/write capabilities, REST/Hive/Glue catalog support, and CLI tool similar to PyIceberg. Go 1.23+. Open source (Apache 2.0).
- iceberg-cpp - Official C++ implementation with Apache Arrow integration and CMake build system. C++23 compliant. Open source (Apache 2.0).
- Icebird - JavaScript/TypeScript library for reading Iceberg tables with Parquet/Avro support, browser and Node.js compatible. Read-only. Open source.
- Debezium Server Iceberg - Direct CDC to Iceberg without Kafka or Spark, supporting PostgreSQL, MySQL, MongoDB, SQL Server, Oracle with real-time change capture. Open source (Apache 2.0).
- AWS DMS with Amazon Data Firehose - Captures database changes and streams updates to Iceberg tables with automatic table creation and schema management. Commercial (AWS).
- Upsolver - Cloud-based CDC platform with native Iceberg support, equality deletes as first-class citizen, and aggressive compaction for 1000s of changes per minute. Commercial (acquired by Qlik 2025).
- Estuary Flow - Real-time data integration with sub-100ms latency, automatic schema evolution, and CDC capabilities from Kafka to Iceberg. Commercial.
- Apache Iceberg Kafka Connect - Official sink connector with exactly-once semantics, multi-table fan-out, automatic schema evolution, and DebeziumTransform SMT for CDC. Open source (Apache 2.0).
- Apache Flink Iceberg Connector - Native streaming and batch integration with equality delete support, CDC processing, and both DataStream API and Table/SQL API. Open source (Apache 2.0).
- Apache Flink CDC Pipeline Connector - Dedicated CDC connector with automatic table creation, schema synchronization, and direct pipeline from databases to Iceberg. Open source (Apache 2.0).
- Apache Spark Structured Streaming - Native streaming reads and writes with micro-batch processing, checkpoint support, and fanout writer for low-latency ingestion. Open source (Apache 2.0).
- Confluent Tableflow - Automatically exposes Kafka topics as Iceberg tables with Flink SQL transformations and seamless integration with Confluent Cloud. Commercial.
- Airbyte - Open-source ELT platform with 600+ connectors, S3 Data Lake destination supporting Iceberg format, and automated schema mapping with AWS Glue/Nessie catalogs. Open source core with Cloud offering.
- Fivetran - Managed data movement with 700+ connectors, Managed Data Lake Service with native Iceberg support, ACID transactions, and Iceberg REST catalog. Commercial.
- dbt - Data transformation framework with native Iceberg materializations for Snowflake, BigQuery, Spark, and Databricks, supporting incremental strategies and catalog integrations. Open source core with Cloud offering.
- AWS Glue - Managed ETL service with native Iceberg connector, optimistic locking, and Lake Formation integration. Commercial (AWS).
- DLT - Lightweight Python code to move data. Open Source with 3rd party connector. Open source.
- Great Expectations - Expectations-based testing framework working with Iceberg through Spark/Trino, supporting Write-Audit-Publish pattern with Iceberg branching. Open source (Apache 2.0).
- Soda Core - Python-based data quality framework with native Iceberg branch-level checks, 25+ built-in metrics, and YAML-based SodaCL. Open source (Apache 2.0).
- Soda Cloud - Commercial platform with AI-powered metrics observability (SodaGPT), collaborative data contracts, and metrics storage in Iceberg tables. Commercial.
- Monte Carlo - Data observability platform with end-to-end pipeline monitoring, anomaly detection, and incident management for Iceberg tables. Commercial.
- Microsoft Purview - Data quality assessment on Iceberg assets with profiling, schema import, and time travel for historical quality views. Commercial (Microsoft, Public Preview).
- OpenMetadata - Unified metadata platform with native Iceberg connector (v0.9.0+), automated lineage tracking, and support for REST, Hive, Glue, and DynamoDB catalogs. Open source (Apache 2.0).
- DataHub - Metadata management with native Iceberg connector, REST Catalog API implementation, and real-time metadata updates using PyIceberg. Open source (Apache 2.0).
- Apache Atlas - Metadata management for Hadoop ecosystems with native Iceberg integration in Cloudera CDP, lineage tracking, and schema evolution support. Open source (Apache 2.0).
- Project Nessie - Git-like catalog with reference-based and path-based access control, commit-level governance, and branch-specific permissions. Open source (Apache 2.0).
- AWS Lake Formation - Cell-level access permissions, fine-grained access control, and row/column-level security for Iceberg tables on AWS. Commercial (AWS).
- Unity Catalog - Unified governance for Delta, Iceberg, Hudi, and Parquet with centralized access control, lineage, and business metrics governance. Commercial.
- Apache Iceberg Metrics API - Built-in MetricsReporter API (v1.1.0+) with ScanReport and CommitReport metrics for scan planning, file operations, and commit tracking. Open source (Apache 2.0).
- AWS CloudWatch for Iceberg - Time-series metrics collection from Iceberg metadata with EventBridge triggers, Lambda functions, and dashboard templates. Open source sample + AWS service.
- Monte Carlo - Full observability platform with automatic anomaly detection, data freshness/volume/schema monitoring, and integration with Slack/PagerDuty/Jira. Commercial.
- Dremio Arctic - Automatic table optimization monitoring, branch/tag health tracking, and query performance metrics with Git-like version control observability. Commercial.
- Apache Spark Actions - Built-in maintenance operations including rewriteDataFiles (binpack, sort, z-order), expireSnapshots, rewriteManifests, and removeOrphanFiles with parallel execution. Open source (Apache 2.0).
- Spark SQL Procedures - SQL-based maintenance commands like
CALL system.rewrite_data_files()for no-code command-line execution of compaction and cleanup. Open source (Apache 2.0). - AWS Glue Optimization - Automatic compaction configuration with binpack, sort, z-order strategies and integration with AWS Glue jobs and EMR Serverless. Commercial (AWS).
- Amazon Athena Maintenance - OPTIMIZE command for compaction and VACUUM for orphan file cleanup with table property-based configuration. Commercial (AWS).
- Snowflake Automatic Maintenance - Automatic compaction for Snowflake-managed tables with configurable target file sizes and position delete handling. Commercial.
- Upsolver - Continuous table optimization with small file compaction, sorting, compression, repartitioning, and health analysis of Iceberg tables. Commercial.
- Dremio Arctic Optimizer - Automatic background optimization with compaction strategies, snapshot management, orphan file cleanup, and branch-based optimization. Commercial.
- Amazon S3 Tables - Managed Iceberg with automatic maintenance scheduling including binpack, sort, and z-order compaction strategies. Commercial (AWS).
- Iceberg Table Migration - Built-in migration capabilities supporting in-place metadata migration (Snapshot Table, Migrate Table, Add Files) and full data migration via CTAS/INSERT. Open source (Apache 2.0).
- Hive Migration Module - Official migration from ORC, Parquet, and Avro Hive tables via Spark Procedures. Open source (Apache 2.0).
- Delta Lake Migration Module - iceberg-delta-lake module with snapshotDeltaLakeTable action supporting Delta protocol minReaderVersion 1, minWriterVersion 2. Open source (Apache 2.0).
- Iceberg Catalog Migrator - CLI tool for bulk catalog migrations without data copy, supporting migrate and register commands. Requires Java 21+. Open source (Apache 2.0).
- Delta Lake UniForm - Enables Iceberg reads on Delta Lake tables by generating Iceberg metadata asynchronously without data rewriting. Requires Unity Catalog. Commercial (Databricks).
- Apache XTable (Incubating) - Seamless interoperability between Hudi, Delta, and Iceberg co-launched by Microsoft, Google, and Onehouse. Open source (Apache incubating).
- Amazon Athena - Serverless query engine with native Iceberg support, MERGE/UPDATE/DELETE operations, time travel, and automatic metadata generation. Commercial.
- Amazon EMR - Managed big data platform with Spark integration, Iceberg v3 deletion vectors (EMR 7.10.0+), and AWS Glue catalog integration. Commercial.
- AWS Glue - Managed ETL with native Iceberg connector, optimistic locking by default (Glue 4.0+), and REST catalog support (Glue 5.0+). Commercial.
- Amazon Redshift - Query RMS tables as Iceberg tables via SageMaker Lakehouse with REST catalog backed by AWS Glue. Commercial.
- AWS Lake Formation - Row and column-level security, fine-grained access control, and centralized permissions management for Iceberg tables. Commercial.
- Amazon SageMaker Lakehouse - Unified lakehouse experience with Iceberg REST catalog backed by AWS Glue, accessing Redshift and S3 data via Iceberg format. Commercial.
- Amazon S3 Tables - Fully managed Iceberg tables with automatic maintenance (compaction, snapshot expiration, orphan removal) and sort/z-order strategies. Commercial.
- Amazon Data Firehose - Streaming CDC to Iceberg tables with automatic scaling and schema management for PostgreSQL, MySQL, Oracle, SQL Server, MongoDB. Commercial.
- BigQuery - BigLake Iceberg tables (GA) with fully managed storage, automatic optimization, high-throughput streaming, and cross-engine compatibility via Spark/Flink connectors. Commercial.
- Google Dataproc - Managed Spark/Hadoop with full Iceberg capabilities through Apache Spark, BigLake metastore integration, and serverless Spark option. Commercial.
- BigLake Metastore - REST Catalog implementation with unified catalog access and BigQuery Storage API integration. Commercial.
- Google Cloud Storage - Native GCS integration with GCSFileIO, Application Default Credentials, and customer-managed encryption keys. Commercial.
- Azure Synapse Analytics - Spark pools with Iceberg support (manual JAR configuration), ACID transactions, schema evolution, and ADLS Gen2 integration. Commercial.
- Microsoft Fabric - OneLake bidirectional table format virtualization (Public Preview Nov 2024) with Delta-Iceberg interoperability and shortcut integration to external Iceberg tables. Commercial.
- Azure Data Factory - Data pipeline support for Iceberg format (Nov 2024) with copy activity and transformation capabilities. Commercial.
- Microsoft Purview - Data governance with Iceberg support for metadata management, lineage tracking, and data quality assessment. Commercial.
- Snowflake - Cloud data platform with native Iceberg support (GA April 2025), managed and external tables, multiple catalog integrations, data sharing, and Polaris catalog. Commercial.
- Databricks - Lakehouse platform with Unity Catalog-managed Iceberg (Public Preview), acquired Tabular (June 2024), Predictive Optimization, and Delta UniForm for interoperability. Commercial.
- Dremio - SQL lakehouse with native Iceberg integration, Apache Polaris catalog, automatic optimization, Git-like versioning, and sub-second BI performance. Commercial with open source components.
- Starburst Galaxy - Managed Trino with Iceberg v3 support, deletion vectors, materialized view auto-refresh, Apache Polaris connector, and AI-powered features. Commercial.
- Tabular - Managed Iceberg service created by Iceberg founders with centralized RBAC, auto-optimization, and multi-engine support (acquired by Databricks June 2024). Commercial.
- Cloudera Data Platform - Hybrid cloud platform with first-class Iceberg support via Hive, Impala, Spark, Flink, REST catalog, and SDX integration for security and governance. Commercial.
- Upsolver - Real-time streaming ingestion platform with native Iceberg support, Adaptive Iceberg Optimizer, and CDC processing (acquired by Qlik January 2025). Commercial.
- Confluent - Enterprise Kafka platform with Tableflow for automatic Iceberg table creation, managed Kafka Connect with Iceberg sink, and Flink integration. Commercial.
- PyIceberg CLI - Official CLI for describing, listing, managing Iceberg tables with commands for schema, properties, snapshots, and table operations. Python. Open source (Apache 2.0).
- Iceberg Go CLI - Go implementation similar to PyIceberg CLI for table and namespace operations. Go 1.23+. Open source (Apache 2.0).
- Upsolver Iceberg Diagnostic CLI - Evaluates Iceberg tables for optimization opportunities with side-by-side comparison of current vs optimized metrics. Install via Brew/PIP. Commercial.
- Iceberg Catalog Migrator CLI - Bulk migrate Iceberg tables between catalogs with migrate and register commands. Java 21+. Open source (Apache 2.0).
- Spark Iceberg Docker Image - Official tabulario/spark-iceberg Docker image with pre-configured Spark cluster, Iceberg catalog, and Jupyter notebook environment for local development. Open source (Apache 2.0).
- Iceberg REST API Test - Example project testing Iceberg REST API with PyIceberg for namespace and table operations. Open source.
- AWS Iceberg Streaming Examples - High-throughput IoT and CDC examples with best practices for Spark + Iceberg streaming, deployable to EMR Serverless and AWS Glue. Open source samples.
- TestHiveMetastore - JUnit testing with local Thrift service and Derby databases for isolated Iceberg testing from iceberg-hive-metastore module. Open source (Apache 2.0).
- TestContainers - Docker-based integration testing framework used in official Iceberg repository for testing catalog implementations and engine integrations. Open source (Apache 2.0).
- Iceberg Summit 2025 - Second annual ASF-sanctioned event, hybrid format (San Francisco April 8, virtual April 9), organized by AWS, Dremio, Microsoft, Snowflake with 500+ expected attendees. Official.
- Apache Iceberg Talks - Curated list of talks and videos related to Iceberg, updated regularly with conference presentations and technical deep-dives. Official.
- Apache Iceberg Slack - Active community workspace with channels for meetups, vendors, jobs, and technical discussions. Join here. Official.
- Mailing Lists - dev@, commits@, and private@ lists for development discussions, commit notifications, and PMC discussions. Subscribe at [email protected]. Official (ASF).
- GitHub Discussions - Issue tracking, pull requests, and community discussions on specific features and bugs. Official (ASF).
- Iceberg Community Meetups - Platform for starting and joining local Apache Iceberg meetups with Slack channels like #meetup-seattle, #meetup-atlanta for regional communities. Community-organized.
- @IcebergDevs Twitter - Developer-focused Twitter account for latest announcements, blogs, and project updates. Community advocacy.
- Apache Iceberg Documentation - Comprehensive official documentation covering schema evolution, partitioning, time travel, performance tuning, and multi-engine support. Official (ASF).
- API Documentation - Java API reference for core Iceberg interfaces and implementations. Official (ASF).
- Apache Iceberg: The Definitive Guide - Comprehensive O'Reilly guide by Tomer Shiran, Jason Hughes, and Alex Merced covering architecture, features, WAP, CDC, and streaming. Free PDF. Official O'Reilly publication.
- Apache Iceberg 101 (Dremio University) - Free 12-part video course by Alex Merced covering architecture, transactions, catalogs, time-travel, and maintenance. Free.
- Apache Iceberg: The Complete Masterclass (Udemy) - Hands-on practical course with schema evolution, hidden partitioning, Spark integration, and production best practices. Paid.
- Getting Started with Apache Iceberg (O'Reilly) - Live training with interactive exercises covering ACID guarantees, table creation, migration, and time-travel. Paid.
- Apache Iceberg Courses (Class Central) - Aggregator of 100+ Apache Iceberg courses covering Kafka, Spark, Impala, and AWS integrations. Mixed free/paid.
- Spark Iceberg Quickstart - Official Docker-based quick start with tabulario/spark-iceberg image and sample code for table operations. Official (ASF).
- Apache Iceberg Tutorial (DataCamp) - Complete beginner's guide with hands-on setup instructions and optimization techniques. Third-party.
- Mastering Apache Iceberg (DigitalOcean) - Scalable data lake management guide with prerequisites, architecture, and practical implementation. Third-party.
- Introduction to Apache Iceberg (Baeldung) - Practical deployment tutorial with Minio, Trino, Docker, and code examples. Third-party.
- Dremio Apache Iceberg 101 - Comprehensive resource hub with 100+ links organized by core concepts, features, hands-on exercises, and production usage. Third-party.
- AWS Prescriptive Guidance for Iceberg - Optimization for cost, performance, and data retention with configuration trade-offs. AWS official.
- Snowflake Iceberg Best Practices - Refresh schedules, storage serialization, and query optimization guidance. Snowflake official.
- Cloudera Iceberg Best Practices - Format version 2, parallelism, positional deletes, and drop table performance optimization. Cloudera official.
- 7 Apache Iceberg Best Practices (Monte Carlo) - Schema evolution, metadata management, catalog choice, and integration patterns with dbt, Snowflake, Databricks, Kafka. Third-party.
- Optimize Iceberg Performance (Upsolver) - 10 tips for partitioning, compaction, file sizes, z-ordering with webinar and advanced techniques. Third-party.
- Dremio Blog - Extensive collection of Iceberg articles, tutorials, comparisons, and feature announcements with regular updates. Vendor.
- Tabular Blog - Monthly Iceberg community news (2023-2024), technical deep-dives, Apache Iceberg Cookbook with 30+ recipes. Vendor.
- 2025 Guide to Architecting an Iceberg Lakehouse - Comprehensive architecture guide with self-audit questions, tool selection, and best practices (December 2024). Community.
- What's New in Apache Iceberg V3? - Binary deletion vectors with Roaring bitmaps, default column values, and V3 specification ratification (2025). Google official.
- 10 Future Iceberg Developments for 2025 - Scan Planning endpoint, interoperable views, streaming support, and Hybrid Catalog GA (January 2025). Community.
- Apache Iceberg YouTube Channel - Official talks, tutorials, conference recordings, and Iceberg Summit 2024 playlist with 32 talks. Official (ASF).
- Gnarly Data Waves Podcast - Apache Iceberg Office Hours recordings and dedicated episodes on Iceberg topics, available on YouTube and Spotify. Community.
- Iceberg Spark S3 Examples - Spark SQL examples with Iceberg on S3 and Java API usage demonstrations. Community.
- Iceberg Streaming Examples (AWS) - High-throughput IoT and CDC scenarios with Spark + Iceberg streaming best practices and local development environment. AWS samples.
- Iceberg Demo (Flink + Trino) - Stream writes to Iceberg on GCS using Flink and reading with Trino via Iceberg connector. Community.
- Iceberg in Production - Curated blogs and videos showing real-world Iceberg production usage from various organizations. Community.
Contributions welcome! Please read the contribution guidelines first. Submit a pull request with your suggestions following the awesome-x format.
To the extent possible under law, the contributors have waived all copyright and related rights to this work.