Open table formats (OTFs) like Apache Iceberg are being increasingly adopted, for example, to improve transactional consistency of a data lake or to consolidate batch and streaming data pipelines on a single file format and reduce complexity. In practice, architects need to integrate the chosen format with the various layers of a modern data platform. However, the level of support for the different OTFs varies across common analytical services.
Commercial vendors and the open source community have recognized this situation and are working on interoperability between table formats. One approach is to make a single physical dataset readable in different formats by translating its metadata and avoiding reprocessing of actual data files. Apache XTable is an open source solution that follows this approach and provides abstractions and tools for the translation of open table format metadata.
In this repository, we show how to get started with Apache XTable on AWS. This repository is meant to be used in combination with related blog posts on this topic:
- Blog - Run Apache XTable in AWS Lambda for background conversion of open table formats
- Blog - Run Apache XTable on Amazon MWAA to translate open table formats
Let’s dive deeper into how to deploy the provided AWS CDK stack.
You need one of the following container runtimes and AWS CDK installed:
You also need an existing lakehouse setup or can use the provided scripts to set up a test environment.
(1) To deploy the stack, clone the github repo, change into the folder for this Blog Post (xtable_lambda) and deploy the CDK stack. This deploys a set of Lambda functions and an Amazon EventBridge Scheduler:
git clone https://github.com/aws-samples/apache-xtable-on-aws-samples.git
cd xtable_lambda
cdk deployWhen using Finch you need to set CDK_DOCKER environment variable before deployment:
export CDK_DOCKER=finchAfter the deployment, all your correctly configured Glue Tables will be transformed every hour.
(2) Set required Glue Data Catalog parameters:
For the solution to work, you have to set the correct parameters in the Glue Data Catalog, in each of the Glue Tables you want to transform:
"xtable_table_type": "<source_format>""xtable_target_formats": "<target_format>, <target_format>"
In the AWS Console the parameters look like the following and can be set under “Table properties” when editing a Glue Table:
(3) (Optional) Create data lake test environment
In case you do not have a lakehouse setup, these scripts can help you to set up a test environment either with your local machine or in a AWS Glue for Spark Job.
#local: create hudi dataset on S3
cd scripts
pip install -r requirements.txt
python ./create_hudi_s3.p- Delete the deployed CDK stack:
cdk destroy- Delete the downloaded git repository:
rm -r apache-xtable-on-aws-samples- Delete the used docker image:
docker image rm public.ecr.aws/lambda/python:3.12.2024.07.10.11-arm64- Reverse AWS Glue configurations in your Glue Tables
- (Optional) Delete test data files created with the test scripts in your S3 Bucket
To deploy the custom operator to Amazon MWAA, we upload it together with DAGs into the configured DAG folder. Besides the operator itself, we also need to upload XTable’s executable JAR. As of writing this post, the JAR needs to be compiled by the user from source code. To simplify this, we provide a container-based build script.
We assume you have at least an environment consisting of Amazon MWAA itself, an S3 bucket, and an AWS Identity and Access Management (IAM) role for Amazon MWAA that has read access to the bucket and optionally write access to the AWS Glue Data Catalog. In addition, you need one of the following container runtimes to run the provided build script for XTable:
To compile XTable, you can use the provided build script and complete the following steps:
(1) Clone the sample code from GitHub:
git clone [email protected]:aws-samples/apache-xtable-on-aws-samples.git
cd ./apache-xtable-on-aws-samples/xtable_operator(2) Run the build script:
./build-airflow-operator.sh(3) Because the Airflow operator uses the library JPype to invoke XTable’s JAR, add a dependency in the Amazon MWAA requirement.txt file:
# requirements.txt
JPype1==1.5.0For a background on installing additional Python libraries on Amazon MWAA, see Installing Python dependencies.
(4) Because XTable is Java-based, a Java 11 runtime environment (JRE) is required on Amazon MWAA. You can use Amazon MWAA’s startup script, to install a JRE. Add the following lines to an existing startup script or create a new one as provided in the sample code base of this post:
# startup.sh
if [[ "${MWAA_AIRFLOW_COMPONENT}" != "webserver" ]]
then
sudo yum install -y java-11-amazon-corretto-headless
fiFor more information about this mechanism, see Using a startup script with Amazon MWAA.
(5) Upload xtable_operator/, requirements.txt, startup.sh and .airflowignore to the S3 bucket and respective paths from which Amazon MWAA will read files. Make sure the IAM role for Amazon MWAA has appropriate read permissions.
With regard to the Customer Operator, make sure to upload the local folder xtable_operator/ into the configured DAG folder along with the .airflowignore file.
(6) Update the configuration of your Amazon MWAA environment as follows and start the update process:
- Add or update the S3 URI to the requirements.txt file through the Requirements file configuration option.
- Add or update the S3 URI to the startup.sh script through the Startup script configuration option.
Optionally, you can use the AWS Glue Data Catalog as an Iceberg catalog.
(7) In case you create Iceberg metadata and want to register it in the AWS Glue Data Catalog, the Amazon MWAA role needs permissions to create or modify tables in AWS Glue. The following listing shows a minimal policy for this. It constrains permissions to a defined database in AWS Glue:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"glue:GetDatabase",
"glue:CreateTable",
"glue:GetTables",
"glue:UpdateTable",
"glue:GetDatabases",
"glue:GetTable"
],
"Resource": [
"arn:aws:glue:<AWS Region>:<AWS Account ID>:catalog",
"arn:aws:glue:<AWS Region>:<AWS Account ID>:database/<Database name>",
"arn:aws:glue:<AWS Region>:<AWS Account ID>:table/<Database name>/*"
]
}
]
}- Delete the downloaded git repository:
rm -r apache-xtable-on-aws-samples- Delete the used docker image:
docker image rm public.ecr.aws/amazonlinux/amazonlinux:2023.4.20240319.1-
Reverse Amazon MWAA configurations in:
- requirements.txt
- startup.sh
- DAG
- MWAA execution role
-
Delete files or versions of files in your S3 Bucket:
- requirements.txt
- startup.sh
- DAG
- .airflowignore
See CONTRIBUTING for more information.
See CODE OF CONDUCT for more information.
This library is licensed under the MIT-0 License. See the LICENSE file.




