`apache-xtable-on-aws-samples`

apache-xtable-on-aws-samples

Open table formats (OTFs) like Apache Iceberg are being increasingly adopted, for example, to improve transactional consistency of a data lake or to consolidate batch and streaming data pipelines on a single file format and reduce complexity. In practice, architects need to integrate the chosen format with the various layers of a modern data platform. However, the level of support for the different OTFs varies across common analytical services.

Commercial vendors and the open source community have recognized this situation and are working on interoperability between table formats. One approach is to make a single physical dataset readable in different formats by translating its metadata and avoiding reprocessing of actual data files. Apache XTable is an open source solution that follows this approach and provides abstractions and tools for the translation of open table format metadata.

In this repository, we show how to get started with Apache XTable on AWS. This repository is meant to be used in combination with related blog posts on this topic:

AWS Lambda

Let’s dive deeper into how to deploy the provided AWS CDK stack.

Prerequisites - AWS Lambda

You need one of the following container runtimes and AWS CDK installed:

Finch or Docker
AWS CDK

You also need an existing lakehouse setup or can use the provided scripts to set up a test environment.

Step-by-step-guide - AWS Lambda

(1) To deploy the stack, clone the github repo, change into the folder for this Blog Post (xtable_lambda) and deploy the CDK stack. This deploys a set of Lambda functions and an Amazon EventBridge Scheduler:

git clone https://github.com/aws-samples/apache-xtable-on-aws-samples.git
cd xtable_lambda
cdk deploy

When using Finch you need to set CDK_DOCKER environment variable before deployment:

export CDK_DOCKER=finch

After the deployment, all your correctly configured Glue Tables will be transformed every hour.

(2) Set required Glue Data Catalog parameters:

For the solution to work, you have to set the correct parameters in the Glue Data Catalog, in each of the Glue Tables you want to transform:

"xtable_table_type": "<source_format>"
"xtable_target_formats": "<target_format>, <target_format>"

In the AWS Console the parameters look like the following and can be set under “Table properties” when editing a Glue Table:

(3) (Optional) Create data lake test environment

In case you do not have a lakehouse setup, these scripts can help you to set up a test environment either with your local machine or in a AWS Glue for Spark Job.

#local: create hudi dataset on S3
cd scripts
pip install -r requirements.txt
python ./create_hudi_s3.p

Clean Up

Delete the deployed CDK stack:

  cdk destroy

Delete the downloaded git repository:

  rm -r apache-xtable-on-aws-samples

Delete the used docker image:

  docker image rm public.ecr.aws/lambda/python:3.12.2024.07.10.11-arm64

Reverse AWS Glue configurations in your Glue Tables
(Optional) Delete test data files created with the test scripts in your S3 Bucket

Amazon MWAA Operator

To deploy the custom operator to Amazon MWAA, we upload it together with DAGs into the configured DAG folder. Besides the operator itself, we also need to upload XTable’s executable JAR. As of writing this post, the JAR needs to be compiled by the user from source code. To simplify this, we provide a container-based build script.

Prerequisites - Amazon MWAA Operator

We assume you have at least an environment consisting of Amazon MWAA itself, an S3 bucket, and an AWS Identity and Access Management (IAM) role for Amazon MWAA that has read access to the bucket and optionally write access to the AWS Glue Data Catalog. In addition, you need one of the following container runtimes to run the provided build script for XTable:

Finch
Docker

Step-by-step-guide - Amazon MWAA Operator

To compile XTable, you can use the provided build script and complete the following steps:

(1) Clone the sample code from GitHub:

git clone [email protected]:aws-samples/apache-xtable-on-aws-samples.git
cd ./apache-xtable-on-aws-samples/xtable_operator

(2) Run the build script:

./build-airflow-operator.sh

(3) Because the Airflow operator uses the library JPype to invoke XTable’s JAR, add a dependency in the Amazon MWAA requirement.txt file:

# requirements.txt
JPype1==1.5.0

For a background on installing additional Python libraries on Amazon MWAA, see Installing Python dependencies.

(4) Because XTable is Java-based, a Java 11 runtime environment (JRE) is required on Amazon MWAA. You can use Amazon MWAA’s startup script, to install a JRE. Add the following lines to an existing startup script or create a new one as provided in the sample code base of this post:

# startup.sh
if [[ "${MWAA_AIRFLOW_COMPONENT}" != "webserver" ]]
then
    sudo yum install -y java-11-amazon-corretto-headless
fi

For more information about this mechanism, see Using a startup script with Amazon MWAA.

(5) Upload xtable_operator/, requirements.txt, startup.sh and .airflowignore to the S3 bucket and respective paths from which Amazon MWAA will read files. Make sure the IAM role for Amazon MWAA has appropriate read permissions.

With regard to the Customer Operator, make sure to upload the local folder xtable_operator/ into the configured DAG folder along with the .airflowignore file.

(6) Update the configuration of your Amazon MWAA environment as follows and start the update process:

Add or update the S3 URI to the requirements.txt file through the Requirements file configuration option.
Add or update the S3 URI to the startup.sh script through the Startup script configuration option.

[Optional step: AWS Glue Data Catalog used as an Iceberg Catalog]

Optionally, you can use the AWS Glue Data Catalog as an Iceberg catalog.

(7) In case you create Iceberg metadata and want to register it in the AWS Glue Data Catalog, the Amazon MWAA role needs permissions to create or modify tables in AWS Glue. The following listing shows a minimal policy for this. It constrains permissions to a defined database in AWS Glue:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "glue:GetDatabase",
                "glue:CreateTable",
                "glue:GetTables",
                "glue:UpdateTable",
                "glue:GetDatabases",
                "glue:GetTable"
            ],
            "Resource": [
                "arn:aws:glue:<AWS Region>:<AWS Account ID>:catalog",
                "arn:aws:glue:<AWS Region>:<AWS Account ID>:database/<Database name>",
                "arn:aws:glue:<AWS Region>:<AWS Account ID>:table/<Database name>/*"
            ]
        }
    ]
}

Clean Up

Delete the downloaded git repository:

  rm -r apache-xtable-on-aws-samples

Delete the used docker image:

  docker image rm public.ecr.aws/amazonlinux/amazonlinux:2023.4.20240319.1

Reverse Amazon MWAA configurations in:
- requirements.txt
- startup.sh
- DAG
- MWAA execution role
Delete files or versions of files in your S3 Bucket:
- requirements.txt
- startup.sh
- DAG
- .airflowignore

Security

See CONTRIBUTING for more information.

Code of Conduct

See CODE OF CONDUCT for more information.

License

This library is licensed under the MIT-0 License. See the LICENSE file.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
docs		docs
scripts		scripts
xtable_lambda		xtable_lambda
xtable_operator		xtable_operator
.gitignore		.gitignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

`apache-xtable-on-aws-samples`

AWS Lambda

Prerequisites - AWS Lambda

Step-by-step-guide - AWS Lambda

Clean Up

Amazon MWAA Operator

Prerequisites - Amazon MWAA Operator

Step-by-step-guide - Amazon MWAA Operator

[Optional step: AWS Glue Data Catalog used as an Iceberg Catalog]

Clean Up

Security

Code of Conduct

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

License

aws-samples/apache-xtable-on-aws-samples

Folders and files

Latest commit

History

Repository files navigation

apache-xtable-on-aws-samples

AWS Lambda

Prerequisites - AWS Lambda

Step-by-step-guide - AWS Lambda

Clean Up

Amazon MWAA Operator

Prerequisites - Amazon MWAA Operator

Step-by-step-guide - Amazon MWAA Operator

[Optional step: AWS Glue Data Catalog used as an Iceberg Catalog]

Clean Up

Security

Code of Conduct

License

About

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

`apache-xtable-on-aws-samples`

Packages