Skip to content

Conversation

@SanjayMarreddi
Copy link
Contributor

@SanjayMarreddi SanjayMarreddi commented Feb 17, 2025

Description

  • This is an initial PR for integrating analytics-accelerator-s3 into iceberg/aws
  • In short, our library accelerates read performance of objects stored in Amazon S3 by integrating AWS Common Run Time (CRT) libraries and implementing optimisations specific to Apache Parquet files.

Testing

  • Added relevant integration tests

PS: We are also integrating this library into hadoop-aws.

@geruh
Copy link
Contributor

geruh commented Feb 18, 2025

Hey @SanjayMarreddi, I think this is a cool addition!!

Since this will rely on the analytics-accelerator-s3 I'd assume there would be some tests to ensure nothing breaks.

I was wondering if your team has any benchmark numbers or performance metrics for this that you can share here? Meanwhile I'll ping some others to get some opinions

cc: @amogh-jahagirdar @kevinjqliu

@kevinjqliu kevinjqliu self-requested a review February 19, 2025 01:07
@SanjayMarreddi
Copy link
Contributor Author

SanjayMarreddi commented Feb 19, 2025

Hey @geruh, thanks for the review! I will make changes as you suggested above.

  • Yes, we will be adding some integration tests for it.
  • We have some internal benchmark results, we need some more time to validate them before we post here.

@SanjayMarreddi SanjayMarreddi force-pushed the feat/integrate-analytics-accelerator branch from 0a5ff57 to 361e465 Compare February 24, 2025 19:10
@SanjayMarreddi SanjayMarreddi force-pushed the feat/integrate-analytics-accelerator branch 2 times, most recently from dbee3c1 to e6aa71e Compare February 25, 2025 16:55
@SanjayMarreddi SanjayMarreddi marked this pull request as ready for review February 25, 2025 16:57
@SanjayMarreddi SanjayMarreddi force-pushed the feat/integrate-analytics-accelerator branch from e6aa71e to e8e2645 Compare February 27, 2025 16:47
@SanjayMarreddi SanjayMarreddi force-pushed the feat/integrate-analytics-accelerator branch 4 times, most recently from c4814ea to 989ed63 Compare March 5, 2025 16:07
@SanjayMarreddi SanjayMarreddi force-pushed the feat/integrate-analytics-accelerator branch from 989ed63 to 8d65c2c Compare March 11, 2025 16:39
@kevinjqliu
Copy link
Contributor

kevinjqliu commented Mar 11, 2025

I verified that the async client should only affect S3 FileIO when the feature flag is enabled.

s3Async() is the factory function that returns a S3AsyncClient.
It is called in S3FileIO's initialize function and sets the s3Async variable.

    // Do not override s3Async client if it was provided
    if (s3Async == null) {
      Object clientFactory = S3FileIOAwsClientFactories.initialize(props);
      if (clientFactory instanceof S3FileIOAwsClientFactory) {
        this.s3Async = ((S3FileIOAwsClientFactory) clientFactory)::s3Async;
      }
      if (clientFactory instanceof AwsClientFactory) {
        this.s3Async = ((AwsClientFactory) clientFactory)::s3Async;
      }
    }

s3Async is called in S3FileIO's asyncClient() function to return an async client.

  public S3AsyncClient asyncClient() {
    if (asyncClient == null) {
      synchronized (this) {
        if (asyncClient == null) {
          asyncClient = s3Async.get();
        }
      }
    }
    return asyncClient;
  }

asyncClient() is only called within the accelerator library or when gated by shouldUseAsyncClient()

@jackye1995
Copy link
Contributor

It would also be great to outline the migration path going forward.

Yes, I think in general there is data point supporting using async client & CRT client makes the performance faster, so we would like to gradually complete the whole code path using async, and then let people opt-in through some config like s3.async-client.enabled. But I am glad that we are introducing async into this as a part of the effort of integrating analytics accelerator library, it seems like a good first step towards that goal.

@jackye1995
Copy link
Contributor

Are there plans to replace the current s3 client with the async client?

Maybe after many versions, once we have enough confidence that it is stable. But probably not in the short term.

@jackye1995
Copy link
Contributor

@HonahX could you take a look? Given the fact that we plan to refactor the HTTPClientProperties and other related classes as the next step, it's probably good for you to take a look at this PR and the follow up PRs.

Copy link
Contributor

@geruh geruh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay changes LGTM! Breaking up the PR to follow up with documentation and refactoring the S3FileIOProperties configurations to work with async clients makes sense.

@HonahX HonahX self-requested a review March 12, 2025 00:26
Copy link
Contributor

@HonahX HonahX left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@SanjayMarreddi Thanks for the great contribution! Looking forward to documentation and follow-up PRs

Copy link
Contributor

@HonahX HonahX left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@jackye1995
Copy link
Contributor

Looks like all comments are addressed, thanks @SanjayMarreddi for all the work! Let us know when you have the follow up PRs for async client configs and doc update!

@jackye1995 jackye1995 merged commit 3dba6af into apache:main Mar 12, 2025
43 checks passed
@danielcweeks
Copy link
Contributor

danielcweeks commented Mar 12, 2025

@SanjayMarreddi and @jackye1995 This PR added a new dependency, but it doesn't look like the LICENSE/NOTICE files were updated. This needs to be taken care of before the next release (cc @ajantha-bhat is handling that release).

Edit: Looks like there's a NOTICE from that dependency that needs to be evaluated as well.

@ajantha-bhat
Copy link
Member

@SanjayMarreddi: Will you be working a PR to update the Notice and License file for this?
cc: @jbonofre

@SanjayMarreddi
Copy link
Contributor Author

@ajantha-bhat: Yeah, I will be working on it.

@Neuw84
Copy link
Contributor

Neuw84 commented Apr 29, 2025

Hi,
One question that I have here, are we supporting the configs of the underlying client? For S3 tables this was useful:

--conf spark.sql.catalog.my_catalog.http-client.apache.max-connections=5

Looking at this...

https://github.com/awslabs/analytics-accelerator-s3/blob/main/doc/CONFIGURATION.md

Thanks

@SanjayMarreddi
Copy link
Contributor Author

Hi, One question that I have here, are we supporting the configs of the underlying client? For S3 tables this was useful:

--conf spark.sql.catalog.my_catalog.http-client.apache.max-connections=5

Looking at this...

https://github.com/awslabs/analytics-accelerator-s3/blob/main/doc/CONFIGURATION.md

Thanks

Hey, thanks for reaching out.

As of now, when we create the async client that's used with analytics-accelerator-s3, we are not applying the HttpClientProperties. So the http related configs of the underlying client won't be picked up. We plan to work on it.

Meanwhile, would you mind creating an opensource issue here on the codebase - so I can discuss with team.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants