Skip to content

Conversation

@viirya
Copy link
Member

@viirya viirya commented Oct 27, 2025

What changes were proposed in this pull request?

This patch proposes some changes to optimize memory usage on Arrow in Spark. It compress Arrow IPC data when serializing.

Why are the changes needed?

We have encountered OOM when loading data and processing them in PySpark through toArrow or toPandas. The same data could be loaded by PyArrow directly but fails to load through toArrow or toPandas into PySpark due to OOM issues.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Unit tests. Manually test it locally.

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Code v2.0.13

@viirya viirya marked this pull request as draft October 27, 2025 22:09
@viirya viirya changed the title Release buffers [SPARK-xxxxx][SQL] Release Arrow buffers early Oct 29, 2025
@github-actions github-actions bot added the BUILD label Nov 2, 2025
@viirya viirya changed the title [SPARK-xxxxx][SQL] Release Arrow buffers early [SPARK-xxxxx][SQL] Optimize Arrow memory usage Nov 2, 2025
@viirya viirya changed the title [SPARK-xxxxx][SQL] Optimize Arrow memory usage [SPARK-54134][SQL] Optimize Arrow memory usage Nov 2, 2025
@viirya viirya marked this pull request as ready for review November 2, 2025 02:02
Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM. Thank you, @viirya and @cloud-fan .

Merged to master/4.1.

dongjoon-hyun pushed a commit that referenced this pull request Nov 3, 2025
### What changes were proposed in this pull request?

This patch proposes some changes to optimize memory usage on Arrow in Spark. It compress Arrow IPC data when serializing.

### Why are the changes needed?

We have encountered OOM when loading data and processing them in PySpark through `toArrow` or `toPandas`. The same data could be loaded by PyArrow directly but fails to load through `toArrow` or `toPandas` into PySpark due to OOM issues.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Unit tests. Manually test it locally.

### Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Code v2.0.13

Closes #52747 from viirya/release_buffers.

Authored-by: Liang-Chi Hsieh <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
(cherry picked from commit 02ba89f)
Signed-off-by: Dongjoon Hyun <[email protected]>
@viirya
Copy link
Member Author

viirya commented Nov 4, 2025

Thank you @cloud-fan @dongjoon-hyun

Copy link
Contributor

@allisonwang-db allisonwang-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding this!

val codecType = new Lz4CompressionCodec().getCodecType()
factory.createCodec(codecType)
case other =>
throw new IllegalArgumentException(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be SparkException.internalError

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, that would be better.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will change to SparkException when I extend this to pandas udf.


val ARROW_EXECUTION_COMPRESSION_CODEC =
buildConf("spark.sql.execution.arrow.compressionCodec")
.doc("Compression codec used to compress Arrow IPC data when transferring data " +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this optimization take effect in pandas udf?

Copy link
Member Author

@viirya viirya Nov 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think no, it is currently applied on toArrow and toPandas which is on the reported issue. It should be also available to arrow udf and pandas udf. I will try to extend this to such cases.

case "zstd" =>
val factory = CompressionCodec.Factory.INSTANCE
val codecType = new ZstdCompressionCodec().getCodecType()
factory.createCodec(codecType)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be great that we can have an option to add compression levels

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, we can add compression level option together.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am going to add the option in #52925 along with Pandas UDF support.

dongjoon-hyun pushed a commit that referenced this pull request Nov 7, 2025
### What changes were proposed in this pull request?

This is an extension to #52747. In #52747, we add the support of Arrow compression to `toArrow` and `toPandas` to reduce memory usage. We would like to extend the memory optimization feature to Pandas UDF case.

### Why are the changes needed?

To optimize memory usage for Pandas UDF case.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Unit test

### Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Code v2.0.14

Closes #52925 from viirya/arrow_compress_udf.

Authored-by: Liang-Chi Hsieh <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
dongjoon-hyun pushed a commit that referenced this pull request Nov 7, 2025
### What changes were proposed in this pull request?

This is an extension to #52747. In #52747, we add the support of Arrow compression to `toArrow` and `toPandas` to reduce memory usage. We would like to extend the memory optimization feature to Pandas UDF case.

### Why are the changes needed?

To optimize memory usage for Pandas UDF case.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Unit test

### Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Code v2.0.14

Closes #52925 from viirya/arrow_compress_udf.

Authored-by: Liang-Chi Hsieh <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
(cherry picked from commit 96ed48d)
Signed-off-by: Dongjoon Hyun <[email protected]>
zifeif2 pushed a commit to zifeif2/spark that referenced this pull request Nov 22, 2025
### What changes were proposed in this pull request?

This is an extension to apache#52747. In apache#52747, we add the support of Arrow compression to `toArrow` and `toPandas` to reduce memory usage. We would like to extend the memory optimization feature to Pandas UDF case.

### Why are the changes needed?

To optimize memory usage for Pandas UDF case.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Unit test

### Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Code v2.0.14

Closes apache#52925 from viirya/arrow_compress_udf.

Authored-by: Liang-Chi Hsieh <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
huangxiaopingRD pushed a commit to huangxiaopingRD/spark that referenced this pull request Nov 25, 2025
### What changes were proposed in this pull request?

This patch proposes some changes to optimize memory usage on Arrow in Spark. It compress Arrow IPC data when serializing.

### Why are the changes needed?

We have encountered OOM when loading data and processing them in PySpark through `toArrow` or `toPandas`. The same data could be loaded by PyArrow directly but fails to load through `toArrow` or `toPandas` into PySpark due to OOM issues.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Unit tests. Manually test it locally.

### Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Code v2.0.13

Closes apache#52747 from viirya/release_buffers.

Authored-by: Liang-Chi Hsieh <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
huangxiaopingRD pushed a commit to huangxiaopingRD/spark that referenced this pull request Nov 25, 2025
### What changes were proposed in this pull request?

This is an extension to apache#52747. In apache#52747, we add the support of Arrow compression to `toArrow` and `toPandas` to reduce memory usage. We would like to extend the memory optimization feature to Pandas UDF case.

### Why are the changes needed?

To optimize memory usage for Pandas UDF case.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Unit test

### Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Code v2.0.14

Closes apache#52925 from viirya/arrow_compress_udf.

Authored-by: Liang-Chi Hsieh <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants