[SPARK-54134][SQL] Optimize Arrow memory usage #52747

viirya · 2025-10-27T22:08:57Z

What changes were proposed in this pull request?

This patch proposes some changes to optimize memory usage on Arrow in Spark. It compress Arrow IPC data when serializing.

Why are the changes needed?

We have encountered OOM when loading data and processing them in PySpark through toArrow or toPandas. The same data could be loaded by PyArrow directly but fails to load through toArrow or toPandas into PySpark due to OOM issues.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Unit tests. Manually test it locally.

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Code v2.0.13

sql/catalyst/src/main/scala/org/apache/spark/sql/execution/arrow/ArrowWriterWrapper.scala

sql/core/src/main/scala/org/apache/spark/sql/execution/arrow/ArrowConverters.scala

dongjoon-hyun

+1, LGTM. Thank you, @viirya and @cloud-fan .

Merged to master/4.1.

### What changes were proposed in this pull request? This patch proposes some changes to optimize memory usage on Arrow in Spark. It compress Arrow IPC data when serializing. ### Why are the changes needed? We have encountered OOM when loading data and processing them in PySpark through `toArrow` or `toPandas`. The same data could be loaded by PyArrow directly but fails to load through `toArrow` or `toPandas` into PySpark due to OOM issues. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit tests. Manually test it locally. ### Was this patch authored or co-authored using generative AI tooling? Generated-by: Claude Code v2.0.13 Closes #52747 from viirya/release_buffers. Authored-by: Liang-Chi Hsieh <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]> (cherry picked from commit 02ba89f) Signed-off-by: Dongjoon Hyun <[email protected]>

viirya · 2025-11-04T01:21:29Z

Thank you @cloud-fan @dongjoon-hyun

allisonwang-db

Thanks for adding this!

allisonwang-db · 2025-11-05T00:37:04Z

sql/core/src/main/scala/org/apache/spark/sql/execution/arrow/ArrowConverters.scala

+        val codecType = new Lz4CompressionCodec().getCodecType()
+        factory.createCodec(codecType)
+      case other =>
+        throw new IllegalArgumentException(


Should be SparkException.internalError

Yea, that would be better.

I will change to SparkException when I extend this to pandas udf.

zhengruifeng · 2025-11-05T05:44:44Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala


+  val ARROW_EXECUTION_COMPRESSION_CODEC =
+    buildConf("spark.sql.execution.arrow.compressionCodec")
+      .doc("Compression codec used to compress Arrow IPC data when transferring data " +


does this optimization take effect in pandas udf?

I think no, it is currently applied on toArrow and toPandas which is on the reported issue. It should be also available to arrow udf and pandas udf. I will try to extend this to such cases.

dbtsai · 2025-11-05T19:57:52Z

sql/core/src/main/scala/org/apache/spark/sql/execution/arrow/ArrowConverters.scala

+      case "zstd" =>
+        val factory = CompressionCodec.Factory.INSTANCE
+        val codecType = new ZstdCompressionCodec().getCodecType()
+        factory.createCodec(codecType)


Would be great that we can have an option to add compression levels

Okay, we can add compression level option together.

I am going to add the option in #52925 along with Pandas UDF support.

### What changes were proposed in this pull request? This is an extension to #52747. In #52747, we add the support of Arrow compression to `toArrow` and `toPandas` to reduce memory usage. We would like to extend the memory optimization feature to Pandas UDF case. ### Why are the changes needed? To optimize memory usage for Pandas UDF case. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit test ### Was this patch authored or co-authored using generative AI tooling? Generated-by: Claude Code v2.0.14 Closes #52925 from viirya/arrow_compress_udf. Authored-by: Liang-Chi Hsieh <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

### What changes were proposed in this pull request? This is an extension to #52747. In #52747, we add the support of Arrow compression to `toArrow` and `toPandas` to reduce memory usage. We would like to extend the memory optimization feature to Pandas UDF case. ### Why are the changes needed? To optimize memory usage for Pandas UDF case. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit test ### Was this patch authored or co-authored using generative AI tooling? Generated-by: Claude Code v2.0.14 Closes #52925 from viirya/arrow_compress_udf. Authored-by: Liang-Chi Hsieh <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]> (cherry picked from commit 96ed48d) Signed-off-by: Dongjoon Hyun <[email protected]>

### What changes were proposed in this pull request? This is an extension to apache#52747. In apache#52747, we add the support of Arrow compression to `toArrow` and `toPandas` to reduce memory usage. We would like to extend the memory optimization feature to Pandas UDF case. ### Why are the changes needed? To optimize memory usage for Pandas UDF case. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit test ### Was this patch authored or co-authored using generative AI tooling? Generated-by: Claude Code v2.0.14 Closes apache#52925 from viirya/arrow_compress_udf. Authored-by: Liang-Chi Hsieh <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

### What changes were proposed in this pull request? This patch proposes some changes to optimize memory usage on Arrow in Spark. It compress Arrow IPC data when serializing. ### Why are the changes needed? We have encountered OOM when loading data and processing them in PySpark through `toArrow` or `toPandas`. The same data could be loaded by PyArrow directly but fails to load through `toArrow` or `toPandas` into PySpark due to OOM issues. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit tests. Manually test it locally. ### Was this patch authored or co-authored using generative AI tooling? Generated-by: Claude Code v2.0.13 Closes apache#52747 from viirya/release_buffers. Authored-by: Liang-Chi Hsieh <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

### What changes were proposed in this pull request? This is an extension to apache#52747. In apache#52747, we add the support of Arrow compression to `toArrow` and `toPandas` to reduce memory usage. We would like to extend the memory optimization feature to Pandas UDF case. ### Why are the changes needed? To optimize memory usage for Pandas UDF case. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit test ### Was this patch authored or co-authored using generative AI tooling? Generated-by: Claude Code v2.0.14 Closes apache#52925 from viirya/arrow_compress_udf. Authored-by: Liang-Chi Hsieh <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

viirya marked this pull request as draft October 27, 2025 22:09

github-actions bot added SQL PYTHON labels Oct 27, 2025

viirya force-pushed the release_buffers branch from 309d66f to 14ff1f9 Compare October 27, 2025 22:13

cloud-fan reviewed Oct 29, 2025

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/execution/arrow/ArrowWriterWrapper.scala Outdated Show resolved Hide resolved

viirya changed the title ~~Release buffers~~ [SPARK-xxxxx][SQL] Release Arrow buffers early Oct 29, 2025

viirya force-pushed the release_buffers branch from b0320db to 9f57f37 Compare October 29, 2025 23:43

cloud-fan reviewed Oct 30, 2025

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/execution/arrow/ArrowConverters.scala Outdated Show resolved Hide resolved

github-actions bot added the BUILD label Nov 2, 2025

viirya changed the title ~~[SPARK-xxxxx][SQL] Release Arrow buffers early~~ [SPARK-xxxxx][SQL] Optimize Arrow memory usage Nov 2, 2025

viirya commented Nov 2, 2025

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/execution/arrow/ArrowConverters.scala Outdated Show resolved Hide resolved

viirya changed the title ~~[SPARK-xxxxx][SQL] Optimize Arrow memory usage~~ [SPARK-54134][SQL] Optimize Arrow memory usage Nov 2, 2025

viirya marked this pull request as ready for review November 2, 2025 02:02

Add compression

0e402d0

viirya force-pushed the release_buffers branch from 77fc1b2 to 0e402d0 Compare November 2, 2025 02:32

viirya added 4 commits November 1, 2025 19:46

revert some changes

f433232

more

467bedb

update manifest

29291fd

remove

c3d5a77

viirya force-pushed the release_buffers branch from f1e7921 to c3d5a77 Compare November 2, 2025 23:59

dongjoon-hyun approved these changes Nov 3, 2025

View reviewed changes

dongjoon-hyun closed this in 02ba89f Nov 3, 2025

allisonwang-db reviewed Nov 5, 2025

View reviewed changes

zhengruifeng reviewed Nov 5, 2025

View reviewed changes

dbtsai reviewed Nov 5, 2025

View reviewed changes

viirya mentioned this pull request Nov 7, 2025

[SPARK-54226][SQL] Extend Arrow compression to Pandas UDF #52925

Closed

[SPARK-54134][SQL] Optimize Arrow memory usage #52747

[SPARK-54134][SQL] Optimize Arrow memory usage #52747

Uh oh!

Conversation

viirya commented Oct 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

Uh oh!

Uh oh!

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

viirya commented Nov 4, 2025

Uh oh!

allisonwang-db left a comment

Choose a reason for hiding this comment

Uh oh!

allisonwang-db Nov 5, 2025

Choose a reason for hiding this comment

Uh oh!

viirya Nov 5, 2025

Choose a reason for hiding this comment

Uh oh!

viirya Nov 5, 2025

Choose a reason for hiding this comment

Uh oh!

zhengruifeng Nov 5, 2025

Choose a reason for hiding this comment

Uh oh!

viirya Nov 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dbtsai Nov 5, 2025

Choose a reason for hiding this comment

Uh oh!

viirya Nov 5, 2025

Choose a reason for hiding this comment

Uh oh!

viirya Nov 7, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

viirya commented Oct 27, 2025 •

edited

Loading

viirya Nov 5, 2025 •

edited

Loading