Skip to content

Conversation

@JeffreySmith
Copy link

No description provided.

HyukjinKwon and others added 30 commits August 27, 2024 17:38
…cumentation

Followup PR to change JRE version from 8 to 11

Authored-by: Hyukjin Kwon <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
(cherry picked from commit 9fc1e05)
Signed-off-by: Hyukjin Kwon <[email protected]>
…cumentation

Followup PR to change JRE version from 8 to 11

Authored-by: Hyukjin Kwon <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
(cherry picked from commit 9fc1e05)
Signed-off-by: Hyukjin Kwon <[email protected]>
…cumentation

Followup PR to change JRE version from 8 to 11

Authored-by: Hyukjin Kwon <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
(cherry picked from commit 9fc1e05)
Signed-off-by: Hyukjin Kwon <[email protected]>
…cumentation

Followup PR to change JRE version from 8 to 11

Authored-by: Hyukjin Kwon <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
(cherry picked from commit 9fc1e05)
Signed-off-by: Hyukjin Kwon <[email protected]>
…cumentation

Followup PR to change JRE version from 8 to 17

Authored-by: Hyukjin Kwon <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
(cherry picked from commit 9fc1e05)
Signed-off-by: Hyukjin Kwon <[email protected]>
Authored-by: Hyukjin Kwon <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
(cherry picked from commit 9fc1e05)
Signed-off-by: Hyukjin Kwon <[email protected]>
…nnect notebook

This is a followup of apache#47883 that adds manual `source ~/.profile`.

Ever since we switched to `Dockerfile`, none of `./profile`, `/.bashrc`, `./bash_profile`, etc seems working. There are a couple of related issues in Jupyter but I cannot figure it out.

This is the only cell it needs the environment variable so decided to simply work around.

No.

Manually tested.

No.

Closes apache#47902 from HyukjinKwon/SPARK-49402-followup.

Authored-by: Hyukjin Kwon <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
(cherry picked from commit 1c9cde5)
Signed-off-by: Hyukjin Kwon <[email protected]>
(cherry picked from commit df07aa7)
Signed-off-by: Hyukjin Kwon <[email protected]>
…without codegen

This is a re-submitting of apache#43938 to fix a join correctness bug caused by apache#41398 . Credits go to mcdull-zhang

correctness fix

Yes, the query result will be corrected.

new test

no

Closes apache#47905 from cloud-fan/join.

Authored-by: Wenchen Fan <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
(cherry picked from commit af5e0a2)
Signed-off-by: Wenchen Fan <[email protected]>
…uffle corruption diagnose

#### What changes were proposed in this pull request?
port to 3.5 for [[SPARK-43242](https://issues.apache.org/jira/browse/SPARK-43242)][CORE] Fix throw 'Unexpected type of BlockId' in shuffle corruption diagnose

#### Why are the changes needed?
3.5 conflict with PR in master, see end of discussion apache#40921

#### Does this PR introduce any user-facing change?
No

#### How was this patch tested?
Existing tests

Closes apache#47910 from CavemanIV/port3.5-SPARK-43242.

Authored-by: zhangliang <[email protected]>
Signed-off-by: Yi Wu <[email protected]>
### What changes were proposed in this pull request?
Add `artifacts` to `.gitignore`

### Why are the changes needed?
```
bin/spark-shell --remote "local[*]"
```

generates a lot of files in it
```
(spark_dev_312) ➜  spark git:(master) ✗ git status
On branch master
Your branch is ahead of 'origin/master' by 1386 commits.
  (use "git push" to publish your local commits)

Changes to be committed:
  (use "git restore --staged <file>..." to unstage)
	new file:   artifacts/spark-37fc351b-0207-4957-ac39-5b23ae672c0c/85157252-6f8a-46b3-ab42-585c70184d08/classes/ammonite/$sess/cmd0$.class
	new file:   artifacts/spark-37fc351b-0207-4957-ac39-5b23ae672c0c/85157252-6f8a-46b3-ab42-585c70184d08/classes/ammonite/$sess/cmd0$Helper.class
	new file:   artifacts/spark-37fc351b-0207-4957-ac39-5b23ae672c0c/85157252-6f8a-46b3-ab42-585c70184d08/classes/ammonite/$sess/cmd0.class
	new file:   artifacts/spark-37fc351b-0207-4957-ac39-5b23ae672c0c/85157252-6f8a-46b3-ab42-585c70184d08/classes/ammonite/$sess/cmd1$.class
	new file:   artifacts/spark-37fc351b-0207-4957-ac39-5b23ae672c0c/85157252-6f8a-46b3-ab42-585c70184d08/classes/ammonite/$sess/cmd1$Helper.class
	new file:   artifacts/spark-37fc351b-0207-4957-ac39-5b23ae672c0c/85157252-6f8a-46b3-ab42-585c70184d08/classes/ammonite/$sess/cmd1.class
	new file:   artifacts/spark-37fc351b-0207-4957-ac39-5b23ae672c0c/85157252-6f8a-46b3-ab42-585c70184d08/classes/ammonite/$sess/cmd2$.class
	new file:   artifacts/spark-37fc351b-0207-4957-ac39-5b23ae672c0c/85157252-6f8a-46b3-ab42-585c70184d08/classes/ammonite/$sess/cmd2$Helper.class
	new file:   artifacts/spark-37fc351b-0207-4957-ac39-5b23ae672c0c/85157252-6f8a-46b3-ab42-585c70184d08/classes/ammonite/$sess/cmd2.class
	new file:   artifacts/spark-37fc351b-0207-4957-ac39-5b23ae672c0c/85157252-6f8a-46b3-ab42-585c70184d08/classes/ammonite/$sess/cmd9999999$.class
	new file:   artifacts/spark-37fc351b-0207-4957-ac39-5b23ae672c0c/85157252-6f8a-46b3-ab42-585c70184d08/classes/ammonite/$sess/cmd9999999$Helper.class
	new file:   artifacts/spark-37fc351b-0207-4957-ac39-5b23ae672c0c/8515
```

### Does this PR introduce _any_ user-facing change?
No, dev only

### How was this patch tested?
manually check

### Was this patch authored or co-authored using generative AI tooling?
No

Closes apache#47936 from zhengruifeng/infra_artifacts.

Authored-by: Ruifeng Zheng <[email protected]>
Signed-off-by: Kent Yao <[email protected]>
(cherry picked from commit df42568)
Signed-off-by: Kent Yao <[email protected]>
…er.isInternalError`

### What changes were proposed in this pull request?

Handle null input for `SparkThrowableHelper.isInternalError` method.

### Why are the changes needed?

The `SparkThrowableHelper.isInternalError` method doesn't handle null input, and it could lead to NullPointerException. It happens when a `SparkException` without `errorClass` is invoked `isInternalError`.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Add 2 assertions to current test cases to cover this issue.

### Was this patch authored or co-authored using generative AI tooling?

No

Closes apache#47946 from jshmchenxi/SPARK-49480/null-pointer-is-internal-error.

Authored-by: Xi Chen <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
(cherry picked from commit cef3c86)
Signed-off-by: Wenchen Fan <[email protected]>
### What changes were proposed in this pull request?

Fix the nullability of the `Base64` expression to be based on the child's nullability, and not always be nullable.

### Why are the changes needed?

apache#47303 had a side effect of changing the nullability by the switch to using `StaticInvoke`. This was also backported to Spark 3.5.2 and caused schema mismatch errors for stateful streams when we upgraded. This restores the previous behavior which is supported by StaticInvoke through the `returnNullable` argument. If the child is non-nullable, we know the result will be non-nullable.

### Does this PR introduce _any_ user-facing change?

Restores the nullability of the `Base64` expression to what is was in Spark 3.5.1 and earlier.

### How was this patch tested?

New UT

### Was this patch authored or co-authored using generative AI tooling?

No

Closes apache#47941 from Kimahriman/base64-nullability.

Lead-authored-by: Adam Binford <[email protected]>
Co-authored-by: Maxim Gekk <[email protected]>
Signed-off-by: Max Gekk <[email protected]>
(cherry picked from commit c274c5a)
Signed-off-by: Max Gekk <[email protected]>
### What changes were proposed in this pull request?

Fix a test that is failing from backporting apache#47941

### Why are the changes needed?

Fix test

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Fixed test

### Was this patch authored or co-authored using generative AI tooling?

No

Closes apache#47964 from Kimahriman/base64-proto-test.

Authored-by: Adam Binford <[email protected]>
Signed-off-by: Kent Yao <[email protected]>
### What changes were proposed in this pull request?

This is a cherry-pick of apache#47796.

The `xpath` expression incorrectly marks its return type as array of non-null strings. However, it can actually return an array containing nulls. This can cause NPE in code generation, such as query `select coalesce(xpath(repeat('<a></a>', id), 'a')[0], '') from range(1, 2)`.

### Why are the changes needed?

It avoids potential failures in queries that uses the `xpath` expression.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

A new unit test. It would fail without the change in the PR.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes apache#47959 from chenhao-db/fix_xpath_nullness_3.5.

Authored-by: Chenhao Li <[email protected]>
Signed-off-by: Max Gekk <[email protected]>
Fix site.SPARK_VERSION pattern in RDD Programming Guide. I found this when I was developing apache#47968

doc fix

no

doc build

no

Closes apache#47985 from yaooqinn/version.

Authored-by: Kent Yao <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
(cherry picked from commit 90a236e)
Signed-off-by: Hyukjin Kwon <[email protected]>
…yteBuffer.allocateDirect`

This PR aims to use `Platform.allocateDirectBuffer` instead of `ByteBuffer.allocateDirect`.

apache#47733 (review)

Allocating off-heap memory should use the `allocateDirectBuffer` API provided `by Platform`.

No

GA

No

Closes apache#47987 from cxzl25/SPARK-49509.

Authored-by: sychen <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
(cherry picked from commit 2ed6c3e)
Signed-off-by: Dongjoon Hyun <[email protected]>
In `Dataset#toJSON`, use the schema from `exprEnc`. This schema reflects any changes (e.g., decimal precision, column ordering) that `exprEnc` might make to input rows.

`Dataset#toJSON` currently uses the schema from the logical plan, but that schema does not necessarily describe the rows passed to `JacksonGenerator`: the function passed to `mapPartitions` uses `exprEnc` to serialize the input, and this could potentially change the precision on decimals or rearrange columns.

Here's an example that tricks `UnsafeRow#getDecimal` (called from `JacksonGenerator`) to mistakenly assume the decimal is stored as a Long:
```
scala> case class Data(a: BigDecimal)
class Data

scala> sql("select 123.456bd as a").as[Data].toJSON.collect
warning: 1 deprecation (since 2.13.3); for details, enable `:setting -deprecation` or `:replay -deprecation`
val res0: Array[String] = Array({"a":68719476.745})

scala>
```
Here's an example that tricks `JacksonGenerator` to ask for a string from an array and an array from a string. This case actually crashes the JVM:
```
scala> case class Data(x: Array[Int], y: String)
class Data

scala> sql("select repeat('Hey there', 17) as y, array_repeat(22, 17) as x").as[Data].toJSON.collect
warning: 1 deprecation (since 2.13.3); for details, enable `:setting -deprecation` or `:replay -deprecation`
Exception in task 0.0 in stage 0.0 (TID 0)
java.lang.InternalError: a fault occurred in a recent unsafe memory access operation in compiled Java code
	at org.apache.spark.sql.catalyst.json.JacksonGenerator.$anonfun$makeWriter$5(JacksonGenerator.scala:129) ~[spark-catalyst_2.13-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
	at org.apache.spark.sql.catalyst.json.JacksonGenerator.$anonfun$makeWriter$5$adapted(JacksonGenerator.scala:128) ~[spark-catalyst_2.13-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
	at org.apache.spark.sql.catalyst.json.JacksonGenerator.writeArrayData(JacksonGenerator.scala:258) ~[spark-catalyst_2.13-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
	at org.apache.spark.sql.catalyst.json.JacksonGenerator.$anonfun$makeWriter$23(JacksonGenerator.scala:201) ~[spark-catalyst_2.13-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
	at org.apache.spark.sql.catalyst.json.JacksonGenerator.writeArray(JacksonGenerator.scala:249) ~[spark-catalyst_2.13-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
...
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
	at java.base/java.lang.Thread.run(Thread.java:833)

bash-3.2$
```
Both these cases work correctly without `toJSON`.

Before the PR, converting the dataframe to a dataset of Tuple would preserve the column names in the JSON strings:
```
scala> sql("select 123.456d as a, 12 as b").as[(Double, Int)].toJSON.collect
warning: 1 deprecation (since 2.13.3); for details, enable `:setting -deprecation` or `:replay -deprecation`
val res0: Array[String] = Array({"a":123.456,"b":12})

scala>
```
After the PR, the JSON strings use the field name from the Tuple class:
```
scala> sql("select 123.456d as a, 12 as b").as[(Double, Int)].toJSON.collect
warning: 1 deprecation (since 2.13.3); for details, enable `:setting -deprecation` or `:replay -deprecation`
val res1: Array[String] = Array({"_1":123.456,"_2":12})

scala>
```

New tests.

No.

Closes apache#47982 from bersprockets/to_json_issue.

Authored-by: Bruce Robbins <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
(cherry picked from commit 5375ce2)
Signed-off-by: Dongjoon Hyun <[email protected]>
### What changes were proposed in this pull request?
In ProjectingInternalRow, accessing colOrdinals causes poor performace. Replace colOrdinals with the IndexedSeq type.

### Why are the changes needed?
Replace colOrdinals with the IndexedSeq type.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
No need to add UT

### Was this patch authored or co-authored using generative AI tooling?
No

Closes apache#47890 from wzx140/project-row-fix.

Lead-authored-by: wzx <[email protected]>
Co-authored-by: Kent Yao <[email protected]>
Signed-off-by: Kent Yao <[email protected]>
(cherry picked from commit 37f2fa9)
Signed-off-by: Kent Yao <[email protected]>
…se V1 commands

### What changes were proposed in this pull request?

This is a followup of apache#47660 . If users override `spark_catalog` with
`DelegatingCatalogExtension`, we should still use v1 commands as `DelegatingCatalogExtension` forwards requests to HMS and there are still behavior differences between v1 and v2 commands targeting HMS.

This PR also forces to use v1 commands for certain commands that do not have a v2 version.

### Why are the changes needed?

Avoid introducing behavior changes to Spark plugins that implements `DelegatingCatalogExtension` to override `spark_catalog`.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

new test case

### Was this patch authored or co-authored using generative AI tooling?

No

Closes apache#47995 from amaliujia/fix_catalog_v2.

Lead-authored-by: Wenchen Fan <[email protected]>
Co-authored-by: Rui Wang <[email protected]>
Co-authored-by: Wenchen Fan <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
(cherry picked from commit f7cfeb5)
Signed-off-by: Wenchen Fan <[email protected]>
…be changed by falling back to v1 command

This is a followup of apache#47772 . The behavior of SaveAsTable should not be changed by switching v1 to v2 command. This is similar to apache#47995. For the case of `DelegatingCatalogExtension` we need it goes to V1 commands to be consistent with previous behavior.

Behavior regression.

No

UT

No

Closes apache#48019 from amaliujia/regress_v2.

Lead-authored-by: Wenchen Fan <[email protected]>
Co-authored-by: Rui Wang <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
(cherry picked from commit 37b39b4)
Signed-off-by: Wenchen Fan <[email protected]>
Change the implementation of `createTable` to avoid escaping of special chars in `UnresolvedTableSpec.location`. This field should contain the original user-provided `path` option and not the URI that is constructed by the `buildStorageFormatFromOptions()` call.

In addition this commit extends `SparkFunSuite` and `SQLTestUtils` to allow creating temporary directories with a custom prefix. This can be used to create temporary directories with special chars.

Bug fix. The following code would result in the creation of a table that is stored in `/tmp/test%table` instead of `/tmp/test table`:
```
spark.catalog.createTable("testTable", source = "parquet", schema = new StructType().add("id", "int"), description = "", options = Map("path" -> "/tmp/test table"))
```

Note that this was not consistent with the SQL API, e.g. `create table testTable(id int) using parquet location '/tmp/test table'`

Yes. The previous behaviour would result in table path be escaped. After this change the path will not be escaped.

Updated existing test in `CatalogSuite`.

No

Closes apache#47976 from cstavr/location-double-escaping.

Authored-by: Christos Stavrakakis <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
(cherry picked from commit dc3333b)
Signed-off-by: Wenchen Fan <[email protected]>
senthh and others added 28 commits September 2, 2025 15:56
…ent. (#40)

* ODP-2118: Hudi, DeltaLake, Iceberg version upgrade for open table clients.

* ODP-2118: Delta spark version fix

* ODP-2118: delta-spark and iceberg jars scala version fix.

* ODP-2118: iceberg jars scala version fix.
…ent.

### What changes were proposed in this pull request?
Update `kubernetes-client` from 6.10.0 to 6.11.0

### Why are the changes needed?

[Release notes for 6.11.0](https://github.com/fabric8io/kubernetes-client/releases/tag/v6.11.0)

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Pass GA

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes apache#45707 from bjornjorgensen/kub-client6.11.0.

Authored-by: Bjørn Jørgensen <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
(cherry picked from commit 7b9b3cb)
(cherry picked from commit 06e2b2e)


### What changes were proposed in this pull request?
Update `kubernetes-client` from 6.10.0 to 6.11.0

### Why are the changes needed?

[Release notes for 6.11.0](https://github.com/fabric8io/kubernetes-client/releases/tag/v6.11.0)

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Pass GA

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes apache#45707 from bjornjorgensen/kub-client6.11.0.

Authored-by: Bjørn Jørgensen <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
(cherry picked from commit 7b9b3cb)
(cherry picked from commit 06e2b2e)
### What changes were proposed in this pull request?

This PR aims to upgrade `Parquet` to 1.15.2.

### Why are the changes needed?

To bring the latest bug fixes.
- https://parquet.apache.org/blog/2025/05/01/1.15.2/
- https://github.com/apache/parquet-java/releases/tag/apache-parquet-1.15.2

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass the CIs.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes apache#50755 from dongjoon-hyun/SPARK-51950.

Authored-by: Dongjoon Hyun <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>

(cherry picked from commit 15732fc)
…s-client from `6.x` to `7.x`"

This reverts commit 9b17fca
…ager based on Hadoop's Abortable interface"

This reverts commit b89d077
…ions to PartitionedFileUtil API to reduce memory requirements"

This reverts commit 23637fe.
@shubhluck shubhluck closed this Dec 9, 2025
@shubhluck shubhluck deleted the rel/ODP-3.2.3.4-2 branch December 9, 2025 19:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.