-
Notifications
You must be signed in to change notification settings - Fork 51
Spark3 diff on top of tag 3.0.1 #735
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Link to the survey we did on diff in our fork and what we're keeping and dropping: https://pl.ntr/1Ut |
cd67edb to
c592ed2
Compare
core/src/main/scala/org/apache/spark/internal/plugin/PluginContainer.scala
Show resolved
Hide resolved
32a1129 to
dea0fd8
Compare
rshkv
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hope those comments don't get dropped when we rebase.
.circleci/config.yml
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We disabled this cache for this branch. It was causing puzzling compile errors and we couldn't get to the bottom of it.
.circleci/config.yml
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added the -Phadoop-palantir profile here. Our pom previously declared hadoop-cloud and dists/palantir-pom/bom as default modules. I now moved those under the hadoop-palantir profile for a more orderly diff that makes clear we're building those modules as dependencies of our "Palantir build".
FORK.md
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can remove this whole section.
FORK.md
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We were repeatedly getting build failures, which this cherry-pick fixed: apache#30756
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Matching what we have on master here. Not sure if we want this but IIRC we had failing tests without.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Matching palantir/spark's master here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is from that cherry-pick to fix our build failures: apache#30756
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, matching palantir/spark's master.
Co-authored-by: Robert Kruszewski <[email protected]> Co-authored-by: Josh Casale <[email protected]> Co-authored-by: Will Raschkowski <[email protected]>
…ITION` tests that delete files out of partition path ### What changes were proposed in this pull request? Modify the tests that add partitions with `LOCATION`, and where the number of nested folders in `LOCATION` doesn't match to the number of partitioned columns. In that case, `ALTER TABLE .. DROP PARTITION` tries to access (delete) folder out of the "base" path in `LOCATION`. The problem belongs to Hive's MetaStore method `drop_partition_common`: https://github.com/apache/hive/blob/8696c82d07d303b6dbb69b4d443ab6f2b241b251/standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/HiveMetaStore.java#L4876 which tries to delete empty partition sub-folders recursively starting from the most deeper partition sub-folder up to the base folder. In the case when the number of sub-folder is not equal to the number of partitioned columns `part_vals.size()`, the method will try to list and delete folders out of the base path. ### Why are the changes needed? To fix test failures like apache#30643 (comment): ``` org.apache.spark.sql.hive.execution.command.AlterTableAddPartitionSuite.ALTER TABLE .. ADD PARTITION Hive V1: SPARK-33521: universal type conversions of partition values sbt.ForkMain$ForkError: org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: File file:/home/jenkins/workspace/SparkPullRequestBuilder/target/tmp/spark-832cb19c-65fd-41f3-ae0b-937d76c07897 does not exist; at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:112) at org.apache.spark.sql.hive.HiveExternalCatalog.dropPartitions(HiveExternalCatalog.scala:1014) ... Caused by: sbt.ForkMain$ForkError: org.apache.hadoop.hive.metastore.api.MetaException: File file:/home/jenkins/workspace/SparkPullRequestBuilder/target/tmp/spark-832cb19c-65fd-41f3-ae0b-937d76c07897 does not exist at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.drop_partition_with_environment_context(HiveMetaStore.java:3381) at sun.reflect.GeneratedMethodAccessor304.invoke(Unknown Source) ``` The issue can be reproduced by the following steps: 1. Create a base folder, for example: `/Users/maximgekk/tmp/part-location` 2. Create a sub-folder in the base folder and drop permissions for it: ``` $ mkdir /Users/maximgekk/tmp/part-location/aaa $ chmod a-rwx chmod a-rwx /Users/maximgekk/tmp/part-location/aaa $ ls -al /Users/maximgekk/tmp/part-location total 0 drwxr-xr-x 3 maximgekk staff 96 Dec 13 18:42 . drwxr-xr-x 33 maximgekk staff 1056 Dec 13 18:32 .. d--------- 2 maximgekk staff 64 Dec 13 18:42 aaa ``` 3. Create a table with a partition folder in the base folder: ```sql spark-sql> create table tbl (id int) partitioned by (part0 int, part1 int); spark-sql> alter table tbl add partition (part0=1,part1=2) location '/Users/maximgekk/tmp/part-location/tbl'; ``` 4. Try to drop this partition: ``` spark-sql> alter table tbl drop partition (part0=1,part1=2); 20/12/13 18:46:07 ERROR HiveClientImpl: ====================== Attempt to drop the partition specs in table 'tbl' database 'default': Map(part0 -> 1, part1 -> 2) In this attempt, the following partitions have been dropped successfully: The remaining partitions have not been dropped: [1, 2] ====================== Error in query: org.apache.hadoop.hive.ql.metadata.HiveException: Error accessing file:/Users/maximgekk/tmp/part-location/aaa; org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: Error accessing file:/Users/maximgekk/tmp/part-location/aaa; ``` The command fails because it tries to access to the sub-folder `aaa` that is out of the partition path `/Users/maximgekk/tmp/part-location/tbl`. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By running the affected tests from local IDEA which does not have access to folders out of partition paths. Lead-authored-by: Max Gekk <max.gekkgmail.com> Co-authored-by: Maxim Gekk <max.gekkgmail.com> Signed-off-by: HyukjinKwon <gurwls223apache.org> (cherry picked from commit 9160d59) Signed-off-by: Max Gekk <max.gekkgmail.com> Closes apache#30756 from MaxGekk/fix-drop-partition-location-3.1. Authored-by: Max Gekk <[email protected]> Signed-off-by: HyukjinKwon <[email protected]>
We use the InProcessLauncher internally [1] to launch to different YARN clusters. The clusters might need different configuration files, which we can't keep apart if the InProcessLauncher discovers config folders from the same HADOOP_CONF_DIR environment variable. This change allows us to specify different config directories using Spark config. See upstream PR [2] and ticket [3]. [1] https://pl.ntr/1UK [2] apache#22289 [1] https://issues.apache.org/jira/browse/SPARK-25200 Co-authored-by: Adam Balogh <[email protected]> Co-authored-by: Robert Kruszewski <[email protected]> Co-authored-by: Josh Casale <[email protected]> Co-authored-by: Will Raschkowski <[email protected]>
… SQL expressions We have internal applications (BS and C) prone to OOMs with repeated use of aliases. See ticket [1] and upstream PR [2]. [1] https://issues.apache.org/jira/browse/SPARK-26626 [2] apache#23556 Co-authored-by: j-esse <[email protected]> Co-authored-by: Josh Casale <[email protected]> Co-authored-by: Will Raschkowski <[email protected]>
…kjoinpool See ticket [1] and the PR [2] @robert3005 opened on apache/spark. It seems upstream wasn't convinced by our use-case. We need this change because Spark reads parquet footers on a different thread. Without this change, that thread doesn't inherit the thread-local that stores the TaskContext, meaning we don't have access to the properties stored inside that TaskContext Our internal filesystem needs those properties in the task context and fails to read footers without. [1] https://issues.apache.org/jira/browse/SPARK-20952 [2] apache#18176 Co-authored-by: Robert Kruszewski <[email protected]> Co-authored-by: Josh Casale <[email protected]> Co-authored-by: Will Raschkowski <[email protected]>
… per-partition limits We have an internal product that needs this. See upstream PR [1]. [1] apache#15614 Co-authored-by: Patrick Woody <[email protected]> Co-authored-by: Josh Casale <[email protected]> Co-authored-by: Will Raschkowski <[email protected]>
…n task start and end events Proposing a new set of APIs for ExecutorPlugins, to provide callbacks invoked at the start and end of each task of a job. Not very opinionated on the shape of the API, tried to be as minimal as possible for now. Changes described in detail on [SPARK-33088](https://issues.apache.org/jira/browse/SPARK-33088), but mostly this boils down to: 1. This feature was considered when the ExecutorPlugin API was initially introduced in apache#21923, but never implemented. 2. The use-case which **requires** this feature is to propagate tracing information from the driver to the executor, such that calls from the same job can all be traced. a. Tracing frameworks usually are setup in thread locals, therefore it's important for the setup to happen in the same thread which runs the tasks. b. Executors can be for multiple jobs, therefore it's not sufficient to set tracing information at executor startup time -- it needs to happen every time a task starts or ends. No. This PR introduces new features for future developers to use. Unit tests on `PluginContainerSuite`. Closes apache#29977 from fsamuel-bs/SPARK-33088. Authored-by: Samuel Souza <[email protected]> Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>
This adds config that allows us to to inject a custom session builder. Internally we use it to build SparkSessions that highly configured beyond what Spark's built-in configs allow. Most importantly that includes building and registering our own session catalog (v1) implementation with the SparkSession. You can find how we use this config here [1] and our own SessionStateBuilder here [2]. [1]: https://pl.ntr/1UU [2]: https://pl.ntr/1UT Co-authored-by: Robert Kruszewski <[email protected]> Co-authored-by: Josh Casale <[email protected]> Co-authored-by: Will Raschkowski <[email protected]>
Authored-by: Robert Kruszewski <[email protected]>
45d5084 to
adf557c
Compare
Palantir's mechanism to mount local files into pods using secret-populated volumes. Used internally to provide pods with config files. Co-authored-by: mccheah <[email protected]> Co-authored-by: Josh Casale <[email protected]> Co-authored-by: Willi Raschkowski <[email protected]>
Authored-by: Onur Satici <[email protected]>
When serializing a Pandas dataframe using Arrow under Python 2, Arrow can't tell if string columns should be serialized as string type or as binary (due to how Python 2 stores strings). The result is that Arrow serializes string columns in Pandas dataframes to binary ones. We can remove this when we discontinue support for Python 2. See original PR [1] and follow-up for 'mixed' type columns [2]. [1] #679 [2] #702
Co-authored-by: Dan Sanduleac <[email protected]> Co-authored-by: Josh Casale <[email protected]> Co-authored-by: Will Raschkowski <[email protected]>
This PR upgrade Py4J from 0.10.9 to 0.10.9.1 that contains some bug fixes and improvements. It contains one bug fix (py4j/py4j@4152353). To leverage fixes from the upstream in Py4J. No. Jenkins build and GitHub Actions will test it out. Closes apache#31009 from HyukjinKwon/SPARK-33984. Authored-by: HyukjinKwon <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
…d metrics in sources commit 9704d3a47bc6ec0d6e5ac08575808a182ad503b7 Author: Robert Kruszewski <[email protected]> Date: Fri Oct 9 00:00:41 2020 +0100 better variable name commit 89f863270d11a8fd583b02b3467c49cb403bbf5e Author: Robert Kruszewski <[email protected]> Date: Thu Oct 8 23:59:17 2020 +0100 variable name commit d20847a8d0d91a6d5e436a5d5c47df1f875d84a1 Author: Robert Kruszewski <[email protected]> Date: Thu Jun 22 16:53:30 2017 +0100 Merge existing registry with default one or configure default metric registry
e21fea6 to
5bdd474
Compare
|
Closing in favour of #737 |
Upstream SPARK-XXXXX ticket and PR link (if not applicable, explain)
This is our entire diff post-spark3-upgrade, chunked into logical components where each commit builds successfully.
A diff this big is really hard to review, but reviewing the diff with current pt/master is even more intractable, so I believe the way to go here is to review each commit here individually.
The first commit (8f1e51e) sets up our palantir-hadoop and palantir-parquet dependencies and has a lot of the eccentricities associated with getting spark to work with foundry. This could perhaps be broken up further, if you have any opinions on that please let me know.
The second commit (8f1e51e) is basically #381 and related updates. It's giant, but so was the original PR and I didn't see a great way to split it up further.
What changes were proposed in this pull request?
How was this patch tested?
(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)
Please review http://spark.apache.org/contributing.html before opening a pull request.
This change is