Skip to content

Commit b29b67d

Browse files
dongjoon-hyunericm-db
authored andcommitted
[SPARK-47152][SQL][BUILD] Provide CodeHaus Jackson dependencies via a new optional directory
### What changes were proposed in this pull request? This PR aims to provide `Apache Hive`'s `CodeHaus Jackson` dependencies via a new optional directory, `hive-jackson`, instead of the standard `jars` directory of Apache Spark binary distribution. Additionally, two internal configurations are added whose default values are `hive-jackson/*`. - `spark.driver.defaultExtraClassPath` - `spark.executor.defaultExtraClassPath` For example, Apache Spark distributions have been providing `spark-*-yarn-shuffle.jar` file under `yarn` directory instead of `jars`. **YARN SHUFFLE EXAMPLE** ``` $ ls -al yarn/*jar -rw-r--r-- 1 dongjoon staff 77352048 Sep 8 19:08 yarn/spark-3.5.0-yarn-shuffle.jar ``` This PR changes `Apache Hive`'s `CodeHaus Jackson` dependencies in a similar way. **BEFORE** ``` $ ls -al jars/*asl* -rw-r--r-- 1 dongjoon staff 232248 Sep 8 19:08 jars/jackson-core-asl-1.9.13.jar -rw-r--r-- 1 dongjoon staff 780664 Sep 8 19:08 jars/jackson-mapper-asl-1.9.13.jar ``` **AFTER** ``` $ ls -al jars/*asl* zsh: no matches found: jars/*asl* $ ls -al hive-jackson total 1984 drwxr-xr-x 4 dongjoon staff 128 Feb 23 15:37 . drwxr-xr-x 16 dongjoon staff 512 Feb 23 16:34 .. -rw-r--r-- 1 dongjoon staff 232248 Feb 23 15:37 jackson-core-asl-1.9.13.jar -rw-r--r-- 1 dongjoon staff 780664 Feb 23 15:37 jackson-mapper-asl-1.9.13.jar ``` ### Why are the changes needed? Since Apache Hadoop 3.3.5, only Apache Hive requires old CodeHaus Jackson dependencies. Apache Spark 3.5.0 tried to eliminate them completely but it's reverted due to Hive UDF support. - apache#40893 - apache#42446 SPARK-47119 added a way to exclude Apache Hive Jackson dependencies at the distribution building stage for Apache Spark 4.0.0. - apache#45201 This PR provides a way to exclude Apache Hive Jackson dependencies at runtime for Apache Spark 4.0.0. - Spark Shell without Apache Hive Jackson dependencies. ``` $ bin/spark-shell --driver-default-class-path "" ``` - Spark SQL Shell without Apache Hive Jackson dependencies. ``` $ bin/spark-sql --driver-default-class-path "" ``` - Spark Thrift Server without Apache Hive Jackson dependencies. ``` $ sbin/start-thriftserver.sh --driver-default-class-path "" ``` In addition, last but not least, this PR eliminates `CodeHaus Jackson` dependencies from the following Apache Spark deamons (using `spark-daemon.sh start`) because they don't require Hive `CodeHaus Jackson` dependencies - Spark Master - Spark Worker - Spark History Server ``` $ grep 'spark-daemon.sh start' * start-history-server.sh:exec "${SPARK_HOME}/sbin"/spark-daemon.sh start $CLASS 1 "$" start-master.sh:"${SPARK_HOME}/sbin"/spark-daemon.sh start $CLASS 1 \ start-worker.sh: "${SPARK_HOME}/sbin"/spark-daemon.sh start $CLASS $WORKER_NUM \ ``` ### Does this PR introduce _any_ user-facing change? No. There is no user-facing change by default. - For the distributions with `hive-jackson-provided` profile, the `scope` of Apache Hive Jackson dependencies is `provided` and `hive-jackson` directory is not created at all. - For the distributions with default setting, the `scope` of Apache Hive Jackson dependencies is still `compile`. In addition, they are in the Apache Spark's built-in class path like the following. ![Screenshot 2024-02-23 at 16 48 08](https://github.com/apache/spark/assets/9700541/99ed0f02-2792-4666-ae19-ce4f4b7b8ff9) - The following Spark Deamon don't use `CodeHaus Jackson` dependencies. - Spark Master - Spark Worker - Spark History Server ### How was this patch tested? Pass the CIs and manually build a distribution and check the class paths in the `Environment` Tab. ``` $ dev/make-distribution.sh -Phive,hive-thriftserver ``` ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#45237 from dongjoon-hyun/SPARK-47152. Authored-by: Dongjoon Hyun <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
1 parent 98ca3ea commit b29b67d

File tree

6 files changed

+43
-0
lines changed

6 files changed

+43
-0
lines changed

core/src/main/scala/org/apache/spark/internal/config/package.scala

Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -17,6 +17,7 @@
1717

1818
package org.apache.spark.internal
1919

20+
import java.io.File
2021
import java.util.Locale
2122
import java.util.concurrent.TimeUnit
2223

@@ -64,8 +65,16 @@ package object config {
6465
.stringConf
6566
.createOptional
6667

68+
private[spark] val DRIVER_DEFAULT_EXTRA_CLASS_PATH =
69+
ConfigBuilder(SparkLauncher.DRIVER_DEFAULT_EXTRA_CLASS_PATH)
70+
.internal()
71+
.version("4.0.0")
72+
.stringConf
73+
.createWithDefault(SparkLauncher.DRIVER_DEFAULT_EXTRA_CLASS_PATH_VALUE)
74+
6775
private[spark] val DRIVER_CLASS_PATH =
6876
ConfigBuilder(SparkLauncher.DRIVER_EXTRA_CLASSPATH)
77+
.withPrepended(DRIVER_DEFAULT_EXTRA_CLASS_PATH.key, File.pathSeparator)
6978
.version("1.0.0")
7079
.stringConf
7180
.createOptional
@@ -254,8 +263,16 @@ package object config {
254263
private[spark] val EXECUTOR_ID =
255264
ConfigBuilder("spark.executor.id").version("1.2.0").stringConf.createOptional
256265

266+
private[spark] val EXECUTOR_DEFAULT_EXTRA_CLASS_PATH =
267+
ConfigBuilder(SparkLauncher.EXECUTOR_DEFAULT_EXTRA_CLASS_PATH)
268+
.internal()
269+
.version("4.0.0")
270+
.stringConf
271+
.createWithDefault(SparkLauncher.EXECUTOR_DEFAULT_EXTRA_CLASS_PATH_VALUE)
272+
257273
private[spark] val EXECUTOR_CLASS_PATH =
258274
ConfigBuilder(SparkLauncher.EXECUTOR_EXTRA_CLASSPATH)
275+
.withPrepended(EXECUTOR_DEFAULT_EXTRA_CLASS_PATH.key, File.pathSeparator)
259276
.version("1.0.0")
260277
.stringConf
261278
.createOptional

dev/make-distribution.sh

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -189,6 +189,12 @@ echo "Build flags: $@" >> "$DISTDIR/RELEASE"
189189
# Copy jars
190190
cp "$SPARK_HOME"/assembly/target/scala*/jars/* "$DISTDIR/jars/"
191191

192+
# Only create the hive-jackson directory if they exist.
193+
for f in "$DISTDIR"/jars/jackson-*-asl-*.jar; do
194+
mkdir -p "$DISTDIR"/hive-jackson
195+
mv $f "$DISTDIR"/hive-jackson/
196+
done
197+
192198
# Only create the yarn directory if the yarn artifacts were built.
193199
if [ -f "$SPARK_HOME"/common/network-yarn/target/scala*/spark-*-yarn-shuffle.jar ]; then
194200
mkdir "$DISTDIR/yarn"

launcher/src/main/java/org/apache/spark/launcher/AbstractCommandBuilder.java

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -271,6 +271,8 @@ Map<String, String> getEffectiveConfig() throws IOException {
271271
Properties p = loadPropertiesFile();
272272
p.stringPropertyNames().forEach(key ->
273273
effectiveConfig.computeIfAbsent(key, p::getProperty));
274+
effectiveConfig.putIfAbsent(SparkLauncher.DRIVER_DEFAULT_EXTRA_CLASS_PATH,
275+
SparkLauncher.DRIVER_DEFAULT_EXTRA_CLASS_PATH_VALUE);
274276
}
275277
return effectiveConfig;
276278
}

launcher/src/main/java/org/apache/spark/launcher/SparkLauncher.java

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -54,6 +54,10 @@ public class SparkLauncher extends AbstractLauncher<SparkLauncher> {
5454

5555
/** Configuration key for the driver memory. */
5656
public static final String DRIVER_MEMORY = "spark.driver.memory";
57+
/** Configuration key for the driver default extra class path. */
58+
public static final String DRIVER_DEFAULT_EXTRA_CLASS_PATH =
59+
"spark.driver.defaultExtraClassPath";
60+
public static final String DRIVER_DEFAULT_EXTRA_CLASS_PATH_VALUE = "hive-jackson/*";
5761
/** Configuration key for the driver class path. */
5862
public static final String DRIVER_EXTRA_CLASSPATH = "spark.driver.extraClassPath";
5963
/** Configuration key for the default driver VM options. */
@@ -65,6 +69,10 @@ public class SparkLauncher extends AbstractLauncher<SparkLauncher> {
6569

6670
/** Configuration key for the executor memory. */
6771
public static final String EXECUTOR_MEMORY = "spark.executor.memory";
72+
/** Configuration key for the executor default extra class path. */
73+
public static final String EXECUTOR_DEFAULT_EXTRA_CLASS_PATH =
74+
"spark.executor.defaultExtraClassPath";
75+
public static final String EXECUTOR_DEFAULT_EXTRA_CLASS_PATH_VALUE = "hive-jackson/*";
6876
/** Configuration key for the executor class path. */
6977
public static final String EXECUTOR_EXTRA_CLASSPATH = "spark.executor.extraClassPath";
7078
/** Configuration key for the default executor VM options. */

launcher/src/main/java/org/apache/spark/launcher/SparkSubmitCommandBuilder.java

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -267,6 +267,12 @@ private List<String> buildSparkSubmitCommand(Map<String, String> env)
267267
Map<String, String> config = getEffectiveConfig();
268268
boolean isClientMode = isClientMode(config);
269269
String extraClassPath = isClientMode ? config.get(SparkLauncher.DRIVER_EXTRA_CLASSPATH) : null;
270+
String defaultExtraClassPath = config.get(SparkLauncher.DRIVER_DEFAULT_EXTRA_CLASS_PATH);
271+
if (extraClassPath == null || extraClassPath.trim().isEmpty()) {
272+
extraClassPath = defaultExtraClassPath;
273+
} else {
274+
extraClassPath += File.pathSeparator + defaultExtraClassPath;
275+
}
270276

271277
List<String> cmd = buildJavaCommand(extraClassPath);
272278
// Take Thrift/Connect Server as daemon
@@ -498,6 +504,8 @@ protected boolean handle(String opt, String value) {
498504
case DRIVER_MEMORY -> conf.put(SparkLauncher.DRIVER_MEMORY, value);
499505
case DRIVER_JAVA_OPTIONS -> conf.put(SparkLauncher.DRIVER_EXTRA_JAVA_OPTIONS, value);
500506
case DRIVER_LIBRARY_PATH -> conf.put(SparkLauncher.DRIVER_EXTRA_LIBRARY_PATH, value);
507+
case DRIVER_DEFAULT_CLASS_PATH ->
508+
conf.put(SparkLauncher.DRIVER_DEFAULT_EXTRA_CLASS_PATH, value);
501509
case DRIVER_CLASS_PATH -> conf.put(SparkLauncher.DRIVER_EXTRA_CLASSPATH, value);
502510
case CONF -> {
503511
checkArgument(value != null, "Missing argument to %s", CONF);

launcher/src/main/java/org/apache/spark/launcher/SparkSubmitOptionParser.java

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -40,6 +40,7 @@ class SparkSubmitOptionParser {
4040
protected final String CONF = "--conf";
4141
protected final String DEPLOY_MODE = "--deploy-mode";
4242
protected final String DRIVER_CLASS_PATH = "--driver-class-path";
43+
protected final String DRIVER_DEFAULT_CLASS_PATH = "--driver-default-class-path";
4344
protected final String DRIVER_CORES = "--driver-cores";
4445
protected final String DRIVER_JAVA_OPTIONS = "--driver-java-options";
4546
protected final String DRIVER_LIBRARY_PATH = "--driver-library-path";
@@ -94,6 +95,7 @@ class SparkSubmitOptionParser {
9495
{ DEPLOY_MODE },
9596
{ DRIVER_CLASS_PATH },
9697
{ DRIVER_CORES },
98+
{ DRIVER_DEFAULT_CLASS_PATH },
9799
{ DRIVER_JAVA_OPTIONS },
98100
{ DRIVER_LIBRARY_PATH },
99101
{ DRIVER_MEMORY },

0 commit comments

Comments
 (0)