[SPARK-52542] Use `/nonexistent` instead of nonexistent `/home/spark` #87

dongjoon-hyun · 2025-06-20T16:44:16Z

What changes were proposed in this pull request?

This PR aims to use /nonexistent explicitly instead of nonexistent /home/spark because the current status is misleading.

Please note that SPARK-40528 introduced useradd --system which created spark user with a non-existent /home/spark directory from the beginning of this repository, spark-docker.

[SPARK-40528] Support dockerfile template #12

spark-docker/Dockerfile.template

Lines 21 to 22 in c264d48

RUN groupadd --system --gid=${spark_uid} spark && \

useradd --system --uid=${spark_uid} --gid=spark spark

Rejected Alternatives

We can set HOME to /opt/spark like Apache Spark behavior. However, it's also different from WORKDIR (/opt/spark/work-dir).
We can create /home/spark, but it could be more vulnerable than AS-IS status. For system account, /nonexistent is frequently used as the security practice to prevent any side effects of HOME directory.

$ docker run -it --rm apache/spark:4.0.0 cat /etc/passwd | grep /nonexistent
nobody:x:65534:65534:nobody:/nonexistent:/usr/sbin/nologin
_apt:x:100:65534::/nonexistent:/usr/sbin/nologin

Why are the changes needed?

Apache Spark 3.3.3

$ docker run -it --rm apache/spark:3.3.3 /opt/spark/bin/spark-sql
...
25/06/20 20:15:41 WARN SparkSQLCLIDriver: WARNING: Directory for Hive history file: /home/spark does not exist.   History will not be available during this session.

$ docker run -it --rm -uroot apache/spark:3.3.3 tail -1 /etc/passwd
spark:x:185:185::/home/spark:/bin/sh

$ docker run -it --rm -uroot apache/spark:3.3.3 ls -al /home/spark
ls: cannot access '/home/spark': No such file or directory

Apache Spark 3.4.4

$ docker run -it --rm -uroot apache/spark:3.4.4 tail -1 /etc/passwd
spark:x:185:185::/home/spark:/bin/sh

$ docker run -it --rm -uroot apache/spark:3.4.4 ls -al /home/spark
ls: cannot access '/home/spark': No such file or directory

Apache Spark 3.5.6

$ docker run -it --rm -uroot apache/spark:3.5.6 tail -1 /etc/passwd
spark:x:185:185::/home/spark:/bin/sh

$ docker run -it --rm -uroot apache/spark:3.5.6 ls /home/spark
ls: cannot access '/home/spark': No such file or directory

Apache Spark 4.0.0

$ docker run -it --rm -uroot apache/spark:4.0.0 tail -1 /etc/passwd
spark:x:185:185::/home/spark:/bin/sh

$ docker run -it --rm -uroot apache/spark:4.0.0 ls /home/spark
ls: cannot access '/home/spark': No such file or directory

Does this PR introduce any user-facing change?

No behavior change because it doesn't exist already.

How was this patch tested?

Manual review.

dongjoon-hyun · 2025-06-20T17:10:28Z

It seems that master branch is broken.

dongjoon-hyun · 2025-06-20T17:20:44Z

It's the flakiness of archive.apache.org.

0.064 + wget -nv -O spark.tgz https://archive.apache.org/dist/spark/spark-4.0.0/spark-4.0.0-bin-hadoop3.tgz
133.8 failed: Connection timed out.
133.8 failed: Network is unreachable.

dongjoon-hyun · 2025-06-20T20:11:35Z

Rebased to the master to bring the ASF Mirror patch.

dongjoon-hyun · 2025-06-20T20:28:48Z

cc @Yikun , @yaooqinn , @viirya , @HyukjinKwon , @LuciferYang .

viirya

Sounds reasonable. Just wonder if it is possible to break downstream images if any?

Yikun

@dongjoon-hyun Thanks for catching this and fixing it, I'm OK with this changes:

IIRC, this behavior introduced on #11, I revisit with it. The PR were addressing root user problem and meet DOI requirements.
Consider the home path (/home/spark) is non-exists, so it's safe to switch /nonexistent, the non-exists also be used in nginx docker official image
The default workspace is /opt/spark/work-dir also no impacts on.
This PR change the behavior,
- 1. It will raise different error (with different path info) if cd ~, I think it's OK.
- 1. It will have behavior change when user cd ~ if user mkdir the path /home/spark in thier downstream dockerfile which based on spark-docker base image. Users should use cd /home/spark instead.
Maybe we should doc this behavior changes in release note or somewhere

cc @yosifkit @tianon It is better to make the DOI maintainers aware of this and get their blessing to avoid getting some unexpected reviews on the spark docker version upgrade PR.

dongjoon-hyun · 2025-06-23T16:15:46Z

Thank you, @viirya , @HyukjinKwon , @yaooqinn , @LuciferYang , @Yikun .

This PR only affected Apache Spark-maintained images. A user who builds a docker image from Apache Spark binary via docker-image-tool.sh has no impact on this. So, it's a little strange to document this in Apache Spark release note. Instead, I'll try to document this in Apache Spark website somewhere.

$ bin/docker-image-tool.sh --help | head -n2
Usage: bin/docker-image-tool.sh [options] [command]
Builds or pushes the built-in Spark Docker image.

DenWav · 2025-07-08T21:11:42Z

I think the documentation on https://hub.docker.com/r/apache/spark/ might need to be updated. If I run the example:

docker run -it apache/spark /opt/spark/bin/spark-shell

then

spark.range(1000 * 1000 * 1000).count()

it works, but I get:

25/07/08 21:07:43 WARN jline: Failed to save history
java.nio.file.AccessDeniedException: /nonexistent
	at java.base/sun.nio.fs.UnixException.translateToIOException(UnixException.java:90)
	at java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:106)
	at java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:111)
	at java.base/sun.nio.fs.UnixFileSystemProvider.createDirectory(UnixFileSystemProvider.java:462)
	at java.base/java.nio.file.Files.createDirectory(Files.java:700)
	at java.base/java.nio.file.Files.createAndCheckIsDirectory(Files.java:808)
	at java.base/java.nio.file.Files.createDirectories(Files.java:794)
	at org.jline.reader.impl.history.DefaultHistory.internalWrite(DefaultHistory.java:223)
	at org.jline.reader.impl.history.DefaultHistory.save(DefaultHistory.java:215)
	at org.jline.reader.impl.history.DefaultHistory.add(DefaultHistory.java:388)
	at org.jline.reader.impl.LineReaderImpl.finish(LineReaderImpl.java:1197)
	at org.jline.reader.impl.LineReaderImpl.finishBuffer(LineReaderImpl.java:1166)
	at org.jline.reader.impl.LineReaderImpl.readLine(LineReaderImpl.java:734)
	at org.jline.reader.impl.LineReaderImpl.readLine(LineReaderImpl.java:512)
	at scala.tools.nsc.interpreter.jline.Reader.readOneLine(Reader.scala:43)
	at scala.tools.nsc.interpreter.shell.InteractiveReader.readLine(InteractiveReader.scala:38)
	at scala.tools.nsc.interpreter.shell.InteractiveReader.readLine$(InteractiveReader.scala:38)
	at scala.tools.nsc.interpreter.jline.Reader.readLine(Reader.scala:33)
	at scala.tools.nsc.interpreter.shell.ILoop.readOneLine(ILoop.scala:453)
	at scala.tools.nsc.interpreter.shell.ILoop.loop(ILoop.scala:458)
	at scala.tools.nsc.interpreter.shell.ILoop.run(ILoop.scala:991)
	at org.apache.spark.repl.Main$.doMain(Main.scala:85)
	at org.apache.spark.repl.Main$.main(Main.scala:60)
	at org.apache.spark.repl.Main.main(Main.scala)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:75)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:52)
	at java.base/java.lang.reflect.Method.invoke(Method.java:580)
	at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
	at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:1027)
	at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:204)
	at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:227)
	at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:96)
	at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1132)
	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1141)
	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

This happens for any command. Do I need to modify my Docker command to use it properly now with this change?

dongjoon-hyun · 2025-07-08T22:59:39Z

To @DenWav , your question is completely irrelevant to this PR because the situation happened with the previous status /home/spark.

$ docker run -it apache/spark:4.0.0-preview1 /opt/spark/bin/spark-shell
WARNING: Using incubator modules: jdk.incubator.vector
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 4.0.0-preview1
      /_/

Using Scala version 2.13.14 (OpenJDK 64-Bit Server VM, Java 17.0.12)
Type in expressions to have them evaluated.
Type :help for more information.
25/07/08 23:01:32 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Spark context Web UI available at http://c919ce533784:4040
Spark context available as 'sc' (master = local[*], app id = local-1752015692659).
Spark session available as 'spark'.

scala> spark.range(1).count()25/07/08 23:01:38 WARN jline: Failed to save history
java.nio.file.AccessDeniedException: /home/spark

dongjoon-hyun · 2025-07-08T23:15:35Z

Just FYI,

As mentioned before ([SPARK-52542] Use /nonexistent instead of nonexistent /home/spark #87 (comment)), if you build your docker image from Apache Spark distribution, you have an existing home directory. So, we don't have your problem. That's the official recommendation.
If you want to run it quickly without considering safety, you can simply run with root.

docker run -it -u root apache/spark /opt/spark/bin/spark-shell

tianon · 2025-07-09T04:20:48Z

You can also add an anonymous volume or a tmpfs really easily with -v /nonexistent or --tmpfs /nonexistent (or if the name is off-putting like it would be for me, it'd be even more reliable to explicitly set HOME so you could even do something simpler like --env HOME=/tmp to a directory you already know exists).

dongjoon-hyun changed the title ~~[SPARK-52542] Use /nonexistent instead of nonexistent /opt/spark~~ [SPARK-52542] Use /nonexistent instead of nonexistent /home/spark Jun 20, 2025

dongjoon-hyun force-pushed the SPARK-52542 branch from e07cb60 to f7ab79e Compare June 20, 2025 17:26

dongjoon-hyun mentioned this pull request Jun 20, 2025

[SPARK-52543] Download Apache Spark distributions via ASF Mirrors site #88

Closed

[SPARK-52542] Use /nonexistent instead of nonexistent /opt/spark

287d96c

dongjoon-hyun force-pushed the SPARK-52542 branch from f7ab79e to 287d96c Compare June 20, 2025 20:11

dongjoon-hyun requested review from HyukjinKwon, LuciferYang, Yikun, viirya and yaooqinn June 20, 2025 20:27

viirya reviewed Jun 21, 2025

View reviewed changes

HyukjinKwon approved these changes Jun 23, 2025

View reviewed changes

yaooqinn approved these changes Jun 23, 2025

View reviewed changes

LuciferYang approved these changes Jun 23, 2025

View reviewed changes

viirya approved these changes Jun 23, 2025

View reviewed changes

Yikun approved these changes Jun 23, 2025

View reviewed changes

dongjoon-hyun merged commit 0f76cd1 into apache:master Jun 23, 2025
8 checks passed

dongjoon-hyun deleted the SPARK-52542 branch June 23, 2025 16:16

dongjoon-hyun mentioned this pull request Jun 30, 2025

add home dir for spark user #79

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-52542] Use `/nonexistent` instead of nonexistent `/home/spark` #87

[SPARK-52542] Use `/nonexistent` instead of nonexistent `/home/spark` #87

Uh oh!

dongjoon-hyun commented Jun 20, 2025 •

edited

Loading

Uh oh!

dongjoon-hyun commented Jun 20, 2025

Uh oh!

dongjoon-hyun commented Jun 20, 2025

Uh oh!

dongjoon-hyun commented Jun 20, 2025

Uh oh!

dongjoon-hyun commented Jun 20, 2025

Uh oh!

viirya left a comment

Uh oh!

Yikun left a comment •

edited

Loading

Uh oh!

dongjoon-hyun commented Jun 23, 2025

Uh oh!

Uh oh!

DenWav commented Jul 8, 2025

Uh oh!

dongjoon-hyun commented Jul 8, 2025 •

edited

Loading

Uh oh!

dongjoon-hyun commented Jul 8, 2025

Uh oh!

tianon commented Jul 9, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

	RUN groupadd --system --gid=${spark_uid} spark && \
	useradd --system --uid=${spark_uid} --gid=spark spark

[SPARK-52542] Use /nonexistent instead of nonexistent /home/spark #87

[SPARK-52542] Use /nonexistent instead of nonexistent /home/spark #87

Uh oh!

Conversation

dongjoon-hyun commented Jun 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

dongjoon-hyun commented Jun 20, 2025

Uh oh!

dongjoon-hyun commented Jun 20, 2025

Uh oh!

dongjoon-hyun commented Jun 20, 2025

Uh oh!

dongjoon-hyun commented Jun 20, 2025

Uh oh!

viirya left a comment

Choose a reason for hiding this comment

Uh oh!

Yikun left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun commented Jun 23, 2025

Uh oh!

Uh oh!

DenWav commented Jul 8, 2025

Uh oh!

dongjoon-hyun commented Jul 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dongjoon-hyun commented Jul 8, 2025

Uh oh!

tianon commented Jul 9, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

[SPARK-52542] Use `/nonexistent` instead of nonexistent `/home/spark` #87

[SPARK-52542] Use `/nonexistent` instead of nonexistent `/home/spark` #87

dongjoon-hyun commented Jun 20, 2025 •

edited

Loading

Yikun left a comment •

edited

Loading

dongjoon-hyun commented Jul 8, 2025 •

edited

Loading