Skip to content

Conversation

@dongjoon-hyun
Copy link
Member

@dongjoon-hyun dongjoon-hyun commented Jun 20, 2025

What changes were proposed in this pull request?

This PR aims to use /nonexistent explicitly instead of nonexistent /home/spark because the current status is misleading.

Please note that SPARK-40528 introduced useradd --system which created spark user with a non-existent /home/spark directory from the beginning of this repository, spark-docker.

Rejected Alternatives

  • We can set HOME to /opt/spark like Apache Spark behavior. However, it's also different from WORKDIR (/opt/spark/work-dir).
  • We can create /home/spark, but it could be more vulnerable than AS-IS status. For system account, /nonexistent is frequently used as the security practice to prevent any side effects of HOME directory.
$ docker run -it --rm apache/spark:4.0.0 cat /etc/passwd | grep /nonexistent
nobody:x:65534:65534:nobody:/nonexistent:/usr/sbin/nologin
_apt:x:100:65534::/nonexistent:/usr/sbin/nologin

Why are the changes needed?

Apache Spark 3.3.3

$ docker run -it --rm apache/spark:3.3.3 /opt/spark/bin/spark-sql
...
25/06/20 20:15:41 WARN SparkSQLCLIDriver: WARNING: Directory for Hive history file: /home/spark does not exist.   History will not be available during this session.
$ docker run -it --rm -uroot apache/spark:3.3.3 tail -1 /etc/passwd
spark:x:185:185::/home/spark:/bin/sh

$ docker run -it --rm -uroot apache/spark:3.3.3 ls -al /home/spark
ls: cannot access '/home/spark': No such file or directory

Apache Spark 3.4.4

$ docker run -it --rm -uroot apache/spark:3.4.4 tail -1 /etc/passwd
spark:x:185:185::/home/spark:/bin/sh

$ docker run -it --rm -uroot apache/spark:3.4.4 ls -al /home/spark
ls: cannot access '/home/spark': No such file or directory

Apache Spark 3.5.6

$ docker run -it --rm -uroot apache/spark:3.5.6 tail -1 /etc/passwd
spark:x:185:185::/home/spark:/bin/sh

$ docker run -it --rm -uroot apache/spark:3.5.6 ls /home/spark
ls: cannot access '/home/spark': No such file or directory

Apache Spark 4.0.0

$ docker run -it --rm -uroot apache/spark:4.0.0 tail -1 /etc/passwd
spark:x:185:185::/home/spark:/bin/sh

$ docker run -it --rm -uroot apache/spark:4.0.0 ls /home/spark
ls: cannot access '/home/spark': No such file or directory

Does this PR introduce any user-facing change?

No behavior change because it doesn't exist already.

How was this patch tested?

Manual review.

@dongjoon-hyun dongjoon-hyun changed the title [SPARK-52542] Use /nonexistent instead of nonexistent /opt/spark [SPARK-52542] Use /nonexistent instead of nonexistent /home/spark Jun 20, 2025
@dongjoon-hyun
Copy link
Member Author

It seems that master branch is broken.

@dongjoon-hyun
Copy link
Member Author

It's the flakiness of archive.apache.org.

0.064 + wget -nv -O spark.tgz https://archive.apache.org/dist/spark/spark-4.0.0/spark-4.0.0-bin-hadoop3.tgz
133.8 failed: Connection timed out.
133.8 failed: Network is unreachable.

@dongjoon-hyun
Copy link
Member Author

Rebased to the master to bring the ASF Mirror patch.

@dongjoon-hyun
Copy link
Member Author

cc @Yikun , @yaooqinn , @viirya , @HyukjinKwon , @LuciferYang .

Copy link
Member

@viirya viirya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds reasonable. Just wonder if it is possible to break downstream images if any?

Copy link
Member

@Yikun Yikun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dongjoon-hyun Thanks for catching this and fixing it, I'm OK with this changes:

  • IIRC, this behavior introduced on #11, I revisit with it. The PR were addressing root user problem and meet DOI requirements.
  • Consider the home path (/home/spark) is non-exists, so it's safe to switch /nonexistent, the non-exists also be used in nginx docker official image
  • The default workspace is /opt/spark/work-dir also no impacts on.
  • This PR change the behavior,
      1. It will raise different error (with different path info) if cd ~, I think it's OK.
      1. It will have behavior change when user cd ~ if user mkdir the path /home/spark in thier downstream dockerfile which based on spark-docker base image. Users should use cd /home/spark instead.
  • Maybe we should doc this behavior changes in release note or somewhere

cc @yosifkit @tianon It is better to make the DOI maintainers aware of this and get their blessing to avoid getting some unexpected reviews on the spark docker version upgrade PR.

@dongjoon-hyun
Copy link
Member Author

Thank you, @viirya , @HyukjinKwon , @yaooqinn , @LuciferYang , @Yikun .

This PR only affected Apache Spark-maintained images. A user who builds a docker image from Apache Spark binary via docker-image-tool.sh has no impact on this. So, it's a little strange to document this in Apache Spark release note. Instead, I'll try to document this in Apache Spark website somewhere.

$ bin/docker-image-tool.sh --help | head -n2
Usage: bin/docker-image-tool.sh [options] [command]
Builds or pushes the built-in Spark Docker image.

@dongjoon-hyun dongjoon-hyun merged commit 0f76cd1 into apache:master Jun 23, 2025
8 checks passed
@dongjoon-hyun dongjoon-hyun deleted the SPARK-52542 branch June 23, 2025 16:16
@DenWav
Copy link

DenWav commented Jul 8, 2025

I think the documentation on https://hub.docker.com/r/apache/spark/ might need to be updated. If I run the example:

docker run -it apache/spark /opt/spark/bin/spark-shell

then

spark.range(1000 * 1000 * 1000).count()

it works, but I get:

25/07/08 21:07:43 WARN jline: Failed to save history
java.nio.file.AccessDeniedException: /nonexistent
	at java.base/sun.nio.fs.UnixException.translateToIOException(UnixException.java:90)
	at java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:106)
	at java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:111)
	at java.base/sun.nio.fs.UnixFileSystemProvider.createDirectory(UnixFileSystemProvider.java:462)
	at java.base/java.nio.file.Files.createDirectory(Files.java:700)
	at java.base/java.nio.file.Files.createAndCheckIsDirectory(Files.java:808)
	at java.base/java.nio.file.Files.createDirectories(Files.java:794)
	at org.jline.reader.impl.history.DefaultHistory.internalWrite(DefaultHistory.java:223)
	at org.jline.reader.impl.history.DefaultHistory.save(DefaultHistory.java:215)
	at org.jline.reader.impl.history.DefaultHistory.add(DefaultHistory.java:388)
	at org.jline.reader.impl.LineReaderImpl.finish(LineReaderImpl.java:1197)
	at org.jline.reader.impl.LineReaderImpl.finishBuffer(LineReaderImpl.java:1166)
	at org.jline.reader.impl.LineReaderImpl.readLine(LineReaderImpl.java:734)
	at org.jline.reader.impl.LineReaderImpl.readLine(LineReaderImpl.java:512)
	at scala.tools.nsc.interpreter.jline.Reader.readOneLine(Reader.scala:43)
	at scala.tools.nsc.interpreter.shell.InteractiveReader.readLine(InteractiveReader.scala:38)
	at scala.tools.nsc.interpreter.shell.InteractiveReader.readLine$(InteractiveReader.scala:38)
	at scala.tools.nsc.interpreter.jline.Reader.readLine(Reader.scala:33)
	at scala.tools.nsc.interpreter.shell.ILoop.readOneLine(ILoop.scala:453)
	at scala.tools.nsc.interpreter.shell.ILoop.loop(ILoop.scala:458)
	at scala.tools.nsc.interpreter.shell.ILoop.run(ILoop.scala:991)
	at org.apache.spark.repl.Main$.doMain(Main.scala:85)
	at org.apache.spark.repl.Main$.main(Main.scala:60)
	at org.apache.spark.repl.Main.main(Main.scala)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:75)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:52)
	at java.base/java.lang.reflect.Method.invoke(Method.java:580)
	at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
	at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:1027)
	at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:204)
	at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:227)
	at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:96)
	at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1132)
	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1141)
	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

This happens for any command. Do I need to modify my Docker command to use it properly now with this change?

@dongjoon-hyun
Copy link
Member Author

dongjoon-hyun commented Jul 8, 2025

To @DenWav , your question is completely irrelevant to this PR because the situation happened with the previous status /home/spark.

$ docker run -it apache/spark:4.0.0-preview1 /opt/spark/bin/spark-shell
WARNING: Using incubator modules: jdk.incubator.vector
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 4.0.0-preview1
      /_/

Using Scala version 2.13.14 (OpenJDK 64-Bit Server VM, Java 17.0.12)
Type in expressions to have them evaluated.
Type :help for more information.
25/07/08 23:01:32 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Spark context Web UI available at http://c919ce533784:4040
Spark context available as 'sc' (master = local[*], app id = local-1752015692659).
Spark session available as 'spark'.

scala> spark.range(1).count()25/07/08 23:01:38 WARN jline: Failed to save history
java.nio.file.AccessDeniedException: /home/spark

@dongjoon-hyun
Copy link
Member Author

Just FYI,

docker run -it -u root apache/spark /opt/spark/bin/spark-shell

@tianon
Copy link

tianon commented Jul 9, 2025

You can also add an anonymous volume or a tmpfs really easily with -v /nonexistent or --tmpfs /nonexistent (or if the name is off-putting like it would be for me, it'd be even more reliable to explicitly set HOME so you could even do something simpler like --env HOME=/tmp to a directory you already know exists).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants