-
Notifications
You must be signed in to change notification settings - Fork 9.2k
HADOOP-19209. Update and optimize hadoop-runner #6910
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HADOOP-19209. Update and optimize hadoop-runner #6910
Conversation
| RUN apt update -q \ | ||
| && DEBIAN_FRONTEND=noninteractive apt install -y --no-install-recommends \ | ||
| jq \ | ||
| krb5-user \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I remember there were some issues in my previous tests with Ubuntu 22 and secured hadoop. roughly remember it is related to openssl 1 removal from the apt source while hadoop does not work with openssl 3
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @pan3793 for the info. We can tweak the image later if needed, based on bug reports.
|
@ayushtkn @jojochuang @smengcl could you please review, or help find someone who can review? |
|
@ayushtkn @jojochuang @smengcl please take a look |
ayushtkn
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whats the plan, how do we plan to publish the image, manually? I think it is high time we move to github actions to publish the docker images...
| x86_64) \ | ||
| sha256='e874b55f3279ca41415d290c512a7ba9d08f98041b28ae7c2acb19a545f1c4df'; \ | ||
| ;; \ | ||
| aarch64) \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this take care of both aarch64 & arm64, In create release we had to handle both
hadoop/dev-support/bin/create-release
Line 208 in f000942
| if [[ "$CPU_ARCH" = "aarch64" || "$CPU_ARCH" = "arm64" ]]; then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have access to ARM64 hardware. This was taken from ozone-runner, where @smengcl added support for ARM-based Mac.
BTW, not sure we can try to cover all arm... architectures:
https://stackoverflow.com/questions/45125516/possible-values-for-uname-m
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that change was done as part of https://issues.apache.org/jira/browse/HADOOP-19238, so it would be MAC thing...
We can wait for @smengcl to confirm things, I am not very experienced in this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ayushtkn Yup aarch64 alone should do it here for the Dockerfile, because arch command returns aarch64 on Linux (inside a Docker Desktop VM on macOS).
It is true that arch command on macOS (M1 or later) gives arm64, but I don't see a case where Dockerfile would be built natively (by docker build under macOS, which differs from #6962 . So I don't think that is a problem here.
On a sidenote, MACHTYPE env variable is not a reliable way to give the current system architecture because for instance zsh always gives the compile-time system arch: https://apple.stackexchange.com/a/467854 . And this is happening to the zsh shipped in latest macOS builds for M1 and later (presumably because it was cross-compiled on a x86_64 box):
$ uname -mp
arm64 arm
$ echo $0
-zsh
$ echo $MACHTYPE
x86_64while built-in bash behaves differently and gives the intended result.
This should prove the point that MACHTYPE env variable (alone) is not reliable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanx @smengcl for the details and confirmation
|
Thanks @ayushtkn for taking a look.
AFAIK: Hadoop images are built by Docker Hub automation set up by Apache Infra, mapping from branch name to image tag. For the We can publish new tags by creating new branches.
That's why tags like 3.3.6 must be published manually as of now. |
|
Should be f9 then, I was thinking sometime in future rather than relying on these branches & stuff we start doing it our code like |
|
WIll this update generate a new 2.10 (2.10.3?) release? |
No, Hadoop release is independent of this |
smengcl
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @adoroszlai . Looks fine by me.
ayushtkn
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
Thanks for the reviews. Pushed to branch |
What changes were proposed in this pull request?
apache/hadoop-runnerDocker image comes with software necessary to run Hadoop, as well as some nice-to-haves for testing it.apache/hadoopimages add the Hadoop release binaries on top of that. (see HADOOP-14898 for details)This PR updates the definition of the
hadoop-runnerimage:eclipse-temurin, an official Docker image with OpenJDK installed on top of Ubuntu (22.04 LTS in this case).hadoop-runnerimage.hadoop-runnerwith various versions.aarch64architecture (for Apple M1 and beyond)It also improves
build.sh(the helper script for developers):devinstead oflatest. This lets the developer keep usinglatestfrom Docker Hub while working on the image.jdk11-dev.Misc. improvements:
.dockerignoreto reduce the size of context sent to Docker while building the image.build(temp dir where Rat is downloaded) to.gitignore.This PR targets the
docker-hadoop-runner-jdk11branch, so only theapache/hadoop-runner:jdk11image would be rebuilt after merging the PR (see INFRA-18001 for the mapping). To avoid potential disruption for any existing users of this image, it would be useful to publish a new Docker image tag, which requires a new Git branch. If the changes in this PR are approved, we can create the new branch by pushing the commit directly instead of merging the PR.https://issues.apache.org/jira/browse/HADOOP-19209
How was this patch tested?
Built the image for various Java versions:
Image size is smaller than current
apache/hadoop-runner:latest, despite having full JDK instead of just JRE:Verified Java version:
Built Hadoop image for 3.3.6 on top of
jdk8-devby changingFROMinDockerfileondocker-hadoop-3branch.Verified Hadoop version and being able to run
hadoopcommand:Tested using
docker-compose(after editingdocker-compose.yamlto use the specific image instead of re-building):Also used the
3.3.6-devimage successfully in Apache Ozone's Docker-based tests for Hadoop integration.