Skip to content

Conversation

@jiasheng55
Copy link

What changes were proposed in this pull request?

Currently credentials file configuration is recovered from checkpoint file when Spark Streaming applicatioin is restarted, which will lead to some unwanted behaviors, for example:

  1. Submit Spark Streaming application using keytab file with checkpoint enabled in yarn-cluster mode.

spark-submit --master yarn-cluster --principal xxxx --keytab xxx ...

  1. Stop Spark Streaming application;
  2. Resubmit this application after a period of time (i.e. one day);
  3. Credentials file configuration recover from checkpoint file, so value of "spark.yarn.credentials.file" points to old staging directory (i.e. hdfs://xxxx/.sparkStaging/application_xxxx/credentials-xxxx, application_xxxx is the application id of the previous application which was stopped.)
  4. When launching executor, ExecutorDelegationTokenUpdater will update credentials from credentials file immediately. As credentials file was generated one day ago (maybe older), it has already expired, so after a period of time the executor keeps failing.

Some useful logs are shown below :

2017-04-27,15:08:08,098 INFO org.apache.spark.executor.CoarseGrainedExecutorBackend: Will periodically update credentials from: hdfs://xxxx/application_xxxx/credentials-xxxx
2017-04-27,15:08:12,519 INFO org.apache.spark.deploy.yarn.ExecutorDelegationTokenUpdater: Reading new delegation tokens from hdfs://xxxx/application_1xxxx/credentials-xxxx-xx
2017-04-27,15:08:12,661 INFO org.apache.spark.deploy.yarn.ExecutorDelegationTokenUpdater: Tokens updated from credentials file.
...
2017-04-27,15:08:48,156 WARN org.apache.hadoop.ipc.Client: Exception encountered while connecting to the server : org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken): token (HDFS_DELEGATION_TOKEN token xxxx for xx) can't be found in cache

How was this patch tested?

manual tests

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@jiasheng55
Copy link
Author

Comments on last PR, #17782.

@jerryshao
Copy link
Contributor

@Victor-Wong can you please update the PR title like other PRs?

By seeing your description, seems the log is from old Spark version, in the latest Spark there's no ExecutorDelegationTokenUpdater and it has renamed to CredentialUpdater, also CredentialUpdater will not update the credential immediately at start, it is controlled by spark.yarn.credentials.updateTime. Can you please check if your problem still exists in latest master code, also what exception will be met?

Also I would guess some more internal configurations should be excluded from checkpoint, like "spark.yarn.credentials.renewalTime", "spark.yarn.credentials.updateTime".

@jerryshao
Copy link
Contributor

Besides I guess this issue only exists in yarn cluster mode, can you also verify it?

@HyukjinKwon
Copy link
Member

ping @Victor-Wong, how it is going?

@gatorsmile
Copy link
Member

We are closing it due to inactivity. please do reopen if you want to push it forward. Thanks!

@jerryshao
Copy link
Contributor

This is already fix in #18230 CC @gatorsmile .

@asfgit asfgit closed this in b32bd00 Jun 27, 2017
zifeif2 pushed a commit to zifeif2/spark that referenced this pull request Nov 22, 2025
## What changes were proposed in this pull request?

This PR proposes to close stale PRs, mostly the same instances with apache#18017

I believe the author in apache#14807 removed his account.

Closes apache#7075
Closes apache#8927
Closes apache#9202
Closes apache#9366
Closes apache#10861
Closes apache#11420
Closes apache#12356
Closes apache#13028
Closes apache#13506
Closes apache#14191
Closes apache#14198
Closes apache#14330
Closes apache#14807
Closes apache#15839
Closes apache#16225
Closes apache#16685
Closes apache#16692
Closes apache#16995
Closes apache#17181
Closes apache#17211
Closes apache#17235
Closes apache#17237
Closes apache#17248
Closes apache#17341
Closes apache#17708
Closes apache#17716
Closes apache#17721
Closes apache#17937

Added:
Closes apache#14739
Closes apache#17139
Closes apache#17445
Closes apache#18042
Closes apache#18359

Added:
Closes apache#16450
Closes apache#16525
Closes apache#17738

Added:
Closes apache#16458
Closes apache#16508
Closes apache#17714

Added:
Closes apache#17830
Closes apache#14742

## How was this patch tested?

N/A

Author: hyukjinkwon <[email protected]>

Closes apache#18417 from HyukjinKwon/close-stale-pr.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants