Skip to content

Conversation

@jkremser
Copy link
Contributor

@jkremser jkremser commented Nov 23, 2017

What changes were proposed in this pull request?

If the InvalidClassException exception is thrown in the TransportRequestHandler.processOneWayMessage (as a consequence of the different serialVersionUIDs) let's send the connected client a message that there is probably a version mismatch between his spark-submit and the remote spark master she is trying to contact.

How was this patch tested?

TODO:

@jerryshao
Copy link
Contributor

Can you please explain more, and how to reproduce this issue? Spark's RPC is not designed for version compatible.

@jkremser
Copy link
Contributor Author

jkremser commented Nov 24, 2017

Can you please explain more, and how to reproduce this issue?

Sure,

  1. Start master and worker (in version 2.2.0 for instance):
cat ./python/pyspark/version.py
__version__='2.2.0'
./bin/spark-class org.apache.spark.deploy.worker.Worker spark://`hostname -I | cut -d' ' -f1`:7077 -c 1 -m 1G &> /dev/null &
./bin/spark-class org.apache.spark.deploy.master.Master 
  1. Then in another terminal run spark-submit from Spark in different version (be it 2.3.0-SNAPSHOT)
cat ./python/pyspark/version.py
...
__version__ = "2.3.0.dev0"
./bin/spark-submit --class org.apache.spark.examples.SparkPi \
                   --master spark://`hostname -I | cut -d' ' -f1`:7077 \
                   --executor-memory 512M \
                   $PWD/examples/target/scala-2.11/jars/spark-examples_2.11-2.3.0-SNAPSHOT.jar \
                   10

result: in the Spark Master log there is the following exception:

java.io.InvalidClassException: org.apache.spark.rpc.RpcEndpointRef; local class incompatible: stream classdesc serialVersionUID = -1329125091869941550, local class serialVersionUID = 1835832137613908542

but on the spark-submit terminal there is nothing about the possibility of running a different version, actually it tries couple of times to connect and then fails on 17/11/24 13:13:58 ERROR StandaloneSchedulerBackend: Application has been killed. Reason: All masters are unresponsive! Giving up. , here is the complete log: https://pastebin.com/Wzs8vjBd

Spark's RPC is not designed for version compatible.

I hear you, on the other hand PR doesn't even try to make it compatible, all it does is to translate a cryptic error to more understandable one. I think, it may be quite common to run older spark-submit against the updated spark master or at least I've hit the issue couple of times. And I had to google the exception and then on stackoverflow or elsewhere figure out that it actually means that there was a version discrepancy.

If this PR is merged the "client" (the invoker of the spark-submit) will get this message in the log:

7/11/23 16:28:31 WARN TransportResponseHandler: Ignoring response for RPC 0 from /10.43.17.5:7077
 (There is probably a version mismatch between client and server: java.io.InvalidClassException: org.apache.spark.rpc.RpcEndpointRef; local class incompatible: stream classdesc serialVersionUID = 1835832137613908542, local class serialVersionUID = -1329125091869941550
	at java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:616)
	at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1843)
...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is that this won't catch all or even most errors resulting from version incompatibility. Spark has never supported or contemplated mis-matching versions internally. I don't think we should try to handle this, because it's fundamentally piecemeal and hacky.

Copy link
Contributor Author

@jkremser jkremser Nov 24, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps I should have picked different name for the PR than Handling spark-submit and master version mismatch. It doesn't try to solve the issue in a complex way that two different version could be able to talk to each other, all it does is saying the user. Hey, you have probably different version than spark master. I agree, it's little bit hacky, on the other hand I see no other option than to catch the InvalidClassException, if the version is not part of the message. Perhaps some initial handshake in which the version is sent would be cleaner.

What about re-throwing the exception. This way it wouldn't change the semantics of the code, but the client would be informed. wdyt?

@jkremser
Copy link
Contributor Author

jkremser commented Nov 24, 2017

To explain better the intentions, it doesn't try to solve of somehow provide a compatibility layer between old and new versions of Spark, all it does is slightly improving the UX, because people are hitting the issue all the time (including me)

couple of instances of this issue:

more here:
https://www.google.com/search?ei=BxMYWt3xLJDdwQKLwr9w&q=java.io.InvalidClassException%3A+org.apache.spark.rpc.RpcEndpointRef%3B+local+class+incompatible

btw. it's not only issue with spark-submit, spark-shell connecting to a remote master has the same flaw and with this change it would get the message that something is wrong.. Otherwise, currently it just hangs without any additional error.
I'd bet it also happens with any other kind of Spark driver, be it Jupyter notebook or JVM process trying to connect to a remote master.

@felixcheung
Copy link
Member

I think there are some values in having a better experience with mismatch client/server versions. We discussed it might be even more common when the client was a R package (or Python perhaps down the line) @shivaram

I agree though focusing on one particular line is likely not sufficient. Perhaps we should design this out? Have this version check formally be part of the protocol handshake?

@jkremser jkremser changed the title [WIP][SPARK-22594][CORE] Handling spark-submit and master version mismatch [SPARK-22594][CORE] Handling spark-submit and master version mismatch Dec 21, 2017
@jkremser
Copy link
Contributor Author

jkremser commented Dec 21, 2017

I've also added tests and removed "WIP" from the label? Do you guys want the change or would you like to have something more elaborated, like an initial handshake?

One note about the "handshake approach". If there is new initial message that would be sent from client to server in which the client sends its version and server decides if it's backward compatible or not with this particular client version, this approach itself wouldn't solve the issue when old client (that doesn't know anything about it) tries to talk with new server. This PR solves this scenario.

Merry Christmas 🎄

@jkremser
Copy link
Contributor Author

@srowen is there anything I can improve in the PR?

Copy link
Member

@srowen srowen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's probably OK if you also re-throw that exception for now. You're just giving more info then.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are JVM asserts, not JUnit asserts.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍
ah, long time no Java 🤦‍♂️

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: put static members first and separated by whitespace.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 space, not 4 space indent

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Space after casts

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it's necessary to assert about the text because it could change

@jkremser jkremser force-pushed the SPARK-22594 branch 2 times, most recently from 868e650 to ff0b7fb Compare January 10, 2018 15:05
@jkremser
Copy link
Contributor Author

jkremser commented Jan 10, 2018

I think it's probably OK if you also re-throw that exception for now. You're just giving more info then.

Hmm, In the original code the exception wasn't re-thrown. Do you really think it is a good idea? The InvalidClassException is a checked exception so the signature of processOneWayMessage() method would need to be changed as well.

this is the original 'catch-them-all' code :)

} catch (Exception e) {
      logger.error("Error while invoking RpcHandler#receive() for one-way message.", e);
}

I think, I addressed all your inline other comments, thanks for the review.

@srowen
Copy link
Member

srowen commented Jan 10, 2018

I see, yeah that's a fine point. Don't rethrow then.

@jkremser
Copy link
Contributor Author

@srowen was Jenkins ok with the change? I can't see the results.

@SparkQA
Copy link

SparkQA commented Jan 30, 2018

Test build #4086 has finished for PR 19802 at commit 4f79632.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@srowen
Copy link
Member

srowen commented Jan 31, 2018

@vanzin do you have thoughts on this one?

@vanzin
Copy link
Contributor

vanzin commented Jan 31, 2018

I think this is the wrong place for the fix, if one is desired. The transport library doesn't know anything about serialization, that's all handled by the code using it. In this case, the 'RpcEnv' stuff in core. So I don't think it's correct to handle those errors in the transport library.

@miacobv
Copy link

miacobv commented Feb 1, 2018

I get this when I start the master and worker, without running spark-submit on both 2.2.0 and 2.2.1

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@vanzin
Copy link
Contributor

vanzin commented May 11, 2018

@Jiri-Kremser if you're not planning to address the feedback you should probably close this PR.

@vanzin vanzin mentioned this pull request May 11, 2018
@jkremser jkremser closed this May 11, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants