-
Notifications
You must be signed in to change notification settings - Fork 3.4k
HBASE-27947 RegionServer OOM when outbound channel backed up #5350
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
💔 -1 overall
This message was automatically generated. |
dc2bebc to
4072564
Compare
|
💔 -1 overall
This message was automatically generated. |
|
💔 -1 overall
This message was automatically generated. |
|
🎊 +1 overall
This message was automatically generated. |
|
🎊 +1 overall
This message was automatically generated. |
|
🎊 +1 overall
This message was automatically generated. |
hbase-hadoop-compat/src/main/java/org/apache/hadoop/hbase/ipc/MetricsHBaseServerWrapper.java
Show resolved
Hide resolved
| } | ||
|
|
||
| private boolean handleFatalThreshold(ChannelHandlerContext ctx) { | ||
| int fatalThreshold = rpcServer.getWriteBufferFatalThreshold(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We just need this threshold? Let's just pass the threshold in?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure what you mean here. You mean instead of exposing a getter?
I want to use a getter here so that the fatal threshold can be live updated with update_config. If I passed it into the NettyRpcServerResponseEncoder, it would be static for the lifetime of a connection. I could pass in an IntSupplier instead, but is that much better than passing in the NettyRpcServer? I'm happy to do that if preferred
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's pass an IntSupplier and add comment to say we want to support update_config so we can not pass the threshold in directly.
| return false; | ||
| } | ||
|
|
||
| NettyServerRpcConnection conn = NettyServerRpcConnection.get(channel); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just pass the connection in when creating the encoder? Like what we have done in decoder?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The decoder is created after the NettyServerRpcConnection is constructed. The encoder unfortunately is created in NettyRpcServer when the channel is first initialized, before the NettyServerRpcConnection is constructed. Maybe we can refactor a bit?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's refactor. Using different pattern for encoder and decoder will make others confusing...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
| // when SslHandler is enabled, as it prefers to send a close_notify to the client first. | ||
| channel.config().setOption(ChannelOption.SO_LINGER, 0); | ||
| NettyUnsafeUtils.closeDirect(channel); | ||
| aborted = true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After thte above close, channel.isOpen could still returns true? Otherwise we do not need to test aborted in the below isConnectionOpen method?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TBH I'm not 100% sure. I added this just to be sure, but I can try digging deeper in the code to see where in process isConnectionOpen() would turn false.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok I traced through the code.
- Since this must be called in the event loop, we don't need to worry about thread safety
- Since we set SO_LINGER to 0, the call to closeDirect should
socket.close() - This will cause
isConnectionOpen()to return false
So we should be ok to remove the aborted boolean. I'm going to run it through my test case and see if I noticed any duplicate logging or anything to indicate otherwise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. Didn't see any issues in tests.
hbase-server/src/main/java/org/apache/hadoop/hbase/ipc/ServerRpcConnection.java
Outdated
Show resolved
Hide resolved
|
🎊 +1 overall
This message was automatically generated. |
|
🎊 +1 overall
This message was automatically generated. |
|
🎊 +1 overall
This message was automatically generated. |
hbase-common/src/main/java/org/apache/hadoop/hbase/util/NettyUnsafeUtils.java
Outdated
Show resolved
Hide resolved
hbase-hadoop-compat/src/main/java/org/apache/hadoop/hbase/ipc/MetricsHBaseServerWrapper.java
Show resolved
Hide resolved
hbase-server/src/main/java/org/apache/hadoop/hbase/ipc/NettyRpcServer.java
Outdated
Show resolved
Hide resolved
hbase-server/src/main/java/org/apache/hadoop/hbase/ipc/NettyRpcServer.java
Outdated
Show resolved
Hide resolved
… the handlers rather than assert eventLoop in connection
|
💔 -1 overall
This message was automatically generated. |
|
@Apache9 I just pushed a change which includes a bit of a refactor:
Let me know what you think |
| @InterfaceAudience.Private | ||
| protected NettyRpcServerPreambleHandler createNettyRpcServerPreambleHandler() { | ||
| return new NettyRpcServerPreambleHandler(NettyRpcServer.this); | ||
| protected NettyServerRpcConnection createNettyServerRpcConnection(Channel channel) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this change is no longer strictly necessary, but i kept it because i think it's a cleaner way to do what we want to do. in all cases, tests were overriding createNettyRpcServerPreambleHandler() just so they could inject a special NettyServerRpcConnection. So it works better to simply override the creation of the NettyServerRpcConnection directly.
|
💔 -1 overall
This message was automatically generated. |
|
I went through and deployed this refactor into my test cluster. Still works as expected. One thing to note is that when we abort the connection, the This gets spammed for all of the requests that were in flight when the connection was closed. I wonder if I should add a |
|
🎊 +1 overall
This message was automatically generated. |
|
🎊 +1 overall
This message was automatically generated. |
|
🎊 +1 overall
This message was automatically generated. |
|
🎊 +1 overall
This message was automatically generated. |
|
🎊 +1 overall
This message was automatically generated. |
Apache9
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1.
Thanks for tuning this!
normanmaurer
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left 2 comments related to issues that I noticed while reviewing this.
| public void write(ChannelHandlerContext ctx, Object msg, ChannelPromise promise) | ||
| throws Exception { | ||
| if (handleFatalThreshold(ctx)) { | ||
| promise.setFailure(EXCEPTION); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You must also release the msg via ReferenceCountUtil.release(msg) before return early as otherwise you might leak memory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch on this, thank you
| public void write(ChannelHandlerContext ctx, Object msg, ChannelPromise promise) | ||
| throws Exception { | ||
| if (handleFatalThreshold(ctx)) { | ||
| promise.setFailure(EXCEPTION); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also as a side note reusing a static exception instance is only safe if the instance does not support addSuppressed(...) as otherwise it is possible that you will end up with a memory leak if someone add suppressed exceptions to the static instance, which is possible in this case as you dont know what listeners etc will do.
|
🎊 +1 overall
This message was automatically generated. |
|
🎊 +1 overall
This message was automatically generated. |
|
🎊 +1 overall
This message was automatically generated. |
Signed-off-by: Duo Zhang <[email protected]> Reviewed-by: Norman Maurer <[email protected]>
Signed-off-by: Duo Zhang <[email protected]> Reviewed-by: Norman Maurer <[email protected]>
…5350) Signed-off-by: Duo Zhang <[email protected]> Reviewed-by: Norman Maurer <[email protected]>
…5350) Signed-off-by: Duo Zhang <[email protected]> Reviewed-by: Norman Maurer <[email protected]> (cherry picked from commit 29ecfc5) Change-Id: I7ab0cdcbb9c8ee615ad50aacad46145ab50c566f
If a client is not able to read response bytes from the channel faster than the server can serve them, netty's pending outbound bytes will build up. If it builds up long enough, it will result in an OOM. We protect against this with the following:
channel.setAutoRead(false). When the high watermark is exceeded, autoRead is disabled. When autoRead is disabled, the server will not continue to read incoming bytes from the client, thus will not enqueue more calls from the client.fatalthreshold, which should be higher than the high watermark. If exceeded, the connection will be forcibly closed so that we can reclaim the memory.Adds new configs:
All 3 default to 0. The setAutoRead management is disabled if both
highandloware 0. Enforcing of the fatal threshold is disabled iffatalis 0. Thus this new behavior is disabled by default.Adds new metrics:
A version of this has been tested internally at my company. We set the fatal limit to 100mb, and were able to avoid any OOMs in our test case.
I still need to add tests to AbstractTestIPC, but wanted to get feedback first.