[jetty-users] Jetty thread spins onIncompleteFlush / SslConnection$1.run forever

Discussion:

Steven Schlansker

2018-07-10 23:39:53 UTC

Hi everyone,

We've observed runaway Jetty threads in our production environment.
The remaining threads seem to continue serving traffic just fine, but two of them are totally stuck, seemingly in a spinloop.

We run Jetty 9.4.11.v20180605 on Java 9.0.4

You can see the difference by comparing a couple of our threads:

[ The "hung" thread has an order of magnitude more CPU time than any other worker thread ]

By sampling the stacks, I am able to notice a couple of hot spots:

"default-pool-4665" #4665 prio=5 os_prio=0 tid=0x00000000013c5000 nid=0x1258 runnable [0x00007ff9eb1ef000]
java.lang.Thread.State: RUNNABLE
at org.eclipse.jetty.io.ssl.SslConnection$1.run(SslConnection.java:121)
at org.eclipse.jetty.io.ssl.SslConnection$DecryptedEndPoint.onIncompleteFlush(SslConnection.java:545)
at org.eclipse.jetty.io.AbstractEndPoint$2.onIncompleteFlush(AbstractEndPoint.java:54)
at org.eclipse.jetty.io.WriteFlusher.write(WriteFlusher.java:332)
at org.eclipse.jetty.io.AbstractEndPoint.write(AbstractEndPoint.java:380)
at org.eclipse.jetty.server.HttpConnection$SendCallback.process(HttpConnection.java:804)
at org.eclipse.jetty.util.IteratingCallback.processing(IteratingCallback.java:241)
at org.eclipse.jetty.util.IteratingCallback.iterate(IteratingCallback.java:224)
at org.eclipse.jetty.server.HttpConnection.send(HttpConnection.java:538)
at org.eclipse.jetty.server.HttpChannel.sendResponse(HttpChannel.java:795)

"default-pool-4665" #4665 prio=5 os_prio=0 tid=0x00000000013c5000 nid=0x1258 runnable [0x00007ff9eb1ef000]
java.lang.Thread.State: RUNNABLE
at java.lang.ThreadLocal$ThreadLocalMap.getEntryAfterMiss(***@9.0.4/ThreadLocal.java:443)
at java.lang.ThreadLocal$ThreadLocalMap.getEntry(***@9.0.4/ThreadLocal.java:418)
at java.lang.ThreadLocal$ThreadLocalMap.access$000(***@9.0.4/ThreadLocal.java:298)
at java.lang.ThreadLocal.get(***@9.0.4/ThreadLocal.java:163)
at org.eclipse.jetty.io.ssl.SslConnection$DecryptedEndPoint.onIncompleteFlush(SslConnection.java:547)
at org.eclipse.jetty.io.AbstractEndPoint$2.onIncompleteFlush(AbstractEndPoint.java:54)
at org.eclipse.jetty.io.WriteFlusher.write(WriteFlusher.java:332)
at org.eclipse.jetty.io.AbstractEndPoint.write(AbstractEndPoint.java:380)
at org.eclipse.jetty.server.HttpConnection$SendCallback.process(HttpConnection.java:804)
at org.eclipse.jetty.util.IteratingCallback.processing(IteratingCallback.java:241)
at org.eclipse.jetty.util.IteratingCallback.iterate(IteratingCallback.java:224)

In a series of dozens of samples, I never saw a stack that was outside AEP.write, every stack trace I got looked like this.
Note how 'ThreadLocal.get' is 1/3 of the time -- this must be a *hot* loop!

I attempted to up the logging level to DEBUG for the WriteFlusher and SslConnection, but unfortunately Jetty
uses this pattern to boost performance in WriteFlusher:

private static final boolean DEBUG = LOG.isDebugEnabled(); // Easy for the compiler to remove the code if DEBUG==false

which means my attempts to increase debugging without restarting were totally ineffective.
(Is the performance boost here worth the inability to change logging levels at runtime? Pretty annoying in this case! And the JIT ought to be able to elide it all the same if it is actually disabled.)

I dumped the heap and inspected the state of the thread directly.
I observe the following state, an arbitrary point-in-time:

WriteFlusher._state = IDLE
HttpConnectionOverHTTP.state = CLOSED
SslConnection$DecryptedEndPoint._state = OPEN
SSLEngineImpl.connectionState = 2 ( DATA )
SocketChannelImpl.closed = false

I verified (with lsof -i and netstat -a) that the socket is no longer known to the kernel (or, at least, nobody is sitting on the HttpConnectionOverHTTP._endPoint._local.port)

To me, it looks like the thread is stuck in a hot loop where it sees an incomplete flush, "tries again". But the connection is not open, so no data is written,
so it makes no forward progress but also seems to never exit.

I note that in DecryptedEndPoint.onIncompleteFlush, the loop preamble checks `isOutputShutdown()` but then we never re-check for a closed socket inside the __tryWriteAgain loop.
Inside SslConnection$1.run, we call WriteFlusher.completeWrite(), which checks things like:

if (previous.getType() != StateType.PENDING)
return; // failure already handled.

If WriteFlusher state is IDLE, this returns immediately. (I can't know what the state was at the time, but I see a number of early exit paths in the code).

So my current hypothesis is that the socket was asynchronously closed in the middle of a retry, such that it neither ever succeeds nor fails, but just burns CPU.

I'm not very familiar with this code, but hopefully someone here can help guide us to understand what is going on. I see that `try_again` is only set in a //TODO block so I worry that we might have hit a bug or unfortunate race condition. In particular,

// Keep running complete write until
__tryWriteAgain.set(Boolean.FALSE);
do
{
_runCompleteWrite.run();
} while (Boolean.TRUE.equals(__tryWriteAgain.get()));

seems that it would possibly never terminate, since we never set __tryWriteAgain inside the do/while, only in a finally outside.

Thanks for any guidance or suggestions!

Best,
Steven

Greg Wilkins

2018-07-11 08:08:24 UTC

Permalink

Steven,

The loop appears to be in code that we have added as a partial solution to
https://github.com/eclipse/jetty.project/issues/2233

Your analysis sounds plausible. Is this reproducible for you or has it so
far been a once off?

regards

Post by Steven Schlansker
Hi everyone,
We've observed runaway Jetty threads in our production environment.
The remaining threads seem to continue serving traffic just fine, but two
of them are totally stuck, seemingly in a spinloop.
We run Jetty 9.4.11.v20180605 on Java 9.0.4
[ The "hung" thread has an order of magnitude more CPU time than any other worker thread ]
"default-pool-4665" #4665 prio=5 os_prio=0 tid=0x00000000013c5000
nid=0x1258 runnable [0x00007ff9eb1ef000]
java.lang.Thread.State: RUNNABLE
at org.eclipse.jetty.io.ssl.SslConnection$1.run(
SslConnection.java:121)
at org.eclipse.jetty.io.ssl.SslConnection$DecryptedEndPoint.
onIncompleteFlush(SslConnection.java:545)
at org.eclipse.jetty.io.AbstractEndPoint$2.onIncompleteFlush(
AbstractEndPoint.java:54)
at org.eclipse.jetty.io.WriteFlusher.write(WriteFlusher.java:332)
at org.eclipse.jetty.io.AbstractEndPoint.write(
AbstractEndPoint.java:380)
at org.eclipse.jetty.server.HttpConnection$SendCallback.
process(HttpConnection.java:804)
at org.eclipse.jetty.util.IteratingCallback.processing(
IteratingCallback.java:241)
at org.eclipse.jetty.util.IteratingCallback.iterate(
IteratingCallback.java:224)
at org.eclipse.jetty.server.HttpConnection.send(
HttpConnection.java:538)
at org.eclipse.jetty.server.HttpChannel.sendResponse(
HttpChannel.java:795)
"default-pool-4665" #4665 prio=5 os_prio=0 tid=0x00000000013c5000
nid=0x1258 runnable [0x00007ff9eb1ef000]
java.lang.Thread.State: RUNNABLE
at java.lang.ThreadLocal$ThreadLocalMap.
/ThreadLocal.java:418)
/ThreadLocal.java:298)
at org.eclipse.jetty.io.ssl.SslConnection$DecryptedEndPoint.
onIncompleteFlush(SslConnection.java:547)
at org.eclipse.jetty.io.AbstractEndPoint$2.onIncompleteFlush(
AbstractEndPoint.java:54)
at org.eclipse.jetty.io.WriteFlusher.write(WriteFlusher.java:332)
at org.eclipse.jetty.io.AbstractEndPoint.write(
AbstractEndPoint.java:380)
at org.eclipse.jetty.server.HttpConnection$SendCallback.
process(HttpConnection.java:804)
at org.eclipse.jetty.util.IteratingCallback.processing(
IteratingCallback.java:241)
at org.eclipse.jetty.util.IteratingCallback.iterate(
IteratingCallback.java:224)
In a series of dozens of samples, I never saw a stack that was outside
AEP.write, every stack trace I got looked like this.
Note how 'ThreadLocal.get' is 1/3 of the time -- this must be a *hot* loop!
I attempted to up the logging level to DEBUG for the WriteFlusher and
SslConnection, but unfortunately Jetty
private static final boolean DEBUG = LOG.isDebugEnabled(); // Easy for
the compiler to remove the code if DEBUG==false
which means my attempts to increase debugging without restarting were totally ineffective.
(Is the performance boost here worth the inability to change logging
levels at runtime? Pretty annoying in this case! And the JIT ought to be
able to elide it all the same if it is actually disabled.)
I dumped the heap and inspected the state of the thread directly.
WriteFlusher._state = IDLE
HttpConnectionOverHTTP.state = CLOSED
SslConnection$DecryptedEndPoint._state = OPEN
SSLEngineImpl.connectionState = 2 ( DATA )
SocketChannelImpl.closed = false
I verified (with lsof -i and netstat -a) that the socket is no longer
known to the kernel (or, at least, nobody is sitting on the
HttpConnectionOverHTTP._endPoint._local.port)
To me, it looks like the thread is stuck in a hot loop where it sees an
incomplete flush, "tries again". But the connection is not open, so no
data is written,
so it makes no forward progress but also seems to never exit.
I note that in DecryptedEndPoint.onIncompleteFlush, the loop preamble
checks `isOutputShutdown()` but then we never re-check for a closed socket
inside the __tryWriteAgain loop.
if (previous.getType() != StateType.PENDING)
return; // failure already handled.
If WriteFlusher state is IDLE, this returns immediately. (I can't know
what the state was at the time, but I see a number of early exit paths in
the code).
So my current hypothesis is that the socket was asynchronously closed in
the middle of a retry, such that it neither ever succeeds nor fails, but
just burns CPU.
I'm not very familiar with this code, but hopefully someone here can help
guide us to understand what is going on. I see that `try_again` is only
set in a //TODO block so I worry that we might have hit a bug or
unfortunate race condition. In particular,
// Keep running complete write until
__tryWriteAgain.set(Boolean.FALSE);
do
{
_runCompleteWrite.run();
} while (Boolean.TRUE.equals(__tryWriteAgain.get()));
seems that it would possibly never terminate, since we never set
__tryWriteAgain inside the do/while, only in a finally outside.
Thanks for any guidance or suggestions!
Best,
Steven
_______________________________________________
jetty-users mailing list
To change your delivery options, retrieve your password, or unsubscribe
from this list, visit
https://dev.eclipse.org/mailman/listinfo/jetty-users

--
Greg Wilkins <***@webtide.com> CTO http://webtide.com

Steven Schlansker

2018-07-12 21:10:45 UTC

Permalink

thanks for taking a look.

Post by Greg Wilkins
Steven,
The loop appears to be in code that we have added as a partial solution to https://github.com/eclipse/jetty.project/issues/2233
Your analysis sounds plausible. Is this reproducible for you or has it so far been a once off?
regards

So far, one off. I've subscribed to the issue and we'll certainly keep an eye to see if it happens again.

Post by Greg Wilkins
Hi everyone,
We've observed runaway Jetty threads in our production environment.
The remaining threads seem to continue serving traffic just fine, but two of them are totally stuck, seemingly in a spinloop.
We run Jetty 9.4.11.v20180605 on Java 9.0.4
<Screen Shot 2018-07-10 at 3.41.18 PM.png>
[ The "hung" thread has an order of magnitude more CPU time than any other worker thread ]
"default-pool-4665" #4665 prio=5 os_prio=0 tid=0x00000000013c5000 nid=0x1258 runnable [0x00007ff9eb1ef000]
java.lang.Thread.State: RUNNABLE
at org.eclipse.jetty.io.ssl.SslConnection$1.run(SslConnection.java:121)
at org.eclipse.jetty.io.ssl.SslConnection$DecryptedEndPoint.onIncompleteFlush(SslConnection.java:545)
at org.eclipse.jetty.io.AbstractEndPoint$2.onIncompleteFlush(AbstractEndPoint.java:54)
at org.eclipse.jetty.io.WriteFlusher.write(WriteFlusher.java:332)
at org.eclipse.jetty.io.AbstractEndPoint.write(AbstractEndPoint.java:380)
at org.eclipse.jetty.server.HttpConnection$SendCallback.process(HttpConnection.java:804)
at org.eclipse.jetty.util.IteratingCallback.processing(IteratingCallback.java:241)
at org.eclipse.jetty.util.IteratingCallback.iterate(IteratingCallback.java:224)
at org.eclipse.jetty.server.HttpConnection.send(HttpConnection.java:538)
at org.eclipse.jetty.server.HttpChannel.sendResponse(HttpChannel.java:795)
"default-pool-4665" #4665 prio=5 os_prio=0 tid=0x00000000013c5000 nid=0x1258 runnable [0x00007ff9eb1ef000]
java.lang.Thread.State: RUNNABLE
at org.eclipse.jetty.io.ssl.SslConnection$DecryptedEndPoint.onIncompleteFlush(SslConnection.java:547)
at org.eclipse.jetty.io.AbstractEndPoint$2.onIncompleteFlush(AbstractEndPoint.java:54)
at org.eclipse.jetty.io.WriteFlusher.write(WriteFlusher.java:332)
at org.eclipse.jetty.io.AbstractEndPoint.write(AbstractEndPoint.java:380)
at org.eclipse.jetty.server.HttpConnection$SendCallback.process(HttpConnection.java:804)
at org.eclipse.jetty.util.IteratingCallback.processing(IteratingCallback.java:241)
at org.eclipse.jetty.util.IteratingCallback.iterate(IteratingCallback.java:224)
<Screen Shot 2018-07-10 at 4.29.06 PM.png>
In a series of dozens of samples, I never saw a stack that was outside AEP.write, every stack trace I got looked like this.
Note how 'ThreadLocal.get' is 1/3 of the time -- this must be a *hot* loop!
I attempted to up the logging level to DEBUG for the WriteFlusher and SslConnection, but unfortunately Jetty
private static final boolean DEBUG = LOG.isDebugEnabled(); // Easy for the compiler to remove the code if DEBUG==false
which means my attempts to increase debugging without restarting were totally ineffective.
(Is the performance boost here worth the inability to change logging levels at runtime? Pretty annoying in this case! And the JIT ought to be able to elide it all the same if it is actually disabled.)
I dumped the heap and inspected the state of the thread directly.
WriteFlusher._state = IDLE
HttpConnectionOverHTTP.state = CLOSED
SslConnection$DecryptedEndPoint._state = OPEN
SSLEngineImpl.connectionState = 2 ( DATA )
SocketChannelImpl.closed = false
I verified (with lsof -i and netstat -a) that the socket is no longer known to the kernel (or, at least, nobody is sitting on the HttpConnectionOverHTTP._endPoint._local.port)
To me, it looks like the thread is stuck in a hot loop where it sees an incomplete flush, "tries again". But the connection is not open, so no data is written,
so it makes no forward progress but also seems to never exit.
I note that in DecryptedEndPoint.onIncompleteFlush, the loop preamble checks `isOutputShutdown()` but then we never re-check for a closed socket inside the __tryWriteAgain loop.
if (previous.getType() != StateType.PENDING)
return; // failure already handled.
If WriteFlusher state is IDLE, this returns immediately. (I can't know what the state was at the time, but I see a number of early exit paths in the code).
So my current hypothesis is that the socket was asynchronously closed in the middle of a retry, such that it neither ever succeeds nor fails, but just burns CPU.
I'm not very familiar with this code, but hopefully someone here can help guide us to understand what is going on. I see that `try_again` is only set in a //TODO block so I worry that we might have hit a bug or unfortunate race condition. In particular,
// Keep running complete write until
__tryWriteAgain.set(Boolean.FALSE);
do
{
_runCompleteWrite.run();
} while (Boolean.TRUE.equals(__tryWriteAgain.get()));
seems that it would possibly never terminate, since we never set __tryWriteAgain inside the do/while, only in a finally outside.
Thanks for any guidance or suggestions!
Best,
Steven
_______________________________________________
jetty-users mailing list
To change your delivery options, retrieve your password, or unsubscribe from this list, visit
https://dev.eclipse.org/mailman/listinfo/jetty-users
--
_______________________________________________
jetty-users mailing list
To change your delivery options, retrieve your password, or unsubscribe from this list, visit
https://dev.eclipse.org/mailman/listinfo/jetty-users