Details
-
Type:
Bug
-
Status:
Resolved
-
Priority:
Major
-
Resolution: Won't Fix
-
Affects Version/s: 7.1.6
-
Fix Version/s: None
-
Component/s: Servlet
-
Labels:None
-
Environment:Sun Solaris 11. Jetty 7.1.6.
-
Number of attachments :0
Description
I'm using WebSockets in production. Over time threads will start to hang in the following place:
WebSocketConnection.handle() line: 168
SelectChannelEndPoint.run() line: 506
QueuedThreadPool$2.run() line: 436
Thread.run() line: 619
A new thread hangs about once a day and after I few days I'm forced to restart my server.
Please let me know if there's any other info I can provide to help diagnose this bug!
Issue Links
- is related to
-
JETTY-1391
Apparent Race Hazard / Flushing Problem In Websockets
-
Activity
I was able to take a couple of snapshots of the threads that seemed to be eating 100% CPU. They are below. I actually wasn't able to get this thread to stop in the same place as I first described (WebSocketConnection line 168) but I'm still guessing this is the same bug.
Thread [qtp19097823-1224] (Suspended)
FileDispatcher.read0(FileDescriptor, long, int) line: not available [native method]
SocketDispatcher.read(FileDescriptor, long, int) line: 21
IOUtil.readIntoNativeBuffer(FileDescriptor, ByteBuffer, long, NativeDispatcher, Object) line: 237
IOUtil.read(FileDescriptor, ByteBuffer, long, NativeDispatcher, Object) line: 210
SocketChannelImpl.read(ByteBuffer) line: 236
SelectChannelEndPoint(ChannelEndPoint).fill(Buffer) line: 149
WebSocketConnection.handle() line: 147
SelectChannelEndPoint.run() line: 506
QueuedThreadPool$2.run() line: 436
Thread.run() line: 619
Thread [qtp19097823-1224] (Suspended)
NativeThread.current() line: not available [native method]
SocketChannelImpl.read(ByteBuffer) line: 182
SelectChannelEndPoint(ChannelEndPoint).fill(Buffer) line: 149
WebSocketConnection.handle() line: 147
SelectChannelEndPoint.run() line: 506
QueuedThreadPool$2.run() line: 436
Thread.run() line: 619
thanks.
Another thing that would be useful to know is if the connection is still open or not. Are you able to use lsof or similar to tell if there are websockets open to the server?
I've built a snapshot with some defensive handling of IOExceptions.
Could you try the latest in: http://oss.sonatype.org/content/repositories/jetty-snapshots/org/eclipse/jetty/jetty-distribution/7.2.0-SNAPSHOT/
My server has several hundred connections open at a time, most of those are http long-polls using continuations, but many of those are websocket connections. (I try to use WebSockets and fall back to long-polling when needed). I don't appear to have lsof on my server. Is there a Solaris lsof equivalent that you know of that I can try?
I'll give this latest snapshot a try and report back. Thanks!
Solaris doesn't come with lsof typically.
I don't have extensive Solaris experience, but I can come up with 2 choices for you.
1) You can install lsof yourself from http://www.sunfreeware.com/
2) You can use the information from the utilities in /usr/proc/bin/* to get the socket details. (pfiles is a good place to start looking. grep for AF_INET on the output)
Greg - I tried the new snapshot you provided (thank you!), but it didn't appear to help. Three threads went into infinite loops pretty much immediately. (It's an 8 core server, and 3 of the cores were spinning away on those threads). I think I have seen these threads do this immediately in the past as well, so I don't think the fact that they died immediately is a change from past behaviour.
They all had stack traces that looked like this:
FileDispatcher.read0(FileDescriptor, long, int) line: not available [native method]
SocketDispatcher.read(FileDescriptor, long, int) line: 21
IOUtil.readIntoNativeBuffer(FileDescriptor, ByteBuffer, long, NativeDispatcher, Object) line: 237
IOUtil.read(FileDescriptor, ByteBuffer, long, NativeDispatcher, Object) line: 210
SocketChannelImpl.read(ByteBuffer) line: 236
SelectChannelEndPoint(ChannelEndPoint).fill(Buffer) line: 149
WebSocketConnection.handle() line: 183
SelectChannelEndPoint.handle() line: 510
SelectChannelEndPoint.access$000(SelectChannelEndPoint) line: 34
SelectChannelEndPoint$1.run() line: 40
QueuedThreadPool$2.run() line: 450
Thread.run() line: 619
I know this is a "sleep" state, but when I ran those threads again they went back to using up 100% CPU. What other info can I provide? Should I try to get lsof working and send you that info? (like I said there are a ton of open connections though).
Thanks for the info Joakim! I'll check those out.
I'm also getting a ton of these thrown in my error logs:
IDLE SCEP@7381979 [d=true,io=1,w=true,rb=false,wb=false]
IDLE SCEP@4493926 [d=true,io=1,w=true,rb=false,wb=false]
IDLE SCEP@24839132 [d=true,io=1,w=true,rb=false,wb=false]
IDLE SCEP@32968678 [d=true,io=1,w=true,rb=false,wb=false]
IDLE SCEP@14629295 [d=true,io=1,w=true,rb=false,wb=false]
IDLE SCEP@24423429 [d=true,io=1,w=true,rb=false,wb=false]
IDLE SCEP@14070433 [d=true,io=1,w=true,rb=false,wb=false]
IDLE SCEP@13205487 [d=true,io=1,w=true,rb=false,wb=false]
I don't know if they're related, but when this starts happening people can't connect to my server at all it appears.
What do these mean?
This might be total coincidence, but this exception was thrown right about when a thread went into its infinite loop:
org.eclipse.jetty.io.EofException
at org.eclipse.jetty.websocket.WebSocketGenerator.flushBuffer(WebSocketGenerator.java:134)
at org.eclipse.jetty.websocket.WebSocketGenerator.addFrame(WebSocketGenerator.java:78)
at org.eclipse.jetty.websocket.WebSocketGenerator.addFrame(WebSocketGenerator.java:112)
at org.eclipse.jetty.websocket.WebSocketConnection.sendMessage(WebSocketConnection.java:243)
at org.eclipse.jetty.websocket.WebSocketConnection.sendMessage(WebSocketConnection.java:238)
at env.server.comet.WebSockConnection.enqueue(WebSockConnection.java:129)
at env.server.comet.ClientConnData.enqueueStatement(ClientConnData.java:83)
at env.server.comet.ClientInstanceBase.enqueueStatement(ClientInstanceBase.java:58)
at env.server.TabClient.handleInitialZoneState(TabClient.java:183)
at env.siteserv.client.ZoneConnection$ZoneJoinedHandler.onResponse(ZoneConnection.java:95)
at env.siteserv.client.ZoneConnection$ZoneJoinedHandler.onResponse(ZoneConnection.java:1)
at env.interserver.ISConnectionPool.handleRequestResponse(ISConnectionPool.java:174)
at env.interserver.ISConnectionSocketHandler$RxHandler.run(ISConnectionSocketHandler.java:191)
at java.lang.Thread.run(Thread.java:619)
Andrew,
sorry for the slow response here. I've now finally got some cycles to investigate this problem.
Do you have any additional information?
What browsers and versions are you seeing in your log? Any suggestion that it is any particular browser version that is causing the problem?
regards
Andrew,
can you also give us some information about your ws usage. Are you sending lots messages, or a few? Are you sending big messages or little ones? What about receiving?
Can you only reproduce in production?
If so, I assume you are currently resetting the server? Could you perhaps try instead the next time this happens disconnecting the server from the network and waiting 90s to see if the 100% continues (this will close all connections and will tell us if the problem is with existing live connections or old closed ones).
Greg - We had to take websockets out of our production code. Our server just couldn't stay up. I unfortunately have only so far been able to reproduce this issue in production.
My usage looks like:
-Sending messages of varying size (a few bytes all the way up to maybe 50k)
-Receiving small messages < 1k
-Sending "lots" of messages I guess. A busy user would need to be sent several per second I think.
If I can reproduce this in testing I'll try what you describe about disconnecting the server from the network (it's a virtualized server though... so I'll have to figure out how to do that). If I'm able to do this I'll report back, though it may be a while.
I wonder if this relates to a bizarre bug I've been seeing...
1. WS onConnect
2. WS client sends a ~50 byte message
3. WS client sends a ~500 bytes message
4. WS client sends a ~50 byte message
5. WS client sends a ~100 byte message
If the client is 'fast' - i.e. a snippet of Javascript on localhost, the whole socket will block around step 4/5 and only flush when the client disconnects (at which point the remaining messages are suddenly handed to my WebSocket subclass).
Of interest is that the point at which the blocking occurs always seems to have a different thread ID to the previous step. So what if both my problem and the one that Andrew reported are to do with a non-synchronizing race hazard in something like WebSocketParser.parseNext(). Does, for example, the _state manipulation assume wrongly that a given thread will run to completion upon the receipt of some new data before the next thread's contribution is read in?
Additional note: by making my test app respond with a 50ish byte message after every one that's received, the blockage no longer occurs - perhaps because the synchronized blocks on the generator are creating a mutex-by-proxy effect?
Hiya - I wondered if there is a game-plan for attacking this bug? It's bitten us again in production (just came back from the weekend to discover most of our servers at max CPU with 0-2 active connections apiece). I could try and have a go myself but I'm concerned that, given the complexity of the state machine and the need to understand the Jetty acceptor / handling lifecycle that I will be far less effective than the code's original author.
Well we need a tame version of the bug that we can conduct scientific experiments on while in captivity.
Failing that, we need some thread dumps - about 1s apart - of suffering server.
Note that we have fixed a couple of issues in jetty's selector that may improve the websockets experience (we also have implemented draft 6 websockets).
So it may be worthwhile testing 7.3.1 (to be released in a few days), and try to replicate on that.
websockets has been substantially reworked in 7.4.x, so I'm closing this one.
Please open a new issue if you can reproduce on 7.4.1 or later.
what would be good to see is several thread dumps a few milliseconds apart, that way I can get a sense of the loop that the code is running through.
thanks