Details
Description
Setup:
======
I have a Server(running on ip:port) with its IO-subsystem managed by Jetty (running with NIO). The server, after it receives an HTTP request, processes the request and closes the client connection.
Under heavy load (described below), we are observing that Jetty threads no longer answer requests and appear to be wedged. Once the load subsides(connections terminate), Jetty recovers and normal operation resumes.
When the server is not responding, we observed that the server has 20K+ tcp connections in CLOSE_WAIT state.
Client test:
============
The client test is a simulation of DOS attack. Basically every thread establishes a TCP connection to a server and sends a bunch of requests for a specified time. It then increments the number of threads running in the system by a factor of 2. The request count you see are a cumulative sum of all the requests sent by all the threads on the client.
You'll notice that once the number of threads increase to 1920, the server throughput drops to 0 queries per sec. At this point, the server is completely unresponsive. All the Jetty threads are wedged. It is at this time, the number of CLOSE_WAIT connections on the Server explode. Once we dial down the number of client threads, the service is restored on the server.
tcpdump shows that the client is sending requests but the server is responding only to a few requests and sometimes none.
60 threads 60.00 seconds:
server: count 64859 fail 0 (0.00%) timeout 0 (0.00%) input 1080.96qps output 1080.96qps mean 54.95ms median 53.96ms 80% 89.61ms 90% 102.63ms 95% 111.11ms 98% 119.56ms 99% 124.70ms
120 threads 30.00 seconds:
server: count 32773 fail 0 (0.00%) timeout 0 (0.00%) input 1092.40qps output 1092.40qps mean 109.39ms median 108.50ms 80% 179.89ms 90% 205.07ms 95% 219.47ms 98% 232.91ms 99% 240.86ms
240 threads 30.00 seconds:
server: count 32364 fail 0 (0.00%) timeout 0 (0.00%) input 1078.77qps output 1078.77qps mean 222.10ms median 220.74ms 80% 367.91ms 90% 416.30ms 95% 445.22ms 98% 468.16ms 99% 480.68ms
480 threads 30.00 seconds:
server: count 32589 fail 2 (0.01%) timeout 3 (0.01%) input 1086.27qps output 1086.20qps mean 440.30ms median 439.49ms 80% 730.94ms 90% 828.91ms 95% 876.85ms 98% 911.92ms 99% 933.24ms
960 threads 30.00 seconds:
server: count 32574 fail 16892 (51.86%) timeout 16970 (52.10%) input 1085.77qps output 522.72qps mean 883.31ms median 999.80ms 80% 1000.67ms 90% 1000.80ms 95% 1000.90ms 98% 1001.11ms 99% 1001.27ms
1920 threads 30.00 seconds:
server: count 57555 fail 57555 (100.00%) timeout 57555 (100.00%) input 1918.40qps output 0.00qps mean 1000.13ms median 1000.08ms 80% 1000.48ms 90% 1000.64ms 95% 1000.75ms 98% 1000.91ms 99% 1001.15ms
1920 threads 30.00 seconds:
server: count 46489 fail 46489 (100.00%) timeout 32386 (69.66%) input 1549.56qps output 0.00qps mean 701.38ms median 1000.84ms 80% 1003.44ms 90% 1006.38ms 95% 1008.35ms 98% 1010.06ms 99% 1011.00ms
1920 threads 30.00 seconds:
server: count 46427 fail 46427 (100.00%) timeout 31232 (67.27%) input 1547.36qps output 0.00qps mean 678.65ms median 1000.12ms 80% 1003.13ms 90% 1006.29ms 95% 1008.44ms 98% 1010.42ms 99% 1011.58ms
960 threads 30.00 seconds:
server: count 28781 fail 28781 (100.00%) timeout 28781 (100.00%) input 959.34qps output 0.00qps mean 1000.36ms median 1000.50ms 80% 1000.80ms 90% 1000.88ms 95% 1000.98ms 98% 1001.27ms 99% 1001.37ms
480 threads 30.00 seconds:
server: count 14396 fail 14396 (100.00%) timeout 14396 (100.00%) input 479.84qps output 0.00qps mean 1000.24ms median 1000.25ms 80% 1000.71ms 90% 1000.79ms 95% 1000.86ms 98% 1000.89ms 99% 1001.04ms
240 threads 30.00 seconds:
server: count 29609 fail 786 (2.65%) timeout 787 (2.66%) input 986.94qps output 960.74qps mean 245.45ms median 228.65ms 80% 380.21ms 90% 431.68ms 95% 469.08ms 98% 1000.65ms 99% 1000.79ms
server: pass. After DOS attack, service restored.
System params on server:
========================
net.core.somaxconn = 4096
net.core.netdev_max_backlog = 30000
net.core.rmem_default = 2524287
net.core.wmem_default = 2524287
net.core.rmem_max = 240000000
net.core.wmem_max = 240000000
net.ipv4.tcp_congestion_control = cubic
net.ipv4.tcp_rmem = 4096 87380 240000000
net.ipv4.tcp_wmem = 4096 65536 240000000
net.ipv4.tcp_mem = 761280 1015040 1522560
net.ipv4.tcp_max_syn_backlog = 8192
net.ipv4.tcp_tw_recycle = 1
net.ipv4.tcp_syncookies = 1
net.ipv4.ip_local_port_range = 32768 61000
limits.conf on server:
======================
- soft nofile 8192
- hard nofile 450000
Jetty config on the server:
===========================
<Call name="addConnector">
<Arg>
<New class="org.eclipse.jetty.server.nio.SelectChannelConnector">
<Set name="host"><Property name="jetty.host" /></Set>
<Set name="port"><Property name="jetty.port" default="4001"/></Set>
<Set name="maxIdleTime">300000</Set>
<Set name="Acceptors">10</Set>
<Set name="AcceptQueueSize">3000</Set>
<Set name="statsOn">false</Set>
<Set name="confidentialPort">8443</Set>
<Set name="lowResourcesConnections">20000</Set>
<Set name="lowResourcesMaxIdleTime">5000</Set>
</New>
</Arg>
</Call>
<Set name="ThreadPool">
<!-- Default queued blocking threadpool -->
<New class="org.eclipse.jetty.util.thread.QueuedThreadPool">
<Set name="minThreads">10</Set>
<Set name="maxThreads">500</Set>
<Set name="detailedDump">false</Set>
</New>
</Set>
Server Config:
==============
Linux 2.6.18-238.0.0.0.1.el5 #1 SMP Tue Jan 4 09:29:38 EST 2011 x86_64 x86_64 x86_64 GNU/Linux
Exceptions we see in the server code when Jetty threads are wedged:
===================================================================
2012-03-28 22:19:03,815 ERROR [qtp1442119391-1073] (Server.java:277) - error processing request to /foo
org.eclipse.jetty.io.EofException
at org.eclipse.jetty.http.HttpGenerator.flushBuffer(HttpGenerator.java:952)
at org.eclipse.jetty.http.AbstractGenerator.blockForOutput(AbstractGenerator.java:518)
at org.eclipse.jetty.http.AbstractGenerator.flush(AbstractGenerator.java:442)
at org.eclipse.jetty.server.HttpOutput.flush(HttpOutput.java:94)
at org.eclipse.jetty.server.AbstractHttpConnection$Output.flush(AbstractHttpConnection.java:1022)
at com.project.serve.Server.doPost(Server.java:273)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:755)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:848)
at org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:594)
at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:486)
at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:119)
at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:524)
at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:233)
at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1065)
at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:413)
at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:192)
at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:999)
at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:117)
at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:250)
at org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:149)
at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:111)
at org.eclipse.jetty.server.Server.handle(Server.java:351)
at org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:454)
at org.eclipse.jetty.server.AbstractHttpConnection.content(AbstractHttpConnection.java:900)
at org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.content(AbstractHttpConnection.java:954)
at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:857)
at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:235)
at org.eclipse.jetty.server.AsyncHttpConnection.handle(AsyncHttpConnection.java:77)
at org.eclipse.jetty.io.nio.SelectChannelEndPoint.handle(SelectChannelEndPoint.java:609)
at org.eclipse.jetty.io.nio.SelectChannelEndPoint$1.run(SelectChannelEndPoint.java:45)
at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:599)
at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:534)
at java.lang.Thread.run(Thread.java:722)
Caused by: java.io.IOException: Broken pipe
at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47)
at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:89)
at sun.nio.ch.IOUtil.write(IOUtil.java:46)
at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:450)
at org.eclipse.jetty.io.nio.ChannelEndPoint.flush(ChannelEndPoint.java:292)
at org.eclipse.jetty.io.nio.SelectChannelEndPoint.flush(SelectChannelEndPoint.java:344)
at org.eclipse.jetty.http.HttpGenerator.flushBuffer(HttpGenerator.java:888)
... 32 more
strace output on the server:
============================
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
98.34 13039.218984 74181 175775 29640 futex
1.49 197.349711 2223 88766 epoll_wait
0.03 3.797560 48 78371 32991 write
0.03 3.687008 50 74097 read
0.02 2.920068 47 62065 31223 epoll_ctl
0.01 1.720053 43 39766 fcntl
0.01 1.604560 49 33040 32975 rt_sigreturn
0.01 1.592088 51 31223 dup2
0.01 1.591610 51 31329 close
0.01 1.562989 50 31124 writev
0.01 1.227706 77 15914 mprotect
0.01 1.061175 40 26442 getsockname
0.01 1.028789 39 26440 setsockopt
0.00 0.612548 46 13220 accept
0.00 0.503092 45 11169 clk_gettime
0.00 0.149363 51 2915 106 stat
0.00 0.096893 43 2235 sched_yield
0.00 0.040313 45 894 brk
0.00 0.024888 45 549 465 shutdown
0.00 0.010789 51 212 getdents
0.00 0.007996 2665 3 madvise
0.00 0.005129 48 106 open
0.00 0.004593 124 37 mmap
0.00 0.004525 98 46 lstat
0.00 0.004373 729 6 munmap
0.00 0.000000 0 2 lseek
------ ----------- ----------- --------- --------- ----------------
100.00 13259.826803 745746 127400 total
Netstat on the server:
======================
on the server (we can see that there are a ton of connections in CLOSE_WAIT state while the test was running):
the server did not close the connections even though, the client did.
netstat -a | grep CLOSE_WAIT | wc -l
24124
example: a small snippet of connections in CLOSE_WAIT state:
tcp 190 0 ls88.com:newoak ls78.com:52615 CLOSE_WAIT
tcp 277 0 ls88.com:newoak ls78.com:52359 CLOSE_WAIT
tcp 200 0 ls88.com:newoak ls78.com:58247 CLOSE_WAIT
tcp 200 0 ls88.com:newoak ls78.com:57991 CLOSE_WAIT
tcp 198 0 ls88.com:newoak ls78.com:57735 CLOSE_WAIT
tcp 190 0 ls88.com:newoak ls78.com:57479 CLOSE_WAIT
tcp 199 0 ls88.com:newoak ls78.com:59271 CLOSE_WAIT
tcp 260 0 ls88.com:newoak ls78.com:59015 CLOSE_WAIT
tcp 260 0 ls88.com:newoak ls78.com:58759 CLOSE_WAIT
tcp 198 0 ls88.com:newoak ls78.com:58503 CLOSE_WAIT
tcp 196 0 ls88.com:newoak ls78.com:60295 CLOSE_WAIT
tcp 195 0 ls88.com:newoak ls78.com:60039 CLOSE_WAIT
tcp 556 0 ls88.com:newoak ls78.com:59783 CLOSE_WAIT
tcp 239 0 ls88.com:newoak ls78.com:60551 CLOSE_WAIT
tcp 240 0 ls88.com:newoak ls78.com:37510 CLOSE_WAIT
tcp 200 0 ls88.com:newoak ls78.com:37766 CLOSE_WAIT
:::::::::::::::::::::::
Activity
| Field | Original Value | New Value |
|---|---|---|
| Assignee | Jan Bartel [ janb ] |
| Status | Open [ 1 ] | Resolved [ 5 ] |
| Resolution | Not A Bug [ 6 ] |