Jetty
  1. Jetty
  2. JETTY-1505

Jetty Configured with NIO hangs during Load test

    Details

    • Type: Bug Bug
    • Status: Resolved Resolved
    • Priority: Major Major
    • Resolution: Not A Bug
    • Affects Version/s: 8.1.2
    • Fix Version/s: None
    • Component/s: Client, Eclipse, NIO, Servlet
    • Labels:
      None
    • Environment:

      Linux 2.6.18-238.0.0.0.1.el5 #1 SMP Tue Jan 4 09:29:38 EST 2011 x86_64 x86_64 x86_64 GNU/Linux
    • Number of attachments :
      0

      Description

      Setup:
      ======
      I have a Server(running on ip:port) with its IO-subsystem managed by Jetty (running with NIO). The server, after it receives an HTTP request, processes the request and closes the client connection.

      Under heavy load (described below), we are observing that Jetty threads no longer answer requests and appear to be wedged. Once the load subsides(connections terminate), Jetty recovers and normal operation resumes.

      When the server is not responding, we observed that the server has 20K+ tcp connections in CLOSE_WAIT state.

      Client test:
      ============

      The client test is a simulation of DOS attack. Basically every thread establishes a TCP connection to a server and sends a bunch of requests for a specified time. It then increments the number of threads running in the system by a factor of 2. The request count you see are a cumulative sum of all the requests sent by all the threads on the client.

      You'll notice that once the number of threads increase to 1920, the server throughput drops to 0 queries per sec. At this point, the server is completely unresponsive. All the Jetty threads are wedged. It is at this time, the number of CLOSE_WAIT connections on the Server explode. Once we dial down the number of client threads, the service is restored on the server.

      tcpdump shows that the client is sending requests but the server is responding only to a few requests and sometimes none.

      60 threads 60.00 seconds:
      server: count 64859 fail 0 (0.00%) timeout 0 (0.00%) input 1080.96qps output 1080.96qps mean 54.95ms median 53.96ms 80% 89.61ms 90% 102.63ms 95% 111.11ms 98% 119.56ms 99% 124.70ms
      120 threads 30.00 seconds:
      server: count 32773 fail 0 (0.00%) timeout 0 (0.00%) input 1092.40qps output 1092.40qps mean 109.39ms median 108.50ms 80% 179.89ms 90% 205.07ms 95% 219.47ms 98% 232.91ms 99% 240.86ms
      240 threads 30.00 seconds:
      server: count 32364 fail 0 (0.00%) timeout 0 (0.00%) input 1078.77qps output 1078.77qps mean 222.10ms median 220.74ms 80% 367.91ms 90% 416.30ms 95% 445.22ms 98% 468.16ms 99% 480.68ms
      480 threads 30.00 seconds:
      server: count 32589 fail 2 (0.01%) timeout 3 (0.01%) input 1086.27qps output 1086.20qps mean 440.30ms median 439.49ms 80% 730.94ms 90% 828.91ms 95% 876.85ms 98% 911.92ms 99% 933.24ms
      960 threads 30.00 seconds:
      server: count 32574 fail 16892 (51.86%) timeout 16970 (52.10%) input 1085.77qps output 522.72qps mean 883.31ms median 999.80ms 80% 1000.67ms 90% 1000.80ms 95% 1000.90ms 98% 1001.11ms 99% 1001.27ms
      1920 threads 30.00 seconds:
      server: count 57555 fail 57555 (100.00%) timeout 57555 (100.00%) input 1918.40qps output 0.00qps mean 1000.13ms median 1000.08ms 80% 1000.48ms 90% 1000.64ms 95% 1000.75ms 98% 1000.91ms 99% 1001.15ms
      1920 threads 30.00 seconds:
      server: count 46489 fail 46489 (100.00%) timeout 32386 (69.66%) input 1549.56qps output 0.00qps mean 701.38ms median 1000.84ms 80% 1003.44ms 90% 1006.38ms 95% 1008.35ms 98% 1010.06ms 99% 1011.00ms
      1920 threads 30.00 seconds:
      server: count 46427 fail 46427 (100.00%) timeout 31232 (67.27%) input 1547.36qps output 0.00qps mean 678.65ms median 1000.12ms 80% 1003.13ms 90% 1006.29ms 95% 1008.44ms 98% 1010.42ms 99% 1011.58ms
      960 threads 30.00 seconds:
      server: count 28781 fail 28781 (100.00%) timeout 28781 (100.00%) input 959.34qps output 0.00qps mean 1000.36ms median 1000.50ms 80% 1000.80ms 90% 1000.88ms 95% 1000.98ms 98% 1001.27ms 99% 1001.37ms
      480 threads 30.00 seconds:
      server: count 14396 fail 14396 (100.00%) timeout 14396 (100.00%) input 479.84qps output 0.00qps mean 1000.24ms median 1000.25ms 80% 1000.71ms 90% 1000.79ms 95% 1000.86ms 98% 1000.89ms 99% 1001.04ms
      240 threads 30.00 seconds:
      server: count 29609 fail 786 (2.65%) timeout 787 (2.66%) input 986.94qps output 960.74qps mean 245.45ms median 228.65ms 80% 380.21ms 90% 431.68ms 95% 469.08ms 98% 1000.65ms 99% 1000.79ms
      server: pass. After DOS attack, service restored.

      System params on server:
      ========================

      net.core.somaxconn = 4096
      net.core.netdev_max_backlog = 30000
      net.core.rmem_default = 2524287
      net.core.wmem_default = 2524287
      net.core.rmem_max = 240000000
      net.core.wmem_max = 240000000

      net.ipv4.tcp_congestion_control = cubic
      net.ipv4.tcp_rmem = 4096 87380 240000000
      net.ipv4.tcp_wmem = 4096 65536 240000000
      net.ipv4.tcp_mem = 761280 1015040 1522560
      net.ipv4.tcp_max_syn_backlog = 8192
      net.ipv4.tcp_tw_recycle = 1
      net.ipv4.tcp_syncookies = 1

      net.ipv4.ip_local_port_range = 32768 61000

      limits.conf on server:
      ======================

      • soft nofile 8192
      • hard nofile 450000

      Jetty config on the server:
      ===========================

      <Call name="addConnector">
      <Arg>
      <New class="org.eclipse.jetty.server.nio.SelectChannelConnector">
      <Set name="host"><Property name="jetty.host" /></Set>
      <Set name="port"><Property name="jetty.port" default="4001"/></Set>
      <Set name="maxIdleTime">300000</Set>
      <Set name="Acceptors">10</Set>
      <Set name="AcceptQueueSize">3000</Set>
      <Set name="statsOn">false</Set>
      <Set name="confidentialPort">8443</Set>
      <Set name="lowResourcesConnections">20000</Set>
      <Set name="lowResourcesMaxIdleTime">5000</Set>
      </New>
      </Arg>
      </Call>

      <Set name="ThreadPool">
      <!-- Default queued blocking threadpool -->
      <New class="org.eclipse.jetty.util.thread.QueuedThreadPool">
      <Set name="minThreads">10</Set>
      <Set name="maxThreads">500</Set>
      <Set name="detailedDump">false</Set>
      </New>
      </Set>

      Server Config:
      ==============

      Linux 2.6.18-238.0.0.0.1.el5 #1 SMP Tue Jan 4 09:29:38 EST 2011 x86_64 x86_64 x86_64 GNU/Linux

      Exceptions we see in the server code when Jetty threads are wedged:
      ===================================================================

      2012-03-28 22:19:03,815 ERROR [qtp1442119391-1073] (Server.java:277) - error processing request to /foo
      org.eclipse.jetty.io.EofException
      at org.eclipse.jetty.http.HttpGenerator.flushBuffer(HttpGenerator.java:952)
      at org.eclipse.jetty.http.AbstractGenerator.blockForOutput(AbstractGenerator.java:518)
      at org.eclipse.jetty.http.AbstractGenerator.flush(AbstractGenerator.java:442)
      at org.eclipse.jetty.server.HttpOutput.flush(HttpOutput.java:94)
      at org.eclipse.jetty.server.AbstractHttpConnection$Output.flush(AbstractHttpConnection.java:1022)
      at com.project.serve.Server.doPost(Server.java:273)
      at javax.servlet.http.HttpServlet.service(HttpServlet.java:755)
      at javax.servlet.http.HttpServlet.service(HttpServlet.java:848)
      at org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:594)
      at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:486)
      at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:119)
      at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:524)
      at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:233)
      at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1065)
      at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:413)
      at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:192)
      at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:999)
      at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:117)
      at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:250)
      at org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:149)
      at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:111)
      at org.eclipse.jetty.server.Server.handle(Server.java:351)
      at org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:454)
      at org.eclipse.jetty.server.AbstractHttpConnection.content(AbstractHttpConnection.java:900)
      at org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.content(AbstractHttpConnection.java:954)
      at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:857)
      at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:235)
      at org.eclipse.jetty.server.AsyncHttpConnection.handle(AsyncHttpConnection.java:77)
      at org.eclipse.jetty.io.nio.SelectChannelEndPoint.handle(SelectChannelEndPoint.java:609)
      at org.eclipse.jetty.io.nio.SelectChannelEndPoint$1.run(SelectChannelEndPoint.java:45)
      at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:599)
      at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:534)
      at java.lang.Thread.run(Thread.java:722)
      Caused by: java.io.IOException: Broken pipe
      at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
      at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47)
      at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:89)
      at sun.nio.ch.IOUtil.write(IOUtil.java:46)
      at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:450)
      at org.eclipse.jetty.io.nio.ChannelEndPoint.flush(ChannelEndPoint.java:292)
      at org.eclipse.jetty.io.nio.SelectChannelEndPoint.flush(SelectChannelEndPoint.java:344)
      at org.eclipse.jetty.http.HttpGenerator.flushBuffer(HttpGenerator.java:888)
      ... 32 more

      strace output on the server:
      ============================

      % time seconds usecs/call calls errors syscall
      ------ ----------- ----------- --------- --------- ----------------
      98.34 13039.218984 74181 175775 29640 futex
      1.49 197.349711 2223 88766 epoll_wait
      0.03 3.797560 48 78371 32991 write
      0.03 3.687008 50 74097 read
      0.02 2.920068 47 62065 31223 epoll_ctl
      0.01 1.720053 43 39766 fcntl
      0.01 1.604560 49 33040 32975 rt_sigreturn
      0.01 1.592088 51 31223 dup2
      0.01 1.591610 51 31329 close
      0.01 1.562989 50 31124 writev
      0.01 1.227706 77 15914 mprotect
      0.01 1.061175 40 26442 getsockname
      0.01 1.028789 39 26440 setsockopt
      0.00 0.612548 46 13220 accept
      0.00 0.503092 45 11169 clk_gettime
      0.00 0.149363 51 2915 106 stat
      0.00 0.096893 43 2235 sched_yield
      0.00 0.040313 45 894 brk
      0.00 0.024888 45 549 465 shutdown
      0.00 0.010789 51 212 getdents
      0.00 0.007996 2665 3 madvise
      0.00 0.005129 48 106 open
      0.00 0.004593 124 37 mmap
      0.00 0.004525 98 46 lstat
      0.00 0.004373 729 6 munmap
      0.00 0.000000 0 2 lseek
      ------ ----------- ----------- --------- --------- ----------------
      100.00 13259.826803 745746 127400 total

      Netstat on the server:
      ======================

      on the server (we can see that there are a ton of connections in CLOSE_WAIT state while the test was running):
      the server did not close the connections even though, the client did.

      netstat -a | grep CLOSE_WAIT | wc -l
      24124

      example: a small snippet of connections in CLOSE_WAIT state:
      tcp 190 0 ls88.com:newoak ls78.com:52615 CLOSE_WAIT
      tcp 277 0 ls88.com:newoak ls78.com:52359 CLOSE_WAIT
      tcp 200 0 ls88.com:newoak ls78.com:58247 CLOSE_WAIT
      tcp 200 0 ls88.com:newoak ls78.com:57991 CLOSE_WAIT
      tcp 198 0 ls88.com:newoak ls78.com:57735 CLOSE_WAIT
      tcp 190 0 ls88.com:newoak ls78.com:57479 CLOSE_WAIT
      tcp 199 0 ls88.com:newoak ls78.com:59271 CLOSE_WAIT
      tcp 260 0 ls88.com:newoak ls78.com:59015 CLOSE_WAIT
      tcp 260 0 ls88.com:newoak ls78.com:58759 CLOSE_WAIT
      tcp 198 0 ls88.com:newoak ls78.com:58503 CLOSE_WAIT
      tcp 196 0 ls88.com:newoak ls78.com:60295 CLOSE_WAIT
      tcp 195 0 ls88.com:newoak ls78.com:60039 CLOSE_WAIT
      tcp 556 0 ls88.com:newoak ls78.com:59783 CLOSE_WAIT
      tcp 239 0 ls88.com:newoak ls78.com:60551 CLOSE_WAIT
      tcp 240 0 ls88.com:newoak ls78.com:37510 CLOSE_WAIT
      tcp 200 0 ls88.com:newoak ls78.com:37766 CLOSE_WAIT
      :::::::::::::::::::::::

        Activity

        Hide
        Jan Bartel added a comment -

        Rama,

        I'm not convinced this is a bug, at this point it seems more like an issue of how to configure the server correctly for load testing.

        Have you read and applied the suggestions here: http://wiki.eclipse.org/Jetty/Howto/High_Load

        From what you have said, it seems like the client is suddenly dropping the connection and not waiting around for the proper completion of the tcp connection - as witnessed by the stack trace you provided, and the fact that there are a lot of connections in CLOSE_WAIT on the server side, which means that the client has not done the proper closing handshake. Is the client configured to do this deliberately perhaps?

        If the rate that the client is trying to open new connections is fast enough then you may run out of file descriptors on the server side. You may want to try changing the lowResources and lowResourcesMaxIdle times on the connector to make them timeout faster.

        Also, do you have 10 cores in the machine? Usually we recommend 1 acceptor per core.

        regards
        Jan

        Show
        Jan Bartel added a comment - Rama, I'm not convinced this is a bug, at this point it seems more like an issue of how to configure the server correctly for load testing. Have you read and applied the suggestions here: http://wiki.eclipse.org/Jetty/Howto/High_Load From what you have said, it seems like the client is suddenly dropping the connection and not waiting around for the proper completion of the tcp connection - as witnessed by the stack trace you provided, and the fact that there are a lot of connections in CLOSE_WAIT on the server side, which means that the client has not done the proper closing handshake. Is the client configured to do this deliberately perhaps? If the rate that the client is trying to open new connections is fast enough then you may run out of file descriptors on the server side. You may want to try changing the lowResources and lowResourcesMaxIdle times on the connector to make them timeout faster. Also, do you have 10 cores in the machine? Usually we recommend 1 acceptor per core. regards Jan
        Hide
        rama added a comment -

        Jan,

        Yes most of the settings I used were from: http://wiki.eclipse.org/Jetty/Howto/High_Load

        The connection closes initiate from the server but the client will drop a connection if the server doesn't respond within a specified time(5ms). At which point, the client will try to reestablish the connection. This is the nature of the test.

        However, it is surprising that the server got into that wedged state at the first place. The server always has a sub millisecond response time. As the load increased, the server went into this unresponsive stage causing the client to timeout and restablish connection (happening inside curl library). This probably built up CLOSE_WAIT connections on the server.

        It is unclear why the server stopped responding as the load increased.

        I think you are talking about this one:
        <Set name="lowResourcesMaxIdleTime">5000</Set>
        what do you think this value should be? <5ms when the load is high so the server can close connections faster?

        Actually, there are 16 cores on the box and I intend to increase the number of Acceptors further. 10 was a experiment.

        I'll try to play with lowResourcesMaxIdleTime and see if it makes a difference.

        Thanks,
        Rama

        Show
        rama added a comment - Jan, Yes most of the settings I used were from: http://wiki.eclipse.org/Jetty/Howto/High_Load The connection closes initiate from the server but the client will drop a connection if the server doesn't respond within a specified time(5ms). At which point, the client will try to reestablish the connection. This is the nature of the test. However, it is surprising that the server got into that wedged state at the first place. The server always has a sub millisecond response time. As the load increased, the server went into this unresponsive stage causing the client to timeout and restablish connection (happening inside curl library). This probably built up CLOSE_WAIT connections on the server. It is unclear why the server stopped responding as the load increased. I think you are talking about this one: <Set name="lowResourcesMaxIdleTime">5000</Set> what do you think this value should be? <5ms when the load is high so the server can close connections faster? Actually, there are 16 cores on the box and I intend to increase the number of Acceptors further. 10 was a experiment. I'll try to play with lowResourcesMaxIdleTime and see if it makes a difference. Thanks, Rama
        Hide
        Greg Wilkins added a comment -

        Rama,

        the key question to answer is: are the CLOSE_WAIT connections a symptom or a cause? ie did the server hang and a bunch of CLOSE_WAIT connections result, or did the client cause lots of CLOSE_WAIT connections that exhausted the FDs and thus hung the server.

        I've never found CURL to produce a realistic load test for most circumstances, as most traffic is not short connections with only a single request.
        Also, 5ms is a pretty low threshhold to start discarding connections - a single JVM GC could cause all current requests to be 5ms late and then CURL will discard them all and recreate new connections, making more garbage and thus making another GC likely and you could be cascading down from there.

        Generally speaking, you want the server to give up on a connection before the client does, other wise you leave the server hanging attempting a graceful close, but the client does not want to speak anymore. I would consider configuring curl with a max latency of 50ms and the server with a low resource max Idle time of 10ms.

        But even then, these are not realistic tests. Real networks take 20-300ms for RTT, so load profiles are hugely different than a local client hammering away in a tight loop with a 1ms RTT.

        Show
        Greg Wilkins added a comment - Rama, the key question to answer is: are the CLOSE_WAIT connections a symptom or a cause? ie did the server hang and a bunch of CLOSE_WAIT connections result, or did the client cause lots of CLOSE_WAIT connections that exhausted the FDs and thus hung the server. I've never found CURL to produce a realistic load test for most circumstances, as most traffic is not short connections with only a single request. Also, 5ms is a pretty low threshhold to start discarding connections - a single JVM GC could cause all current requests to be 5ms late and then CURL will discard them all and recreate new connections, making more garbage and thus making another GC likely and you could be cascading down from there. Generally speaking, you want the server to give up on a connection before the client does, other wise you leave the server hanging attempting a graceful close, but the client does not want to speak anymore. I would consider configuring curl with a max latency of 50ms and the server with a low resource max Idle time of 10ms. But even then, these are not realistic tests. Real networks take 20-300ms for RTT, so load profiles are hugely different than a local client hammering away in a tight loop with a 1ms RTT.
        Hide
        rama added a comment -

        Hi Greg,

        The test we are doing is a realistic test from our perspective. The server needs to close a client connection after it serves a request.

        CLOSE_WAIT connections are a symptom. The server is taking too long to respond causing client to timeout and and reestablish a connection. I've set the client side timeout value to be 1 second just to give a chance to respond.

        We have identified a potential problem on the server side that is causing the server to take too long(> 500ms) to respond. A possible synchronization problem on the server. We'll let you know what we find but it looks like a problem in the server code than Jetty at this point.

        Thanks a lot for all the insight.

        Regards,
        Rama

        Show
        rama added a comment - Hi Greg, The test we are doing is a realistic test from our perspective. The server needs to close a client connection after it serves a request. CLOSE_WAIT connections are a symptom. The server is taking too long to respond causing client to timeout and and reestablish a connection. I've set the client side timeout value to be 1 second just to give a chance to respond. We have identified a potential problem on the server side that is causing the server to take too long(> 500ms) to respond. A possible synchronization problem on the server. We'll let you know what we find but it looks like a problem in the server code than Jetty at this point. Thanks a lot for all the insight. Regards, Rama
        Hide
        Jan Bartel added a comment -

        Rama,

        From your last comment it appears this is not related to jetty, so I'm closing the issue.

        thanks
        Jan

        Show
        Jan Bartel added a comment - Rama, From your last comment it appears this is not related to jetty, so I'm closing the issue. thanks Jan

          People

          • Assignee:
            Jan Bartel
            Reporter:
            rama
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Time Tracking

              Estimated:
              Original Estimate - 1 week
              1w
              Remaining:
              Remaining Estimate - 1 week
              1w
              Logged:
              Time Spent - Not Specified
              Not Specified