Jetty
  1. Jetty
  2. JETTY-748

Assist Hadoop to port to jetty-6 and optimize

    Details

    • Type: Task Task
    • Status: Resolved Resolved
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 6.1.26, 7.2.1
    • Component/s: HTTP
    • Labels:
      None
    • Number of attachments :
      2
    1. JETTY-748.patch
      5 kB
      Greg Wilkins
    2. OpenCloseTest.java
      2 kB
      Greg Wilkins

      Activity

      Hide
      Greg Wilkins added a comment -

      It looks like we have the same failure in https://bugs.eclipse.org/bugs/show_bug.cgi?id=310634
      so I'm definitely going to change the way we return the local port and get it at the time we open the server socket.

      Show
      Greg Wilkins added a comment - It looks like we have the same failure in https://bugs.eclipse.org/bugs/show_bug.cgi?id=310634 so I'm definitely going to change the way we return the local port and get it at the time we open the server socket.
      Hide
      Sachin Bochare added a comment -

      Hi Greg,

      I am running few tests on a 4 node hadoop cluster and hitting this problem at will.
      You mentioned about snapshot build in last comment. Could you please provide me snapshot build so that I can try it out on my cluster.

      Following are details of my environment:
      OS : Ubuntu 8.04.1
      Hadoop Version : 0.20.1
      Jetty Version : 6.1.14

      Error Messages:
      2010-05-16 12:32:53,282 WARN org.apache.hadoop.mapred.ReduceTask: java.net.ConnectException: Connection refused
      at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
      at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
      at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
      at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
      at sun.net._www.protocol.http.HttpURLConnection$6.run(HttpURLConnection.java:1368)
      at java.security.AccessController.doPrivileged(Native Method)
      at sun.net._www.protocol.http.HttpURLConnection.getChainedException(HttpURLConnection.java:1362)
      at sun.net._www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1016)
      at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getInputStream(ReduceTask.java:1447)
      at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getMapOutput(ReduceTask.java:1349)
      at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:1261)
      at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:1195)
      Caused by: java.net.ConnectException: Connection refused
      at java.net.PlainSocketImpl.socketConnect(Native Method)
      at java.net.PlainSocketImpl.doConnect(PlainSocketImpl.java:333)
      at java.net.PlainSocketImpl.connectToAddress(PlainSocketImpl.java:195)
      at java.net.PlainSocketImpl.connect(PlainSocketImpl.java:182)
      at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:366)
      at java.net.Socket.connect(Socket.java:525)
      at sun.net.NetworkClient.doConnect(NetworkClient.java:158)
      at sun.net._www.http.HttpClient.openServer(HttpClient.java:394)
      at sun.net._www.http.HttpClient.openServer(HttpClient.java:529)
      at sun.net._www.http.HttpClient.<init>(HttpClient.java:233)
      at sun.net._www.http.HttpClient.New(HttpClient.java:306)
      at sun.net._www.http.HttpClient.New(HttpClient.java:323)
      at sun.net._www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:860)
      at sun.net._www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:801)
      at sun.net._www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:726)
      at sun.net._www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1049)
      ... 4 more

      Thanks,
      Sachin

      Show
      Sachin Bochare added a comment - Hi Greg, I am running few tests on a 4 node hadoop cluster and hitting this problem at will. You mentioned about snapshot build in last comment. Could you please provide me snapshot build so that I can try it out on my cluster. Following are details of my environment: OS : Ubuntu 8.04.1 Hadoop Version : 0.20.1 Jetty Version : 6.1.14 Error Messages: 2010-05-16 12:32:53,282 WARN org.apache.hadoop.mapred.ReduceTask: java.net.ConnectException: Connection refused at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27) at java.lang.reflect.Constructor.newInstance(Constructor.java:513) at sun.net._www.protocol.http.HttpURLConnection$6.run(HttpURLConnection.java:1368) at java.security.AccessController.doPrivileged(Native Method) at sun.net._www.protocol.http.HttpURLConnection.getChainedException(HttpURLConnection.java:1362) at sun.net._www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1016) at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getInputStream(ReduceTask.java:1447) at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getMapOutput(ReduceTask.java:1349) at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:1261) at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:1195) Caused by: java.net.ConnectException: Connection refused at java.net.PlainSocketImpl.socketConnect(Native Method) at java.net.PlainSocketImpl.doConnect(PlainSocketImpl.java:333) at java.net.PlainSocketImpl.connectToAddress(PlainSocketImpl.java:195) at java.net.PlainSocketImpl.connect(PlainSocketImpl.java:182) at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:366) at java.net.Socket.connect(Socket.java:525) at sun.net.NetworkClient.doConnect(NetworkClient.java:158) at sun.net._www.http.HttpClient.openServer(HttpClient.java:394) at sun.net._www.http.HttpClient.openServer(HttpClient.java:529) at sun.net._www.http.HttpClient.<init>(HttpClient.java:233) at sun.net._www.http.HttpClient.New(HttpClient.java:306) at sun.net._www.http.HttpClient.New(HttpClient.java:323) at sun.net._www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:860) at sun.net._www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:801) at sun.net._www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:726) at sun.net._www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1049) ... 4 more Thanks, Sachin
      Hide
      Joakim Erdfelt added a comment -

      We have started to see this port issue crop up in the past 2 months during our ever increasing unit testing.

      The latest spate of problems were however fixed by simply updating our Linux instances with a higher number of file descriptors.

      On one of the systems showing this local port issue, we noticed that the default setup for open files was at a mere 2,000.

      [joakim@lapetus jetty]$ lsb_release -a
      No LSB modules are available.
      Distributor ID:	Ubuntu
      Description:	Ubuntu 10.04 LTS
      Release:	10.04
      Codename:	lucid
      
      [joakim@lapetus jetty]$ ulimit -n
      2000
      

      So we updated the /etc/security/limits.conf to bump this number up to 20,000, and this solved our bad local port issues (after a reboot)

      [joakim@lapetus jetty]$ grep nofile /etc/security/limits.conf 
      #        - nofile - max number of open files
      *	hard	nofile	40000
      *	soft	nofile	20000
      
      [joakim@lapetus jetty]$ ulimit -a | grep -i file
      core file size          (blocks, -c) 0
      file size               (blocks, -f) unlimited
      open files                      (-n) 20000
      file locks                      (-x) unlimited
      

      This new ulimit helped our unit testing on the systems having issues.
      Our analysis shows that the aggressive unit testing that we do starts and stops a jetty server (on a system assigned port, using special port #0) consumes the socket at a rate faster than they can be recycled back into the "open files" ulimit, and caused our unit tests to eventually all fail due to a "-1" local port.
      See examples of error messages at https://bugs.eclipse.org/bugs/show_bug.cgi?id=310634

      Having a loop that continues to attempt to start the server while looking for a valid port number (as seen in the Hadoop codebase) will not help in this excessive "open files" ulimit condition.
      The best choice we've been able to come up with is to simply increase the "open files" ulimit.

      • Joakim
      Show
      Joakim Erdfelt added a comment - We have started to see this port issue crop up in the past 2 months during our ever increasing unit testing. The latest spate of problems were however fixed by simply updating our Linux instances with a higher number of file descriptors. On one of the systems showing this local port issue, we noticed that the default setup for open files was at a mere 2,000. [joakim@lapetus jetty]$ lsb_release -a No LSB modules are available. Distributor ID: Ubuntu Description: Ubuntu 10.04 LTS Release: 10.04 Codename: lucid [joakim@lapetus jetty]$ ulimit -n 2000 So we updated the /etc/security/limits.conf to bump this number up to 20,000, and this solved our bad local port issues (after a reboot) [joakim@lapetus jetty]$ grep nofile /etc/security/limits.conf # - nofile - max number of open files * hard nofile 40000 * soft nofile 20000 [joakim@lapetus jetty]$ ulimit -a | grep -i file core file size (blocks, -c) 0 file size (blocks, -f) unlimited open files (-n) 20000 file locks (-x) unlimited This new ulimit helped our unit testing on the systems having issues. Our analysis shows that the aggressive unit testing that we do starts and stops a jetty server (on a system assigned port, using special port #0) consumes the socket at a rate faster than they can be recycled back into the "open files" ulimit, and caused our unit tests to eventually all fail due to a "-1" local port. See examples of error messages at https://bugs.eclipse.org/bugs/show_bug.cgi?id=310634 Having a loop that continues to attempt to start the server while looking for a valid port number (as seen in the Hadoop codebase) will not help in this excessive "open files" ulimit condition. The best choice we've been able to come up with is to simply increase the "open files" ulimit. Joakim
      Hide
      Joakim Erdfelt added a comment -

      Sachin,

      Greg has posted a patch against Jetty 6 (attachment JETTY-748.patch) that backports the fixes for this from Jetty 7.
      I can build a patched Jetty 6 for you to test against if you need it. (Note: Jetty 6 is in maintenance mode)

      Show
      Joakim Erdfelt added a comment - Sachin, Greg has posted a patch against Jetty 6 (attachment JETTY-748 .patch) that backports the fixes for this from Jetty 7. I can build a patched Jetty 6 for you to test against if you need it. (Note: Jetty 6 is in maintenance mode)
      Hide
      Greg Wilkins added a comment -

      I think I've finally found the underlying cause of the -1 return for the local port.

      In the finally block of the run method of AbstractConnector, there is a check to see if the thread is for acceptor==0 and if so, it closes the server socket. The problem is that the acceptor thread apparently can complete after the next open of the connector (in a stop/start) scenario. So the socket is opened and then closed a short time later.

      Quick fix is to remove this close (not needed as doStop does a close).

      But a better fix will be to make sure that calls to Server.doStop wait until the thread pool has stopped all threads before returning.

      Show
      Greg Wilkins added a comment - I think I've finally found the underlying cause of the -1 return for the local port. In the finally block of the run method of AbstractConnector, there is a check to see if the thread is for acceptor==0 and if so, it closes the server socket. The problem is that the acceptor thread apparently can complete after the next open of the connector (in a stop/start) scenario. So the socket is opened and then closed a short time later. Quick fix is to remove this close (not needed as doStop does a close). But a better fix will be to make sure that calls to Server.doStop wait until the thread pool has stopped all threads before returning.

        People

        • Assignee:
          Joakim Erdfelt
          Reporter:
          Greg Wilkins
        • Votes:
          1 Vote for this issue
          Watchers:
          6 Start watching this issue

          Dates

          • Created:
            Updated:
            Resolved: