Jetty

Assist Hadoop to port to jetty-6 and optimize

Details

  • Type: Task Task
  • Status: Resolved Resolved
  • Priority: Minor Minor
  • Resolution: Fixed
  • Affects Version/s: None
  • Fix Version/s: 6.1.26, 7.2.1
  • Component/s: HTTP
  • Labels:
    None
  • Number of attachments :
    2
  1. JETTY-748.patch
    13/Apr/10 8:33 AM
    5 kB
    Greg Wilkins
  2. OpenCloseTest.java
    26/Aug/09 8:51 AM
    2 kB
    Greg Wilkins

Activity

Hide
Greg Wilkins added a comment -

Added RandomAccessFileBuffer and test harness.
Updated AbstractGenerator so that such buffers passed with sendContent are efficiently sent.

now testing.... then will advise hadoop how to use.

Show
Greg Wilkins added a comment - Added RandomAccessFileBuffer and test harness. Updated AbstractGenerator so that such buffers passed with sendContent are efficiently sent. now testing.... then will advise hadoop how to use.
Hide
Chris Douglas added a comment -

We're hitting a wall with transferTo in Java SE 6 (HADOOP-4700). It looks like a particularly buggy Linux implementation, and most of the issues aren't fixed until JDK7. There are a few more things to try, but we may have to abandon this particular route for now. Has anyone had more success with this on Linux? Is it worth experimenting with blocking I/O, or is that a non-starter for Jetty6?

On the bright side, dropping the transferTo experiment means we can tune the Server and MapOutputServlet in more conventional ways. Where should we start?

There's currently an open issue (HADOOP-4744) that may be related to the Jetty6 migration. In allocating a medium/large cluster (~400-500 nodes), it looks like some of the jetty servers are failing to start. If there are strategies for troubleshooting this, any guidance would be appreciated.

Show
Chris Douglas added a comment - We're hitting a wall with transferTo in Java SE 6 (HADOOP-4700). It looks like a particularly buggy Linux implementation, and most of the issues aren't fixed until JDK7. There are a few more things to try, but we may have to abandon this particular route for now. Has anyone had more success with this on Linux? Is it worth experimenting with blocking I/O, or is that a non-starter for Jetty6? On the bright side, dropping the transferTo experiment means we can tune the Server and MapOutputServlet in more conventional ways. Where should we start? There's currently an open issue (HADOOP-4744) that may be related to the Jetty6 migration. In allocating a medium/large cluster (~400-500 nodes), it looks like some of the jetty servers are failing to start. If there are strategies for troubleshooting this, any guidance would be appreciated.
Hide
Greg Wilkins added a comment -

David,

can you have a quick look at these issues and summarize them to me.
Also can you make sure that the RandomAccessFileBuffer and it's support are
backported from jetty-7 to jetty-6.

Show
Greg Wilkins added a comment - David, can you have a quick look at these issues and summarize them to me. Also can you make sure that the RandomAccessFileBuffer and it's support are backported from jetty-7 to jetty-6.
Hide
Devaraj Das added a comment -

I recently commented on HADOOP-4744. Please have a look. Thanks!

Show
Devaraj Das added a comment - I recently commented on HADOOP-4744. Please have a look. Thanks!
Hide
Chris Douglas added a comment -

We're seeing this pair of exceptions frequently:

org.mortbay.jetty.EofException
        at org.mortbay.jetty.HttpGenerator.flush(HttpGenerator.java:787)
        at org.mortbay.jetty.AbstractGenerator$Output.blockForOutput(AbstractGenerator.java:548)
        at org.mortbay.jetty.AbstractGenerator$Output.flush(AbstractGenerator.java:569)
        at org.mortbay.jetty.HttpConnection$Output.flush(HttpConnection.java:946)
        at org.mortbay.jetty.AbstractGenerator$Output.write(AbstractGenerator.java:646)
        at org.mortbay.jetty.AbstractGenerator$Output.write(AbstractGenerator.java:577)
        at org.apache.hadoop.mapred.TaskTracker$MapOutputServlet.doGet(TaskTracker.java:2872)
        at javax.servlet.http.HttpServlet.service(HttpServlet.java:707)
        at javax.servlet.http.HttpServlet.service(HttpServlet.java:820)
        at org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:502)
        at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:363)
        at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
        at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181)
        at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766)
        at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:417)
        at org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)
        at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
        at org.mortbay.jetty.Server.handle(Server.java:324)
        at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:534)
        at org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:864)
        at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:533)
        at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:207)
        at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:403)
        at org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:409)
        at org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:522)
Caused by: java.io.IOException: Broken pipe
        at sun.nio.ch.FileDispatcher.write0(Native Method)
        at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:29)
        at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:104)
        at sun.nio.ch.IOUtil.write(IOUtil.java:75)
        at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:334)
        at org.mortbay.io.nio.ChannelEndPoint.flush(ChannelEndPoint.java:169)
        at org.mortbay.io.nio.SelectChannelEndPoint.flush(SelectChannelEndPoint.java:221)
        at org.mortbay.jetty.HttpGenerator.flush(HttpGenerator.java:721)
        ... 24 more
java.lang.IllegalStateException: Committed
        at org.mortbay.jetty.Response.resetBuffer(Response.java:994)
        at org.mortbay.jetty.Response.sendError(Response.java:240)
        at org.apache.hadoop.mapred.TaskTracker$MapOutputServlet.doGet(TaskTracker.java:2895)
        at javax.servlet.http.HttpServlet.service(HttpServlet.java:707)
        at javax.servlet.http.HttpServlet.service(HttpServlet.java:820)
        at org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:502)
        at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:363)
        at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
        at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181)
        at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766)
        at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:417)
        at org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)
        at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
        at org.mortbay.jetty.Server.handle(Server.java:324)
        at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:534)
        at org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:864)
        at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:533)
        at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:207)
        at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:403)
        at org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:409)
        at org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:522)

This is running trunk without attempting to use RandomAccessFileBuffer. Have we misconfigured something? Is it worth worrying about?

Show
Chris Douglas added a comment - We're seeing this pair of exceptions frequently:
org.mortbay.jetty.EofException
        at org.mortbay.jetty.HttpGenerator.flush(HttpGenerator.java:787)
        at org.mortbay.jetty.AbstractGenerator$Output.blockForOutput(AbstractGenerator.java:548)
        at org.mortbay.jetty.AbstractGenerator$Output.flush(AbstractGenerator.java:569)
        at org.mortbay.jetty.HttpConnection$Output.flush(HttpConnection.java:946)
        at org.mortbay.jetty.AbstractGenerator$Output.write(AbstractGenerator.java:646)
        at org.mortbay.jetty.AbstractGenerator$Output.write(AbstractGenerator.java:577)
        at org.apache.hadoop.mapred.TaskTracker$MapOutputServlet.doGet(TaskTracker.java:2872)
        at javax.servlet.http.HttpServlet.service(HttpServlet.java:707)
        at javax.servlet.http.HttpServlet.service(HttpServlet.java:820)
        at org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:502)
        at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:363)
        at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
        at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181)
        at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766)
        at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:417)
        at org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)
        at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
        at org.mortbay.jetty.Server.handle(Server.java:324)
        at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:534)
        at org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:864)
        at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:533)
        at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:207)
        at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:403)
        at org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:409)
        at org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:522)
Caused by: java.io.IOException: Broken pipe
        at sun.nio.ch.FileDispatcher.write0(Native Method)
        at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:29)
        at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:104)
        at sun.nio.ch.IOUtil.write(IOUtil.java:75)
        at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:334)
        at org.mortbay.io.nio.ChannelEndPoint.flush(ChannelEndPoint.java:169)
        at org.mortbay.io.nio.SelectChannelEndPoint.flush(SelectChannelEndPoint.java:221)
        at org.mortbay.jetty.HttpGenerator.flush(HttpGenerator.java:721)
        ... 24 more
java.lang.IllegalStateException: Committed
        at org.mortbay.jetty.Response.resetBuffer(Response.java:994)
        at org.mortbay.jetty.Response.sendError(Response.java:240)
        at org.apache.hadoop.mapred.TaskTracker$MapOutputServlet.doGet(TaskTracker.java:2895)
        at javax.servlet.http.HttpServlet.service(HttpServlet.java:707)
        at javax.servlet.http.HttpServlet.service(HttpServlet.java:820)
        at org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:502)
        at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:363)
        at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
        at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181)
        at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766)
        at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:417)
        at org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)
        at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
        at org.mortbay.jetty.Server.handle(Server.java:324)
        at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:534)
        at org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:864)
        at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:533)
        at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:207)
        at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:403)
        at org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:409)
        at org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:522)
This is running trunk without attempting to use RandomAccessFileBuffer. Have we misconfigured something? Is it worth worrying about?
Hide
Greg Wilkins added a comment -

The Eof exception implies that your client is closing the connection while content is still being sent. Can this happen?
If so, then you can silently ignore(but log in debug) this exception (but best not to close while writing content).

The second exception looks like an attempt to generate an error page for the first exception - should not be a problem if you silently ignore the first exception.

Show
Greg Wilkins added a comment - The Eof exception implies that your client is closing the connection while content is still being sent. Can this happen? If so, then you can silently ignore(but log in debug) this exception (but best not to close while writing content). The second exception looks like an attempt to generate an error page for the first exception - should not be a problem if you silently ignore the first exception.
Hide
Jan Bartel added a comment -

Greg,

I checked in the change to AbstractGenerator.isBufferFull() see svn rev 4213. Please review.

Jan

Show
Jan Bartel added a comment - Greg, I checked in the change to AbstractGenerator.isBufferFull() see svn rev 4213. Please review. Jan
Hide
David Yu added a comment -

The invalid port returned from Connector.getLocalPort() was extremely random.
The accepted solution was simply to restart the connector and get the new valid port when its restarted.

Show
David Yu added a comment - The invalid port returned from Connector.getLocalPort() was extremely random. The accepted solution was simply to restart the connector and get the new valid port when its restarted.
Hide
Owen O'Malley added a comment -

That is an unfortunate response to an observed race condition.

Show
Owen O'Malley added a comment - That is an unfortunate response to an observed race condition.
Hide
Greg Wilkins added a comment -

Owen,

We are working to remove this close race.
It's not yet 100% fixed, but we will address it soon.

I'll reopen this issue so that we can close it when we have the 100% fix, but I'll reduce the priority because the exception thrown is essentially just noise (in the case that we understand it).

Show
Greg Wilkins added a comment - Owen, We are working to remove this close race. It's not yet 100% fixed, but we will address it soon. I'll reopen this issue so that we can close it when we have the 100% fix, but I'll reduce the priority because the exception thrown is essentially just noise (in the case that we understand it).
Hide
Greg Wilkins added a comment -

Oh my previous comment was about the wrong race condition... I was talking about the issue if somebody closes a connection as it is being written to.

David was discussion a getLocalPort issue that I believe was fixed.
We have certainly not observed getLocalPort returning incorrect values and we heavily use it in our test harnesses.

But I'll reinvestigate anyway.

Show
Greg Wilkins added a comment - Oh my previous comment was about the wrong race condition... I was talking about the issue if somebody closes a connection as it is being written to. David was discussion a getLocalPort issue that I believe was fixed. We have certainly not observed getLocalPort returning incorrect values and we heavily use it in our test harnesses. But I'll reinvestigate anyway.
Hide
Greg Wilkins added a comment -

https://issues.apache.org/jira/browse/HADOOP-4744 is closed, so I think the getLocalPort issue is indeed fixed.
The open issue associated with this is https://issues.apache.org/jira/browse/MAPREDUCE-258
Apparently int is not sufficient size for hadoop, so perhaps another approach is required.

Show
Greg Wilkins added a comment - https://issues.apache.org/jira/browse/HADOOP-4744 is closed, so I think the getLocalPort issue is indeed fixed. The open issue associated with this is https://issues.apache.org/jira/browse/MAPREDUCE-258 Apparently int is not sufficient size for hadoop, so perhaps another approach is required.
Hide
Chris Douglas added a comment -

As David said earlier, the "resolution" for HADOOP-4744 was to check if the race was lost, and retry. It looks like the race condition still exists.

Show
Chris Douglas added a comment - As David said earlier, the "resolution" for HADOOP-4744 was to check if the race was lost, and retry. It looks like the race condition still exists.
Hide
Greg Wilkins added a comment -

OK, so I've looked at this again, and I just cannot see how the localPort can be -1 unless the open fails somehow.

I've written a test harness to try to reproduce the problem, but it does not for me.
Could somebody who see the problem on hadoop run the OpenCloseTest and see if they can reproduce.

thanks

Show
Greg Wilkins added a comment - OK, so I've looked at this again, and I just cannot see how the localPort can be -1 unless the open fails somehow. I've written a test harness to try to reproduce the problem, but it does not for me. Could somebody who see the problem on hadoop run the OpenCloseTest and see if they can reproduce. thanks
Hide
Greg Wilkins added a comment -

pass this test a list of interface addresses on the command line

Show
Greg Wilkins added a comment - pass this test a list of interface addresses on the command line
Hide
Chris Douglas added a comment -

I couldn't reproduce this with OpenCloseTest and our default interface. I'll try to reproduce in our specific use of Jetty in the next few weeks.

Show
Chris Douglas added a comment - I couldn't reproduce this with OpenCloseTest and our default interface. I'll try to reproduce in our specific use of Jetty in the next few weeks.
Hide
Joakim Erdfelt added a comment -

Looks like https://issues.apache.org/jira/browse/HADOOP-6528 is a re-emergence of this problem.

Show
Joakim Erdfelt added a comment - Looks like https://issues.apache.org/jira/browse/HADOOP-6528 is a re-emergence of this problem.
Hide
Joakim Erdfelt added a comment -

Adding tracking for similar (now closed) issue at https://issues.apache.org/jira/browse/HADOOP-6386

Show
Joakim Erdfelt added a comment - Adding tracking for similar (now closed) issue at https://issues.apache.org/jira/browse/HADOOP-6386
Hide
Greg Wilkins added a comment -

Joakim,

can you have a look at this? I've poured over the code and tried numerous tests and I fail to see how we can return -1 there.

thanks

Show
Greg Wilkins added a comment - Joakim, can you have a look at this? I've poured over the code and tried numerous tests and I fail to see how we can return -1 there. thanks
Hide
Steve Loughran added a comment -

One possibility is that something needs to be marked as volatile but it isn't, so its being cached or prefetched -and the change isn't been picked up. I've encountered that in some projects.

Show
Steve Loughran added a comment - One possibility is that something needs to be marked as volatile but it isn't, so its being cached or prefetched -and the change isn't been picked up. I've encountered that in some projects.
Hide
Greg Wilkins added a comment -

Hi,

we've been looking at the jetty code again and again and can't see how a volatile or synchronized block or any memory barrier could help - unless it is an issue within the JVM libraries themselves.

We've been running stop/start tests but also cannot reproduce.

It is possible that something else in hadoop is stopping the server before the call to get the port?

Can we get full details of the platforms on which this fails - operating system, JVM version etc. etc.

Show
Greg Wilkins added a comment - Hi, we've been looking at the jetty code again and again and can't see how a volatile or synchronized block or any memory barrier could help - unless it is an issue within the JVM libraries themselves. We've been running stop/start tests but also cannot reproduce. It is possible that something else in hadoop is stopping the server before the call to get the port? Can we get full details of the platforms on which this fails - operating system, JVM version etc. etc.
Hide
Greg Wilkins added a comment -

Attached is a patch that changes the way localports are obtained.
Instead of calling socket.getLocalPort() when connector.getLocalPort() is called, this patch calls socket.getLocalPort() in the open method and holds the result in a volatile int

If close is called, the int is set to -2, so this will tell us if the port has not been opened, or has been closed before it was used.

This patch is not yet in a release, but we can produce a snapshot build for the hadoop guys to test if that helps.

Show
Greg Wilkins added a comment - Attached is a patch that changes the way localports are obtained. Instead of calling socket.getLocalPort() when connector.getLocalPort() is called, this patch calls socket.getLocalPort() in the open method and holds the result in a volatile int If close is called, the int is set to -2, so this will tell us if the port has not been opened, or has been closed before it was used. This patch is not yet in a release, but we can produce a snapshot build for the hadoop guys to test if that helps.
Hide
Greg Wilkins added a comment -

It looks like we have the same failure in https://bugs.eclipse.org/bugs/show_bug.cgi?id=310634
so I'm definitely going to change the way we return the local port and get it at the time we open the server socket.

Show
Greg Wilkins added a comment - It looks like we have the same failure in https://bugs.eclipse.org/bugs/show_bug.cgi?id=310634 so I'm definitely going to change the way we return the local port and get it at the time we open the server socket.
Hide
Sachin Bochare added a comment -

Hi Greg,

I am running few tests on a 4 node hadoop cluster and hitting this problem at will.
You mentioned about snapshot build in last comment. Could you please provide me snapshot build so that I can try it out on my cluster.

Following are details of my environment:
OS : Ubuntu 8.04.1
Hadoop Version : 0.20.1
Jetty Version : 6.1.14

Error Messages:
2010-05-16 12:32:53,282 WARN org.apache.hadoop.mapred.ReduceTask: java.net.ConnectException: Connection refused
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
at sun.net._www.protocol.http.HttpURLConnection$6.run(HttpURLConnection.java:1368)
at java.security.AccessController.doPrivileged(Native Method)
at sun.net._www.protocol.http.HttpURLConnection.getChainedException(HttpURLConnection.java:1362)
at sun.net._www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1016)
at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getInputStream(ReduceTask.java:1447)
at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getMapOutput(ReduceTask.java:1349)
at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:1261)
at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:1195)
Caused by: java.net.ConnectException: Connection refused
at java.net.PlainSocketImpl.socketConnect(Native Method)
at java.net.PlainSocketImpl.doConnect(PlainSocketImpl.java:333)
at java.net.PlainSocketImpl.connectToAddress(PlainSocketImpl.java:195)
at java.net.PlainSocketImpl.connect(PlainSocketImpl.java:182)
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:366)
at java.net.Socket.connect(Socket.java:525)
at sun.net.NetworkClient.doConnect(NetworkClient.java:158)
at sun.net._www.http.HttpClient.openServer(HttpClient.java:394)
at sun.net._www.http.HttpClient.openServer(HttpClient.java:529)
at sun.net._www.http.HttpClient.<init>(HttpClient.java:233)
at sun.net._www.http.HttpClient.New(HttpClient.java:306)
at sun.net._www.http.HttpClient.New(HttpClient.java:323)
at sun.net._www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:860)
at sun.net._www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:801)
at sun.net._www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:726)
at sun.net._www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1049)
... 4 more

Thanks,
Sachin

Show
Sachin Bochare added a comment - Hi Greg, I am running few tests on a 4 node hadoop cluster and hitting this problem at will. You mentioned about snapshot build in last comment. Could you please provide me snapshot build so that I can try it out on my cluster. Following are details of my environment: OS : Ubuntu 8.04.1 Hadoop Version : 0.20.1 Jetty Version : 6.1.14 Error Messages: 2010-05-16 12:32:53,282 WARN org.apache.hadoop.mapred.ReduceTask: java.net.ConnectException: Connection refused at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27) at java.lang.reflect.Constructor.newInstance(Constructor.java:513) at sun.net._www.protocol.http.HttpURLConnection$6.run(HttpURLConnection.java:1368) at java.security.AccessController.doPrivileged(Native Method) at sun.net._www.protocol.http.HttpURLConnection.getChainedException(HttpURLConnection.java:1362) at sun.net._www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1016) at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getInputStream(ReduceTask.java:1447) at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getMapOutput(ReduceTask.java:1349) at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:1261) at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:1195) Caused by: java.net.ConnectException: Connection refused at java.net.PlainSocketImpl.socketConnect(Native Method) at java.net.PlainSocketImpl.doConnect(PlainSocketImpl.java:333) at java.net.PlainSocketImpl.connectToAddress(PlainSocketImpl.java:195) at java.net.PlainSocketImpl.connect(PlainSocketImpl.java:182) at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:366) at java.net.Socket.connect(Socket.java:525) at sun.net.NetworkClient.doConnect(NetworkClient.java:158) at sun.net._www.http.HttpClient.openServer(HttpClient.java:394) at sun.net._www.http.HttpClient.openServer(HttpClient.java:529) at sun.net._www.http.HttpClient.<init>(HttpClient.java:233) at sun.net._www.http.HttpClient.New(HttpClient.java:306) at sun.net._www.http.HttpClient.New(HttpClient.java:323) at sun.net._www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:860) at sun.net._www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:801) at sun.net._www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:726) at sun.net._www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1049) ... 4 more Thanks, Sachin
Hide
Joakim Erdfelt added a comment -

We have started to see this port issue crop up in the past 2 months during our ever increasing unit testing.

The latest spate of problems were however fixed by simply updating our Linux instances with a higher number of file descriptors.

On one of the systems showing this local port issue, we noticed that the default setup for open files was at a mere 2,000.

[joakim@lapetus jetty]$ lsb_release -a
No LSB modules are available.
Distributor ID:	Ubuntu
Description:	Ubuntu 10.04 LTS
Release:	10.04
Codename:	lucid

[joakim@lapetus jetty]$ ulimit -n
2000

So we updated the /etc/security/limits.conf to bump this number up to 20,000, and this solved our bad local port issues (after a reboot)

[joakim@lapetus jetty]$ grep nofile /etc/security/limits.conf 
#        - nofile - max number of open files
*	hard	nofile	40000
*	soft	nofile	20000

[joakim@lapetus jetty]$ ulimit -a | grep -i file
core file size          (blocks, -c) 0
file size               (blocks, -f) unlimited
open files                      (-n) 20000
file locks                      (-x) unlimited

This new ulimit helped our unit testing on the systems having issues.
Our analysis shows that the aggressive unit testing that we do starts and stops a jetty server (on a system assigned port, using special port #0) consumes the socket at a rate faster than they can be recycled back into the "open files" ulimit, and caused our unit tests to eventually all fail due to a "-1" local port.
See examples of error messages at https://bugs.eclipse.org/bugs/show_bug.cgi?id=310634

Having a loop that continues to attempt to start the server while looking for a valid port number (as seen in the Hadoop codebase) will not help in this excessive "open files" ulimit condition.
The best choice we've been able to come up with is to simply increase the "open files" ulimit.

  • Joakim
Show
Joakim Erdfelt added a comment - We have started to see this port issue crop up in the past 2 months during our ever increasing unit testing. The latest spate of problems were however fixed by simply updating our Linux instances with a higher number of file descriptors. On one of the systems showing this local port issue, we noticed that the default setup for open files was at a mere 2,000.
[joakim@lapetus jetty]$ lsb_release -a
No LSB modules are available.
Distributor ID:	Ubuntu
Description:	Ubuntu 10.04 LTS
Release:	10.04
Codename:	lucid

[joakim@lapetus jetty]$ ulimit -n
2000
So we updated the /etc/security/limits.conf to bump this number up to 20,000, and this solved our bad local port issues (after a reboot)
[joakim@lapetus jetty]$ grep nofile /etc/security/limits.conf 
#        - nofile - max number of open files
*	hard	nofile	40000
*	soft	nofile	20000

[joakim@lapetus jetty]$ ulimit -a | grep -i file
core file size          (blocks, -c) 0
file size               (blocks, -f) unlimited
open files                      (-n) 20000
file locks                      (-x) unlimited
This new ulimit helped our unit testing on the systems having issues. Our analysis shows that the aggressive unit testing that we do starts and stops a jetty server (on a system assigned port, using special port #0) consumes the socket at a rate faster than they can be recycled back into the "open files" ulimit, and caused our unit tests to eventually all fail due to a "-1" local port. See examples of error messages at https://bugs.eclipse.org/bugs/show_bug.cgi?id=310634 Having a loop that continues to attempt to start the server while looking for a valid port number (as seen in the Hadoop codebase) will not help in this excessive "open files" ulimit condition. The best choice we've been able to come up with is to simply increase the "open files" ulimit.
  • Joakim
Hide
Joakim Erdfelt added a comment -

Sachin,

Greg has posted a patch against Jetty 6 (attachment JETTY-748.patch) that backports the fixes for this from Jetty 7.
I can build a patched Jetty 6 for you to test against if you need it. (Note: Jetty 6 is in maintenance mode)

Show
Joakim Erdfelt added a comment - Sachin, Greg has posted a patch against Jetty 6 (attachment JETTY-748.patch) that backports the fixes for this from Jetty 7. I can build a patched Jetty 6 for you to test against if you need it. (Note: Jetty 6 is in maintenance mode)
Hide
Greg Wilkins added a comment -

I think I've finally found the underlying cause of the -1 return for the local port.

In the finally block of the run method of AbstractConnector, there is a check to see if the thread is for acceptor==0 and if so, it closes the server socket. The problem is that the acceptor thread apparently can complete after the next open of the connector (in a stop/start) scenario. So the socket is opened and then closed a short time later.

Quick fix is to remove this close (not needed as doStop does a close).

But a better fix will be to make sure that calls to Server.doStop wait until the thread pool has stopped all threads before returning.

Show
Greg Wilkins added a comment - I think I've finally found the underlying cause of the -1 return for the local port. In the finally block of the run method of AbstractConnector, there is a check to see if the thread is for acceptor==0 and if so, it closes the server socket. The problem is that the acceptor thread apparently can complete after the next open of the connector (in a stop/start) scenario. So the socket is opened and then closed a short time later. Quick fix is to remove this close (not needed as doStop does a close). But a better fix will be to make sure that calls to Server.doStop wait until the thread pool has stopped all threads before returning.

People

Vote (1)
Watch (6)

Dates

  • Created:
    Updated:
    Resolved: