Details
-
- JETTY-748.patch
- 13/Apr/10 8:33 AM
- 5 kB
- Greg Wilkins
-
- OpenCloseTest.java
- 26/Aug/09 8:51 AM
- 2 kB
- Greg Wilkins
Activity
We're hitting a wall with transferTo in Java SE 6 (HADOOP-4700). It looks like a particularly buggy Linux implementation, and most of the issues aren't fixed until JDK7. There are a few more things to try, but we may have to abandon this particular route for now. Has anyone had more success with this on Linux? Is it worth experimenting with blocking I/O, or is that a non-starter for Jetty6?
On the bright side, dropping the transferTo experiment means we can tune the Server and MapOutputServlet in more conventional ways. Where should we start?
There's currently an open issue (HADOOP-4744) that may be related to the Jetty6 migration. In allocating a medium/large cluster (~400-500 nodes), it looks like some of the jetty servers are failing to start. If there are strategies for troubleshooting this, any guidance would be appreciated.
David,
can you have a quick look at these issues and summarize them to me.
Also can you make sure that the RandomAccessFileBuffer and it's support are
backported from jetty-7 to jetty-6.
We're seeing this pair of exceptions frequently:
org.mortbay.jetty.EofException
at org.mortbay.jetty.HttpGenerator.flush(HttpGenerator.java:787)
at org.mortbay.jetty.AbstractGenerator$Output.blockForOutput(AbstractGenerator.java:548)
at org.mortbay.jetty.AbstractGenerator$Output.flush(AbstractGenerator.java:569)
at org.mortbay.jetty.HttpConnection$Output.flush(HttpConnection.java:946)
at org.mortbay.jetty.AbstractGenerator$Output.write(AbstractGenerator.java:646)
at org.mortbay.jetty.AbstractGenerator$Output.write(AbstractGenerator.java:577)
at org.apache.hadoop.mapred.TaskTracker$MapOutputServlet.doGet(TaskTracker.java:2872)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:707)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:820)
at org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:502)
at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:363)
at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181)
at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766)
at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:417)
at org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)
at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
at org.mortbay.jetty.Server.handle(Server.java:324)
at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:534)
at org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:864)
at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:533)
at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:207)
at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:403)
at org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:409)
at org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:522)
Caused by: java.io.IOException: Broken pipe
at sun.nio.ch.FileDispatcher.write0(Native Method)
at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:29)
at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:104)
at sun.nio.ch.IOUtil.write(IOUtil.java:75)
at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:334)
at org.mortbay.io.nio.ChannelEndPoint.flush(ChannelEndPoint.java:169)
at org.mortbay.io.nio.SelectChannelEndPoint.flush(SelectChannelEndPoint.java:221)
at org.mortbay.jetty.HttpGenerator.flush(HttpGenerator.java:721)
... 24 more
java.lang.IllegalStateException: Committed
at org.mortbay.jetty.Response.resetBuffer(Response.java:994)
at org.mortbay.jetty.Response.sendError(Response.java:240)
at org.apache.hadoop.mapred.TaskTracker$MapOutputServlet.doGet(TaskTracker.java:2895)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:707)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:820)
at org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:502)
at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:363)
at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181)
at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766)
at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:417)
at org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)
at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
at org.mortbay.jetty.Server.handle(Server.java:324)
at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:534)
at org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:864)
at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:533)
at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:207)
at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:403)
at org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:409)
at org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:522)
This is running trunk without attempting to use RandomAccessFileBuffer. Have we misconfigured something? Is it worth worrying about?
The Eof exception implies that your client is closing the connection while content is still being sent. Can this happen?
If so, then you can silently ignore(but log in debug) this exception (but best not to close while writing content).
The second exception looks like an attempt to generate an error page for the first exception - should not be a problem if you silently ignore the first exception.
Greg,
I checked in the change to AbstractGenerator.isBufferFull() see svn rev 4213. Please review.
Jan
Owen,
We are working to remove this close race.
It's not yet 100% fixed, but we will address it soon.
I'll reopen this issue so that we can close it when we have the 100% fix, but I'll reduce the priority because the exception thrown is essentially just noise (in the case that we understand it).
Oh my previous comment was about the wrong race condition... I was talking about the issue if somebody closes a connection as it is being written to.
David was discussion a getLocalPort issue that I believe was fixed.
We have certainly not observed getLocalPort returning incorrect values and we heavily use it in our test harnesses.
But I'll reinvestigate anyway.
https://issues.apache.org/jira/browse/HADOOP-4744 is closed, so I think the getLocalPort issue is indeed fixed.
The open issue associated with this is https://issues.apache.org/jira/browse/MAPREDUCE-258
Apparently int is not sufficient size for hadoop, so perhaps another approach is required.
As David said earlier, the "resolution" for HADOOP-4744 was to check if the race was lost, and retry. It looks like the race condition still exists.
OK, so I've looked at this again, and I just cannot see how the localPort can be -1 unless the open fails somehow.
I've written a test harness to try to reproduce the problem, but it does not for me.
Could somebody who see the problem on hadoop run the OpenCloseTest and see if they can reproduce.
thanks
pass this test a list of interface addresses on the command line
I couldn't reproduce this with OpenCloseTest and our default interface. I'll try to reproduce in our specific use of Jetty in the next few weeks.
Looks like https://issues.apache.org/jira/browse/HADOOP-6528 is a re-emergence of this problem.
Adding tracking for similar (now closed) issue at https://issues.apache.org/jira/browse/HADOOP-6386
Joakim,
can you have a look at this? I've poured over the code and tried numerous tests and I fail to see how we can return -1 there.
thanks
One possibility is that something needs to be marked as volatile but it isn't, so its being cached or prefetched -and the change isn't been picked up. I've encountered that in some projects.
Hi,
we've been looking at the jetty code again and again and can't see how a volatile or synchronized block or any memory barrier could help - unless it is an issue within the JVM libraries themselves.
We've been running stop/start tests but also cannot reproduce.
It is possible that something else in hadoop is stopping the server before the call to get the port?
Can we get full details of the platforms on which this fails - operating system, JVM version etc. etc.
Attached is a patch that changes the way localports are obtained.
Instead of calling socket.getLocalPort() when connector.getLocalPort() is called, this patch calls socket.getLocalPort() in the open method and holds the result in a volatile int
If close is called, the int is set to -2, so this will tell us if the port has not been opened, or has been closed before it was used.
This patch is not yet in a release, but we can produce a snapshot build for the hadoop guys to test if that helps.
It looks like we have the same failure in https://bugs.eclipse.org/bugs/show_bug.cgi?id=310634
so I'm definitely going to change the way we return the local port and get it at the time we open the server socket.
Hi Greg,
I am running few tests on a 4 node hadoop cluster and hitting this problem at will.
You mentioned about snapshot build in last comment. Could you please provide me snapshot build so that I can try it out on my cluster.
Following are details of my environment:
OS : Ubuntu 8.04.1
Hadoop Version : 0.20.1
Jetty Version : 6.1.14
Error Messages:
2010-05-16 12:32:53,282 WARN org.apache.hadoop.mapred.ReduceTask: java.net.ConnectException: Connection refused
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
at sun.net._www.protocol.http.HttpURLConnection$6.run(HttpURLConnection.java:1368)
at java.security.AccessController.doPrivileged(Native Method)
at sun.net._www.protocol.http.HttpURLConnection.getChainedException(HttpURLConnection.java:1362)
at sun.net._www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1016)
at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getInputStream(ReduceTask.java:1447)
at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getMapOutput(ReduceTask.java:1349)
at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:1261)
at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:1195)
Caused by: java.net.ConnectException: Connection refused
at java.net.PlainSocketImpl.socketConnect(Native Method)
at java.net.PlainSocketImpl.doConnect(PlainSocketImpl.java:333)
at java.net.PlainSocketImpl.connectToAddress(PlainSocketImpl.java:195)
at java.net.PlainSocketImpl.connect(PlainSocketImpl.java:182)
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:366)
at java.net.Socket.connect(Socket.java:525)
at sun.net.NetworkClient.doConnect(NetworkClient.java:158)
at sun.net._www.http.HttpClient.openServer(HttpClient.java:394)
at sun.net._www.http.HttpClient.openServer(HttpClient.java:529)
at sun.net._www.http.HttpClient.<init>(HttpClient.java:233)
at sun.net._www.http.HttpClient.New(HttpClient.java:306)
at sun.net._www.http.HttpClient.New(HttpClient.java:323)
at sun.net._www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:860)
at sun.net._www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:801)
at sun.net._www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:726)
at sun.net._www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1049)
... 4 more
Thanks,
Sachin
We have started to see this port issue crop up in the past 2 months during our ever increasing unit testing.
The latest spate of problems were however fixed by simply updating our Linux instances with a higher number of file descriptors.
On one of the systems showing this local port issue, we noticed that the default setup for open files was at a mere 2,000.
[joakim@lapetus jetty]$ lsb_release -a No LSB modules are available. Distributor ID: Ubuntu Description: Ubuntu 10.04 LTS Release: 10.04 Codename: lucid [joakim@lapetus jetty]$ ulimit -n 2000
So we updated the /etc/security/limits.conf to bump this number up to 20,000, and this solved our bad local port issues (after a reboot)
[joakim@lapetus jetty]$ grep nofile /etc/security/limits.conf # - nofile - max number of open files * hard nofile 40000 * soft nofile 20000 [joakim@lapetus jetty]$ ulimit -a | grep -i file core file size (blocks, -c) 0 file size (blocks, -f) unlimited open files (-n) 20000 file locks (-x) unlimited
This new ulimit helped our unit testing on the systems having issues.
Our analysis shows that the aggressive unit testing that we do starts and stops a jetty server (on a system assigned port, using special port #0) consumes the socket at a rate faster than they can be recycled back into the "open files" ulimit, and caused our unit tests to eventually all fail due to a "-1" local port.
See examples of error messages at https://bugs.eclipse.org/bugs/show_bug.cgi?id=310634
Having a loop that continues to attempt to start the server while looking for a valid port number (as seen in the Hadoop codebase) will not help in this excessive "open files" ulimit condition.
The best choice we've been able to come up with is to simply increase the "open files" ulimit.
- Joakim
Sachin,
Greg has posted a patch against Jetty 6 (attachment JETTY-748.patch) that backports the fixes for this from Jetty 7.
I can build a patched Jetty 6 for you to test against if you need it. (Note: Jetty 6 is in maintenance mode)
I think I've finally found the underlying cause of the -1 return for the local port.
In the finally block of the run method of AbstractConnector, there is a check to see if the thread is for acceptor==0 and if so, it closes the server socket. The problem is that the acceptor thread apparently can complete after the next open of the connector (in a stop/start) scenario. So the socket is opened and then closed a short time later.
Quick fix is to remove this close (not needed as doStop does a close).
But a better fix will be to make sure that calls to Server.doStop wait until the thread pool has stopped all threads before returning.
Added RandomAccessFileBuffer and test harness.
Updated AbstractGenerator so that such buffers passed with sendContent are efficiently sent.
now testing.... then will advise hadoop how to use.