Jetty

continuation may cause 100%cpu occasionally!

Details

  • Type: Bug Bug
  • Status: Resolved Resolved
  • Priority: Major Major
  • Resolution: Fixed
  • Affects Version/s: 6.0.0beta8, 6.0.0beta9, 6.0.0beta10, 6.0.0beta11, 6.0.0beta12, 6.0.0beta14, 6.0.0beta15, 6.0.0beta16, 6.0.0beta17, 6.0.0beta18, 6.0.0RC0, 6.0.0rc1, 6.0.0rc2, 6.0.0rc3, 6.0.0rc4, 6.0.0, 6.0.1, 6.0.2, 6.0.3, 6.1.0pre0, 6.1.0pre1, 6.1.0pre2, 6.1.0pre3, 6.1.0rc0, 6.1.0rc1, 6.1.0rc2, 6.1.0rc3, 6.1.0, 6.1.1rc0, 6.1.1rc1, 6.1.1, 6.1.2rc0
  • Fix Version/s: 6.1.2rc2
  • Component/s: Continuations
  • Labels:
    None
  • Environment:
    jdk1.5 gentoo-linux
  • Number of attachments :
    1

Description

continuation may cause 100%cpu occasionally.

sun.nio.ch.SelectorImpl.select(long) invoked 314658 times in 16889ms ! (see the YourKit snapshot )

may be the blocking Selector.select method stop blocking occasionally.
CODE SelectorManager.doSelect [
if (wait > 0)
_selector.select(wait); <====================== may return immediately.
else if (wait == 0)
_selector.selectNow();
else
_selector.select();
]

suggest code [
if (wait > 0) { int ret = _selector.select(wait); if (ret == 0) Thread.sleep(1000); //do something }
else if (wait == 0)
_selector.selectNow();
else
_selector.select();

]

Activity

Hide
Jan Bartel added a comment -

Boyce,

What is the configuration on your SelectChannelConnector? Also, what kind of io load were you running during the profiling?

regards, Jan

Show
Jan Bartel added a comment - Boyce, What is the configuration on your SelectChannelConnector? Also, what kind of io load were you running during the profiling? regards, Jan
Hide
Boyce Lee added a comment -

the SelectChannelConnector configuration :
<Call name="addConnector">
<Arg>
<New class="org.mortbay.jetty.nio.SelectChannelConnector">
<Set name="port"><SystemProperty name="jetty.port" default="80"/></Set>
<Set name="maxIdleTime">30000</Set>
<Set name="Acceptors">1</Set>
<Set name="StatsOn">true</Set>
</New>
</Arg>
</Call>

the JAVA_OPTIONS:
-agentlib:yjpagent -Xmn256M -Xms512M -Xmx512M -Xss256k -XX:SurvivorRatio=6 -XX:+DisableExplicitGC -XX:MaxDirectMemorySize=128M

when i start jetty without yourkit, it still cause 100%cpu occasionally.

the duration of full cpu occupation(100% cpu) is close to timeout of 'Continuation.suspend(timeout)' by value,and after a while(the timeout) the cpu usage come down to the normal level(0%).

Show
Boyce Lee added a comment - the SelectChannelConnector configuration : <Call name="addConnector"> <Arg> <New class="org.mortbay.jetty.nio.SelectChannelConnector"> <Set name="port"><SystemProperty name="jetty.port" default="80"/></Set> <Set name="maxIdleTime">30000</Set> <Set name="Acceptors">1</Set> <Set name="StatsOn">true</Set> </New> </Arg> </Call> the JAVA_OPTIONS: -agentlib:yjpagent -Xmn256M -Xms512M -Xmx512M -Xss256k -XX:SurvivorRatio=6 -XX:+DisableExplicitGC -XX:MaxDirectMemorySize=128M when i start jetty without yourkit, it still cause 100%cpu occasionally. the duration of full cpu occupation(100% cpu) is close to timeout of 'Continuation.suspend(timeout)' by value,and after a while(the timeout) the cpu usage come down to the normal level(0%).
Hide
Greg Wilkins added a comment -

Can't reproduce.

Can you tell us more about how you are using continuations? any sample code?

It should be OK for select(wait) to return immediately - so long as it eventually sleeps
for the wait time.

Show
Greg Wilkins added a comment - Can't reproduce. Can you tell us more about how you are using continuations? any sample code? It should be OK for select(wait) to return immediately - so long as it eventually sleeps for the wait time.
Hide
Boyce Lee added a comment -

I found the reason why CPU usage was 100%.

When Continuation is in suspend, if users with some ie versions click the Stop button of ie toolbar or press the Esc key, CPU usage will be 100%.

Through analysing the tcp packet with net sniffer, i know that when the tcp packet with tcp flag RST and ACK set are sent to jetty, CPU usage will be 100%.

And i also found that the isResumed method of RetryContinuation class is not correct. See the followings:

public boolean isResumed()
{
return _pending;
}

My temporary resolution is:

Add rules to iptables: iptables -A INPUT -p tcp --dport 80 --tcp-flags RST,ACK RST,ACK -j DROP
Modify the isResumed method of RetryContinuation:

public boolean isResumed()
{
return _resumed;
}

Show
Boyce Lee added a comment - I found the reason why CPU usage was 100%. When Continuation is in suspend, if users with some ie versions click the Stop button of ie toolbar or press the Esc key, CPU usage will be 100%. Through analysing the tcp packet with net sniffer, i know that when the tcp packet with tcp flag RST and ACK set are sent to jetty, CPU usage will be 100%. And i also found that the isResumed method of RetryContinuation class is not correct. See the followings: public boolean isResumed() { return _pending; } My temporary resolution is: Add rules to iptables: iptables -A INPUT -p tcp --dport 80 --tcp-flags RST,ACK RST,ACK -j DROP Modify the isResumed method of RetryContinuation: public boolean isResumed() { return _resumed; }
Hide
Greg Wilkins added a comment -

I have fixed the isResumed method (thanks),

but I can't reproduce the 100% CPU.

Can you reproduce with the demo chat room?

Show
Greg Wilkins added a comment - I have fixed the isResumed method (thanks), but I can't reproduce the 100% CPU. Can you reproduce with the demo chat room?
Hide
Boyce Lee added a comment -

To reproduce the bug, please do as followings:

1. Write a simple servlet program, the code is:

public class TestServlet extends HttpServlet {
protected void doGet(HttpServletRequest request, HttpServletResponse response) throws ServletException, IOException {
Continuation continuation = ContinuationSupport.getContinuation(request,null);
if (continuation.isPending()) { continuation.reset(); System.out.println("============= resumed or timeout ================"); PrintWriter out = null; out = response.getWriter(); out.write("OK, resumed: " + continuation.isResumed()); out.flush(); }
else { System.out.println("============= suspend(60000) ================"); continuation.suspend(60000); }
}
}

2. Deploy the jetty server in Linux (the tested Linux versions are gentoo 2.6.17-gentoo-r8,redhat 2.6.9-5.EL, the bug may not be shown in Windows)

3. Use the IE browser in Windows 2003 to access to the servlet, and click the Stop button on ie toolbar during the process of waiting (servlet request is in suspend). (the ie behavior of winxp is different with that of win2003, it won't send the tcp packet which tcp flag RST and ACK are set to server)

Show
Boyce Lee added a comment - To reproduce the bug, please do as followings: 1. Write a simple servlet program, the code is: public class TestServlet extends HttpServlet { protected void doGet(HttpServletRequest request, HttpServletResponse response) throws ServletException, IOException { Continuation continuation = ContinuationSupport.getContinuation(request,null); if (continuation.isPending()) { continuation.reset(); System.out.println("============= resumed or timeout ================"); PrintWriter out = null; out = response.getWriter(); out.write("OK, resumed: " + continuation.isResumed()); out.flush(); } else { System.out.println("============= suspend(60000) ================"); continuation.suspend(60000); } } } 2. Deploy the jetty server in Linux (the tested Linux versions are gentoo 2.6.17-gentoo-r8,redhat 2.6.9-5.EL, the bug may not be shown in Windows) 3. Use the IE browser in Windows 2003 to access to the servlet, and click the Stop button on ie toolbar during the process of waiting (servlet request is in suspend). (the ie behavior of winxp is different with that of win2003, it won't send the tcp packet which tcp flag RST and ACK are set to server)
Hide
Greg Wilkins added a comment -

It is going to take me some time to find a windows 2003 setup.... if anybody has a legal CD they can sell me then I will run it on vmware.

In the mean time, I can suggest that you restructure your code a little bit:

<pre>

protected void doGet(HttpServletRequest request, HttpServletResponse response) throws ServletException, IOException
{
Continuation continuation = ContinuationSupport.getContinuation(request,null);

if (!continuation.isPending())

{ System.err.println("============= suspend(5000) ================"); continuation.suspend(5000); }

continuation.reset();
System.err.println("============= resumed or timeout ================");
PrintWriter out = null;
response.setContentType("text/plain");
out = response.getWriter();
out.write("OK, resumed: " + continuation.isResumed());
out.flush();

}

</pre>

This reflects the fact that suspend may actually block. It probably wont fix the 100% loop.
Can you put a println in the start of the doGet method to see if the 100% loop includes the servlet

Show
Greg Wilkins added a comment - It is going to take me some time to find a windows 2003 setup.... if anybody has a legal CD they can sell me then I will run it on vmware. In the mean time, I can suggest that you restructure your code a little bit: <pre> protected void doGet(HttpServletRequest request, HttpServletResponse response) throws ServletException, IOException { Continuation continuation = ContinuationSupport.getContinuation(request,null); if (!continuation.isPending()) { System.err.println("============= suspend(5000) ================"); continuation.suspend(5000); } continuation.reset(); System.err.println("============= resumed or timeout ================"); PrintWriter out = null; response.setContentType("text/plain"); out = response.getWriter(); out.write("OK, resumed: " + continuation.isResumed()); out.flush(); } </pre> This reflects the fact that suspend may actually block. It probably wont fix the 100% loop. Can you put a println in the start of the doGet method to see if the 100% loop includes the servlet
Hide
Boyce Lee added a comment -

i put a println in the start of the doGet method ,and the println appeared only once in console – so the 100% loop doesn't include the servlet invocation.

"It is going to take me some time to find a windows 2003 setup" . it's unnecessary to test it in win2003 ,you can test it in any operation system that the explorer can send the special packet i mentioned before(tcp packet which tcp flag RST and ACK are set to server)

Show
Boyce Lee added a comment - i put a println in the start of the doGet method ,and the println appeared only once in console – so the 100% loop doesn't include the servlet invocation. "It is going to take me some time to find a windows 2003 setup" . it's unnecessary to test it in win2003 ,you can test it in any operation system that the explorer can send the special packet i mentioned before(tcp packet which tcp flag RST and ACK are set to server)
Hide
Boyce Lee added a comment -

if you can put a linux server on internet,and give me url where the test servlet is running, i can help to test from my computer which causes the bug now.

Show
Boyce Lee added a comment - if you can put a linux server on internet,and give me url where the test servlet is running, i can help to test from my computer which causes the bug now.
Hide
Boyce Lee added a comment -

the console output:

1 2007-03-22 17:24:54.788::INFO: Started SelectChannelConnector @ 0.0.0.0:80
2 ========== doGet ==============
3 ============= suspend(5000) ================
4 ========== doGet ==============
5 ============= resumed or timeout ================

the busy loop occurred between the first two invocation of doGet method,and the
duration of cpu100% is exactly the timeout of suspend().

(2)>(3)>cpu100%( duration: 5000)>(4)>(5)

Show
Boyce Lee added a comment - the console output: 1 2007-03-22 17:24:54.788::INFO: Started SelectChannelConnector @ 0.0.0.0:80 2 ========== doGet ============== 3 ============= suspend(5000) ================ 4 ========== doGet ============== 5 ============= resumed or timeout ================ the busy loop occurred between the first two invocation of doGet method,and the duration of cpu100% is exactly the timeout of suspend(). (2)>(3)>cpu100%( duration: 5000)>(4)>(5)
Hide
Boyce Lee added a comment -

the console output:

1 2007-03-22 17:24:54.788::INFO: Started SelectChannelConnector @ 0.0.0.0:80
2 ========== doGet ==============
3 ============= suspend(5000) ================
4 ========== doGet ==============
5 ============= resumed or timeout ================

the busy loop occurred between the only two invocation of doGet method,and the
duration of cpu100% is exactly the timeout of suspend().

(2)>(3)>cpu100%( duration: 5000)>(4)>(5)

Show
Boyce Lee added a comment - the console output: 1 2007-03-22 17:24:54.788::INFO: Started SelectChannelConnector @ 0.0.0.0:80 2 ========== doGet ============== 3 ============= suspend(5000) ================ 4 ========== doGet ============== 5 ============= resumed or timeout ================ the busy loop occurred between the only two invocation of doGet method,and the duration of cpu100% is exactly the timeout of suspend(). (2)>(3)>cpu100%( duration: 5000)>(4)>(5)
Hide
Greg Wilkins added a comment -

Boyce,

OK this sounds like the closed channel gets into a state that is selected but not closed or something?

I will put up a test server later today and get you to hit it with your windows 2003 browser.

thanks for the offer.

Show
Greg Wilkins added a comment - Boyce, OK this sounds like the closed channel gets into a state that is selected but not closed or something? I will put up a test server later today and get you to hit it with your windows 2003 browser. thanks for the offer.
Hide
floyd.yao added a comment -

is it a bug caused by the new feature of jetty 6? recently i found it in redhat server also

Show
floyd.yao added a comment - is it a bug caused by the new feature of jetty 6? recently i found it in redhat server also
Hide
Greg Wilkins added a comment -

Thanks all for the information on this one! It does appear to be a JVM bug that is fixed in java6

I have finally been able to reproduce the problem and it is a real shocker!! It is like somebody designed a JVM
bug specifically to make life REALLY hard for the way Jetty uses NIO.

I can detect that the failure is occuring (select returns for no reason before timeout), and I can inspect every
channel in the select set. But unless I write something, they all look perfectly fine (and open) - so I have no
way of detecting which channel is the one that has been badly closed.

So the only solution that I have found is to cancel all keys that have zero interested ops and to recreate
them again when there are indeed needed again. This appears to be working, but the cancelled keys
appears to provoke a couple of issues in Jetty that assumed key valid == channel open

Anyway.... I am very close to a fix ( I think ) and currently I can't produce any failures. But I'm going to
sleep on it and test again in the morning to be doubly certain.

thanks again!

Show
Greg Wilkins added a comment - Thanks all for the information on this one! It does appear to be a JVM bug that is fixed in java6 I have finally been able to reproduce the problem and it is a real shocker!! It is like somebody designed a JVM bug specifically to make life REALLY hard for the way Jetty uses NIO. I can detect that the failure is occuring (select returns for no reason before timeout), and I can inspect every channel in the select set. But unless I write something, they all look perfectly fine (and open) - so I have no way of detecting which channel is the one that has been badly closed. So the only solution that I have found is to cancel all keys that have zero interested ops and to recreate them again when there are indeed needed again. This appears to be working, but the cancelled keys appears to provoke a couple of issues in Jetty that assumed key valid == channel open Anyway.... I am very close to a fix ( I think ) and currently I can't produce any failures. But I'm going to sleep on it and test again in the morning to be doubly certain. thanks again!
Hide
Greg Wilkins added a comment -

I have checked in what I think is a resolution for this issue.

However, some more stress testing of the server is not required as some additional
instabilities may have been introduced as a result!

This will put back 6.1.2rc2 to the end of the month

Show
Greg Wilkins added a comment - I have checked in what I think is a resolution for this issue. However, some more stress testing of the server is not required as some additional instabilities may have been introduced as a result! This will put back 6.1.2rc2 to the end of the month
Hide
Doug Daniels added a comment -

I'm running JETTY 7.0.0pre5 Continuations and I'm running into this issue when running in a Redhat Linux VMWare environment. Is the 6.1.2rc2 resolution merged into the 7.0.0pre5 code?

We believe we are seeing the RST, ACK issue on our network, and we're seeing 2 threads causing100% CPU utilization:

"qtp0-9 - Acceptor0 SelectChannelConnector@0.0.0.0:8080" - Thread t@28
java.lang.Thread.State: RUNNABLE
at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:215)
at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:65)
at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:69)

  • locked sun.nio.ch.Util$1@46f8d1d0
  • locked java.util.Collections$UnmodifiableSet@1cfc87b7
  • locked sun.nio.ch.EPollSelectorImpl@67454dc9
    at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:80)
    at org.mortbay.io.nio.SelectorManager$SelectSet.doSelect(SelectorManager.java:403)
    at org.mortbay.io.nio.SelectorManager.doSelect(SelectorManager.java:174)
    at org.mortbay.jetty.nio.SelectChannelConnector.accept(SelectChannelConnector.java:123)
    at org.mortbay.jetty.AbstractConnector$Acceptor.run(AbstractConnector.java:698)
    at org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:520)

"qtp0-8 - Acceptor1 SelectChannelConnector@0.0.0.0:8080" - Thread t@27
java.lang.Thread.State: RUNNABLE
at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:215)
at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:65)
at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:69)

  • locked sun.nio.ch.Util$1@f79455e
  • locked java.util.Collections$UnmodifiableSet@697ea809
  • locked sun.nio.ch.EPollSelectorImpl@17b49fcf
    at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:80)
    at org.mortbay.io.nio.SelectorManager$SelectSet.doSelect(SelectorManager.java:403)
    at org.mortbay.io.nio.SelectorManager.doSelect(SelectorManager.java:174)
    at org.mortbay.jetty.nio.SelectChannelConnector.accept(SelectChannelConnector.java:123)
    at org.mortbay.jetty.AbstractConnector$Acceptor.run(AbstractConnector.java:698)
    at org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:520)
Show
Doug Daniels added a comment - I'm running JETTY 7.0.0pre5 Continuations and I'm running into this issue when running in a Redhat Linux VMWare environment. Is the 6.1.2rc2 resolution merged into the 7.0.0pre5 code? We believe we are seeing the RST, ACK issue on our network, and we're seeing 2 threads causing100% CPU utilization: "qtp0-9 - Acceptor0 SelectChannelConnector@0.0.0.0:8080" - Thread t@28 java.lang.Thread.State: RUNNABLE at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method) at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:215) at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:65) at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:69)
  • locked sun.nio.ch.Util$1@46f8d1d0
  • locked java.util.Collections$UnmodifiableSet@1cfc87b7
  • locked sun.nio.ch.EPollSelectorImpl@67454dc9 at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:80) at org.mortbay.io.nio.SelectorManager$SelectSet.doSelect(SelectorManager.java:403) at org.mortbay.io.nio.SelectorManager.doSelect(SelectorManager.java:174) at org.mortbay.jetty.nio.SelectChannelConnector.accept(SelectChannelConnector.java:123) at org.mortbay.jetty.AbstractConnector$Acceptor.run(AbstractConnector.java:698) at org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:520)
"qtp0-8 - Acceptor1 SelectChannelConnector@0.0.0.0:8080" - Thread t@27 java.lang.Thread.State: RUNNABLE at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method) at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:215) at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:65) at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:69)
  • locked sun.nio.ch.Util$1@f79455e
  • locked java.util.Collections$UnmodifiableSet@697ea809
  • locked sun.nio.ch.EPollSelectorImpl@17b49fcf at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:80) at org.mortbay.io.nio.SelectorManager$SelectSet.doSelect(SelectorManager.java:403) at org.mortbay.io.nio.SelectorManager.doSelect(SelectorManager.java:174) at org.mortbay.jetty.nio.SelectChannelConnector.accept(SelectChannelConnector.java:123) at org.mortbay.jetty.AbstractConnector$Acceptor.run(AbstractConnector.java:698) at org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:520)
Hide
Greg Wilkins added a comment -

Doug,

the jetty-7 development has moved to eclipse and the latest release is a 7.0.0.RC1, which has handling for these JVM bugs.

Show
Greg Wilkins added a comment - Doug, the jetty-7 development has moved to eclipse and the latest release is a 7.0.0.RC1, which has handling for these JVM bugs.

People

Vote (0)
Watch (2)

Dates

  • Created:
    Updated:
    Resolved: