Jetty
  1. Jetty
  2. JETTY-515

Timer is missing scavenging Task in HashSessionManager

    Details

    • Type: Bug Bug
    • Status: Resolved Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 6.1.7
    • Fix Version/s: 6.2.0pre0, 6.1.8
    • Component/s: Servlet
    • Labels:
      None
    • Environment:
      Windows Server 2003, jdk1.6.0, jetty 6.1.7
    • Number of attachments :
      4

      Description

      Sporadically our production environment goes out of Heap, because expired sessions are not scavenged.
      To validate this we have performed a heap dump when the server is hitting its maximum heap usage, before crashing the server.
      In this .hprof heap dump we used SAP Memory Analyzer to do post-mortem analysis. (screenshot of problem in attachment)
      What happens is that somehow the java.util.Timer object (_timer in the HashSessionManager) ends up not having any TimerTask assigned to it (we are absolutely sure about this - in other heap dumps the TimerTask is clearly there - pointing to the anonymous inner class that calls the scavenge() method).
      Hence timed-out sessions are not garbage collected as they remain referenced through the SessionManager.
      Analyzing the Session objects in the heap dump reveals a very large amount of HashSessionManager$$Session objects (about 10 times more then normal system usage). Ffurthermore the oldest Session objects have lastAccessed times that were more than 24 hours before the system heap problems.

      Often the system can run weeks before experiencing this problem.

      If more information is needed please contact me at dimitry_dhondt [at] yahoo.com

      Dimitry.

      1. heap.jpg
        191 kB
      2. scavenge.jpg
        201 kB
      3. TimerTask.jpg
        175 kB
      4. TimerThread.jpg
        214 kB

        Activity

        Hide
        Jan Bartel added a comment -

        Dimitry,

        Was the scavenger timer thread still running?

        Jan

        Show
        Jan Bartel added a comment - Dimitry, Was the scavenger timer thread still running? Jan
        Hide
        Jan Bartel added a comment -

        Dimitry,

        Also, were there any exceptions in the log files?

        Jan

        Show
        Jan Bartel added a comment - Dimitry, Also, were there any exceptions in the log files? Jan
        Hide
        Jan Bartel added a comment -

        I've tentatively marked this one Resolved, but please let me know if it isn't.

        I found that the savenger TimerTask does not catch Exceptions, and an Exception thrown out of the TimerTask will cause it not be re-scheduled.

        So I am surmising that perhaps your jvm was running out of memory and that an exception was thrown in the scavenger TimerTask which then killed the timer thread, which allowed the sessions to back up.

        I've committed a change which logs any exceptions in the scavenge() method and prevents them from being thrown so the thread will stay alive.

        It would be great if you could test this out, as we are about to do release 1.6.8 and I'd like to have this issue fixed by then.

        cheers
        Jan

        Show
        Jan Bartel added a comment - I've tentatively marked this one Resolved, but please let me know if it isn't. I found that the savenger TimerTask does not catch Exceptions, and an Exception thrown out of the TimerTask will cause it not be re-scheduled. So I am surmising that perhaps your jvm was running out of memory and that an exception was thrown in the scavenger TimerTask which then killed the timer thread, which allowed the sessions to back up. I've committed a change which logs any exceptions in the scavenge() method and prevents them from being thrown so the thread will stay alive. It would be great if you could test this out, as we are about to do release 1.6.8 and I'd like to have this issue fixed by then. cheers Jan
        Hide
        Dimitry D'hondt added a comment -

        TimerTask in faulty heap dump

        Show
        Dimitry D'hondt added a comment - TimerTask in faulty heap dump
        Hide
        Dimitry D'hondt added a comment -

        Hi Jan,

        I was occupied with other stuff till now.

        I've been looking into this problem further. The theory about the exception in setScavangePeriod() does not pan out according to me. In the second screenshot I've included the state of the TimerTask that normally calls the scavenge() operation. As you can see the state variable contains 1, which is the SCHEDULED state. If the exception prevented the schedule operation, the TimerTask should have VIRGIN state (0). (Furthermore the nextExecutionTime is filled in - and more than 24 h before the heap problems described above. The period variable is also set, further indicating that this TimerTask was indeed scheduled at some point.).

        My best guess right now its that the BackgroundThread of the Timer is somehow broken. Maybe the scavenge() method threw an exception ?

        I'll keep you further informed about our progress.

        Kind regards,

        Dimitiry.

        Show
        Dimitry D'hondt added a comment - Hi Jan, I was occupied with other stuff till now. I've been looking into this problem further. The theory about the exception in setScavangePeriod() does not pan out according to me. In the second screenshot I've included the state of the TimerTask that normally calls the scavenge() operation. As you can see the state variable contains 1, which is the SCHEDULED state. If the exception prevented the schedule operation, the TimerTask should have VIRGIN state (0). (Furthermore the nextExecutionTime is filled in - and more than 24 h before the heap problems described above. The period variable is also set, further indicating that this TimerTask was indeed scheduled at some point.). My best guess right now its that the BackgroundThread of the Timer is somehow broken. Maybe the scavenge() method threw an exception ? I'll keep you further informed about our progress. Kind regards, Dimitiry.
        Hide
        Dimitry D'hondt added a comment -

        I've also included a graph of the heap usage of the application.
        The first red line is a restart of the server for a bugfix release.
        The second red line is 15:56:28, the time at which the TimerTask is found the have it's nextExecutionTime in post-mortem analysis.

        Dimitry.

        Show
        Dimitry D'hondt added a comment - I've also included a graph of the heap usage of the application. The first red line is a restart of the server for a bugfix release. The second red line is 15:56:28, the time at which the TimerTask is found the have it's nextExecutionTime in post-mortem analysis. Dimitry.
        Hide
        Dimitry D'hondt added a comment -

        Added info about TimerThread.

        Taking a look at the TimerThread that should be executing the Timer's Tasks (and thus the scavenge() method), reveales that this thread is probably broken :

        1) The variables underlined in red are null which indicates that likely the Thread.exit() method has run :

        private void exit() {
        if (group != null)

        { group.remove(this); group = null; }

        /* Aggressively null out all reference fields: see bug 4006245 */
        target = null;
        /* Speed the release of some of these resources */
        threadLocals = null;
        inheritableThreadLocals = null;
        inheritedAccessControlContext = null;
        blocker = null;
        uncaughtExceptionHandler = null;
        }

        And a very curious variable (not references from the java portion of Thread - probably from the JVM native code) stillborn indicates true.

        Dimitry.

        Show
        Dimitry D'hondt added a comment - Added info about TimerThread. Taking a look at the TimerThread that should be executing the Timer's Tasks (and thus the scavenge() method), reveales that this thread is probably broken : 1) The variables underlined in red are null which indicates that likely the Thread.exit() method has run : private void exit() { if (group != null) { group.remove(this); group = null; } /* Aggressively null out all reference fields: see bug 4006245 */ target = null; /* Speed the release of some of these resources */ threadLocals = null; inheritableThreadLocals = null; inheritedAccessControlContext = null; blocker = null; uncaughtExceptionHandler = null; } And a very curious variable (not references from the java portion of Thread - probably from the JVM native code) stillborn indicates true. Dimitry.
        Hide
        Jan Bartel added a comment -

        Hi Dimitry,

        Thanks for the thorough investigation!

        We released jetty-6.1.8 yesterday, which includes what I still believe is the fix for the scavenger thread dying. If you are able to try out 6.1.8 and report back, that would be most useful. If the problem re-occurs, can you also please attach the jetty log file.

        cheers
        Jan

        Show
        Jan Bartel added a comment - Hi Dimitry, Thanks for the thorough investigation! We released jetty-6.1.8 yesterday, which includes what I still believe is the fix for the scavenger thread dying. If you are able to try out 6.1.8 and report back, that would be most useful. If the problem re-occurs, can you also please attach the jetty log file. cheers Jan
        Hide
        Dimitry D'hondt added a comment -

        Jan,

        We're currently quite convinced that the Timer background thread was broken from the start of the application server. This would mean that Jetty had no influence on this (as it has no way to control the Timer's thread - which is started from the constructor of the Timer object - called from the HashSessionManager). The stillborn flag clearly indicated that this thread failed to start in the first place. This seems to be a JVM related issue. We're upgrading from JRE 1.6.0 to 1.6.0_04. Other events (unrelated to the problem described here) are requiring us to redeploy the application on Tomcat 5.5 (Some people believe that it would be easier on the people maintaining the production environment - if all applications were running on Tomcat).

        That is not to say that the fix you developed would most surely fix some cases. I'm also assuming that for some of the crashes we've had in the past this race condition may indeed have been a cause.

        Thanks for the support you've given us !

        Dimitry.

        Show
        Dimitry D'hondt added a comment - Jan, We're currently quite convinced that the Timer background thread was broken from the start of the application server. This would mean that Jetty had no influence on this (as it has no way to control the Timer's thread - which is started from the constructor of the Timer object - called from the HashSessionManager). The stillborn flag clearly indicated that this thread failed to start in the first place. This seems to be a JVM related issue. We're upgrading from JRE 1.6.0 to 1.6.0_04. Other events (unrelated to the problem described here) are requiring us to redeploy the application on Tomcat 5.5 (Some people believe that it would be easier on the people maintaining the production environment - if all applications were running on Tomcat). That is not to say that the fix you developed would most surely fix some cases. I'm also assuming that for some of the crashes we've had in the past this race condition may indeed have been a cause. Thanks for the support you've given us ! Dimitry.
        Hide
        Dimitry D'hondt added a comment -

        Jan,

        Apparently we're now back on jetty
        We'll deploy 6.1.8 this week, and keep you posted.

        Dimitry.

        Show
        Dimitry D'hondt added a comment - Jan, Apparently we're now back on jetty We'll deploy 6.1.8 this week, and keep you posted. Dimitry.
        Hide
        Dimitry D'hondt added a comment -

        Jan,

        We've been running for quite a while now in production (about 2 months), and the issue has never returned after your fix.

        Thx,

        Dimi.

        Show
        Dimitry D'hondt added a comment - Jan, We've been running for quite a while now in production (about 2 months), and the issue has never returned after your fix. Thx, Dimi.

          People

          • Assignee:
            Jan Bartel
            Reporter:
            Dimitry D'hondt
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved: