Details

    • Type: Bug Bug
    • Status: Closed Closed
    • Priority: Critical Critical
    • Resolution: Fixed
    • Affects Version/s: JRuby 1.0.0
    • Fix Version/s: JRuby 1.1b1
    • Component/s: Core Classes/Modules
    • Labels:
      None
    • Environment:
      Suspected: any; tested on: Mac OS X 10.4.9, Java 1.5.0_07, JRuby 1.0.0 (Ruby 1.8.5).
    • Testcase included:
      yes
    • Patch Submitted:
      Yes
    • Number of attachments :
      6

      Description

      Summary: a bug in Thread.critical= causes the entire system to deadlock (all Ruby threads); this happens with very high probability if given sufficient concurrency (a large number of threads). The synchronization happening at the Java layer in the implementation of Thread.critical is incorrect, and results in this deadlock. The attached patch fixes it.

      I discovered this by running ~10 threads, each of which made many calls to the Logger (without actually outputting anything); Logger uses a Mutex, which in turn uses Thread.critical=. The threads deadlocked almost without fail after only about 10-20 seconds of runtime on my box. (Obviously, your results may vary wildly, as this is all about thread timing.)

      Specifically, org.jruby.internal.runtime.ThreadService.

      {getCritical(), setCritical(boolean)}

      have the following synchronization bug: they all hold the Java monitor on the singleton ThreadService object, then call into the ReentrantLock meant to implement Thread.critical. But if the call into the ReentrantLock ever blocks, nobody can ever unlock it, because the thread blocked on the ReentrantLock will hold the Java monitor on the ThreadService, which is required to be held in order to unlock the ReentrantLock. (Blocking on the ReentrantLock will unlock it, but not monitors held higher up the call chain – like the one on the ThreadService.)

      There was also an apparent race condition in ThreadService.waitForCritical() (it tested whether the current thread held the critical lock, then tried to lock and unlock it, but there was no guard around these; hence, the test could execute and pass, only to have the test no longer be true by the time it got to the lock/unlock clause); my reworked synchronization fixes this, too.

      Full description, from the comments in the supplied testcase:

      Background:

      • In MRI, setting 'Thread.critical = true' means no other threads can possibly
        be scheduled, period. (This is because MRI uses 'green threads', meaning it
        implements its own thread scheduler, and the OS knows nothing about the
        threads in the process; as such, it can guarantee this easily.) Thus it is
        simply not possible for another thread to set Thread.critical to anything,
        period, until the first thread sets 'Thread.critical = false'.
      • In JRuby, Ruby threads are mapped directly onto Java threads. Because of this,
        JRuby cannot actually guarantee that no other threads are scheduled once you
        set 'Thread.critical = true'; all it can do is block other threads as soon as
        they call one of a certain set of library functions that explicitly check the
        value of 'Thread.critical'.

      The bug involves two threads:

      1. Thread A calls 'Thread.critical = true'.
      2. Thread B is then selected to run by the OS's scheduler. In MRI, this is not
      possible, but in JRuby, it is.
      3. Thread B calls 'Thread.critical = true'. This correctly does the right thing,
      which is to block until Thread A sets 'Thread.critical = false'. However,
      here is where the bug occurred: this would grab a monitor (an ordinary Java
      monitor) that was required to do anything at all with the 'critical lock',
      and hold it while it was blocked on a different lock waiting for thread A
      to set 'Thread.critical = false'. (Because it is blocked on a different lock,
      Java's synchronization primitives do not automatically release the monitor
      this thread has acquited.)
      4. Thread A calls 'Thread.critical = false'. However, in order to change the
      value, it needs to acquire the monitor that thread A holds.
      5. Deadlock.

      The fix was to rework the synchronization so that the same monitor is used for
      both locking the value of Thread.critical and to block until it's set to 'false'
      when required, above. This test case demonstrates that the bug is fixed, and
      makes sure it never recurs.

      The bug was nasty because lots of synchronization objects (e.g., Mutex) use
      Thread.critical in their implementation, and it ends up just being a matter of
      time until the entire system deadlocks. (In particular, even just calls to
      Logger from different threads will do this, as it uses a Mutex, which, in
      turn, uses Thread.critical.)

      Unfortunately, this test case simply hangs when the bug is present, as that's
      the consequence of the bug (it's a deadlock). Further, we can't do anything
      about it, as using another thread to try to time out the test just exacerbates
      the problem and deadlocks itself. Alas, that's all we can do, but at least a
      hung test is likely to get noticed when compared to no test at all.

      1. deadlock_stack_trace.txt
        33 kB
        Nicolas Modrzyk
      2. jruby_1157_owscheduler.tgz
        9 kB
        John Mettraux
      3. jruby_dump.txt
        29 kB
        John Mettraux
      4. jruby-1157.patch
        7 kB
        Andrew Geweke
      5. stack_trace_with_java_util.txt
        13 kB
        Nicolas Modrzyk
      6. thread_service.patch
        1 kB
        Nicolas Modrzyk

        Activity

        Hide
        Andrew Geweke added a comment -

        Feel free to use the supplied test case or not; it sadly actually /hangs/ the tests if the bug exists, but there is no easy way around this AFAIK. (See the comment for more information.)

        Show
        Andrew Geweke added a comment - Feel free to use the supplied test case or not; it sadly actually /hangs/ the tests if the bug exists, but there is no easy way around this AFAIK. (See the comment for more information.)
        Hide
        Nick Sieger added a comment -

        Thanks for the exhaustive analysis, Andrew. I'll look at including your test case in the future when we have the ability to kill off builds if they run too long in our continuous integration server.

        Show
        Nick Sieger added a comment - Thanks for the exhaustive analysis, Andrew. I'll look at including your test case in the future when we have the ability to kill off builds if they run too long in our continuous integration server.
        Hide
        Michael Fellinger added a comment -

        The attached patch finally fixed it for me. Would be nice to integrate it into svn.

        Show
        Michael Fellinger added a comment - The attached patch finally fixed it for me. Would be nice to integrate it into svn.
        Hide
        Nicolas Modrzyk added a comment -

        Andrew's patch is working great here as well.

        Would definitely want this one to be in the SVN trunk.

        Show
        Nicolas Modrzyk added a comment - Andrew's patch is working great here as well. Would definitely want this one to be in the SVN trunk.
        Hide
        John Mettraux added a comment -

        Andrew's patch is essential. Please include it, it makes JRuby really 1.0.

        Show
        John Mettraux added a comment - Andrew's patch is essential. Please include it, it makes JRuby really 1.0.
        Hide
        Charles Oliver Nutter added a comment -

        Well here's an initial concern...it syncs every thread on every call to pollThreadEvents, which is almost certain to be a performance hit. I don't know how much, but ideally I want to find a way to check for criticalization without synchronizing. I'd rather that critical be not-quite-deterministic in when threads actually stop running rather than force every thread to sync multiple times per method.

        Show
        Charles Oliver Nutter added a comment - Well here's an initial concern...it syncs every thread on every call to pollThreadEvents, which is almost certain to be a performance hit. I don't know how much, but ideally I want to find a way to check for criticalization without synchronizing. I'd rather that critical be not-quite-deterministic in when threads actually stop running rather than force every thread to sync multiple times per method.
        Hide
        Nicolas Modrzyk added a comment -

        stack trace without patch when running our own thread test.

        Show
        Nicolas Modrzyk added a comment - stack trace without patch when running our own thread test.
        Hide
        John Mettraux added a comment -

        sonanda$ java -version
        java version "1.5.0_07"
        Java(TM) 2 Runtime Environment, Standard Edition (build 1.5.0_07-164)
        Java HotSpot(TM) Client VM (build 1.5.0_07-87, mixed mode, sharing)

        sonanda$ ../jruby -v
        -bash: ../jruby: is a directory
        sonanda:~/openwfe-ruby mettraux$ ../jruby/bin/jruby -v
        ruby 1.8.5 (2007-07-13 rev 3876) [i386-jruby1.1]

        Dump CTRL BSLASH

        Show
        John Mettraux added a comment - sonanda$ java -version java version "1.5.0_07" Java(TM) 2 Runtime Environment, Standard Edition (build 1.5.0_07-164) Java HotSpot(TM) Client VM (build 1.5.0_07-87, mixed mode, sharing) sonanda$ ../jruby -v -bash: ../jruby: is a directory sonanda:~/openwfe-ruby mettraux$ ../jruby/bin/jruby -v ruby 1.8.5 (2007-07-13 rev 3876) [i386-jruby1.1] Dump CTRL BSLASH
        Hide
        Charles Oliver Nutter added a comment -

        Nicolas: For your trace I'd recommend running with the actual Java 5/6 concurrency libraries rather than the backport. All you need to do is change

        import edu.emory.....java.util.concurrent.whatever

        to

        import java.util.concurrent.whatever

        in RubyThread. Give that a try please.

        John: Can you post a comment describing what all the threads in your trace are expected to do? Your problem looks less like something wrong with critical= and more like a combination of multiple threads all doing different thread operations at the same time.

        Show
        Charles Oliver Nutter added a comment - Nicolas: For your trace I'd recommend running with the actual Java 5/6 concurrency libraries rather than the backport. All you need to do is change import edu.emory.....java.util.concurrent.whatever to import java.util.concurrent.whatever in RubyThread. Give that a try please. John: Can you post a comment describing what all the threads in your trace are expected to do? Your problem looks less like something wrong with critical= and more like a combination of multiple threads all doing different thread operations at the same time.
        Hide
        John Mettraux added a comment -

        Hi Charles,

        thanks a lot. I've been writing a reduced test to display the blocking behaviour, but it takes time.

        I always keep in mind that this could be a problem in my code, but it works fine with the MRI (1.8.4 and 1.8.5).

        I've been trying to remove some synchronization for a piece of my code and JRuby sometimes goes through my initial test, but unfortunately just "sometimes".

        I'm back to implementing a reduced test, it should be somehow easier than explaining in words the thread interactions.

        Maybe Andrew will come up with a better test than the one included in the patch.

        Thanks a lot, we're working on providing you something to work with.

        Best regards,

        John

        Show
        John Mettraux added a comment - Hi Charles, thanks a lot. I've been writing a reduced test to display the blocking behaviour, but it takes time. I always keep in mind that this could be a problem in my code, but it works fine with the MRI (1.8.4 and 1.8.5). I've been trying to remove some synchronization for a piece of my code and JRuby sometimes goes through my initial test, but unfortunately just "sometimes". I'm back to implementing a reduced test, it should be somehow easier than explaining in words the thread interactions. Maybe Andrew will come up with a better test than the one included in the patch. Thanks a lot, we're working on providing you something to work with. Best regards, John
        Hide
        John Mettraux added a comment -

        Hi,

        just attached a test which displays a blocking behaviour (haven't yet verified if it's really thread critical related).

        How to run the tests is described in the README.txt

        In short the suite blocks after 2 to 4 tests. If run with some debug for the synchronization, the suite doesn't block.

        sonanda:~ mettraux$ java -version
        java version "1.5.0_07"
        Java(TM) 2 Runtime Environment, Standard Edition (build 1.5.0_07-164)
        Java HotSpot(TM) Client VM (build 1.5.0_07-87, mixed mode, sharing)

        MacBook Intel Core Duo.

        JRuby at revision 4030.

        I will test with Java 6 tomorrow with Hellonico (if the quakes allow it), and try to provide refinements.

        Cheers,

        John

        Show
        John Mettraux added a comment - Hi, just attached a test which displays a blocking behaviour (haven't yet verified if it's really thread critical related). How to run the tests is described in the README.txt In short the suite blocks after 2 to 4 tests. If run with some debug for the synchronization, the suite doesn't block. sonanda:~ mettraux$ java -version java version "1.5.0_07" Java(TM) 2 Runtime Environment, Standard Edition (build 1.5.0_07-164) Java HotSpot(TM) Client VM (build 1.5.0_07-87, mixed mode, sharing) MacBook Intel Core Duo. JRuby at revision 4030. I will test with Java 6 tomorrow with Hellonico (if the quakes allow it), and try to provide refinements. Cheers, John
        Hide
        Nicolas Modrzyk added a comment -

        This was asked by Charles.

        This is the stack trace when replacing backport with java.util

        The JVM is actually finding a deadlock by itself. I just reproduced it so will have a look in more details from now on.

        Show
        Nicolas Modrzyk added a comment - This was asked by Charles. This is the stack trace when replacing backport with java.util The JVM is actually finding a deadlock by itself. I just reproduced it so will have a look in more details from now on.
        Hide
        Nicolas Modrzyk added a comment -

        actually, why not suppress the synchronized calls to setCritical and getCritical ?

        the synchronization is done at the ReentrantLock level, and nothing else needs to be changed ...

        It actually runs fine by removing those synchronized keyword from those two methods.

        Attached is the patch to make it work.

        Show
        Nicolas Modrzyk added a comment - actually, why not suppress the synchronized calls to setCritical and getCritical ? the synchronization is done at the ReentrantLock level, and nothing else needs to be changed ... It actually runs fine by removing those synchronized keyword from those two methods. Attached is the patch to make it work.
        Hide
        John Mettraux added a comment -

        Nicolas' patch works fine for me with Java 5.

        Show
        John Mettraux added a comment - Nicolas' patch works fine for me with Java 5.
        Hide
        Charles Oliver Nutter added a comment -

        Actually removing synchronized seems to be pretty safe...I can't find any holes in it. We'll run with that and call it resolved for now. Thanks for the help guys.

        Show
        Charles Oliver Nutter added a comment - Actually removing synchronized seems to be pretty safe...I can't find any holes in it. We'll run with that and call it resolved for now. Thanks for the help guys.
        Hide
        Charles Oliver Nutter added a comment -

        Fixed in 4037.

        Show
        Charles Oliver Nutter added a comment - Fixed in 4037.

          People

          • Assignee:
            Charles Oliver Nutter
            Reporter:
            Andrew Geweke
          • Votes:
            5 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved: