groovy

Found a deadlock when I have mutiple threads running a groovy script

Details

  • Type: Bug Bug
  • Status: Closed Closed
  • Priority: Blocker Blocker
  • Resolution: Fixed
  • Affects Version/s: 1.5.2
  • Fix Version/s: 1.5.5
  • Component/s: None
  • Labels:
    None
  • Patch Submitted:
    Yes
  • Number of attachments :
    3

Description

Found one Java-level deadlock:
=============================
"Thread-12":
waiting to lock monitor 0x0810b934 (object 0xaa61ddb8, a org.codehaus.groovy.reflection.CachedClass),
which is held by "Thread-11"
"Thread-11":
waiting to lock monitor 0x0810b974 (object 0x71b8ce88, a java.util.WeakHashMap),
which is held by "Thread-12"

Java stack information for the threads listed above:
===================================================
"Thread-12":
at org.codehaus.groovy.reflection.CachedClass.getCachedSuperClass(CachedClass.java:131)

  • waiting to lock <0xaa61ddb8> (a org.codehaus.groovy.reflection.CachedClass)
    at org.codehaus.groovy.reflection.CachedClass.<init>(CachedClass.java:125)
    at org.codehaus.groovy.reflection.ReflectionCache.getCachedClass(ReflectionCache.java:210)
  • locked <0x71b8ce88> (a java.util.WeakHashMap)
    at groovy.lang.MetaClassImpl.<init>(MetaClassImpl.java:106)
    at groovy.lang.MetaClassImpl.<init>(MetaClassImpl.java:114)
    at groovy.lang.MetaClassRegistry$MetaClassCreationHandle.createNormalMetaClass(MetaClassRegistry.java:102)
    at groovy.lang.MetaClassRegistry$MetaClassCreationHandle.create(MetaClassRegistry.java:92)
    at org.codehaus.groovy.runtime.metaclass.MetaClassRegistryImpl.getGlobalMetaClass(MetaClassRegistryImpl.java:247)
  • locked <0x6e29b0c0> (a java.lang.Class)
    at org.codehaus.groovy.runtime.metaclass.MetaClassRegistryImpl.access$100(MetaClassRegistryImpl.java:45)
    at org.codehaus.groovy.runtime.metaclass.MetaClassRegistryImpl$LocallyKnownClasses.getFromGlobal(MetaClassRegistryImpl.java:112)
    at org.codehaus.groovy.runtime.metaclass.MetaClassRegistryImpl$LocallyKnownClasses.getMetaClass(MetaClassRegistryImpl.java:88)
    at org.codehaus.groovy.runtime.metaclass.MetaClassRegistryImpl$MyThreadLocal.getMetaClass(MetaClassRegistryImpl.java:356)
    at org.codehaus.groovy.runtime.metaclass.MetaClassRegistryImpl.getMetaClass(MetaClassRegistryImpl.java:260)
    at org.codehaus.groovy.runtime.InvokerHelper.getMetaClass(InvokerHelper.java:728)
    at groovy.lang.GroovyObjectSupport.<init>(GroovyObjectSupport.java:32)
    at groovy.lang.Script.<init>(Script.java:40)
    at groovy.lang.Script.<init>(Script.java:37)
    at uc5.<init>(uc5.groovy)
    at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
    at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
    at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
    at java.lang.reflect.Constructor.newInstance(Constructor.java:494)
    at java.lang.Class.newInstance0(Class.java:350)
    at java.lang.Class.newInstance(Class.java:303)
    at org.codehaus.groovy.runtime.InvokerHelper.createScript(InvokerHelper.java:416)
    at groovy.lang.GroovyShell.parse(GroovyShell.java:584)
    at groovy.lang.GroovyShell.evaluate(GroovyShell.java:472)
    at groovy.lang.GroovyShell.evaluate(GroovyShell.java:508)
    at org.terracotta.droid.script.GroovyExec.exec(GroovyExec.java:27)
    at org.terracotta.droid.worker.Worker$1.run(Worker.java:53)
    at java.lang.Thread.run(Thread.java:595)
    "Thread-11":
    at org.codehaus.groovy.reflection.ReflectionCache.getCachedClass(ReflectionCache.java:165)
  • waiting to lock <0x71b8ce88> (a java.util.WeakHashMap)
    at org.codehaus.groovy.reflection.ParameterTypes.getParametersTypes0(ParameterTypes.java:77)
  • locked <0xaa621940> (a org.codehaus.groovy.reflection.CachedMethod)
    at org.codehaus.groovy.reflection.ParameterTypes.getParameterTypes(ParameterTypes.java:63)
    at org.codehaus.groovy.reflection.CachedMethod.compareToCachedMethod(CachedMethod.java:181)
    at org.codehaus.groovy.reflection.CachedMethod.compareTo(CachedMethod.java:163)
    at java.util.Arrays.mergeSort(Arrays.java:1156)
    at java.util.Arrays.mergeSort(Arrays.java:1168)
    at java.util.Arrays.sort(Arrays.java:1080)
    at org.codehaus.groovy.reflection.CachedClass.getMethods(CachedClass.java:171)
  • locked <0xaa61ddb8> (a org.codehaus.groovy.reflection.CachedClass)
    at groovy.lang.MetaClassImpl.populateMethods(MetaClassImpl.java:268)
    at groovy.lang.MetaClassImpl.fillMethodIndex(MetaClassImpl.java:215)
    at groovy.lang.MetaClassImpl.initialize(MetaClassImpl.java:2469)
  • locked <0xaa61d920> (a groovy.lang.MetaClassImpl)
    at org.codehaus.groovy.runtime.metaclass.MetaClassRegistryImpl.getGlobalMetaClass(MetaClassRegistryImpl.java:248)
  • locked <0x6e293f78> (a java.lang.Class)
    at org.codehaus.groovy.runtime.metaclass.MetaClassRegistryImpl.access$100(MetaClassRegistryImpl.java:45)
    at org.codehaus.groovy.runtime.metaclass.MetaClassRegistryImpl$LocallyKnownClasses.getFromGlobal(MetaClassRegistryImpl.java:112)
    at org.codehaus.groovy.runtime.metaclass.MetaClassRegistryImpl$LocallyKnownClasses.getMetaClass(MetaClassRegistryImpl.java:88)
    at org.codehaus.groovy.runtime.metaclass.MetaClassRegistryImpl$MyThreadLocal.getMetaClass(MetaClassRegistryImpl.java:356)
    at org.codehaus.groovy.runtime.metaclass.MetaClassRegistryImpl.getMetaClass(MetaClassRegistryImpl.java:260)
    at org.codehaus.groovy.runtime.InvokerHelper.getMetaClass(InvokerHelper.java:728)
    at groovy.lang.GroovyObjectSupport.<init>(GroovyObjectSupport.java:32)
    at groovy.lang.Script.<init>(Script.java:40)
    at groovy.lang.Script.<init>(Script.java:37)
    at uc5.<init>(uc5.groovy)
    at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
    at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
    at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
    at java.lang.reflect.Constructor.newInstance(Constructor.java:494)
    at java.lang.Class.newInstance0(Class.java:350)
    at java.lang.Class.newInstance(Class.java:303)
    at org.codehaus.groovy.runtime.InvokerHelper.createScript(InvokerHelper.java:416)
    at groovy.lang.GroovyShell.parse(GroovyShell.java:584)
    at groovy.lang.GroovyShell.evaluate(GroovyShell.java:472)
    at groovy.lang.GroovyShell.evaluate(GroovyShell.java:508)
    at org.terracotta.droid.script.GroovyExec.exec(GroovyExec.java:27)
    at org.terracotta.droid.worker.Worker$1.run(Worker.java:53)
    at java.lang.Thread.run(Thread.java:595)
  1. groovy-2596-exp1.txt
    18/Feb/08 4:55 AM
    7 kB
    Jim White
  2. groovy-2596-exp2.txt
    18/Feb/08 5:53 AM
    12 kB
    Jim White
  3. groovy-2596-fix1.txt
    18/Feb/08 7:10 AM
    14 kB
    Jim White

Activity

Hide
Guillaume Laforge added a comment -

Could you please try with Groovy 1.5.4 as such deadlocks should have been fixed in that version?

Show
Guillaume Laforge added a comment - Could you please try with Groovy 1.5.4 as such deadlocks should have been fixed in that version?
Hide
Antonio Si added a comment -

I am using 1.5.4.

Thanks.

Show
Antonio Si added a comment - I am using 1.5.4. Thanks.
Hide
Guillaume Laforge added a comment -

So now, we'd need a sample program that could help reproduce that deadlock.
Also, if you could tell us a bit more about the context of your application, and the details of the hardware and OS you are running this on, that would be great.

Show
Guillaume Laforge added a comment - So now, we'd need a sample program that could help reproduce that deadlock. Also, if you could tell us a bit more about the context of your application, and the details of the hardware and OS you are running this on, that would be great.
Hide
Guillaume Laforge added a comment -

Also, if you don't run this program on top of Terracotta, do you still encounter this deadlock?

Show
Guillaume Laforge added a comment - Also, if you don't run this program on top of Terracotta, do you still encounter this deadlock?
Hide
Jim White added a comment -

I started studying this bug and after a short while it is pretty obvious what the deadlock is.

A couple threads are fired off naturally each making a org.codehaus.groovy.runtime.InvokerHelper.getMetaClass call early in the game and they use some interdependent classes.

Eventually Thread-11 calls the synchronized method CachedClass.getMethods(). This is a reasonably safe (and I'm pretty sure redundant) synchronization call because the thread already holds the lock on the corresponding Class.

The CachedClass.getMethods() function includes a sort on an array of CachedMethod objects (CachedClass.java:171). The comparator for CachedMethod makes many calls to our fickle friend ReflectionCache.getCachedClass(Class).

The call by Thread-11 is not a safe call because ReflectionCache.getCachedClass tries to get a global lock and that is done while the thread holds several locks, and fatally the Class/CachedClass pair.

The race is hit here because while Thread-11 was messing around, Thread-12 was also loading a class that has as a superclass the class that Thread-11 holds a lock on.

So when Thread-12 is running his CachedClass.<init> method and calling ReflectionCache.getCachedClass(Class) on each of his superclasses (CachedClass.java:125) he gets stuck when calling the synchronized method CachedClass.getCachedSuperClass() on the Class/CachedClass that Thread-11 is loading. And it's deadlock because he grabbed the global lock before doing that.

At first I thought this would fairly easy to fix, but after wandering about in the code for a while I realize that there is a lot of random synchronization calls and getting a proper lock strategy worked out will take a fair bit of time and I don't have it right now.

My first thought was to move the call to the CachedClass constructor outside of the global lock in ReflectionCache.getCachedClass. That makes the lock local and simply becomes a double-check put into CACHED_CLASS_MAP. But then I notice ReflectionCache.assignableMap is modified without synchronization and appears to piggyback off only being done while inside the global lock during the CachedClass construction.

So while it is clear we need to eventually make sure no global lock is held while making any non-local calls, I don't have a patch for that at the moment. Note that any non-local synchronization should pretty much be done strictly with locks on Class objects. Any synchronization in CachedClass, ReflectionCache, CachedMethod, et al that cannot be analyzed as strictly local are almost bound to run into trouble.

For a detailed explanation of a locking strategy that works for classloading, see section 2.17.5 "Detailed Initialization Procedure" of the JVMS.
http://java.sun.com/docs/books/jvms/second_edition/html/Concepts.doc.html#19075

Now, for the good news. This particular deadlock can be avoided by eliminating the superfluous lock by CachedClass.getCachedSuperClass(). This is a safe change because it is only going to hurt to get the CachedClass lock without the Class lock first. And as for the local effect of CachedClass.getCachedSuperClass(), it is idempotent and therefore doesn't mind a race. But this won't really be a fix if CachedClass.getMethods() is called before the class is cached by ReflectionCache.getCachedClass (although if that is the case then this would be a definite improvement!).

While a better solution might be getting a lock on the Class/CachedClass before getting the global lock (and I attach a possible fix for ReflectionCache), I don't have time to do further analysis now and I don't like making changes I can't test. This experimental patch includes the fix for GROOVY-2568 which has been long completed and is completely correct and tested and way overdue to be committed on the 1_5_X branch.

Although unit testing for correct concurrency implementation is mostly pointless, I do have an idea for an approach using AOP that could be used for creating test cases for JIRA issues like this one.

More thorough debugging and testing though will still need to be done with tools like JTest and PMD. We should also make use of Coverity. I saw their community manager David Maxwell talk at SCALE 6x last weekend and they support Java and are eager to work with any OSS project that is willing to make use of their analysis.
http://scan.coverity.com/

Show
Jim White added a comment - I started studying this bug and after a short while it is pretty obvious what the deadlock is. A couple threads are fired off naturally each making a org.codehaus.groovy.runtime.InvokerHelper.getMetaClass call early in the game and they use some interdependent classes. Eventually Thread-11 calls the synchronized method CachedClass.getMethods(). This is a reasonably safe (and I'm pretty sure redundant) synchronization call because the thread already holds the lock on the corresponding Class. The CachedClass.getMethods() function includes a sort on an array of CachedMethod objects (CachedClass.java:171). The comparator for CachedMethod makes many calls to our fickle friend ReflectionCache.getCachedClass(Class). The call by Thread-11 is not a safe call because ReflectionCache.getCachedClass tries to get a global lock and that is done while the thread holds several locks, and fatally the Class/CachedClass pair. The race is hit here because while Thread-11 was messing around, Thread-12 was also loading a class that has as a superclass the class that Thread-11 holds a lock on. So when Thread-12 is running his CachedClass.<init> method and calling ReflectionCache.getCachedClass(Class) on each of his superclasses (CachedClass.java:125) he gets stuck when calling the synchronized method CachedClass.getCachedSuperClass() on the Class/CachedClass that Thread-11 is loading. And it's deadlock because he grabbed the global lock before doing that. At first I thought this would fairly easy to fix, but after wandering about in the code for a while I realize that there is a lot of random synchronization calls and getting a proper lock strategy worked out will take a fair bit of time and I don't have it right now. My first thought was to move the call to the CachedClass constructor outside of the global lock in ReflectionCache.getCachedClass. That makes the lock local and simply becomes a double-check put into CACHED_CLASS_MAP. But then I notice ReflectionCache.assignableMap is modified without synchronization and appears to piggyback off only being done while inside the global lock during the CachedClass construction. So while it is clear we need to eventually make sure no global lock is held while making any non-local calls, I don't have a patch for that at the moment. Note that any non-local synchronization should pretty much be done strictly with locks on Class objects. Any synchronization in CachedClass, ReflectionCache, CachedMethod, et al that cannot be analyzed as strictly local are almost bound to run into trouble. For a detailed explanation of a locking strategy that works for classloading, see section 2.17.5 "Detailed Initialization Procedure" of the JVMS. http://java.sun.com/docs/books/jvms/second_edition/html/Concepts.doc.html#19075 Now, for the good news. This particular deadlock can be avoided by eliminating the superfluous lock by CachedClass.getCachedSuperClass(). This is a safe change because it is only going to hurt to get the CachedClass lock without the Class lock first. And as for the local effect of CachedClass.getCachedSuperClass(), it is idempotent and therefore doesn't mind a race. But this won't really be a fix if CachedClass.getMethods() is called before the class is cached by ReflectionCache.getCachedClass (although if that is the case then this would be a definite improvement!). While a better solution might be getting a lock on the Class/CachedClass before getting the global lock (and I attach a possible fix for ReflectionCache), I don't have time to do further analysis now and I don't like making changes I can't test. This experimental patch includes the fix for GROOVY-2568 which has been long completed and is completely correct and tested and way overdue to be committed on the 1_5_X branch. Although unit testing for correct concurrency implementation is mostly pointless, I do have an idea for an approach using AOP that could be used for creating test cases for JIRA issues like this one. More thorough debugging and testing though will still need to be done with tools like JTest and PMD. We should also make use of Coverity. I saw their community manager David Maxwell talk at SCALE 6x last weekend and they support Java and are eager to work with any OSS project that is willing to make use of their analysis. http://scan.coverity.com/
Hide
Jim White added a comment -

Possible tweaks discussed in my comment above.

Show
Jim White added a comment - Possible tweaks discussed in my comment above.
Hide
Jim White added a comment -

Another experimental patch showing a double-check removing the global lock from around the per-class locks. Make a few other notes and since I didn't see MetaClassImpl calling RC.getCachedClass before CachedClass.getMethods, I show making that call.

Still not a real solution, but (probably) much less vulnerable than the current code.

Show
Jim White added a comment - Another experimental patch showing a double-check removing the global lock from around the per-class locks. Make a few other notes and since I didn't see MetaClassImpl calling RC.getCachedClass before CachedClass.getMethods, I show making that call. Still not a real solution, but (probably) much less vulnerable than the current code.
Hide
Jim White added a comment -

Doing the double-check put as in exp2 looks pretty good and is vastly superior to trying to lock the Class in RC.getCachedClass like exp1 shows (because, among other things, it almost certainly just shuffles the deadlock around).

I do see a problem with the patch as shown which is that it will leak the extra CachedClass instances that get constructed but not put because the CachedClass constructor puts them into the assignableMap. That is better by far than deadlocking, but reinforces the point that the initialization scheme still needs work.

So I totally retract exp1, but leave it here just for reference.

Show
Jim White added a comment - Doing the double-check put as in exp2 looks pretty good and is vastly superior to trying to lock the Class in RC.getCachedClass like exp1 shows (because, among other things, it almost certainly just shuffles the deadlock around). I do see a problem with the patch as shown which is that it will leak the extra CachedClass instances that get constructed but not put because the CachedClass constructor puts them into the assignableMap. That is better by far than deadlocking, but reinforces the point that the initialization scheme still needs work. So I totally retract exp1, but leave it here just for reference.
Hide
Jim White added a comment -

This patch resolves the issue with leaking on the double-check put. There is still a race between the put and the CachedClass.initialize call, but my current (limited) understanding of the code is that is OK.

It also cleans out a few clearly unnecessary (and potentially harmful) synchronization calls in CachedClass.

The remaining synchronized calls in CachedClass need to be eliminated (or changed to wait as the JVMS intialization procedure explains), but this is a plausible fix, at least for SVN if not yet ready for release.

Now I must get back to work on IFCX Wings...

Show
Jim White added a comment - This patch resolves the issue with leaking on the double-check put. There is still a race between the put and the CachedClass.initialize call, but my current (limited) understanding of the code is that is OK. It also cleans out a few clearly unnecessary (and potentially harmful) synchronization calls in CachedClass. The remaining synchronized calls in CachedClass need to be eliminated (or changed to wait as the JVMS intialization procedure explains), but this is a plausible fix, at least for SVN if not yet ready for release. Now I must get back to work on IFCX Wings...
Hide
Jim White added a comment -

The patch has been applied to 1_5_X.

As noted in the previous comment, this is only the first cut at a fix and may need further adjustments (for example, that race may not be OK). I'm also concerned about the case in which the SoftReference CachedClass has been collected and must be reinitialized.

What about the trunk? Shall I apply the patch there as well?

Show
Jim White added a comment - The patch has been applied to 1_5_X. As noted in the previous comment, this is only the first cut at a fix and may need further adjustments (for example, that race may not be OK). I'm also concerned about the case in which the SoftReference CachedClass has been collected and must be reinitialized. What about the trunk? Shall I apply the patch there as well?
Hide
Jim White added a comment -

Unless there is a reason not to, I will be commiting this change to the trunk soon.

Show
Jim White added a comment - Unless there is a reason not to, I will be commiting this change to the trunk soon.
Hide
blackdrag blackdrag added a comment -

first... your patch seems not to fit 1.6, ReflectionCache does not have 211 lines in 1.6.

Then about synchronizing the access to assignableMap... As I see it the functions accessing the map do not need synchronization, because they are idempotent. The problem is probably more the map itself, since the map is not threadsafe. But is synchronizing always before accessing the map the best solution to this? (this goes for 1.6 as well as 1.5.5)

Show
blackdrag blackdrag added a comment - first... your patch seems not to fit 1.6, ReflectionCache does not have 211 lines in 1.6. Then about synchronizing the access to assignableMap... As I see it the functions accessing the map do not need synchronization, because they are idempotent. The problem is probably more the map itself, since the map is not threadsafe. But is synchronizing always before accessing the map the best solution to this? (this goes for 1.6 as well as 1.5.5)
Hide
Jim White added a comment -

Yes, the patch is for 1.5.5 and was committed a couple weeks ago without incident. I will do the appropriate merge for 1.6.

Moving the synchronization to a higher level would be more performant if it is possible, but during my initial examination of the code it was not clear whether that can be done. Furthermore the current style of the code does not effectively employ that approach.

As I say above, it is my intention/desire that the whole synchronization design be redone, but I have no idea when that might happen. One of the things that is currently lacking for such work is an effective testing method (also I've got a scad of other things to focus on).

So this change is conservative and I have fair confidence that it does not make anything worse than it is, although the failure may reappear in some other part of the code now. I've asked Antonio to check the 1.5.5 snapshot and see whether it works for them now. If it does, then that is sufficient reason to close this specific case (so 1.5.5 can move ahead) and leave any further consideration of synchronization issues with 1.6.

Show
Jim White added a comment - Yes, the patch is for 1.5.5 and was committed a couple weeks ago without incident. I will do the appropriate merge for 1.6. Moving the synchronization to a higher level would be more performant if it is possible, but during my initial examination of the code it was not clear whether that can be done. Furthermore the current style of the code does not effectively employ that approach. As I say above, it is my intention/desire that the whole synchronization design be redone, but I have no idea when that might happen. One of the things that is currently lacking for such work is an effective testing method (also I've got a scad of other things to focus on). So this change is conservative and I have fair confidence that it does not make anything worse than it is, although the failure may reappear in some other part of the code now. I've asked Antonio to check the 1.5.5 snapshot and see whether it works for them now. If it does, then that is sufficient reason to close this specific case (so 1.5.5 can move ahead) and leave any further consideration of synchronization issues with 1.6.
Hide
blackdrag blackdrag added a comment -

before doing such a change, shouldn't you confirm that the deadlock still happens in 1.6? Also I would like to review the change first... by myself and I wold like to have Alex looking at it too. It is not that I don't trust you to write good code here, it is more that both Alex and myself need to see that first hand to make it better in the future.

Show
blackdrag blackdrag added a comment - before doing such a change, shouldn't you confirm that the deadlock still happens in 1.6? Also I would like to review the change first... by myself and I wold like to have Alex looking at it too. It is not that I don't trust you to write good code here, it is more that both Alex and myself need to see that first hand to make it better in the future.
Hide
Jim White added a comment -

I wouldn't need to check that the deadlock occurs in 1.6 (and can't currently since as I say we don't have a suitable test tool) if the code was the same as it was when I fixed this a couple weeks ago. The fault is apparent by inspection though as I've explained.

But I see that Alex has rewritten the 1.6 code to use java.util.concurrent. So we're just back to not knowing whether the synchronization is correct or not and have no tests. Also makes the performance issue moot since that locking code has more overhead.

Show
Jim White added a comment - I wouldn't need to check that the deadlock occurs in 1.6 (and can't currently since as I say we don't have a suitable test tool) if the code was the same as it was when I fixed this a couple weeks ago. The fault is apparent by inspection though as I've explained. But I see that Alex has rewritten the 1.6 code to use java.util.concurrent. So we're just back to not knowing whether the synchronization is correct or not and have no tests. Also makes the performance issue moot since that locking code has more overhead.
Hide
Jim White added a comment -

This fix isn't relevant to 1.6 (even though the bug may be).

Show
Jim White added a comment - This fix isn't relevant to 1.6 (even though the bug may be).
Hide
Jim White added a comment -

Closing as fixed so 1.5.5 can proceed. Will reopen if there are any reports of the problem still occurring.

Show
Jim White added a comment - Closing as fixed so 1.5.5 can proceed. Will reopen if there are any reports of the problem still occurring.

People

Vote (0)
Watch (1)

Dates

  • Created:
    Updated:
    Resolved: