Notwithstanding the fix for this deadlock:
the problem (or at least a similar deadlock) still exists, and is almost certainly due to a serious bug in ChannelImpl.doRemove.
Note that our load test environment involves many (~100) Bayeux channels and around 100 simulaneous connections & disconnects repeatedly. However the problems can be seen by an examination of the code and much simpler tests can be devised.
Starting with the issue which is causing the deadlock, and decreasing in seriousness, here are the issues I've found with the doRemove method:
1. The first conditional – "_children.containsKey(key)" – is incorrect and results in false positives because it only looks at the end segment of the channel, which is not guaranteed to be unique. Quite likely the deadlock occurs because the wrong ChannelImpl instance is locked in subsequent "synchronized" blocks.
2. Memory leak due to not releasing parents of the child channel removed. Remember that ChannelImpl.addChild causes new channels to be allocated to fill out the tree structure (one level per segment).
3. Seemly unnecessary locks – I can't see any reason for needing both synch(this) and synch(child), especially not the former. Neither lock protects against the case of calling _children.remove(key) twice on the same parent/child, thus removing unnecessarily (this could happen if doRemove is called with same arguments on 2 threads).
4. doRemove should do logarithmic search rather than depth first seach since you have the n-ary tree structure already.
5. doRemove should be a private method, since it is an implementation detail. The problem with being public is you may need to have extra locks that you wouldn't need if it is was private since a public method can be called in unexpected ways.
6. ChannelImpl.addChild() has code that does nothing: "ChannelImpl branch =_children.get(next);"
I have some suggestions and further explanations around these issues...
1 & 4. These could be solved with the same fix – branch on the children based on the current segment, which would be both faster and correct.
If you want an example to clearly illustrate the problem with only looking at the last segment, consider these 2 channels:
Unsubscribing from /x/y/123 may have the right result, but it also may cause the /a/b/123 child to be removed from the /a/b channel. It's pseudo-random depending on how the hash table ordered the channels.
2. Remember that there is a full tree structure of Channels being created. When you subscribe to /a/b/c, you get these channels:
/a, /a/b, /a/b/c, as created by ChannelImpl.addChild.
A simple fix ought to be in the tail-recursion of doRemove, if the result of child.doRemove is true, then try to remove the child (ie, check it's persistent, subscriber count and channel count).
3. I may be wrong here, but I can't find any purpose to the "synchronized(this)" lock in doRemove and I'm not convinced that "synchronized(child)" is the right place to do the lock. If you can refactor the locks out, this code will be much simpler and it will be more clear that there aren't any race conditions.
5. This is related to 3, in that in order to remove the locks from doRemove, you probably will need to make this method private, which would allow you to put a single lock outside of the recursive doRemove call. This would be especially good to do as you fix the memory leak (#2).
Currently it seems that AbstractBayeux is the only other class which calls doRemove, so you should be able to start with default scoping and then refactor the call sequence for unsubscribe & removeChannel such that doRemove becomes entirely private.
6. This has nothing to do with ChannelClient.doRemove, but I noticed this code while debugging and it would be a trivial cleanup:
ChannelImpl branch =_children.get(next); // result is not used
The initialization of brach to _children.get() is not needed since the next line is what gets or creates the branch.