|
[
Permlink
| « Hide
]
Ian Rogers added a comment - 19/Mar/08 09:28 AM
I've tried to repeat this error but gcstress either passes or dies with different errors for me. I'm continuing to look.
Latest results [1] seem to show the problem has gone away (more passes than a week ago). I'm not sure what happened but, Steve, if you're happy can you close the issue. Thanks.
[1] http://jikesrvm.anu.edu.au/cattrack/results/rvmx86lnx32b.anu.edu.au/stress/3147/regression_report Hmmm. That's quite strange. I guess we should just keep an eye on it and close this issue if it doesn't resurface in the next day or two :-/
This might not be the right issue to note this, but eclipse stress test has failed 2 days in a row (http://jikesrvm.anu.edu.au/cattrack/results/rvmx86lnx32b.anu.edu.au/stress/3156/gcstress/Opt_1/dacapo/eclipse/default/Output.txt
Wondering if the problem is related to GC mapping of exception handling blocks that we are jumping do directly due to get/set caught exception handler optimizations or other stackwalking changes. No real evidence, just a vague suspicion that this was one of the areas where we had a number of bug fixes in the past to get eclipse to run solidly. The previous errors have now disappeared for two regression runs in a row. This is weird, but I suppose it's good news. :-/
I agree with Dave's conjecture. The eclipse failure is a new one, thus far is consistent, and it coincides with changes to exception handling blocks. Dave is right---getting that correct was a huge headache when we first made an effort to get eclipse running correctly. So three things have changed:
1) make stack tracing simpler 2) use simple optimizations to get rid of redundant stack traces 3) fix a latent bug in normal out computation. Now all normal outs are returned by calls rather than previously eliding outs which were to exception handlers (see I'd say that things haven't been stable until r14061, so we have one eclipse stress failure which we should watch. I believe there could be 2 causes: 1) the changes in IR have broken some other fundamental assumptions. I can't see this in the instruction elimination cases. The code is removing dead instructions (those that have no uses), the worst thing is if the calculation of dead instructions were wrong - but it's deliberately trivial so hopefully fine. The normal out change is more fundamental but I believe the new behavior is correct. Code that assumed an exception handling block was only reachable by an exception edge was always wrong. 2) latent bugs are getting exposed by having more scope for optimization. The normal out change was one example of a latent bug. What's happening is we can now reach normal outs to exception blocks without intervening "new Throwable, call Throwable.init, set_caught_exception", this can mean we can fold branches together so that branches that previously wouldn't reach exception handlers now can. This may effect assumptions within the code, for example if the register allocator assumed or relied upon certain branches not reaching exception handler blocks. This could then effect the reference maps. Was thinking similar thoughts on the drive into work this morning. I think (1) is the most likely problem. It used to be (ie, back when we initially implemented the FCFG 7 years ago....so maybe this is out of date), that a given block was only reachable via either an exception edge or a normal edge from another block. I'm not sure how the various FCFG-aware dataflow analyses (eg liveness used for reg alloc and to compute GC maps) will react when a block is directly reachable via both normal * exceptional edges.
My instinct is that if in this set of changes something happened to allow this to happen, then that is what needs to be looked at more closely. Most likely the best fix is to prevent this from happening (ie, don't try to collapse basic blocks together if doing so will result in making a block that was directly reachable via an exeptional edge from block A also directly reachable via a normal edge from block A). In most cases, this shouldn't have too much impact on the generated code because the code reorganization phase should place basic blocks in a code order that favors normal control flow, and that in turn should cause the goto instructions to drop our of the MIR. According to the history of BC2IR athrows have always been optimized to gotos thereby creating a normal edge to an exception handling block. Eliminating unnecessary stack traces has just increased the number of cases where this can occur.
I wrote the athrow optimization in bc2ir , so yes I know that.
What I was remembering (apparently backwards) was the distinction between the entryBlock and the block fields in HandlerBlockLE. At one point, we used to enter differently (entryBlock vs. block) depending on whether we were coming from the exception delivery runtime routine or from a throw converted to a goto but I guess we eventually massaged the runtime service routine that sets up the registers to deliver the exception to the handler block to establish the same invariants as the set_caught_exception operator so both cases now enter via entryBlock. Just to verify: the only call to the new method you added (setExceptionHandlerWithNormalIn) is in the athrow optimization clause of BC2IR. We are not doing a similar conversion anywhere else in the system? (I thought I read one of the emails about this set of changes to imply that a later optimization pass was making a similar conversion). If I'm not remembering that correctly, that's fine. Just want to doublecheck to make sure I understand before moving on. GC map bugs that only show up in small pieces of exception handling code are evil... So my code hasn't changed the exception handler blocks that are reached by normal edges, they remain the goto optimization of athrow from pre-2001. The change is that now when we have an athrow that is optimized to a set_caught_exception and goto it can mean the new and call to Throwable.<init> are removed (and the set_caught_exception). Having a normal block with just a goto in it can mean this block is merged, giving rise to ifcmps, etc. to exception handler blocks. The only block reachable in this way should remain the block that the original goto was to though. Without the change to basic block we incorrectly see basic blocks with an ifcmp to an exception handler and a normal block as just having 1 out.. so we try to merge it with the following block and eliminate the branch instruction (which is very bad).
It'd be nice to clarify the assumptions for the edges on the FCFG. This code is extra evil in that it tends to eliminate exceptions in hot code that relies on previous execution to cause the appropriate inlining... I've been particularly focusing on lusearch. Btw: it'd be nice to have a better elimination of set_caught_exception than the one that's in dead code elimination of Simple. We can eliminate any set_caught_exception that doesn't dominate a get_caught_exception. Are we running any dominator phases currently to tie this into? ok. i understand now. thanks.
The dominate test isn't quite right (in theory 2 set_caught_exceptions could feed into a get_caught_exception, neither one dominates it, but they both are needed). If we wanted to model this, I think we should think about a pseudo-location/variable CURRENT_EXCEPTION that is set by SET_CAUGHT_EXCEPTION and by all "taken" PEIs (the exception out-edge from a PEI) and was read by GET_CAUGHT_EXCEPTION. In effect, there's a hidden variable in the program state that we aren't really modeling as dataflow right now. If we made the dataflow explicit, then the normal def/use based optimizations would do the right thing I thnk. I like the idea of a variable that is set by PEIs and read in exception handler blocks. In particular this would allow us to propagate type and other information into the exception handler blocks. I imagine this could complicate SSA construction? Modeling the caught exception as a load/store may be more appealing (which is what set/get caught exception in effect are). I agree about dominance, it'd be nice to have a simple test that shows the set is redundant. Maybe if we can model the current exception as a variable we can do this. I wouldn't want to start re-engineering something as fundamental as this without an idea of how it would impact other compiler phases, in particular the SSA phases that are currently disabled. We're also not currently running any O3 regression tests.
As of 14605 eclipse stress is failing again.
http://jikesrvm.anu.edu.au/cattrack/results/rvmx86lnx32b.anu.edu.au/stress/3196/regression_report I think you mean r14065 which is odd as that commit reduces the amount of stack trace elimination that occurs - which is possibly responsible for the regression.
I'm not really following the details here, but if you look at the regression report it seems that this particular eclipse problem comes and goes. For example, compare 3176, 3186 and 3196, each of which is against 14065. In 3176 eclipse passed, but in the other two it failed. Likewise if you compare 3156 and 3166, they only differ by 14062, which seems unrelated, yet 3156 fails and 3166 passes.
I went and looked back a bit further at eclipse's stress history, and it seems that the bug we're seeing here is indeed a new one. In short, I think we have a new non-deterministic bug, apparently introduced between 3135 and 3147. This suggests that the problem arose in one of 14054, 14055 and 14056. r14054 was backed out and a different, simpler, change put in r14060. r14055 and r14056 are minor changes. The changes are already covered in this tracker. I can try disabling r14060 to see if stress tests clear up.
If you can easily do that, I think that would be very helpful diagnostically.
Given how noisy the failures have been thus far, it might be wise to leave it disabled for at least three days to gain confidence that disabling it has removed the source of the problem. Moving to 3.0.1 so we remember sort this out. In general, I think we should really only have features targeted at 1000. bugs should be targeted to a more concrete release.
Am I right in thinking this issue is out-of-date?
I guess it depends. r14068 disabled an optimization to avoid a bug. If we want to get that optimization back on, then we need a tracker item to remind us of that. We could either morph this one into that by changing the title, or one could open a new one to enable the optimization and backpoint to this item to remind us what stress test/issue needs to be looked at before deciding the optimization is safe to re-enable.
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||