|
This perl script parses the root map as is done by boot image scanning.. it dumps a list of all the references in the root map. It may be informative to run this on the problematic root map.
btw: if anyone thinks its worth distributing the rmap decompressor perl script then I'd support this. It's rough and ready at the moment. The uploaded version has constants from an x86_64 build.
The system appears fairly stable with 1 virtual processor (can reliably run 5 size100 iterations of _213_javac).
With 2 virtual processors, it reliably dies in the first major GC with the above stack trace. Both tests on a prototype image. I have looked at this a bit further with Dave and Daniel today. No big insights yet, I'm afraid.
Some info: The LOS references on the failed stacks are probably red herrings given that traceObject will throw high addresses at the PLOS, so if the address were corrupted high it would end up in the LOS code. It seems that the code is dereferencing a mis-aligned 8 byte address (30cc7c5800000000). It seems that this is happening during boot image scanning, and therefore is most likely happening due to one of two bugs: For b) to happen, it seems that the value would have to be misaligned back at line 129 and earlier of ScanBootImage, since after that point it is only incremented by BYTES_IN_ADDRESS. However, we don't fail on line 131, which suggestes that if b) is true, we're getting a little lucky (perhaps reading zero at 129?). It would be a good idea to throw a few assertions into ScanBootImage.processChunk(). When I get the chance I'll do that. I'm still underwater right now though. One more thing...
Dave mentioned that the system ran OK with one virtual processor, but would reliably fail with > 1 virtual processor using the same build. This suggests that the boot image and map are correctly built, but who knows. I ran with the assertions Steve checked in last night. No assertions tripped before the crash, which with 2 virtual processors happens in the first major collect.
[excalibur:/homes/excalibur/dgrove/SPECjvm98] ../buildit/rvm-trunk/dist/prototype_ppc64-aix/rvm -X:processors=2 -X:verbose -verbose:gc SpecApplication -s100 -m5 -M5 -a _213_javac ======= _213_javac Starting ======= Fatal error: Unknown hardware trap within uninterruptible region. – Stack – Another thing Steve and I noticed last night, sometimes the crash is while scanning the bootimage, other times it isn't. So it seems probable that the bootimage map is actually ok. There's something else going wrong that mysteriously is linked with having multiple virtual processors. It's unlikely to be PPC weak memory problems, since ppc32-aix is running just fine with multiple virtual processors. So perhaps something in the load balancing/work queue aspect of the GC that isn't 64 bit safe??
Those assertions should give us a very strong clue.
It seems that in the case of the stack trace which includes the boot image, only three things could now have happened (since we know the slot was aligned and yet the contents of it is misaligned when used a little while later): a) *slot has bad stuff inside it I can try throwing in a few more assertions to narrow this down further. --Steve Dave, can you please test against 14230 (just committed) and report what you see? I just added some assertions which should help further narrow things down.
using 14230 and a BaseBaseSemiSpace image, with 2 processors it crashes as below on first GC. With 1 processor it runs 5 iterations to completion. I'm going to try a BaseBaseMarkSweep next to see if it has something to do with the fixup after movement.
[excalibur:/homes/excalibur/dgrove/SPECjvm98] ../buildit/rvm-trunk/dist/BaseBaseSemiSpace_ppc64-aix/rvm -verbose:gc -X:processors=2 SpecApplication -s100 -m5 -M5 -a _213_javac – Stack – – Processors – – Stack – MarkSweep results are the same as SemiSpace. I broke the 2 parts of the assertion at line 140 into line 140 and 141. It's the second part that is failing (slot.loadObjectReference is a validRef).
[excalibur:/homes/excalibur/dgrove/SPECjvm98] ../buildit/rvm-trunk/dist/BaseBaseMarkSweep_ppc64-aix/rvm -verbose:gc -X:processors=2 SpecApplication -s100 -m5 -M5 -a _213_javac – Stack – It might be a red herring, but according to the bootimage maps the upper 4 bytes of the bad reference correspond to the same string literal.
in the BaseBaseMarkSweep image: and in the BaseBaseSemiSpace image: hmmm...and that only place that string literal shows up in our source code base is:
private void transferThread(VM_GreenThread t) { Wonder if perhaps a magic being used to manipulate the mutex isn't right in 64 bits. Seems really far fetched but it might explain why the crash only shows up when we have more than 1 virtual processor in the mix. Bang: bogus code found in VM_Processor
private final String[] lockReasons = VM.VerifyAssertions ? new String[100] : null; public void registerLock(String reason) { VM_Magic.setObjectAtOffset(lockReasons, Offset.fromIntSignExtend(lockCount<<2), reason); lockCount ++; } Fix committed in r14231.
Thanks for the help Steve. Once it became clear from the assertions you added that the problem was that value in the bootimage itself was wrong, I had the idea of looking at RVM.map and then got lucky.... I left the 14230 assertions in place; not sure if you wanted to pull them back out or leave them in. reopening so I can modify fix target to 3.0
|
|||||||||||||||||||||||||||||||||||||||||||||||
at [0x000000004001a168] Lorg/jikesrvm/mm/mmtk/ScanBootImage; processChunk(Lorg/vmmagic/unboxed/Address;Lorg/vmmagic/unboxed/Address;Lorg/vmmagic/unboxed/Address;Lorg/vmmagic/unboxed/Address;Lorg/mmtk/plan/TraceLocal;)V at line 140
at [0x000000004001a248] Lorg/jikesrvm/mm/mmtk/ScanBootImage; scanBootImage(Lorg/mmtk/plan/TraceLocal;)V at line 79
at [0x000000004001a308] Lorg/jikesrvm/mm/mmtk/Scanning; computeBootImageRoots(Lorg/mmtk/plan/TraceLocal;)V at line 333
So, I think there's a very good chance that the problem is that the code that is building up references from the encoded bootimage map is not correct on 64 bit platforms. There are a couple of suspicious 4's and int/word conversions in org.jikesrvm.mm.mmtk.ScanBootImage.