Details
-
Type:
Bug
-
Status:
Closed
-
Priority:
Critical
-
Resolution: Fixed
-
Affects Version/s: None
-
Fix Version/s: 3.0.1
-
Component/s: Compiler: Optimizing
-
Labels:None
-
Number of attachments :1
Description
This is the most frequent failure on the perf runs, and therefore should be high priority to resolve.
For example, see: http://jikesrvm.anu.edu.au/cattrack/results/habanero.anu.edu.au/perf/5047/production/default/perf-jvm98/db-10.
The relevant stacktrace is:
– Stack –
at Lorg/jikesrvm/runtime/RuntimeEntrypoints; deliverHardwareException(II)V at line 685
at <hardware trap>
at Lorg/jikesrvm/mm/mmtk/ObjectModel; copy(Lorg/vmmagic/unboxed/ObjectReference;I)Lorg/vmmagic/unboxed/ObjectReference; at line 50
at Lorg/mmtk/policy/CopySpace; traceObject(Lorg/mmtk/plan/TransitiveClosure;Lorg/vmmagic/unboxed/ObjectReference;I)Lorg/vmmagic/unboxed/ObjectReference; at line 187
at Lorg/mmtk/plan/generational/GenNurseryTraceLocal; traceObject(Lorg/vmmagic/unboxed/ObjectReference;)Lorg/vmmagic/unboxed/ObjectReference; at line 87
at Lorg/mmtk/plan/TraceLocal; traceObject(Lorg/vmmagic/unboxed/ObjectReference;Z)Lorg/vmmagic/unboxed/ObjectReference; at line 300
at Lorg/mmtk/plan/TraceLocal; processRootEdge(Lorg/vmmagic/unboxed/Address;Z)V at line 122
at Lorg/mmtk/plan/TraceLocal; processRoots()V at line 467
at Lorg/mmtk/plan/TraceLocal; completeTrace()V at line 479
at Lorg/mmtk/plan/generational/GenCollector; collectionPhase(SZ)V at line 101
at Lorg/mmtk/plan/generational/marksweep/GenMSCollector; collectionPhase(SZ)V at line 144
at Lorg/mmtk/plan/Phase; processPhaseStack(Z)Z at line 477
at Lorg/mmtk/plan/Phase; beginNewPhaseStack(I)Z at line 390
at Lorg/mmtk/plan/StopTheWorldCollector; collect()V at line 39
at Lorg/jikesrvm/mm/mminterface/CollectorThread; run()V at line 383
at Lorg/jikesrvm/scheduler/RVMThread; startoff()V at line 627
Virtual machine state:
-
- RVM-645.out
- 04/Sep/08 12:33 PM
- 2.79 MB
- Ian Rogers
Activity
Setting VALIDATE_REFS to true in MMTk/ext/vm/jikesrvm/org/jikesrvm/mm/mmtk/ScanThread.java and then running a profiled production image dies at the start of run 9 of a -s100 SPEC jvm 98 _209_db with:
Reading database input/db6 ... validRef: TIB outside heap, ref = 0x9fc284b4 tib = 0x00000000
TIB=0x00000000 STATUS=0x00000000
Invalid ref reported while scanning stack
— METHOD (OPT) Lspec/benchmarks/_209_db/Database;.read_db (Ljava/lang/String;)V
— fp = 0x716c4c10 code base = 0x7086e80c code offset = 0x000003af
line number = 0
0x716c4b20:REF=0x9fc284b4 TIB=0x00000000 STATUS=0x00000000 (INVALID TIB: CLASS NOT ACCESSIBLE)
0x716c4b60 (0x00000000): 0x716c4c10
0x716c4b64 (0x00000004): 0x7086ebbb
0x716c4b68 (0x00000008): 0x00005285
0x716c4b6c (0x0000000c): 0x0000000c
0x716c4b70 (0x00000010): 0x00000004
0x716c4b74 (0x00000014): 0x00000001
0x716c4b78 (0x00000018): 0x601022f4
0x716c4b7c (0x0000001c): 0x0000000c
0x716c4b80 (0x00000020): 0x7086eb34
0x716c4b84 (0x00000024): 0x00005142
0x716c4b88 (0x00000028): 0x71cc8c4c
0x716c4b8c (0x0000002c): 0x7083d00c
0x716c4b90 (0x00000030): 0x616c0be4
0x716c4b94 (0x00000034): 0x64039ee2
0x716c4b98 (0x00000038): 0x00000004
0x716c4b9c (0x0000003c): 0x00000009
0x716c4ba0 (0x00000040): 0x9fc27e04
0x716c4ba4 (0x00000044): 0x60100438
0x716c4ba8 (0x00000048): 0x719e6520
0x716c4bac (0x0000004c): 0x00000004
0x716c4bb0 (0x00000050): 0x9fc27df0
0x716c4bb4 (0x00000054): 0x60103f5c
0x716c4bb8 (0x00000058): 0x9fc27df0
0x716c4bbc (0x0000005c): 0x9fc27e44
0x716c4bc0 (0x00000060): 0x616c0be4
0x716c4bc4 (0x00000064): 0x716c4c0c
0x716c4bc8 (0x00000068): 0x0011dc38
0x716c4bcc (0x0000006c): 0x00000009
0x716c4bd0 (0x00000070): 0x9fc27e04
0x716c4bd4 (0x00000074): 0x00000004
0x716c4bd8 (0x00000078): 0x719dbdd0
0x716c4bdc (0x0000007c): 0x9fc284c0
0x716c4be0 (0x00000080): 0x9fc27e70
0x716c4be4 (0x00000084): 0x9fc244a8
0x716c4be8 (0x00000088): 0x00000009
0x716c4bec (0x0000008c): 0x9fc284c0
0x716c4bf0 (0x00000090): 0x9fc27e5c
0x716c4bf4 (0x00000094): 0x00000000
0x716c4bf8 (0x00000098): 0x9fc23ba4
0x716c4bfc (0x0000009c): 0x9fc27e44
0x716c4c00 (0x000000a0): 0x7083d00c
0x716c4c04 (0x000000a4): 0x00004b42
0x716c4c08 (0x000000a8): 0x00004b11
0x716c4c0c (0x000000ac): 0x00005209
Dumping stack starting at frame with bad ref:
– Stack –
at Lspec/benchmarks/_209_db/Database; read_db(Ljava/lang/String;)V
at Lspec/benchmarks/_209_db/Database; <init>(Ljava/lang/String;)V
at Lspec/benchmarks/_209_db/Main; run([Ljava/lang/String;)V
at Lspec/benchmarks/_209_db/Main; inst_main([Ljava/lang/String;)J
at Lspec/benchmarks/_209_db/Main; runBenchmark([Ljava/lang/String;)J
at Lspec/benchmarks/_209_db/Main; harnessMain([Ljava/lang/String;)J
at Lspec/harness/ProgramRunner; runOnce(Ljava/lang/Object;IJILjava/util/Properties;)Lspec/harness/BenchmarkTime; at line 382
at Lspec/harness/ProgramRunner; runBenchmark2()Ljava/util/Properties; at line 305
at Lspec/harness/ProgramRunner; runBenchmark()V at line 238
at Lspec/harness/ProgramRunner; run()V at line 206
at Lspec/harness/RunProgram; run(Ljava/lang/String;ZLjava/util/Properties;Lspec/harness/BenchmarkDone;)V at line 60
at LSpecApplication; runBenchmark(Ljava/lang/String;Z)V at line 255
at LSpecApplication; main([Ljava/lang/String;)V at line 171
at <invisible method>
at Lorg/jikesrvm/runtime/Reflection; outOfLineInvoke(Lorg/jikesrvm/classloader/RVMMethod;Ljava/lang/Object;[Ljava/lang/Object;Z)Ljava/lang/Object; at line 187
at Lorg/jikesrvm/runtime/Reflection; invoke(Lorg/jikesrvm/classloader/RVMMethod;Lorg/jikesrvm/runtime/ReflectionBase;Ljava/lang/Object;[Ljava/lang/Object;Z)Ljava/lang/Object; at line 78
at Lorg/jikesrvm/scheduler/MainThread; run()V at line 200
at Lorg/jikesrvm/scheduler/RVMThread; run()V at line 602
at Lorg/jikesrvm/scheduler/RVMThread; startoff()V at line 627
– Stack –
at Lorg/jikesrvm/scheduler/greenthreads/GreenProcessor; dispatch(Z)V at line 325
at Lorg/jikesrvm/scheduler/greenthreads/GreenThread; morph(Z)V at line 478
at Lorg/jikesrvm/scheduler/greenthreads/GreenThread; yield()V at line 399
at Lorg/jikesrvm/scheduler/greenthreads/GreenScheduler; yieldInternal()V at line 637
at Lorg/jikesrvm/scheduler/Scheduler; yield()V at line 275
at Lorg/jikesrvm/mm/mminterface/Handshake; requestAndAwaitCompletion(I)V at line 72
at Lorg/jikesrvm/mm/mminterface/CollectorThread; collect(Lorg/jikesrvm/mm/mminterface/Handshake;I)V at line 272
at Lorg/jikesrvm/mm/mmtk/Collection; triggerCollectionStatic(I)V at line 126
at Lorg/jikesrvm/mm/mmtk/Collection; triggerCollection(I)V at line 78
at Lorg/mmtk/plan/Plan; poll(ZLorg/mmtk/policy/Space;)Z at line 917
at Lorg/mmtk/policy/Space; acquire(I)Lorg/vmmagic/unboxed/Address; at line 397
at Lorg/mmtk/utility/alloc/LargeObjectAllocator; allocSlowOnce(III)Lorg/vmmagic/unboxed/Address; at line 99
at Lorg/mmtk/utility/alloc/Allocator; allocSlowInline(III)Lorg/vmmagic/unboxed/Address; at line 229
at Lorg/mmtk/utility/alloc/Allocator; allocSlow(III)Lorg/vmmagic/unboxed/Address; at line 209
at Lorg/mmtk/utility/alloc/LargeObjectAllocator; alloc(III)Lorg/vmmagic/unboxed/Address; at line 77
at Lorg/mmtk/plan/MutatorContext; alloc(IIIII)Lorg/vmmagic/unboxed/Address; at line 188
at Lorg/mmtk/plan/generational/GenMutator; alloc(IIIII)Lorg/vmmagic/unboxed/Address; at line 90
at Lorg/mmtk/plan/generational/marksweep/GenMSMutator; alloc(IIIII)Lorg/vmmagic/unboxed/Address; at line 90
at Lorg/jikesrvm/mm/mminterface/MemoryManager; allocateSpace(Lorg/jikesrvm/mm/mminterface/Selected$Mutator;IIIII)Lorg/vmmagic/unboxed/Address; at line 743
at Lorg/jikesrvm/mm/mminterface/MemoryManager; allocateArrayInternal(IILorg/jikesrvm/objectmodel/TIB;IIII)Ljava/lang/Object; at line 718
at Lorg/jikesrvm/mm/mminterface/MemoryManager; allocateArray(IIILorg/jikesrvm/objectmodel/TIB;IIII)Ljava/lang/Object; at line 684
at Lorg/jikesrvm/runtime/RuntimeEntrypoints; resolvedNewArray(IIILorg/jikesrvm/objectmodel/TIB;IIII)Ljava/lang/Object; at line 378
at Lspec/benchmarks/_209_db/Database; read_db(Ljava/lang/String;)V
at Lspec/benchmarks/_209_db/Database; <init>(Ljava/lang/String;)V
at Lspec/benchmarks/_209_db/Main; run([Ljava/lang/String;)V
at Lspec/benchmarks/_209_db/Main; inst_main([Ljava/lang/String;)J
at Lspec/benchmarks/_209_db/Main; runBenchmark([Ljava/lang/String;)J
at Lspec/benchmarks/_209_db/Main; harnessMain([Ljava/lang/String;)J
at Lspec/harness/ProgramRunner; runOnce(Ljava/lang/Object;IJILjava/util/Properties;)Lspec/harness/BenchmarkTime; at line 382
at Lspec/harness/ProgramRunner; runBenchmark2()Ljava/util/Properties; at line 305
at Lspec/harness/ProgramRunner; runBenchmark()V at line 238
at Lspec/harness/ProgramRunner; run()V at line 206
at Lspec/harness/RunProgram; run(Ljava/lang/String;ZLjava/util/Properties;Lspec/harness/BenchmarkDone;)V at line 60
at LSpecApplication; runBenchmark(Ljava/lang/String;Z)V at line 255
at LSpecApplication; main([Ljava/lang/String;)V at line 171
at <invisible method>
at Lorg/jikesrvm/runtime/Reflection; outOfLineInvoke(Lorg/jikesrvm/classloader/RVMMethod;Ljava/lang/Object;[Ljava/lang/Object;Z)Ljava/lang/Object; at line 187
at Lorg/jikesrvm/runtime/Reflection; invoke(Lorg/jikesrvm/classloader/RVMMethod;Lorg/jikesrvm/runtime/ReflectionBase;Ljava/lang/Object;[Ljava/lang/Object;Z)Ljava/lang/Object; at line 78
at Lorg/jikesrvm/scheduler/MainThread; run()V at line 200
at Lorg/jikesrvm/scheduler/RVMThread; run()V at line 602
at Lorg/jikesrvm/scheduler/RVMThread; startoff()V at line 627
Died in GC:
cool. can it be reproduced using opt test harness (instead of relying on the adaptive system to profile/compile things). Most likely you'll need the profiled call graph advice file to feed into it.
The next step is to get the GC map verbose info and see what is being improperly reported as a ref in the GC map at
— METHOD (OPT) Lspec/benchmarks/_209_db/Database;.read_db (Ljava/lang/String;)V
— fp = 0x716c4c10 code base = 0x7086e80c code offset = 0x000003af
Typically, once one knows that much of the hard detective work is over....
I can't reproduce it with OTH at the moment. I run with -X:irc:verbose=true and -X:opt:verbose=true then feed the OTH arguments back as a long command line, and the benchmark runs successfully. This may be due to not compiling proxy or reflection methods that aren't available to OTH. If I just opt compile read_db the benchmark runs successfully. Looking at the GC maps at the failing point:
Instruction: 61, EG ia32_call EAX([B) AF CF OF PF ZF ESP = <0+1611142780>DW,
static"< BootstrapCL, Lorg/jikesrvm/runtime/RuntimeEntrypoints; >.resolvedNewArray (IIILorg/jikesrvm/objectmodel/TIB;IIII)Ljava/lang/Object;",
EAX(I), EDX(Z) ESP (l0psa, Reg: 0) (l2psa, Reg: 0) (l5pa, Reg: 0) (l283pa, Reg: 0)
the l0, l2, l5 and l283 registers all appear to have legitimate objects within them.
I had to hand code the GC map debugging, what are the normal options for generating the profile data, feeding it back in and dumping the GC maps?
It's possible that to get it to reproduce under OTH you need to feedback in the call graph as well (use -X:aos:enable_profiling=true, -X:aos:final_report_level=2 to dump the call graph to stderr in normal profiling run). Then in the OTH run take the DCG portion of the input and feed it back to the system by using -X:aos:dcfi=<file>
To debug opt GC maps, assuming you can get it to fail with a deterministic compilation, then one would flip the variables below in LiveAnalysis to true.
private static final boolean dumpFinalLiveIntervals = false;
private static final boolean dumpFinalMaps = false;
private static final boolean dumpFixedPointResults = false;
There's also a DUMP_MAPS static final boolean defined in OptMachineCodeMap.
In order for the debug dumps to be useful, you almost always have to be able to get a reliably crash with a stable program point in the opt compiled code. Turning on the debug dumps won't change the code/GC maps that are generated but it will greatly perturb the GC timing so the program might not actually hit the bad GC map anymore (even though it is still bad...)
I think the first job is to turn these flags into an opt compiler option (with fuzzy matching and so on). I'll then try to get to the dependable failure with OTH using the profiling. I think the crash is dependable but seems to change with analysis (I've seen 3 different machine code offsets for the crash but the crash is always the same). The maps print out what are thought to be live registers, presumably the next task is to find out why one of those registers is referring to something that is either not an object or dead. The info with the crash seems to suggest that the reference is to an object that has been previously freed, what's the best way to look at the IR to work this out?
It could be the dead reference is coming from a different method (that should have marked the register as live), reducing the set of opt compiled methods implicated in the crash to one is obviously a priority but I think the number may be some what higher. Maybe automatic opt compiler validation of what are thought to be live references would be useful? (ie inject a call to validate references where ever we have a GC map). I'd like to tackle this problem in two ways, one fix the problem but also two introduce automated assertion testing and debug aids so that we rule out this kind of problem in the future. It seems injecting validation code should be quite straightforward (but more work than the validate reference check that was added to ScanThread).
injecting validation code isn't going to be that simple.....the GC maps are computed after register allocation so you can't easily inject calls (or even instructions). I guess one could try to invoke a SaveVolatile wrapper, but it's not entirely trivial.
on the options question. sure, one could make them more full fledged options.
if the crash is for the same stack location at multiple program points (offsets), then that makes it likely that the stack location is either dead, uninitialized, or got smashed by something else. There isn't very good automation for this; the main debugging approach would be to backtrace through the IR/machine code to see what is happening "around" the problem location.
So the error can be reproduced with the aos system disabled and it doesn't require a particular profile to trigger the fault. To reproduce the error the following will do it quickly:
$RVM_ROOT/dist/production_x86_64-linux/rvm -Xms26M -Xmx26M -X:gc:verbose=4 -X:aos:enable_recompilation=false org.jikesrvm.tools.oth.OptTestHarness -class spec/harness/Context -class spec/io/FileInputStream -class spec/benchmarks/_209_db/Entry -oc:O1 -oc:inline_write_barrier=false -oc:inline=false -oc:inline_new=true -methodOpt spec/benchmarks/_209_db/Database read_db - -main SpecApplication -m100 -M100 -a -s10 _209_db
disabling the inlining of new (or using O0) will fix the problem, however, the point of failure isn't an inlined allocation but a new resolved array that hasn't been inlined as the length of the array is unknown. NB remember to set the VALIDATE_REFS in ScanThread to true.
Prior to the bad reference dumping out what is happening with Memory.zero shows:
Zeroing from 0x9fb58000 to 0x9fb60000
– Stack –
at Lorg/jikesrvm/mm/mmtk/Memory; zero(Lorg/vmmagic/unboxed/Address;Lorg/vmmagic/unboxed/Extent;)V at line 153
at Lorg/mmtk/utility/heap/MonotonePageResource; allocPages(I)Lorg/vmmagic/unboxed/Address; at line 176
at Lorg/mmtk/utility/heap/PageResource; getNewPages(I)Lorg/vmmagic/unboxed/Address; at line 234
at Lorg/mmtk/policy/Space; acquire(I)Lorg/vmmagic/unboxed/Address; at line 404
at Lorg/mmtk/utility/alloc/BumpPointer; allocSlowOnce(III)Lorg/vmmagic/unboxed/Address; at line 263
at Lorg/mmtk/utility/alloc/Allocator; allocSlowInline(III)Lorg/vmmagic/unboxed/Address; at line 229
at Lorg/mmtk/utility/alloc/BumpPointer; allocSlow(Lorg/vmmagic/unboxed/Address;Lorg/vmmagic/unboxed/Address;II)Lorg/vmmagic/unboxed/Address; at line 172
at Lorg/mmtk/utility/alloc/BumpPointer; alloc(III)Lorg/vmmagic/unboxed/Address; at line 146
at Lorg/mmtk/plan/generational/GenMutator; alloc(IIIII)Lorg/vmmagic/unboxed/Address; at line 88
at Lorg/mmtk/plan/generational/marksweep/GenMSMutator; alloc(IIIII)Lorg/vmmagic/unboxed/Address; at line 90
at Lorg/jikesrvm/mm/mminterface/MemoryManager; allocateSpace(Lorg/jikesrvm/mm/mminterface/Selected$Mutator;IIIII)Lorg/vmmagic/unboxed/Address; at line 743
at Lorg/jikesrvm/mm/mminterface/MemoryManager; allocateArrayInternal(IILorg/jikesrvm/objectmodel/TIB;IIII)Ljava/lang/Object; at line 718
at Lorg/jikesrvm/mm/mminterface/MemoryManager; allocateArray(IIILorg/jikesrvm/objectmodel/TIB;IIII)Ljava/lang/Object; at line 684
at Lorg/jikesrvm/runtime/RuntimeEntrypoints; resolvedNewArray(IIILorg/jikesrvm/objectmodel/TIB;IIII)Ljava/lang/Object; at line 378
at Ljava/lang/String; <init>(Ljava/lang/StringBuffer;)V at line 505
at Ljava/lang/StringBuffer; toString()Ljava/lang/String; at line 1038
at Lspec/io/ValidityCheckOutputStream; strip(ILjava/io/InputStream;)Ljava/lang/String; at line 256
at Lspec/io/ValidityCheckOutputStream; validityCheck2(Ljava/util/Properties;I)V at line 229
at Lspec/io/ValidityCheckOutputStream; validityCheck(Ljava/util/Properties;I)Z at line 157
at Lspec/harness/ProgramRunner; runOnce(Ljava/lang/Object;IJILjava/util/Properties;)Lspec/harness/BenchmarkTime; at line 399
at Lspec/harness/ProgramRunner; runBenchmark2()Ljava/util/Properties; at line 331
at Lspec/harness/ProgramRunner; runBenchmark()V at line 238
at Lspec/harness/ProgramRunner; run()V at line 206
at Lspec/harness/RunProgram; run(Ljava/lang/String;ZLjava/util/Properties;Lspec/harness/BenchmarkDone;)V at line 60
at LSpecApplication; runBenchmark(Ljava/lang/String;Z)V at line 255
at LSpecApplication; main([Ljava/lang/String;)V at line 171
at <invisible method>
at Lorg/jikesrvm/runtime/Reflection; outOfLineInvoke(Lorg/jikesrvm/classloader/RVMMethod;Ljava/lang/Object;[Ljava/lang/Object;Z)Ljava/lang/Object; at line 187
at Lorg/jikesrvm/runtime/Reflection; invoke(Lorg/jikesrvm/classloader/RVMMethod;Lorg/jikesrvm/runtime/ReflectionBase;Ljava/lang/Object;[Ljava/lang/Object;Z)Ljava/lang/Object; at line 78
at Lorg/jikesrvm/tools/oth/OptTestHarness; executeCommand()V at line 414
at Lorg/jikesrvm/tools/oth/OptTestHarness; main([Ljava/lang/String;)V at line 434
at <invisible method>
at Lorg/jikesrvm/runtime/Reflection; outOfLineInvoke(Lorg/jikesrvm/classloader/RVMMethod;Ljava/lang/Object;[Ljava/lang/Object;Z)Ljava/lang/Object; at line 187
at Lorg/jikesrvm/runtime/Reflection; invoke(Lorg/jikesrvm/classloader/RVMMethod;Lorg/jikesrvm/runtime/ReflectionBase;Ljava/lang/Object;[Ljava/lang/Object;Z)Ljava/lang/Object; at line 78
at Lorg/jikesrvm/scheduler/MainThread; run()V at line 200
at Lorg/jikesrvm/scheduler/RVMThread; run()V at line 602
at Lorg/jikesrvm/scheduler/RVMThread; startoff()V at line 627
r14952 added some verbose debug output for the bad references and it shows that they are at the following addresses:
Failing iterators:
Iterator 0: 0x71572a30: REF=0x9fb5f1dc TIB=0x00000000 STATUS=0x00000000 (INVALID TIB: CLASS NOT ACCESSIBLE)
Iterator 1: 0x71572acc: REF=0x9fb5eb6c TIB=0x601250cc STATUS=0x00000000 TYPE=0x6126b004 CLASS=Ljava/lang/String;
Iterator 2: 0x71572ac8: REF=0x9fb5b1d4 TIB=0x7101664c STATUS=0x00000000 (INVALID TIB: CLASS NOT ACCESSIBLE)
Iterator 3: 0x71572ac4: REF=0x9fb5f1e8 TIB=0x7100924c STATUS=0x00000000 (INVALID TIB: CLASS NOT ACCESSIBLE)
which were zeroed in the previous Memory.zero. The Memory.zero (the memset it calls via a sysCall) was found to be zeroing the TIB by adding hardware watch points in gdb.
Dumping the objects and arrays as they are allocated doesn't mask the bug and shows the following:
the bad ref is:
validRef: TIB outside heap, ref = 0x9fcf7504 tib = 0x00000000
the preceeding mem zeroes latest first:
Zeroing from 0x70573000 to 0x70574000
Zeroing from 0x9fcf0000 to 0x9fcf8000 <-- zeroes space supposedly containing ref
Zeroing from 0x9fce8000 to 0x9fcf0000
preceeding allocations to bad ref but after zero of bad zeroes memory:
...
Reading database input/db2 ... New array: 0x9fcf728c
New array: 0x9fcf729c
New array: 0x9fcf72b0
New array: 0x9fcf72c4
New array: 0x9fcf72d0
New array: 0x9fcf72e0
New object: 0x9fcf72f0
New array: 0x9fcf7304
New object: 0x9fcf7510
New object: 0x9fcf7528
New array: 0x9fcf753c
New object: 0x9fcf7598
New object: 0x9fcf75b0
New object: 0x9fcf75c0
New object: 0x9fcf75cc
New object: 0x9fcf75f8
New object: 0x9fcf7608
New object: 0x9fcf7614
New object: 0x9fcf762c
New object: 0x9fcf763c
New object: 0x9fcf7650
New object: 0x9fcf765c
New object: 0x9fcf7688
New object: 0x9fcf7694
The nearest object is at 0x9fcf7510 which is allocated after an array at 0x9fcf7304, but the reference is off by 12 (0x9fcf7510 - 0x9fcf7504).
IR for failing method (also includes GC, benchmark and crash output)
So looking at the iterators:
(l0psa, Reg: 0) contains Database (l2psa, Reg: 0) contains String (l5pa, Reg: 0) contains [B (l21pa, Reg: 0) contains FileInputStream
in the prologue l5 is initialized as:
6 ia32_mov l5pa(Lorg/jikesrvm/classloader/TypeReference$NULL
= 0
following spill code this becomes:
6 ia32_mov EDI(Lorg/jikesrvm/classloader/TypeReference$NULL
= 0
however, in the inlined allocation of the spec.io.FileInputStream:
42 LABEL7 Frequency: 1.0
43 ia32_mov EDI(Lorg/vmmagic/unboxed/Address
= <[EAX(Lorg/mmtk/policy/CopyLocal;,d,p)]+-4>DW (<mem loc: Lorg/mmtk/utility/alloc/BumpPointer;.cursor>, t427sv(GUARD))
-2 ia32_mov EDI(Lorg/vmmagic/unboxed/Address
= EDI(Lorg/vmmagic/unboxed/Address![]()
-1 ia32_jmp LABEL9
-1 bbend BB7
so some how the register allocator is smashing EDI with an address, that happens to be the cursor (hence being 12 off the allocated address). The original code for BB7 is:
42 LABEL7 Frequency: 1.0
43 ia32_mov t452sa(Lorg/vmmagic/unboxed/Address
= <[t425psa(Lorg/mmtk/policy/CopyLocal;,d,p)]+-4>DW (<mem loc: Lorg/mmtk/utility/al
loc/BumpPointer;.cursor>, t427sv(GUARD))
-2 ia32_mov l453pa(Lorg/vmmagic/unboxed/Address
= t452sa(Lorg/vmmagic/unboxed/Address![]()
-1 ia32_jmp LABEL9
-1 bbend BB7
So there is something very rotten for l453 and l5 to get allocated to the same physical register (r7 - EDI).
A hypothesis is that l5 is defined in the prologue.. as it is killed on all paths the register allocator uses EDI to hold l5 and l453 in BB7, their live ranges don't overlap. However, the GC maps are naive and expect l5 to be live in the call to resolved new array. When we inspect the frame during GC we read EDI expecting to get l5 but it is holding l452, an address and therefore a bad reference.
looking at l5 over the call to resolved new array (at bytecode offset 61) the prologue definition is:
6 ia32_mov l5pa(Lorg/jikesrvm/classloader/TypeReference$NULL
= 0
immediately following the call to resolve new array is:
61 EG ia32_call EAX([B) AF CF OF PF ZF ESP = <0+1611141836>DW,
static"< BootstrapCL, Lorg/jikesrvm/runtime/RuntimeEntrypoints; >.resolvedNewArray (IIILorg/jikesrvm/objectmodel/TIB;IIII)Ljava/lang/Object;", EAX(I), EDX(Z) ESP
61 advise_esp 0
61 ia32_mov t1152a([B,x,p) = EAX([B)
61 ia32_mov l5pa([B,x,p) = t1152a([B)
so the prologue definition of l5 isn't live, it is killed on all routes through 61.. we allocate l453 to EDI and l5 to EDI but we don't kill the prologue definition of l5 and we don't remove l5 from the GC map of the call.
l5 is defined outside of a try block and therefore live at the resolved new array allocation, that is within the try block, as an exception would mean it could be read with the value of null. Linear scan needs to allocate for l453 but doesn't see l5's use of EDI as live. Is this because linear scan isn't considering the exception edges? In both normal and exceptional control flow l5 is never actually viewed having the value of null. Is linear scan smart enough to see this and ignore l5's assignment? If so we need to make the GC maps smarter so that unread but live variables don't get into the map.
So the failing code is like this (l5 == buffer):
public void read_db(String filename)
{
...
byte buffer[] = null;
...
try
catch (IOException ioe)
{
... no uses of buffer
buffer is defined at the beginning but the assignment is never used, it's clobbered in the try-block and never used in the catch block. Liveness however considers it to be live at the point "buffer = new byte[n];" and puts it in the GC map, this causes the failing stack scan as linear scan doesn't consider it to be live. It seems there are 2 fixes:
1) don't consider l5/buffer live and put it in the GC map
2) in linear scan, consider GC map uses along with uses from the GC point instructions
it seems 2 is the safest, linear scan is isolated from faults in liveness - which isn't easy to get right due to the FCFG and the intersection of PEIs and the exception class hierarchy. If we do 1 as well then any conservativeness from 2 is lost, 2 just becomes an overhead in linear scan. If we're confident that 1 is right we can disable 2 but it would be nice to have the code (at least in the SVN history).
The error can be reproduced with less opt code and on the 1st SPEC jvm iteration with the following:
$RVM_ROOT/dist/production_x86_64-linux/rvm -X:gc:stressFactor=128k -X:aos:enable_recompilation=false org.jikesrvm.tools.oth.OptTestHarness -class spec/io/FileInputStream -oc:O1 -oc:inline_write_barrier=false -oc:inline=false -oc:print_all_ir=true -oc:print_gc_maps=true -methodOpt spec/benchmarks/_209_db/Database read_db - -main SpecApplication _209_db
Summary of chat with Ian: it seems likely that the liveness information being used for GC maps is slightly wrong. The preferred fix is to get it to be more precise since the LS liveness calculation is actually correct for this code. I am going to take a look at this next week to see if I can pick up where Ian left off.
I've looked at it more carefully, and the liveness information being used for the GC maps is correct. There aren't any uses in the catch block itself, but there are uses downstream of the catch block. that appear to the opt compiler as reachable (due to imprecise modelling of checked exceptions at calls...see RVM-575: http://jira.codehaus.org/browse/RVM-575) that mean that the analysis is assuming that l5 is coming in live on the exception edge. Since if the call raises a PEI, it won't execute the def of l5, it is correct for it to have not been killed during local propagation.
So, the fix has to be figuring out why reg alloc isn't respecting this.
l5 has two live intervals in this block.
From entry to the call instruction (due to PEI liveness) and from the def of the call instruction (or actually a couple move instructions after the call) to the end of the block. The GC map at the call instruction is picking up the second one of these (EBX). It needs to pick up the first one. This is the problem.
Ignore the above comment about the two live intervals. We assign a symbolic to the same physical/spill location throughout it's lifetime.
The bug is actually in live analysis!
In LiveAnalysis.computeBlockGenAndKill we carefully compute which instruction is the first PEI in the reverse direction in the basic block, but then never actually use that information to properly build the firstPEIKillSet. The firstPEIKillSet is always empty, which means we are not properly modeling the dataflow through the FCFG for live analysis!
sigh. time to go to bed. There is something wrong with live analysis, because firstPEIKillSet is always empty (for every BB in this method), but it wasn't the simple mistake I thought it was. Got fooled by trusting eclipse search to easily...
continuing to make progress. there's a flaw in how we're doing the merging of the IN set from exception edges in the FCFG liveness calculation. The code isn't doing exactly what we described in the paper (it's trying to short circuit a few computations)...
don't have a fix yet, but I'm closing in on it.
I'm pretty confident that the bug is in the liveness analysis and that I see what it is doing wrong algorithmically.
got it. will take a while to clean up all the debug code, but I have a fix. hopefully check in tonight or tomorrow.
The problem was that in LiveAnalysis.processBlock the code was calling getContainsPEIWithHandler() before it called computeBlockGenAndKill. Unfortunately, the value for getContainsPEIWithHandler is actually computed by computeBlockGenAndKill, so it was always returning false when it was called to soon. As a result, the code was treating all IN edges from exception edges as if they were IN edges from normal control flow. This resulted in a too aggressive killing of values that were live via exception edges.
The fix is simply to move the call to computeBlockGenAndKill to the top of processBlock, then all of the info about the block is computed in time and the code works as intended.
Can also be observed on eclipse: http://jikesrvm.anu.edu.au/cattrack/results/habanero.anu.edu.au/perf/5060/production/default/perf-dacapo/eclipse-3