Issue Details (XML | Word | Printable)

Key: RVM-236
Type: Improvement Improvement
Status: Open Open
Priority: Minor Minor
Assignee: Unassigned
Reporter: Ian Rogers
Votes: 0
Watchers: 0
Operations

If you were logged in you would be able to see more operations.
RVM

Detect write barriers in uninterruptible code and handle overflow gracefully

Created: 19/Sep/07 05:01 AM   Updated: 11/Apr/08 09:29 AM
Component/s: MMTk
Affects Version/s: None
Fix Version/s: 1000

Time Tracking:
Not Specified


 Description  « Hide
Daniel recently fixed a problem caused by this following the scheduler refactoring. That fix avoided the write barrier. The original tracker (1147477) went:

I actually looked at a couple of stackdumps from the failing generational images today and convinced myself that the root cause of the failures is that we are executing write barriers in uninterruptible code. This is bad because in JMTk a write barrier overflow can trigger a gc, which is exactly what uninterruptible code is trying to avoid in many cases. This is very similar to the stack overflow-check in uninterruptible code problem that we found and killed years ago. Write barrier overflow was never an issue with the watson collectors because they cheated and kept their write buffers in the C heap (so they could grow w/o triggering a gc). Notice that a few of the problem write barriers are in JMTk itself, most are in other parts of the VM.

The particular crashes I was looking at looked like a write barrier overflow from a write in VM_Processor.dispatch. I
generated a list of the offenders (attached) in a prototype (BaseBaseGenMS) image by tweaking the code in
VM_BaselineCompiler to consider a putfield of a reference type to be a violation of uninterruptiblity.

It seems to me that the possible fixes are:
(1) declare that putfield of reference types in uninterruptible code are programming errors and rewrite the code to avoid them.
(2) allow these putfields, but don't write barrier them
(3) allow them, but call a different write barrier routine that either uses slack in the barrier or grows the buffer w/o triggering GC by stealing space from an emergency slack space.



 All   Comments   Work Log   Change History      Sort Order: Ascending order - Click to sort in descending order
Ian Rogers added a comment - 19/Sep/07 05:03 AM
Comment copied from original tracker:

Steve B suggested the fix was to use an asynch triggering of GC on write buffer "near-overflow"

Perry observed latter that this probably is going to force us to modify the compilers to inject yieldpoints to ensure that no more than a fixed number of pointer-writes occur before the next yieldpoint.

On the plus side, Perry also observed that if we can make write barriers a non-GC point, then we could eliminate a large number of points that currently have GC maps (thus reducing space impact of generational GC on machine code maps).


Daniel Frampton added a comment - 19/Sep/07 05:45 AM
The current system (according to my understanding) has implemented the async trigger on 'near-overflow' solution for a long time.

There are still barriers during scheduling threads, the fix for the scheduling issue was removing the potential recursion when triggering an async gc.

The triggering of an async event involved scheduling threads, and as the write barrier could happen during scheduling, this caused an invalid recursive use of locks.