RVM
  1. RVM
  2. RVM-236

Detect write barriers in uninterruptible code and handle overflow gracefully

    Details

    • Type: Improvement Improvement
    • Status: Open Open
    • Priority: Minor Minor
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: 1000
    • Component/s: MMTk
    • Labels:
      None
    • Number of attachments :
      0

      Description

      Daniel recently fixed a problem caused by this following the scheduler refactoring. That fix avoided the write barrier. The original tracker (1147477) went:

      I actually looked at a couple of stackdumps from the failing generational images today and convinced myself that the root cause of the failures is that we are executing write barriers in uninterruptible code. This is bad because in JMTk a write barrier overflow can trigger a gc, which is exactly what uninterruptible code is trying to avoid in many cases. This is very similar to the stack overflow-check in uninterruptible code problem that we found and killed years ago. Write barrier overflow was never an issue with the watson collectors because they cheated and kept their write buffers in the C heap (so they could grow w/o triggering a gc). Notice that a few of the problem write barriers are in JMTk itself, most are in other parts of the VM.

      The particular crashes I was looking at looked like a write barrier overflow from a write in VM_Processor.dispatch. I
      generated a list of the offenders (attached) in a prototype (BaseBaseGenMS) image by tweaking the code in
      VM_BaselineCompiler to consider a putfield of a reference type to be a violation of uninterruptiblity.

      It seems to me that the possible fixes are:
      (1) declare that putfield of reference types in uninterruptible code are programming errors and rewrite the code to avoid them.
      (2) allow these putfields, but don't write barrier them
      (3) allow them, but call a different write barrier routine that either uses slack in the barrier or grows the buffer w/o triggering GC by stealing space from an emergency slack space.

        Activity

        Hide
        Ian Rogers added a comment -

        Comment copied from original tracker:

        Steve B suggested the fix was to use an asynch triggering of GC on write buffer "near-overflow"

        Perry observed latter that this probably is going to force us to modify the compilers to inject yieldpoints to ensure that no more than a fixed number of pointer-writes occur before the next yieldpoint.

        On the plus side, Perry also observed that if we can make write barriers a non-GC point, then we could eliminate a large number of points that currently have GC maps (thus reducing space impact of generational GC on machine code maps).

        Show
        Ian Rogers added a comment - Comment copied from original tracker: Steve B suggested the fix was to use an asynch triggering of GC on write buffer "near-overflow" Perry observed latter that this probably is going to force us to modify the compilers to inject yieldpoints to ensure that no more than a fixed number of pointer-writes occur before the next yieldpoint. On the plus side, Perry also observed that if we can make write barriers a non-GC point, then we could eliminate a large number of points that currently have GC maps (thus reducing space impact of generational GC on machine code maps).
        Hide
        Daniel Frampton added a comment -

        The current system (according to my understanding) has implemented the async trigger on 'near-overflow' solution for a long time.

        There are still barriers during scheduling threads, the fix for the scheduling issue was removing the potential recursion when triggering an async gc.

        The triggering of an async event involved scheduling threads, and as the write barrier could happen during scheduling, this caused an invalid recursive use of locks.

        Show
        Daniel Frampton added a comment - The current system (according to my understanding) has implemented the async trigger on 'near-overflow' solution for a long time. There are still barriers during scheduling threads, the fix for the scheduling issue was removing the potential recursion when triggering an async gc. The triggering of an async event involved scheduling threads, and as the write barrier could happen during scheduling, this caused an invalid recursive use of locks.

          People

          • Assignee:
            Unassigned
            Reporter:
            Ian Rogers
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated: