RVM
  1. RVM
  2. RVM-755

Tweaks to optimizing compilation to improve the performance of write barriers

    Details

    • Type: Improvement Improvement
    • Status: Open Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: 3.2
    • Component/s: Compiler: Optimizing
    • Labels:
      None
    • Number of attachments :
      3

      Description

      We don't allow copy propagation of physical registers, however, in the case of the processor register this appears overly cautious. In the current generational putfield write barrier we get:

      t1 = PR // define t1 to PR
      if ....
      return // likely
      else
      ... = t1 // unlikely - use of t1

      as t1 is alive in the slow path the definition must occur at the head of the method, so we copy PR redundantly for it only to be used if we get into the write barrier slow path. We should just copy propagate PR and save the allocation of t1.

      Similarly the test at the head of the put field write barrier is often

      ...
      t2 = l1 + constant_offset
      if t2 < constant_start_of_nursery
      ...

      which can be folded in BURS to:

      if l1 < constant_start_of_nursery - constant_offset

      this is already performed in expression folding, but currently disabled.

      1. all_ir_copy_prop.out
        263 kB
        Ian Rogers
      2. all_ir_no_copy_prop.out
        328 kB
        Ian Rogers
      3. copy-prop-pr.patch
        3 kB
        Ian Rogers

        Activity

        Hide
        David Grove added a comment -

        copy prop of physical registers is disabled because you can't copy prop the value of PR across a potential thread switch point.

        we "discovered" this the hard way in 2000 when we had a low probability bug that cost about 2 person months to find.....if a thread switch actually gets taken, then the value of PR has to change to be re-read otherwise the thread ends up using the "old" value of PR and you get two threads accessing thread-local data structures hanging off the PR. This is a real nasty low probability bug to find....

        I think it would be much safer to not try to optimize these and write the barrier code such that PR is explicitly referenced each time it is needed (avoid the assignment of PR to t1).

        Show
        David Grove added a comment - copy prop of physical registers is disabled because you can't copy prop the value of PR across a potential thread switch point. we "discovered" this the hard way in 2000 when we had a low probability bug that cost about 2 person months to find.....if a thread switch actually gets taken, then the value of PR has to change to be re-read otherwise the thread ends up using the "old" value of PR and you get two threads accessing thread-local data structures hanging off the PR. This is a real nasty low probability bug to find.... I think it would be much safer to not try to optimize these and write the barrier code such that PR is explicitly referenced each time it is needed (avoid the assignment of PR to t1).
        Hide
        Ian Rogers added a comment -

        I'm not sure I follow. The PR is only ever going to be a RHS as it is never defined. If we replace uses of t1 with PR, we are causing PR to be read more frequently (ie in all the places t1 was previously read). If anything the copying of the processor register appears to improve the guarantee of processor locality and not weakening it.

        In other words, I'm not sure of the situation you are describing as you are introducing copies of the processor register whereas I am trying to eliminate them.

        Show
        Ian Rogers added a comment - I'm not sure I follow. The PR is only ever going to be a RHS as it is never defined. If we replace uses of t1 with PR, we are causing PR to be read more frequently (ie in all the places t1 was previously read). If anything the copying of the processor register appears to improve the guarantee of processor locality and not weakening it. In other words, I'm not sure of the situation you are describing as you are introducing copies of the processor register whereas I am trying to eliminate them.
        Hide
        David Grove added a comment -

        JIRA dropped my first (longer) comment.

        Yes, what I'm describing is the dual of what you propose. But, I still think modifying ("optimizing") life times of physical registers in the opt compiler that were introduced by magic is a risky idea and that we should instead re-write the barrier code. If the programmer is writing code that mentions PR explicitly we should do exactly what they tell us to.

        Show
        David Grove added a comment - JIRA dropped my first (longer) comment. Yes, what I'm describing is the dual of what you propose. But, I still think modifying ("optimizing") life times of physical registers in the opt compiler that were introduced by magic is a risky idea and that we should instead re-write the barrier code. If the programmer is writing code that mentions PR explicitly we should do exactly what they tell us to.
        Hide
        Ian Rogers added a comment -

        Patch allowing the copy propagation of the processor register. Passes pre-commit tests on Intel and PPC (single processor).

        Show
        Ian Rogers added a comment - Patch allowing the copy propagation of the processor register. Passes pre-commit tests on Intel and PPC (single processor).
        Hide
        Ian Rogers added a comment -

        All IR traces showing the problem of not having copy propagation.

        Show
        Ian Rogers added a comment - All IR traces showing the problem of not having copy propagation.
        Hide
        David Grove added a comment -

        single processor tests won't show anything obviously....bugs in "optimizing" PR usage only show up with multiple virtual processors and usually also require fairly large degrees of real concurrency (SMP) as well.

        Show
        David Grove added a comment - single processor tests won't show anything obviously....bugs in "optimizing" PR usage only show up with multiple virtual processors and usually also require fairly large degrees of real concurrency (SMP) as well.
        Hide
        Ian Rogers added a comment -

        I think there are 2 points of view:

        1) the processor register is special and shouldn't behave like a GPR, people should explicitly use it and take care that optimizations like copy propagation won't occur with it (this means everywhere you can you'll need to write Magic.getProcessorRegister)

        2) the processor register is special (different from other registers) and can be optimized by the compiler in a small number of cases where the optimization should be valid (ie. copy propagation is sound, but say spilling the processor register isn't)

        neither view is perfect. View 1 is contrary to how most of the code base is currently written, we bottom out in a call that does something magic. We don't really want long methods of hand crafted magic where a more OO style may be more convenient, adaptable... View 2 isn't perfect as there is a possibility for more bugs in the opt compiler, we need to remember the PR (soon to be thread register) can be optimized in some ways and not in others.

        My feeling is that our code base, in particular MMTk where this problem was highlighted, assumes view 2 more than view 1. Indeed I think a change to make view 1 held would be too pervasive and expose part of Jikes RVM in MMTk that'd be undesirable. Maybe the best way to move this forward is to apply the patch on the quarantine branch and monitor its effect.

        There should be some urgency to dealing with this issue as we are carrying around a dead register definition per write barrier currently. With the patch or with an implementation of view 1, we could allocate that register for something useful and we can save compilation time as well.

        Show
        Ian Rogers added a comment - I think there are 2 points of view: 1) the processor register is special and shouldn't behave like a GPR, people should explicitly use it and take care that optimizations like copy propagation won't occur with it (this means everywhere you can you'll need to write Magic.getProcessorRegister) 2) the processor register is special (different from other registers) and can be optimized by the compiler in a small number of cases where the optimization should be valid (ie. copy propagation is sound, but say spilling the processor register isn't) neither view is perfect. View 1 is contrary to how most of the code base is currently written, we bottom out in a call that does something magic. We don't really want long methods of hand crafted magic where a more OO style may be more convenient, adaptable... View 2 isn't perfect as there is a possibility for more bugs in the opt compiler, we need to remember the PR (soon to be thread register) can be optimized in some ways and not in others. My feeling is that our code base, in particular MMTk where this problem was highlighted, assumes view 2 more than view 1. Indeed I think a change to make view 1 held would be too pervasive and expose part of Jikes RVM in MMTk that'd be undesirable. Maybe the best way to move this forward is to apply the patch on the quarantine branch and monitor its effect. There should be some urgency to dealing with this issue as we are carrying around a dead register definition per write barrier currently. With the patch or with an implementation of view 1, we could allocate that register for something useful and we can save compilation time as well.
        Hide
        David Grove added a comment -

        I don't see that as being this urgent.

        I also see it becoming much less of a problem with native threads. In native threads, the PR won't change for the thread in the same ways as it does with green-threading. Let's get native threads working and then worry about this.

        Show
        David Grove added a comment - I don't see that as being this urgent. I also see it becoming much less of a problem with native threads. In native threads, the PR won't change for the thread in the same ways as it does with green-threading. Let's get native threads working and then worry about this.
        Hide
        David Grove added a comment -

        Duplicating comment from core list dicusssion:

        I'm objecting to letting the opt compiler copy propagate physical registers as I think it has some potential for introducing bugs in code that is using Magic to manipulate the processor register. The problem is that MMTk gets hold of the PR register through a series of function calls. By the time we inline all of those, we are left with an assignment of PR to a temporary variable (the return value from the inlined function). Ian is right that if we allow copy propagation to be applied to this IR, then we will get better code.

        My reluctance to do this optimization is that the opt compiler will also copy propagate in cases where the programmer has explicitly assigned PR to a local variable because they want to hold on to a particular value in that register and use it later.

        It is awkward for the opt compiler to be able to distinguish between code where the programmer is explicitly putting PR in a local variable vs. where it is only in a local variable as a result of inlining.

        The invasive change is that I'd prefer to force the Magic programmer to invoke the magic to get the PR everywhere they want it in the inline sequence instead of wrapping layers of function calls.

        To be clear, I'm not saying this change is a good idea (or one that we should do), but having been burned badly in the past by the opt compiler trying to be to smart and optimize code sequences that arise from VM magic I would like us to not enable any such optimization unless we really understand and document what implications it has for the correct and bug free use of magic.

        I'm also putting this mail in the tracker...

        --dave

        Show
        David Grove added a comment - Duplicating comment from core list dicusssion: I'm objecting to letting the opt compiler copy propagate physical registers as I think it has some potential for introducing bugs in code that is using Magic to manipulate the processor register. The problem is that MMTk gets hold of the PR register through a series of function calls. By the time we inline all of those, we are left with an assignment of PR to a temporary variable (the return value from the inlined function). Ian is right that if we allow copy propagation to be applied to this IR, then we will get better code. My reluctance to do this optimization is that the opt compiler will also copy propagate in cases where the programmer has explicitly assigned PR to a local variable because they want to hold on to a particular value in that register and use it later. It is awkward for the opt compiler to be able to distinguish between code where the programmer is explicitly putting PR in a local variable vs. where it is only in a local variable as a result of inlining. The invasive change is that I'd prefer to force the Magic programmer to invoke the magic to get the PR everywhere they want it in the inline sequence instead of wrapping layers of function calls. To be clear, I'm not saying this change is a good idea (or one that we should do), but having been burned badly in the past by the opt compiler trying to be to smart and optimize code sequences that arise from VM magic I would like us to not enable any such optimization unless we really understand and document what implications it has for the correct and bug free use of magic. I'm also putting this mail in the tracker... --dave
        Hide
        Ian Rogers added a comment -

        Can we resolve this by documenting that copy propagation of the processor register relies on the processor being @NonMoving and then also document that in the Processor (soon to be RVMThread, but it still holds) that the NonMoving allows amongst other things copy propagation of the processor register? Looking through the history, Processor was made NonMoving during the merge of read barriers.

        Show
        Ian Rogers added a comment - Can we resolve this by documenting that copy propagation of the processor register relies on the processor being @NonMoving and then also document that in the Processor (soon to be RVMThread, but it still holds) that the NonMoving allows amongst other things copy propagation of the processor register? Looking through the history, Processor was made NonMoving during the merge of read barriers.
        Hide
        David Grove added a comment -

        With green threads, Non-Moving doesn't help. IThe opt compiler has to understand that all potential thread switch points can cause the value of PR to change.

        With native threads, things are simpler and we can probably document usage and do something like this.

        Show
        David Grove added a comment - With green threads, Non-Moving doesn't help. IThe opt compiler has to understand that all potential thread switch points can cause the value of PR to change. With native threads, things are simpler and we can probably document usage and do something like this.
        Hide
        Ian Rogers added a comment -

        Then we're back to the issue that it would be invalid to have fewer reads of the PR but having more reads is fine. As r15297 causes more reads of PR (not less) then it is valid. The only time it'd be unexpected would be for code like (NB method names not strictly adhered to):

        Processor p = Magic.getProcessor();
        RemSet r = p.remset;
        ... // possible yield
        assert (r == p.remset);

        But this code would be broken at other levels.

        Show
        Ian Rogers added a comment - Then we're back to the issue that it would be invalid to have fewer reads of the PR but having more reads is fine. As r15297 causes more reads of PR (not less) then it is valid. The only time it'd be unexpected would be for code like (NB method names not strictly adhered to): Processor p = Magic.getProcessor(); RemSet r = p.remset; ... // possible yield assert (r == p.remset); But this code would be broken at other levels.
        Hide
        David Grove added a comment -
        Processor p = Magic.getProcessor();
        ....code that might have a yieldpoint...
        if (p != Magic.getProcessor()) {
         ....code that wants to do something if there has been a switch of virtual processor...
        }
        

        I agree it is contrived, but if the opt compiler copy propagated PR, there would be no reliable way to write this code.
        This is why I'm conservative about enabling this optimization.

        Show
        David Grove added a comment - Processor p = Magic.getProcessor(); ....code that might have a yieldpoint... if (p != Magic.getProcessor()) { ....code that wants to do something if there has been a switch of virtual processor... } I agree it is contrived, but if the opt compiler copy propagated PR, there would be no reliable way to write this code. This is why I'm conservative about enabling this optimization.

          People

          • Assignee:
            Unassigned
            Reporter:
            Ian Rogers
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated: