RVM
  1. RVM
  2. RVM-341

Improved copying in VM_Memory

    Details

    • Number of attachments :
      3

      Description

      r13857 improved memory copying for Intel with SSE2 so that we used 64bit copies rather than 32bit copies. This gave a large number of speed ups:

      http://jikesrvm.anu.edu.au/cattrack/results/rvmx86lnx32.anu.edu.au/perf/1790/performance_report

      most notably on SpecJBB 2000. There is a low-hanging fruit to improve this further, for example, by using 128bit copies and using more than 1 register to do the copying.

      1. arraycopy-options.patch
        7 kB
        Steve Blackburn
      2. memcpyTest.java
        0.5 kB
        Filip Pizlo
      3. memcpyTestC.c
        2 kB
        Filip Pizlo

        Activity

        Hide
        Filip Pizlo added a comment -

        That is disturbing. Note that I got a speedup in fVM (on Fedora 10, x86_64) from just having a check to select either memcpy or memmove depending on whether trg==src. But, in fVM there is zero overhead to making the call to memcpy/memmove since I'm emitting C code. In RVM sysCalls may not be so cheap, so minute differences in performance between memcpy and memmove may not make any difference. Or are you making a call to memmove using some even-more-lowlevel approach? On ia32 it should be possible to call it directly in some cases... In general, if we're running the compiler while hosted it seems that sysCalls don't have to do the address lookup from BootRecord. I don't know if we do it already or not, or if it would matter at all.

        Have you committed? If not, can you send me a patch? I'm playing around with this as well and doing my own perf comparisons. It would be interesting if we could compare results.

        -F

        Show
        Filip Pizlo added a comment - That is disturbing. Note that I got a speedup in fVM (on Fedora 10, x86_64) from just having a check to select either memcpy or memmove depending on whether trg==src. But, in fVM there is zero overhead to making the call to memcpy/memmove since I'm emitting C code. In RVM sysCalls may not be so cheap, so minute differences in performance between memcpy and memmove may not make any difference. Or are you making a call to memmove using some even-more-lowlevel approach? On ia32 it should be possible to call it directly in some cases... In general, if we're running the compiler while hosted it seems that sysCalls don't have to do the address lookup from BootRecord. I don't know if we do it already or not, or if it would matter at all. Have you committed? If not, can you send me a patch? I'm playing around with this as well and doing my own perf comparisons. It would be interesting if we could compare results. -F
        Hide
        Steve Blackburn added a comment -

        All I did was a filthy hack to test the above. No intention of committing; I was just hearing Tony & Daniel discuss it and thought I'd throw some numbers into the mix

        For the memmove numbers, all I did was:
        a) change sysCopy to use memmove (for the real deal I'd add a sysMove call, but I was lazy),
        b) change all conditional calls to sysCopy to be unconditional.

        Show
        Steve Blackburn added a comment - All I did was a filthy hack to test the above. No intention of committing; I was just hearing Tony & Daniel discuss it and thought I'd throw some numbers into the mix For the memmove numbers, all I did was: a) change sysCopy to use memmove (for the real deal I'd add a sysMove call, but I was lazy), b) change all conditional calls to sysCopy to be unconditional.
        Hide
        Steve Blackburn added a comment -

        Here's the very simple patch for the stuff I did. Obviously I undid some of the changes to generate the 3 variations on the head I described above.

        Show
        Steve Blackburn added a comment - Here's the very simple patch for the stuff I did. Obviously I undid some of the changes to generate the 3 variations on the head I described above.
        Hide
        Steve Blackburn added a comment -

        Ugh. On reviewing the patch I see that I did not do it right :-/ Sorry. I forgot to change Memory.java. Will do so now and have results in a few hours.

        Apols for the misleading info.

        Show
        Steve Blackburn added a comment - Ugh. On reviewing the patch I see that I did not do it right :-/ Sorry. I forgot to change Memory.java. Will do so now and have results in a few hours. Apols for the misleading info.
        Hide
        Steve Blackburn added a comment -

        The new numbers are appearing here:

        http://cs.anu.edu.au/~Steve.Blackburn/private/results/jikesrvm-performance-2009/arraycopy-i7/bmtime.jikes.html

        Dave was completely right. jess shows naive use of memmove to be a bad choice.

        The "memmove 512" column is the same broken setup as shown earlier. What happens there is we a) treat all copies as non-overlapping but only call to memmove when the copy is less than 512. While memmove is safe for overlapping arrays, the other java code is not (it is not intended to be). So it is surprising that only jython crashes.

        Sorry for wasting your time with these bogus numbers! Fortunately we can still derive something interesting from the data:

        a) primitive array copy performance can significantly affect the bottom line in real benchmarks.
        b) by looking at the "naive" numbers, we can now see which benchmarks are sensitive to array copy performance.
        c) javac appears to do a lot of overlapping copies, so we win considerably by bypassing that logic (though of course to do so is incorrect)

        Show
        Steve Blackburn added a comment - The new numbers are appearing here: http://cs.anu.edu.au/~Steve.Blackburn/private/results/jikesrvm-performance-2009/arraycopy-i7/bmtime.jikes.html Dave was completely right. jess shows naive use of memmove to be a bad choice. The "memmove 512" column is the same broken setup as shown earlier. What happens there is we a) treat all copies as non-overlapping but only call to memmove when the copy is less than 512. While memmove is safe for overlapping arrays, the other java code is not (it is not intended to be). So it is surprising that only jython crashes. Sorry for wasting your time with these bogus numbers! Fortunately we can still derive something interesting from the data: a) primitive array copy performance can significantly affect the bottom line in real benchmarks. b) by looking at the "naive" numbers, we can now see which benchmarks are sensitive to array copy performance. c) javac appears to do a lot of overlapping copies, so we win considerably by bypassing that logic (though of course to do so is incorrect)

          People

          • Assignee:
            Unassigned
            Reporter:
            Ian Rogers
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated: