RVM
  1. RVM
  2. RVM-885

Alternative performance counter implementation

    Details

    • Type: New Feature New Feature
    • Status: Open Open
    • Priority: Minor Minor
    • Resolution: Unresolved
    • Affects Version/s: 3.1.0
    • Fix Version/s: 3.1.4
    • Component/s: Infrastructure
    • Labels:
      None
    • Environment:
      Newer linux kernels >= 2.6.31
    • Patch Submitted:
      Yes
    • Number of attachments :
      3

      Description

      This patch provides alternative performance counter support using the Linux kernel Performance Counter subsystem and the libpfm4 library. This patch does not remove the existing perfctr performance counter support but it does offer a number of benefits to the JikesRVM community.

      Libpfm4 leverage's new kernel support (>=2.6.31) for a common performance monitoring framework. This framework is in the main line kernel and extends beyond hardware monitoring events to include kernel events. Perfctr in comparison is an external patch that only provides support for hardware performance counters.

      Perfctr requires JikesRVM to keep a translation table of easy to read mnemonics (i.e. L1_MISS) to hardware specific event ID's. Each hardware update potentially requires this table to be updated (currently there are no translations for Intel Nehalem processors). Validating the table is an onerous task and due to lack of widespread use there is the possibility of mistakes. Libpfm4 removes the maintenance burden of such tables from JikesRVM providing the necessary translation services. With its wider usage base libpfm4 provides higher confidence in the correctness of translating mnemonics to event ID's.

      Perfctr exposes hardware counter overflows to software via Linux signals, currently JikesRVM has no support for processing these signals. I have witnessed counter overflows whilst using perfctr but there was no user visible indication that this had happened. Support for both overflowing hardware counters and overflowing internal representation of these counters needs to be stronger to ensure the community can have faith in these values. The new kernel perf_event support offers a less intrusive approach than adding perfctr signal handling to JikesRVM. All kernel managed event counters are virtual 64 bit counters that handle underlying hardware overflows.

      JikesRVM perfctr support only ever exposed one hardware counter even if more were supported. This patch exposes multiple counters, events may be multiplexed onto available counters.

      Some usage instructions are below:

      Prerequisites:
      1. A modern (>=2.6.31) Linux kernel with the Performance Counter system compiled. This code is in the main line kernel, some Linux distributions compile it in by default (e.g Ubuntu 10.04)
      2. The libpfm4 library available from http://perfmon2.sourceforge.net/
      For the impatient...
      {{
      $ git clone git://perfmon2.git.sourceforge.net/gitroot/perfmon2/libpfm4
      $ cd libpfm4
      $ make
      $ sudo make install
      }}
      If you are building on x86_64 see the notes at the end of this message

      Usage guide:
      1. Meet the prerequisites above.
      2. Pass -Dconfig.include.pfm=true to ant along with the rest of your build options.
      3. Pass a comma separated list of performance event you are interested in to the RVM like so:
      -X:gc:pfmMetrics=L1I_MISSES,LLC_MISSES.

      To find a complete list and description of the events available on your hardware and kernel version run the showevtinfo program located in <libpfm4-src-dir>/examples

      4. Also specify -X:gc:harnessAll=true option to the RVM.

      Interpreting the results:
      If all goes well you should find at the very end of the RVM invocation something like:

      === MMTk Statistics Totals ==
      L1I_MISSES.mu: 1629940088073
      L1I_MISSES.gc: 953482739268

      The first line (with ".mu") is the number of events recorded whilst the mutator was running. The second line (with ".gc") is the number of events whilst the GC was running.

      If you specify more performance events than are supported by your hardware it is possible for the events to be multiplexed onto the available counters. If you do this your output will look like this:

      === MMTk Statistics Totals ==
      L1I_MISSES.mu: 1629940088073 (SCALED)
      L1I_MISSES.gc: 953482739268 (SCALED)
      LLC_MISSES.mu: 1546188225840 (SCALED)
      LLC_MISSES.gc: 942745321033 (SCALED)
      PERF_COUNT_HW_CACHE_L1D:MISS.mu 19434083 (SCALED)
      PERF_COUNT_HW_CACHE_L1D:MISS.gc 5600456 (SCALED)

      Events that have been multiplexed and therefore scaled based on time contain "(SCALED)" after the value. It is not possible to directly compare scaled values as there is no guarantee that the events both measured the same parts of the program execution (this can be achieved using counter groups if really needed, see libpfm4 documentation for more details).

      Some events can not be multiplexed and might not be measured during an invocation. In that case the word "CONTENDED" is printed after the counter name and no further information is printed.

      Best efforts are made to detect if the kernel virtual 64 bit counter exceeds 63 bits (and becomes a negative number in Java). Where detected the word "OVERFLOWED" will be printed next to the counter name and no further information is printed.

      Known limitations:
      At present libpfm4 is rather unforgiving for incorrect event mnemonics, if you make a typo in an event name expect a seg fault!

      Notes on building libpfm4 on x86_64:
      This at least applies to version 79c9a0d (April 19th 2010) of libpfm4. On x86_64 JikesRVM will build 32 bit programmes but by default libpfm4 will build 64 bit libraries which JikesRVM cannot be linked with. As a quick hack before building libpfm4 add the following two lines to the end of <libpfm4-src-dir>/config.mk
      {{
      CFLAGS= -m32
      LDFLAGS= -m32
      }}

      1. pfmCtrSupport.patch
        30 kB
        Laurence Hellyer
      2. RVM-885-perThread001.patch
        41 kB
        Laurence Hellyer
      3. statementOfContribution.txt
        0.5 kB
        Laurence Hellyer

        Activity

        Hide
        Daniel Frampton added a comment -

        Sorry for the delay in responding. This looks excellent, and I am happy to test/review it for inclusion. I was thinking I might have had to write this from scratch in the near future!

        Several points to think about from my cursory glance:

        1. It appears this has some overlap with the discussion of printing out the MMTk statistics (this patch also includes those changes in RVM-884). This system has additional requirements not mentioned there to return some additional information pointing out when counters are scaled or contended.

        2. I am concerned/curious about licensing issues. Your patch includes the comment "This code is in part based on perf_examples/self.c as distributed by libpfm4" which is a part GPL part MIT licensed project, so we need to check that this portion is MIT licensed and do the right thing.

        3. We need to decide if the historical perfcounter support should remain.

        4. Does the patch include an option to show currently available counters (if not, this could be useful, relatively simple, and improve the error checking).

        5. Is it sensible to include the building of libpfm4 into our build process to make it easier for people to work with this?

        6. How much is libpfm4 doing for us, is it feasible to more directly access the counters?

        I am also curious about ways we could attribute some counters back to specific hardware threads. This may make the general infrastructure more useful for concurrent gc, etc.

        Show
        Daniel Frampton added a comment - Sorry for the delay in responding. This looks excellent, and I am happy to test/review it for inclusion. I was thinking I might have had to write this from scratch in the near future! Several points to think about from my cursory glance: 1. It appears this has some overlap with the discussion of printing out the MMTk statistics (this patch also includes those changes in RVM-884 ). This system has additional requirements not mentioned there to return some additional information pointing out when counters are scaled or contended. 2. I am concerned/curious about licensing issues. Your patch includes the comment "This code is in part based on perf_examples/self.c as distributed by libpfm4" which is a part GPL part MIT licensed project, so we need to check that this portion is MIT licensed and do the right thing. 3. We need to decide if the historical perfcounter support should remain. 4. Does the patch include an option to show currently available counters (if not, this could be useful, relatively simple, and improve the error checking). 5. Is it sensible to include the building of libpfm4 into our build process to make it easier for people to work with this? 6. How much is libpfm4 doing for us, is it feasible to more directly access the counters? I am also curious about ways we could attribute some counters back to specific hardware threads. This may make the general infrastructure more useful for concurrent gc, etc.
        Hide
        Laurence Hellyer added a comment -

        Hi Daniel,

        Thanks for the positive comments. Addressing your points in turn...

        1) If anyone has any alternative suggestions about how to better present information such as a counter being Contended or Scaled then please feel free to suggest them. Whilst this patch does not depend on RVM-884, as you point out it would be wise to consider the need for counters to present additional information in any refactoring of the MMTk output.

        2) perf_examples/self.c as distributed by libpfm4 is licensed under the MIT licence.

        3) Agreed, does anyone have strong reasons to continue to support perfCounters? (libpfm4 supports a host of architectures including PowerPC). Now that the kernel has a performance counter interface it is unclear if the maintainer of the perfCtr patches will continue to provide patches for new kernels.

        4) No, but this can be easily added. Whilst it should be possible to pass an array of counter names back to Jikes and then sanity check the command line arguments I would prefer to patch libpfm4 so that it is a little more graceful in how it fails.

        5) Currently libpfm4 is only available via git clone, there is no official release so it would mean adding another build dependency

        6) The new kernel performance event subsystem is initialised via a system call which returns a special file descriptor. A standard read() on the FD returns the current counter value. Pfm.C opens the counter and read()'s the value. What libpfm4 provides is a translation from a mnemonic such as "L1I_MISSES" to the kernel and hardware specific event encoding that needs to be passed to sys_perf_event_open(). These encodings could be maintained within Jikes but that seems to lose many of the maintenance benefits this has over perfCounters.

        Before deciding on libpfm4 I did a quick survey of various performance counter support mechanisms. I don't recall any providing per thread counters but I might be mistaken (I shall investigate a bit further...)

        Show
        Laurence Hellyer added a comment - Hi Daniel, Thanks for the positive comments. Addressing your points in turn... 1) If anyone has any alternative suggestions about how to better present information such as a counter being Contended or Scaled then please feel free to suggest them. Whilst this patch does not depend on RVM-884 , as you point out it would be wise to consider the need for counters to present additional information in any refactoring of the MMTk output. 2) perf_examples/self.c as distributed by libpfm4 is licensed under the MIT licence. 3) Agreed, does anyone have strong reasons to continue to support perfCounters? (libpfm4 supports a host of architectures including PowerPC). Now that the kernel has a performance counter interface it is unclear if the maintainer of the perfCtr patches will continue to provide patches for new kernels. 4) No, but this can be easily added. Whilst it should be possible to pass an array of counter names back to Jikes and then sanity check the command line arguments I would prefer to patch libpfm4 so that it is a little more graceful in how it fails. 5) Currently libpfm4 is only available via git clone, there is no official release so it would mean adding another build dependency 6) The new kernel performance event subsystem is initialised via a system call which returns a special file descriptor. A standard read() on the FD returns the current counter value. Pfm.C opens the counter and read()'s the value. What libpfm4 provides is a translation from a mnemonic such as "L1I_MISSES" to the kernel and hardware specific event encoding that needs to be passed to sys_perf_event_open(). These encodings could be maintained within Jikes but that seems to lose many of the maintenance benefits this has over perfCounters. Before deciding on libpfm4 I did a quick survey of various performance counter support mechanisms. I don't recall any providing per thread counters but I might be mistaken (I shall investigate a bit further...)
        Hide
        Ian Rogers added a comment -

        MRP-32 (http://jira.codehaus.org/browse/MRP-32) adds support to MRP for OProfile. Below is the top of an opreport from a profile of a prototype configuration running DaCapo fop. Mutator and GC samples can be distinguished by package name.

        Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (Unhalted core cycles) count 100000
        samples % image name symbol name
        30523 6.5961 28214.jo org.jikesrvm.VM._assert(ZLjava/lang/String;Ljava/lang/String;)V
        27091 5.8544 28214.jo org.jikesrvm.mm.mmtk.Assert._assert(Z)V
        22701 4.9057 28214.jo org.jikesrvm.VM._assert(Z)V
        14315 3.0935 28214.jo org.mmtk.utility.alloc.Allocator.alignAllocation(Lorg/vmmagic/unboxed/Address;IIIZ
        )Lorg/vmmagic/unboxed/Address;
        10551 2.2801 28214.jo org.mmtk.policy.immix.Block.getDefragStateAddress(Lorg/vmmagic/unboxed/Address;)Lo
        rg/vmmagic/unboxed/Address;
        10305 2.2269 28214.jo java.lang.String.<init>([BIII)V
        9288 2.0071 28214.jo org.mmtk.utility.heap.Map.getChunkIndex(Lorg/vmmagic/unboxed/Address;)I
        9097 1.9659 28214.jo org.jikesrvm.objectmodel.JavaHeader.getPointerInMemoryRegion(Lorg/vmmagic/unboxed/
        ObjectReference;)Lorg/vmmagic/unboxed/Address;
        6803 1.4701 28214.jo org.mmtk.utility.heap.Map.getSpaceForAddress(Lorg/vmmagic/unboxed/Address;)Lorg/mm
        tk/policy/Space;
        5987 1.2938 28214.jo org.mmtk.policy.Space.getSpaceForObject(Lorg/vmmagic/unboxed/ObjectReference;)Lorg
        /mmtk/policy/Space;
        5785 1.2501 28214.jo org.jikesrvm.mm.mmtk.ObjectModel.refToAddress(Lorg/vmmagic/unboxed/ObjectReference
        ;)Lorg/vmmagic/unboxed/Address;
        5649 1.2208 28214.jo org.jikesrvm.classloader.NormalMethod.getLineNumberForBCIndex(I)I
        5053 1.0920 28214.jo org.jikesrvm.runtime.Memory.alignUp(II)I

        Show
        Ian Rogers added a comment - MRP-32 ( http://jira.codehaus.org/browse/MRP-32 ) adds support to MRP for OProfile. Below is the top of an opreport from a profile of a prototype configuration running DaCapo fop. Mutator and GC samples can be distinguished by package name. Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (Unhalted core cycles) count 100000 samples % image name symbol name 30523 6.5961 28214.jo org.jikesrvm.VM._assert(ZLjava/lang/String;Ljava/lang/String;)V 27091 5.8544 28214.jo org.jikesrvm.mm.mmtk.Assert._assert(Z)V 22701 4.9057 28214.jo org.jikesrvm.VM._assert(Z)V 14315 3.0935 28214.jo org.mmtk.utility.alloc.Allocator.alignAllocation(Lorg/vmmagic/unboxed/Address;IIIZ )Lorg/vmmagic/unboxed/Address; 10551 2.2801 28214.jo org.mmtk.policy.immix.Block.getDefragStateAddress(Lorg/vmmagic/unboxed/Address;)Lo rg/vmmagic/unboxed/Address; 10305 2.2269 28214.jo java.lang.String.<init>([BIII)V 9288 2.0071 28214.jo org.mmtk.utility.heap.Map.getChunkIndex(Lorg/vmmagic/unboxed/Address;)I 9097 1.9659 28214.jo org.jikesrvm.objectmodel.JavaHeader.getPointerInMemoryRegion(Lorg/vmmagic/unboxed/ ObjectReference;)Lorg/vmmagic/unboxed/Address; 6803 1.4701 28214.jo org.mmtk.utility.heap.Map.getSpaceForAddress(Lorg/vmmagic/unboxed/Address;)Lorg/mm tk/policy/Space; 5987 1.2938 28214.jo org.mmtk.policy.Space.getSpaceForObject(Lorg/vmmagic/unboxed/ObjectReference;)Lorg /mmtk/policy/Space; 5785 1.2501 28214.jo org.jikesrvm.mm.mmtk.ObjectModel.refToAddress(Lorg/vmmagic/unboxed/ObjectReference ;)Lorg/vmmagic/unboxed/Address; 5649 1.2208 28214.jo org.jikesrvm.classloader.NormalMethod.getLineNumberForBCIndex(I)I 5053 1.0920 28214.jo org.jikesrvm.runtime.Memory.alignUp(II)I
        Hide
        Laurence Hellyer added a comment -

        Ian,

        Thank you for pointing this out, do you have any overhead figures for these performance counters? As I understand OProfile it adds sampling overhead (although I am willing to be re-educated!)

        Just to keep everyone up to date I am currently reworking my pfmCounter patch to count stats per task (i.e. hopefully per thread). Just squashing the last few bugs and then I shall post a prototype patch.

        Kind regards
        Laurence

        Show
        Laurence Hellyer added a comment - Ian, Thank you for pointing this out, do you have any overhead figures for these performance counters? As I understand OProfile it adds sampling overhead (although I am willing to be re-educated!) Just to keep everyone up to date I am currently reworking my pfmCounter patch to count stats per task (i.e. hopefully per thread). Just squashing the last few bugs and then I shall post a prototype patch. Kind regards Laurence
        Hide
        Ian Rogers added a comment -

        I've update MRP-32 with some performance data, I can't perceive significant overhead in using oprofile. OProfile allows you to drill down from source to assembly to see where cycles are being spent, it is also integral to AMD's code analyst for Linux that provide pipeline models for AMD processors.

        Show
        Ian Rogers added a comment - I've update MRP-32 with some performance data, I can't perceive significant overhead in using oprofile. OProfile allows you to drill down from source to assembly to see where cycles are being spent, it is also integral to AMD's code analyst for Linux that provide pipeline models for AMD processors.
        Hide
        Laurence Hellyer added a comment -

        Ian: Perhaps I am not reading your output incorrectly, but the OProfile stats do not seem to provide the calling context. You state that Mutator and GC samples can be distinguished by package name but for example the first sample of your output "org.jikesrvm.VM._assert(ZLjava/lang/String;Ljava/lang/String;)V" - how can you tell if these are assertions are GC assertions or RVM assertions? The OProfile results seem to provide an approximation of "hotness" so how is this different to using the AOS system? (I guess you can sample code running in C land?)

        Please find attached a reworking of my pfmCounter patch that now counts events on a per thread basis. This patch is not yet ready for trunk:
        i) It's only a proof of concept, several safety features need to be added
        ii) It has not been extensively tested against all configurations and probably won't pass CheckStyle
        iii) Only 1 counter per thread is currently implemented
        iv) The timer thread does not yet support event counters

        This patch uses the existing MMTk event counter support, but as Daniel has hinted this might need to be extended to better support concurrent GC.

        The current output of this patch is something like this:

        Running Jython with production using default heap size:

        PERF_COUNT_HW_CACHE_L1D:MISS-4-daemon-1-RUNNABLE: 5233
        PERF_COUNT_HW_CACHE_L1D:MISS-3-daemon-collector-4-RUNNABLE: 18219786
        PERF_COUNT_HW_CACHE_L1D:MISS-5-daemon-1-RUNNABLE: 424678
        PERF_COUNT_HW_CACHE_L1D:MISS-9-daemon-1-RUNNABLE: 33140
        PERF_COUNT_HW_CACHE_L1D:MISS-10-daemon-1-RUNNABLE: 27684473
        PERF_COUNT_HW_CACHE_L1D:MISS-6-daemon-1-RUNNABLE: 266274
        PERF_COUNT_HW_CACHE_L1D:MISS-8-daemon-1-RUNNABLE: 509236
        PERF_COUNT_HW_CACHE_L1D:MISS-7-daemon-1-RUNNABLE: 1030
        PERF_COUNT_HW_CACHE_L1D:MISS-11-main-1-RUNNABLE: 107885270

        Increasing the heap size to 500M I get:

        PERF_COUNT_HW_CACHE_L1D:MISS-4-daemon-1-RUNNABLE: 3649
        PERF_COUNT_HW_CACHE_L1D:MISS-3-daemon-collector-4-RUNNABLE: 7225928
        PERF_COUNT_HW_CACHE_L1D:MISS-5-daemon-1-RUNNABLE: 283850
        PERF_COUNT_HW_CACHE_L1D:MISS-9-daemon-1-RUNNABLE: 6461
        PERF_COUNT_HW_CACHE_L1D:MISS-8-daemon-1-RUNNABLE: 435316
        PERF_COUNT_HW_CACHE_L1D:MISS-10-daemon-1-RUNNABLE: 27383861
        PERF_COUNT_HW_CACHE_L1D:MISS-6-daemon-1-RUNNABLE: 162644
        PERF_COUNT_HW_CACHE_L1D:MISS-7-daemon-1-RUNNABLE: 992
        PERF_COUNT_HW_CACHE_L1D:MISS-11-main-1-RUNNABLE: 107982968

        Note the L1D misses for the collector thread have gone down, which makes sense as the number of collections went from 226 GCs to 46 GCs

        A 500M heap but with 10 collector threads yields:

        PERF_COUNT_HW_CACHE_L1D:MISS-12-daemon-collector-1-RUNNABLE: 4034222
        PERF_COUNT_HW_CACHE_L1D:MISS-3-daemon-collector-4-RUNNABLE: 626224
        PERF_COUNT_HW_CACHE_L1D:MISS-13-daemon-1-RUNNABLE: 3882
        PERF_COUNT_HW_CACHE_L1D:MISS-4-daemon-collector-1-RUNNABLE: 421876
        PERF_COUNT_HW_CACHE_L1D:MISS-5-daemon-collector-1-RUNNABLE: 390525
        PERF_COUNT_HW_CACHE_L1D:MISS-11-daemon-collector-1-RUNNABLE: 564460
        PERF_COUNT_HW_CACHE_L1D:MISS-10-daemon-collector-1-RUNNABLE: 604514
        PERF_COUNT_HW_CACHE_L1D:MISS-9-daemon-collector-1-RUNNABLE: 412718
        PERF_COUNT_HW_CACHE_L1D:MISS-8-daemon-collector-1-RUNNABLE: 423487
        PERF_COUNT_HW_CACHE_L1D:MISS-7-daemon-collector-1-RUNNABLE: 219457
        PERF_COUNT_HW_CACHE_L1D:MISS-6-daemon-collector-1-RUNNABLE: 399383
        PERF_COUNT_HW_CACHE_L1D:MISS-14-daemon-1-RUNNABLE 153939
        PERF_COUNT_HW_CACHE_L1D:MISS-18-daemon-1-RUNNABLE: 7196
        PERF_COUNT_HW_CACHE_L1D:MISS-19-daemon-1-RUNNABLE: 17785067
        PERF_COUNT_HW_CACHE_L1D:MISS-15-daemon-1-RUNNABLE: 52125
        PERF_COUNT_HW_CACHE_L1D:MISS-17-daemon-1-RUNNABLE: 323621
        PERF_COUNT_HW_CACHE_L1D:MISS-16-daemon-1-RUNNABLE: 886
        PERF_COUNT_HW_CACHE_L1D:MISS-20-main-1-RUNNABLE: 113463519

        The L1D misses by each collector thread is reported separately. The sum of the L1D misses for the collector threads is about 12% higher than the misses by just 1 collector thread which seems reasonable at first glance.

        I'd be happy to collaborate in any discussions about refactoring the MMTk event counters to better support concurrent collector and I shall work to tidy this patch up in the next week or so.

        Kind regards
        Laurence

        Show
        Laurence Hellyer added a comment - Ian: Perhaps I am not reading your output incorrectly, but the OProfile stats do not seem to provide the calling context. You state that Mutator and GC samples can be distinguished by package name but for example the first sample of your output "org.jikesrvm.VM._assert(ZLjava/lang/String;Ljava/lang/String;)V" - how can you tell if these are assertions are GC assertions or RVM assertions? The OProfile results seem to provide an approximation of "hotness" so how is this different to using the AOS system? (I guess you can sample code running in C land?) Please find attached a reworking of my pfmCounter patch that now counts events on a per thread basis. This patch is not yet ready for trunk: i) It's only a proof of concept, several safety features need to be added ii) It has not been extensively tested against all configurations and probably won't pass CheckStyle iii) Only 1 counter per thread is currently implemented iv) The timer thread does not yet support event counters This patch uses the existing MMTk event counter support, but as Daniel has hinted this might need to be extended to better support concurrent GC. The current output of this patch is something like this: Running Jython with production using default heap size: PERF_COUNT_HW_CACHE_L1D:MISS-4-daemon-1-RUNNABLE: 5233 PERF_COUNT_HW_CACHE_L1D:MISS-3-daemon-collector-4-RUNNABLE: 18219786 PERF_COUNT_HW_CACHE_L1D:MISS-5-daemon-1-RUNNABLE: 424678 PERF_COUNT_HW_CACHE_L1D:MISS-9-daemon-1-RUNNABLE: 33140 PERF_COUNT_HW_CACHE_L1D:MISS-10-daemon-1-RUNNABLE: 27684473 PERF_COUNT_HW_CACHE_L1D:MISS-6-daemon-1-RUNNABLE: 266274 PERF_COUNT_HW_CACHE_L1D:MISS-8-daemon-1-RUNNABLE: 509236 PERF_COUNT_HW_CACHE_L1D:MISS-7-daemon-1-RUNNABLE: 1030 PERF_COUNT_HW_CACHE_L1D:MISS-11-main-1-RUNNABLE: 107885270 Increasing the heap size to 500M I get: PERF_COUNT_HW_CACHE_L1D:MISS-4-daemon-1-RUNNABLE: 3649 PERF_COUNT_HW_CACHE_L1D:MISS-3-daemon-collector-4-RUNNABLE: 7225928 PERF_COUNT_HW_CACHE_L1D:MISS-5-daemon-1-RUNNABLE: 283850 PERF_COUNT_HW_CACHE_L1D:MISS-9-daemon-1-RUNNABLE: 6461 PERF_COUNT_HW_CACHE_L1D:MISS-8-daemon-1-RUNNABLE: 435316 PERF_COUNT_HW_CACHE_L1D:MISS-10-daemon-1-RUNNABLE: 27383861 PERF_COUNT_HW_CACHE_L1D:MISS-6-daemon-1-RUNNABLE: 162644 PERF_COUNT_HW_CACHE_L1D:MISS-7-daemon-1-RUNNABLE: 992 PERF_COUNT_HW_CACHE_L1D:MISS-11-main-1-RUNNABLE: 107982968 Note the L1D misses for the collector thread have gone down, which makes sense as the number of collections went from 226 GCs to 46 GCs A 500M heap but with 10 collector threads yields: PERF_COUNT_HW_CACHE_L1D:MISS-12-daemon-collector-1-RUNNABLE: 4034222 PERF_COUNT_HW_CACHE_L1D:MISS-3-daemon-collector-4-RUNNABLE: 626224 PERF_COUNT_HW_CACHE_L1D:MISS-13-daemon-1-RUNNABLE: 3882 PERF_COUNT_HW_CACHE_L1D:MISS-4-daemon-collector-1-RUNNABLE: 421876 PERF_COUNT_HW_CACHE_L1D:MISS-5-daemon-collector-1-RUNNABLE: 390525 PERF_COUNT_HW_CACHE_L1D:MISS-11-daemon-collector-1-RUNNABLE: 564460 PERF_COUNT_HW_CACHE_L1D:MISS-10-daemon-collector-1-RUNNABLE: 604514 PERF_COUNT_HW_CACHE_L1D:MISS-9-daemon-collector-1-RUNNABLE: 412718 PERF_COUNT_HW_CACHE_L1D:MISS-8-daemon-collector-1-RUNNABLE: 423487 PERF_COUNT_HW_CACHE_L1D:MISS-7-daemon-collector-1-RUNNABLE: 219457 PERF_COUNT_HW_CACHE_L1D:MISS-6-daemon-collector-1-RUNNABLE: 399383 PERF_COUNT_HW_CACHE_L1D:MISS-14-daemon-1-RUNNABLE 153939 PERF_COUNT_HW_CACHE_L1D:MISS-18-daemon-1-RUNNABLE: 7196 PERF_COUNT_HW_CACHE_L1D:MISS-19-daemon-1-RUNNABLE: 17785067 PERF_COUNT_HW_CACHE_L1D:MISS-15-daemon-1-RUNNABLE: 52125 PERF_COUNT_HW_CACHE_L1D:MISS-17-daemon-1-RUNNABLE: 323621 PERF_COUNT_HW_CACHE_L1D:MISS-16-daemon-1-RUNNABLE: 886 PERF_COUNT_HW_CACHE_L1D:MISS-20-main-1-RUNNABLE: 113463519 The L1D misses by each collector thread is reported separately. The sum of the L1D misses for the collector threads is about 12% higher than the misses by just 1 collector thread which seems reasonable at first glance. I'd be happy to collaborate in any discussions about refactoring the MMTk event counters to better support concurrent collector and I shall work to tidy this patch up in the next week or so. Kind regards Laurence
        Hide
        Ian Rogers added a comment -

        Laurence, in production code assert isn't called (we strongly assert for this) and note that MMTk and the RVM have different assert flavours. The results are from a prototype build as I have a 64bit desktop machine and the 64bit x86 opt compiler isn't quite working for MRP yet and I didn't have a 32bit OProfile agent compiled. In any case OProfile has a mode of operation where it gathers call graphs and therefore context. I doubt this will work out of the box but is likely to with the GDB 7 support going into MRP (MRP-31). The real difference between the 2 approaches is one is just a total count and the other is a count that lets you drill down to the assembly instruction (going via source code if you so desire) and see what statements are causing you a performance penalty.

        Show
        Ian Rogers added a comment - Laurence, in production code assert isn't called (we strongly assert for this) and note that MMTk and the RVM have different assert flavours. The results are from a prototype build as I have a 64bit desktop machine and the 64bit x86 opt compiler isn't quite working for MRP yet and I didn't have a 32bit OProfile agent compiled. In any case OProfile has a mode of operation where it gathers call graphs and therefore context. I doubt this will work out of the box but is likely to with the GDB 7 support going into MRP ( MRP-31 ). The real difference between the 2 approaches is one is just a total count and the other is a count that lets you drill down to the assembly instruction (going via source code if you so desire) and see what statements are causing you a performance penalty.
        Hide
        Daniel Frampton added a comment -

        I have commited a related patch in -r15865.

        The differences:

        1. I renamed a bunch of things.
        2. I did not use the code that came from libpfm's examples (I just call pfm libraries and linux syscalls now).
        3. I also removed the old perfctr code.

        Show
        Daniel Frampton added a comment - I have commited a related patch in -r15865. The differences: 1. I renamed a bunch of things. 2. I did not use the code that came from libpfm's examples (I just call pfm libraries and linux syscalls now). 3. I also removed the old perfctr code.
        Hide
        David Grove added a comment -

        bulk defer open issues to 3.1.2

        Show
        David Grove added a comment - bulk defer open issues to 3.1.2
        Hide
        David Grove added a comment -

        Bulk defer to 3.1.3; not essential to address for 3.1.2.

        Show
        David Grove added a comment - Bulk defer to 3.1.3; not essential to address for 3.1.2.
        Hide
        David Grove added a comment -

        bulk defer issues to 3.1.4

        Show
        David Grove added a comment - bulk defer issues to 3.1.4
        Hide
        Erik Brangs added a comment -

        Removing assignees of all issues where the assignee has not contributed to Jikes RVM for quite some time or where the assignee is no longer a member of the core team.

        If you are affected and would like to work on one of the issues, please drop me a note (or change the assignee yourself if you have the necessary permissions).

        Show
        Erik Brangs added a comment - Removing assignees of all issues where the assignee has not contributed to Jikes RVM for quite some time or where the assignee is no longer a member of the core team. If you are affected and would like to work on one of the issues, please drop me a note (or change the assignee yourself if you have the necessary permissions).

          People

          • Assignee:
            Unassigned
            Reporter:
            Laurence Hellyer
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated: