RVM

Investigate performance regression in Compress

Details

  • Type: Bug Bug
  • Status: Open Open
  • Priority: Major Major
  • Resolution: Unresolved
  • Affects Version/s: None
  • Fix Version/s: 1000
  • Component/s: None
  • Labels:
    None
  • Number of attachments :
    0

Description

Not sure if there's going to be time for this pre-2.9.3 as I'm currently out of the office. Recent changes (r14130 to r14134) added information on whether a type in a register operand was precise/extant. This extra information can lead to some extra simplification because we can consider notnull cases, etc. The effect of these flags should be to improve performance. However, SpecJVM's _201_compress has lost between 10 and 12% of its performance following these changes [1]. r14127 to r14129 just restructured code and didn't change anything, so their effect should be performance neutral. _201_compress largely operates on primitive types. It does have a number of final classes, so r13134 (a this operand in a method for a final class must be precise) maybe influencing things. This needs investigation as improving the information in the compiler shouldn't be linked with such a marked performance degradation. My suspicion is that the type information may be causing more (probably) or less inlining and we may be seeing the effect of code bloat... I'm assigning to Dave as I believe he is very knowledgeable on the inlining and on this benchmark.

[1] http://jikesrvm.anu.edu.au/cattrack/results/rvmx86lnx32.anu.edu.au/perf/3503/performance_report

Issue Links

Activity

Hide
Ian Rogers added a comment -

Looking at inline reports, before precise this:

-methodOpt spec.benchmarks._201_compress.Compressor compress ()V
Inline < SystemAppCL, Lspec/benchmarks/_201_compress/Input_Buffer; >.getbyte ()I into < SystemAppCL, Lspec/benchmarks/_201_compress/Compressor; >.compress ()V at bytecode 6
Inline < SystemAppCL, Lspec/benchmarks/_201_compress/Compressor$Hash_Table; >.clear ()V into < SystemAppCL, Lspec/benchmarks/_201_compress/Compressor; >.compress ()V at bytecode 66
Inline < SystemAppCL, Lspec/benchmarks/_201_compress/Input_Buffer; >.getbyte ()I into < SystemAppCL, Lspec/benchmarks/_201_compress/Compressor; >.compress ()V at bytecode 297
Inline < SystemAppCL, Lspec/benchmarks/_201_compress/Compressor$Hash_Table; >.clear ()V into < SystemAppCL, Lspec/benchmarks/_201_compress/Compressor; >.compress ()V at bytecode 66
Inline < SystemAppCL, Lspec/benchmarks/_201_compress/Code_Table; >.of (I)I into < SystemAppCL, Lspec/benchmarks/_201_compress/Compressor; >.compress()V at bytecode 126
Inline < SystemAppCL, Lspec/benchmarks/_201_compress/Code_Table; >.of (I)I into < SystemAppCL, Lspec/benchmarks/_201_compress/Compressor; >.compress()V at bytecode 192
Inline < SystemAppCL, Lspec/benchmarks/_201_compress/Code_Table; >.set (II)V into < SystemAppCL, Lspec/benchmarks/_201_compress/Compressor; >.compress ()V at bytecode 251
Inline < SystemAppCL, Lspec/benchmarks/_201_compress/Input_Buffer; >.getbyte ()I into < SystemAppCL, Lspec/benchmarks/_201_compress/Compressor; >.compress ()V at bytecode 297
Inline < SystemAppCL, Lspec/benchmarks/_201_compress/Compressor$Hash_Table; >.clear ()V into < SystemAppCL, Lspec/benchmarks/_201_compress/Compressor; >.compress ()V at bytecode 66
Inline < SystemAppCL, Lspec/benchmarks/_201_compress/Code_Table; >.of (I)I into < SystemAppCL, Lspec/benchmarks/_201_compress/Compressor; >.compress ()V at bytecode 126
Inline < SystemAppCL, Lspec/benchmarks/_201_compress/Code_Table; >.of (I)I into < SystemAppCL, Lspec/benchmarks/_201_compress/Compressor; >.compress ()V at bytecode 192
Inline < SystemAppCL, Lspec/benchmarks/_201_compress/Code_Table; >.set (II)V into < SystemAppCL, Lspec/benchmarks/_201_compress/Compressor; >.compress ()V at bytecode 251

after precise this:

-methodOpt spec.benchmarks._201_compress.Compressor compress ()V
Inline < SystemAppCL, Lspec/benchmarks/_201_compress/Input_Buffer; >.getbyte ()I into < SystemAppCL, Lspec/benchmarks/_201_compress/Compressor; >.com
press ()V at bytecode 6
Inline < SystemAppCL, Lspec/benchmarks/_201_compress/Compressor$Hash_Table; >.clear ()V into < SystemAppCL, Lspec/benchmarks/_201_compress/Compressor
; >.compress ()V at bytecode 66
Inline < SystemAppCL, Lspec/benchmarks/_201_compress/Input_Buffer; >.getbyte ()I into < SystemAppCL, Lspec/benchmarks/_201_compress/Compressor; >.com
press ()V at bytecode 297
Inline < SystemAppCL, Lspec/benchmarks/_201_compress/Compressor$Hash_Table; >.clear ()V into < SystemAppCL, Lspec/benchmarks/_201_compress/Compressor
; >.compress ()V at bytecode 66
Inline < SystemAppCL, Lspec/benchmarks/_201_compress/Code_Table; >.of (I)I into < SystemAppCL, Lspec/benchmarks/_201_compress/Compressor; >.compress
()V at bytecode 126
Inline < SystemAppCL, Lspec/benchmarks/_201_compress/Code_Table; >.of (I)I into < SystemAppCL, Lspec/benchmarks/_201_compress/Compressor; >.compress
()V at bytecode 192
Inline < SystemAppCL, Lspec/benchmarks/_201_compress/Compressor; >.output (I)V into < SystemAppCL, Lspec/benchmarks/_201_compress/Compressor; >.compr
ess ()V at bytecode 208
Inline < SystemAppCL, Lspec/benchmarks/_201_compress/Output_Buffer; >.putbyte (B)V into < SystemAppCL, Lspec/benchmarks/_201_compress/Compressor; >.o
utput (I)V at bytecode 175
Inline < SystemAppCL, Lspec/benchmarks/_201_compress/Output_Buffer; >.writebytes ([BI)V into < SystemAppCL, Lspec/benchmarks/_201_compress/Compressor
; >.output (I)V at bytecode 227
Inline < SystemAppCL, Lspec/benchmarks/_201_compress/Output_Buffer; >.writebytes ([BI)V into < SystemAppCL, Lspec/benchmarks/_201_compress/Compressor
; >.output (I)V at bytecode 347
Inline < SystemAppCL, Lspec/benchmarks/_201_compress/Output_Buffer; >.putbyte (B)V into < SystemAppCL, Lspec/benchmarks/_201_compress/Compressor; >.o
utput (I)V at bytecode 175
Inline < SystemAppCL, Lspec/benchmarks/_201_compress/Output_Buffer; >.putbyte (B)V into < SystemAppCL, Lspec/benchmarks/_201_compress/Compressor; >.o
utput (I)V at bytecode 175
Inline < SystemAppCL, Lspec/benchmarks/_201_compress/Code_Table; >.set (II)V into < SystemAppCL, Lspec/benchmarks/_201_compress/Compressor; >.compres
s ()V at bytecode 251
Inline < SystemAppCL, Lspec/benchmarks/_201_compress/Input_Buffer; >.getbyte ()I into < SystemAppCL, Lspec/benchmarks/_201_compress/Compressor; >.com
press ()V at bytecode 297
Inline < SystemAppCL, Lspec/benchmarks/_201_compress/Compressor$Hash_Table; >.clear ()V into < SystemAppCL, Lspec/benchmarks/_201_compress/Compressor
; >.compress ()V at bytecode 66
Inline < SystemAppCL, Lspec/benchmarks/_201_compress/Code_Table; >.of (I)I into < SystemAppCL, Lspec/benchmarks/_201_compress/Compressor; >.compress ()V at bytecode 126
Inline < SystemAppCL, Lspec/benchmarks/_201_compress/Code_Table; >.of (I)I into < SystemAppCL, Lspec/benchmarks/_201_compress/Compressor; >.compress ()V at bytecode 192
Inline < SystemAppCL, Lspec/benchmarks/_201_compress/Compressor; >.output (I)V into < SystemAppCL, Lspec/benchmarks/_201_compress/Compressor; >.compress ()V at bytecode 208
Inline < SystemAppCL, Lspec/benchmarks/_201_compress/Output_Buffer; >.putbyte (B)V into < SystemAppCL, Lspec/benchmarks/_201_compress/Compressor; >.output (I)V at bytecode 175
Inline < SystemAppCL, Lspec/benchmarks/_201_compress/Output_Buffer; >.writebytes ([BI)V into < SystemAppCL, Lspec/benchmarks/_201_compress/Compressor; >.output (I)V at bytecode 227
Inline < SystemAppCL, Lspec/benchmarks/_201_compress/Output_Buffer; >.writebytes ([BI)V into < SystemAppCL, Lspec/benchmarks/_201_compress/Compressor; >.output (I)V at bytecode 347
Inline < SystemAppCL, Lspec/benchmarks/_201_compress/Output_Buffer; >.putbyte (B)V into < SystemAppCL, Lspec/benchmarks/_201_compress/Compressor; >.output (I)V at bytecode 175
Inline < SystemAppCL, Lspec/benchmarks/_201_compress/Output_Buffer; >.putbyte (B)V into < SystemAppCL, Lspec/benchmarks/_201_compress/Compressor; >.output (I)V at bytecode 175
Inline < SystemAppCL, Lspec/benchmarks/_201_compress/Code_Table; >.set (II)V into < SystemAppCL, Lspec/benchmarks/_201_compress/Compressor; >.compress ()V at bytecode 251

In other words, making this for final classes precise causes a lot more inlining in Compressor.compress and this is costing the ~10% of execution time.

Show
Ian Rogers added a comment - Looking at inline reports, before precise this: -methodOpt spec.benchmarks._201_compress.Compressor compress ()V Inline < SystemAppCL, Lspec/benchmarks/_201_compress/Input_Buffer; >.getbyte ()I into < SystemAppCL, Lspec/benchmarks/_201_compress/Compressor; >.compress ()V at bytecode 6 Inline < SystemAppCL, Lspec/benchmarks/_201_compress/Compressor$Hash_Table; >.clear ()V into < SystemAppCL, Lspec/benchmarks/_201_compress/Compressor; >.compress ()V at bytecode 66 Inline < SystemAppCL, Lspec/benchmarks/_201_compress/Input_Buffer; >.getbyte ()I into < SystemAppCL, Lspec/benchmarks/_201_compress/Compressor; >.compress ()V at bytecode 297 Inline < SystemAppCL, Lspec/benchmarks/_201_compress/Compressor$Hash_Table; >.clear ()V into < SystemAppCL, Lspec/benchmarks/_201_compress/Compressor; >.compress ()V at bytecode 66 Inline < SystemAppCL, Lspec/benchmarks/_201_compress/Code_Table; >.of (I)I into < SystemAppCL, Lspec/benchmarks/_201_compress/Compressor; >.compress()V at bytecode 126 Inline < SystemAppCL, Lspec/benchmarks/_201_compress/Code_Table; >.of (I)I into < SystemAppCL, Lspec/benchmarks/_201_compress/Compressor; >.compress()V at bytecode 192 Inline < SystemAppCL, Lspec/benchmarks/_201_compress/Code_Table; >.set (II)V into < SystemAppCL, Lspec/benchmarks/_201_compress/Compressor; >.compress ()V at bytecode 251 Inline < SystemAppCL, Lspec/benchmarks/_201_compress/Input_Buffer; >.getbyte ()I into < SystemAppCL, Lspec/benchmarks/_201_compress/Compressor; >.compress ()V at bytecode 297 Inline < SystemAppCL, Lspec/benchmarks/_201_compress/Compressor$Hash_Table; >.clear ()V into < SystemAppCL, Lspec/benchmarks/_201_compress/Compressor; >.compress ()V at bytecode 66 Inline < SystemAppCL, Lspec/benchmarks/_201_compress/Code_Table; >.of (I)I into < SystemAppCL, Lspec/benchmarks/_201_compress/Compressor; >.compress ()V at bytecode 126 Inline < SystemAppCL, Lspec/benchmarks/_201_compress/Code_Table; >.of (I)I into < SystemAppCL, Lspec/benchmarks/_201_compress/Compressor; >.compress ()V at bytecode 192 Inline < SystemAppCL, Lspec/benchmarks/_201_compress/Code_Table; >.set (II)V into < SystemAppCL, Lspec/benchmarks/_201_compress/Compressor; >.compress ()V at bytecode 251 after precise this: -methodOpt spec.benchmarks._201_compress.Compressor compress ()V Inline < SystemAppCL, Lspec/benchmarks/_201_compress/Input_Buffer; >.getbyte ()I into < SystemAppCL, Lspec/benchmarks/_201_compress/Compressor; >.com press ()V at bytecode 6 Inline < SystemAppCL, Lspec/benchmarks/_201_compress/Compressor$Hash_Table; >.clear ()V into < SystemAppCL, Lspec/benchmarks/_201_compress/Compressor ; >.compress ()V at bytecode 66 Inline < SystemAppCL, Lspec/benchmarks/_201_compress/Input_Buffer; >.getbyte ()I into < SystemAppCL, Lspec/benchmarks/_201_compress/Compressor; >.com press ()V at bytecode 297 Inline < SystemAppCL, Lspec/benchmarks/_201_compress/Compressor$Hash_Table; >.clear ()V into < SystemAppCL, Lspec/benchmarks/_201_compress/Compressor ; >.compress ()V at bytecode 66 Inline < SystemAppCL, Lspec/benchmarks/_201_compress/Code_Table; >.of (I)I into < SystemAppCL, Lspec/benchmarks/_201_compress/Compressor; >.compress ()V at bytecode 126 Inline < SystemAppCL, Lspec/benchmarks/_201_compress/Code_Table; >.of (I)I into < SystemAppCL, Lspec/benchmarks/_201_compress/Compressor; >.compress ()V at bytecode 192 Inline < SystemAppCL, Lspec/benchmarks/_201_compress/Compressor; >.output (I)V into < SystemAppCL, Lspec/benchmarks/_201_compress/Compressor; >.compr ess ()V at bytecode 208 Inline < SystemAppCL, Lspec/benchmarks/_201_compress/Output_Buffer; >.putbyte (B)V into < SystemAppCL, Lspec/benchmarks/_201_compress/Compressor; >.o utput (I)V at bytecode 175 Inline < SystemAppCL, Lspec/benchmarks/_201_compress/Output_Buffer; >.writebytes ([BI)V into < SystemAppCL, Lspec/benchmarks/_201_compress/Compressor ; >.output (I)V at bytecode 227 Inline < SystemAppCL, Lspec/benchmarks/_201_compress/Output_Buffer; >.writebytes ([BI)V into < SystemAppCL, Lspec/benchmarks/_201_compress/Compressor ; >.output (I)V at bytecode 347 Inline < SystemAppCL, Lspec/benchmarks/_201_compress/Output_Buffer; >.putbyte (B)V into < SystemAppCL, Lspec/benchmarks/_201_compress/Compressor; >.o utput (I)V at bytecode 175 Inline < SystemAppCL, Lspec/benchmarks/_201_compress/Output_Buffer; >.putbyte (B)V into < SystemAppCL, Lspec/benchmarks/_201_compress/Compressor; >.o utput (I)V at bytecode 175 Inline < SystemAppCL, Lspec/benchmarks/_201_compress/Code_Table; >.set (II)V into < SystemAppCL, Lspec/benchmarks/_201_compress/Compressor; >.compres s ()V at bytecode 251 Inline < SystemAppCL, Lspec/benchmarks/_201_compress/Input_Buffer; >.getbyte ()I into < SystemAppCL, Lspec/benchmarks/_201_compress/Compressor; >.com press ()V at bytecode 297 Inline < SystemAppCL, Lspec/benchmarks/_201_compress/Compressor$Hash_Table; >.clear ()V into < SystemAppCL, Lspec/benchmarks/_201_compress/Compressor ; >.compress ()V at bytecode 66 Inline < SystemAppCL, Lspec/benchmarks/_201_compress/Code_Table; >.of (I)I into < SystemAppCL, Lspec/benchmarks/_201_compress/Compressor; >.compress ()V at bytecode 126 Inline < SystemAppCL, Lspec/benchmarks/_201_compress/Code_Table; >.of (I)I into < SystemAppCL, Lspec/benchmarks/_201_compress/Compressor; >.compress ()V at bytecode 192 Inline < SystemAppCL, Lspec/benchmarks/_201_compress/Compressor; >.output (I)V into < SystemAppCL, Lspec/benchmarks/_201_compress/Compressor; >.compress ()V at bytecode 208 Inline < SystemAppCL, Lspec/benchmarks/_201_compress/Output_Buffer; >.putbyte (B)V into < SystemAppCL, Lspec/benchmarks/_201_compress/Compressor; >.output (I)V at bytecode 175 Inline < SystemAppCL, Lspec/benchmarks/_201_compress/Output_Buffer; >.writebytes ([BI)V into < SystemAppCL, Lspec/benchmarks/_201_compress/Compressor; >.output (I)V at bytecode 227 Inline < SystemAppCL, Lspec/benchmarks/_201_compress/Output_Buffer; >.writebytes ([BI)V into < SystemAppCL, Lspec/benchmarks/_201_compress/Compressor; >.output (I)V at bytecode 347 Inline < SystemAppCL, Lspec/benchmarks/_201_compress/Output_Buffer; >.putbyte (B)V into < SystemAppCL, Lspec/benchmarks/_201_compress/Compressor; >.output (I)V at bytecode 175 Inline < SystemAppCL, Lspec/benchmarks/_201_compress/Output_Buffer; >.putbyte (B)V into < SystemAppCL, Lspec/benchmarks/_201_compress/Compressor; >.output (I)V at bytecode 175 Inline < SystemAppCL, Lspec/benchmarks/_201_compress/Code_Table; >.set (II)V into < SystemAppCL, Lspec/benchmarks/_201_compress/Compressor; >.compress ()V at bytecode 251 In other words, making this for final classes precise causes a lot more inlining in Compressor.compress and this is costing the ~10% of execution time.
Hide
David Grove added a comment -

This blocks 2.9.3 release.

Show
David Grove added a comment - This blocks 2.9.3 release.
Hide
Ian Rogers added a comment -

Recent changes, r14140-r14142 have pulled some compress performance back, but db performance has now crashed [1].

[1] http://jikesrvm.anu.edu.au/cattrack/results/rvmx86lnx32.anu.edu.au/perf/3529/performance_report

Show
Ian Rogers added a comment - Recent changes, r14140-r14142 have pulled some compress performance back, but db performance has now crashed [1]. [1] http://jikesrvm.anu.edu.au/cattrack/results/rvmx86lnx32.anu.edu.au/perf/3529/performance_report
Hide
David Grove added a comment -

Sigh. This stuff is so touchy. The db drop is because with the new values when we're compiling shell_sort, we now estimate the cost of java.util.Vector.elementAt at 24. We used to estimate it at 23. The threshold is 23. So, before we would inline it, now we don't.

Show
David Grove added a comment - Sigh. This stuff is so touchy. The db drop is because with the new values when we're compiling shell_sort, we now estimate the cost of java.util.Vector.elementAt at 24. We used to estimate it at 23. The threshold is 23. So, before we would inline it, now we don't.
Hide
Ian Rogers added a comment -

It's a comment on db's age that it's using Vector rather than the much more popular ArrayList. In r14144 I've changed the test run for the profiled-image to DaCapo's lusearch that should touch Vector and warm it up for the subsequent recompilation.

Show
Ian Rogers added a comment - It's a comment on db's age that it's using Vector rather than the much more popular ArrayList. In r14144 I've changed the test run for the profiled-image to DaCapo's lusearch that should touch Vector and warm it up for the subsequent recompilation.
Hide
David Grove added a comment - - edited

Please let's not pick the test run for the profiled-image to make _209_db work better.

I'm pursuing another fix anyways and doing this is really putting the cart way before the horse.

Show
David Grove added a comment - - edited Please let's not pick the test run for the profiled-image to make _209_db work better. I'm pursuing another fix anyways and doing this is really putting the cart way before the horse.
Hide
David Grove added a comment -

Also, profiling Vector in lusearch isn't going to change behavior in db. The hot call edge that would need to be in the profile is from Database.shell_sort to Vector.elementAt.

We do detect this as a hot call edge, just slightly too late for it to help us (ie, right after the O1 compilation happens in a typical run).

An indirect benefit of getting O2 back online is that we'd naturally tend to have an additional shot at compiling methods after they've been executing for a while (so they have profile data). With the current optimization levels, we're reaching the max opt level too quickly for the really hot spots in program execution.

Show
David Grove added a comment - Also, profiling Vector in lusearch isn't going to change behavior in db. The hot call edge that would need to be in the profile is from Database.shell_sort to Vector.elementAt. We do detect this as a hot call edge, just slightly too late for it to help us (ie, right after the O1 compilation happens in a typical run). An indirect benefit of getting O2 back online is that we'd naturally tend to have an additional shot at compiling methods after they've been executing for a while (so they have profile data). With the current optimization levels, we're reaching the max opt level too quickly for the really hot spots in program execution.
Hide
Ian Rogers added a comment -

Last time I checked what was O2 is now slower than O1, although I since pulled more things out of O1 - in particular simple expression folding.

I've backed the profiled-image test back to fop. Given elementAt only has two operations, a call to a method to check the array bound, an array access I figured not touching Vector in the profile was some how skewing the computed size. Maybe the computed size can be scaled back to ignore exception paths - e.g. by not including bytecodes in catch blocks, weight string buffer methods as cheap.

Given we want a benchmark for the profiled-image that will give good coverage, fop is almost certainly not the optimal benchmark choice.

Show
Ian Rogers added a comment - Last time I checked what was O2 is now slower than O1, although I since pulled more things out of O1 - in particular simple expression folding. I've backed the profiled-image test back to fop. Given elementAt only has two operations, a call to a method to check the array bound, an array access I figured not touching Vector in the profile was some how skewing the computed size. Maybe the computed size can be scaled back to ignore exception paths - e.g. by not including bytecodes in catch blocks, weight string buffer methods as cheap. Given we want a benchmark for the profiled-image that will give good coverage, fop is almost certainly not the optimal benchmark choice.
Hide
Ian Rogers added a comment -

Pushing to 2.9.4. We've not fixed the performance problem, but this kind of issue is a well known with this benchmark.

Show
Ian Rogers added a comment - Pushing to 2.9.4. We've not fixed the performance problem, but this kind of issue is a well known with this benchmark.
Hide
Ian Rogers added a comment -

This issue seems to have no possible resolution and we're happy with the current inlining parameters. I'd suggest marking it resolved+closed.

Show
Ian Rogers added a comment - This issue seems to have no possible resolution and we're happy with the current inlining parameters. I'd suggest marking it resolved+closed.
Hide
David Grove added a comment -

Had a discussion at PLDI with Raj from Rice who's looking at register allocation. He told me that the current spill heuristics which put a very high cost on spilling a register that is live in a catch block is really hurting compress. We should look to see if we can fix that without impacting performance on other benchmarks.

Show
David Grove added a comment - Had a discussion at PLDI with Raj from Rice who's looking at register allocation. He told me that the current spill heuristics which put a very high cost on spilling a register that is live in a catch block is really hurting compress. We should look to see if we can fix that without impacting performance on other benchmarks.

People

Vote (0)
Watch (0)

Dates

  • Created:
    Updated: