X10
  1. X10
  2. XTENLANG-1229

multi-threaded scaling of Array and DistArray initialization

    Details

    • Type: Bug Bug
    • Status: Open Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: X10 2.0.2, X10 2.0.3, X10 2.0.4
    • Fix Version/s: X10 2.6
    • Labels:
      None
    • Environment:
      Linux x86-64, Intel Q6600 @ 2.40GHz (quad core)
    • Testcase included:
      yes
    • Number of attachments :
      2

      Description

      Against X10 2.0.2

      The following code (benchmark^BenchmarkForEach.x10 attached) is intended as a test of the multi-threaded scaling of foreach. It does not scale as expected with X10_NTHREADS. That is, performance is roughly the same on a multi-core system for X10_NTHREADS=1..4

              val a = Array.make[Double](Dist.makeBlock([0..N-1], 0), ((i) : Point) => i as Double)
              finish foreach ((i) in a) {
                  a(i) += 2.0;
              }
      

      This was compiled with -O -NO_CHECKS. When run on an Intel Q6600 @ 2.40GHz (quad core), the following timings were observed:

      X10_NTHREADS kop/s
      1 339
      2 281
      3 346
      4 360

      The [modified benchmark is intended to eliminate the overhead of foreach, by only creating as many activities as there are threads. However this does not scale well either. Looking at top, the CPU usage is as expected (100% for X10_NTHREADS=1, 200% for X10_NTHREADS=2 and so on); but the ops count only increases by a small fraction:

      X10_NTHREADS Mop/s
      1 3.37
      2 3.95
      3 4.31
      4 4.11

      Against SVN HEAD

      The base performance is better against SVN HEAD, but the scaling is similar. If the first benchmark is modified to change the foreach to an ateach statement, the performance against SVN HEAD improves markedly with the number of places.

      Places X10_NTHREADS kop/s
      1 1 561
      1 2 426
      1 3 483
      1 4 457
      2 1 846
      4 1 1470

      (This cannot be said for X10 2.0.2 as it does not include Igor's fix to XTENLANG-1143.)

      All this seems to suggest a problem with the way threads of execution are utilized by foreach.

      1. BenchmarkForEach.x10
        0.7 kB
        Josh Milthorpe
      2. BenchmarkSplitThreads.x10
        0.9 kB
        Josh Milthorpe

        Issue Links

          Activity

          Hide
          David Grove added a comment -

          In your spare time

          I may try to take a look this weekend, but if I don't get to ti quickly probably makes more sense for you to look into this.

          Show
          David Grove added a comment - In your spare time I may try to take a look this weekend, but if I don't get to ti quickly probably makes more sense for you to look into this.
          Hide
          Xing Liu added a comment -

          I did a similar experiments. Seems the foreach() is even slower than the sequential code.

          I wrote the following code to test the performance of foreach.
          add1() is a sequential code.
          in add2(), we use foreach, and let X10 to partition workloads.
          and in add3(), we partition the workloads by ourselves.

          I used c++ backend, and run the code as

          1. env X10_NTHREADS=2 runx10 ./Test_foreach

          The performance:

          time of add1() = 32.3 ms
          time of add2() = 3277.98 ms
          time of add3() = 18.33 ms

          // Test_foreach.x10

          def add1()
          {
          for ( in 0..size-1)

          { data(i) += 5; }
          }

          def add2()
          {
          finish foreach ( in 0..size-1)
          { data(i) += 5; }

          }

          def add3()
          {
          var numThreads: Int = 2;
          val mySize = size/numThreads;
          finish foreach ((p) in 0..numThreads-1)
          {
          for ( in p*mySize..(p+1)*mySize-1)

          { data(i) += 5; }

          }
          }

          Show
          Xing Liu added a comment - I did a similar experiments. Seems the foreach() is even slower than the sequential code. I wrote the following code to test the performance of foreach. add1() is a sequential code. in add2(), we use foreach, and let X10 to partition workloads. and in add3(), we partition the workloads by ourselves. I used c++ backend, and run the code as env X10_NTHREADS=2 runx10 ./Test_foreach The performance: time of add1() = 32.3 ms time of add2() = 3277.98 ms time of add3() = 18.33 ms // Test_foreach.x10 def add1() { for ( in 0..size-1) { data(i) += 5; } } def add2() { finish foreach ( in 0..size-1) { data(i) += 5; } } def add3() { var numThreads: Int = 2; val mySize = size/numThreads; finish foreach ((p) in 0..numThreads-1) { for ( in p*mySize..(p+1)*mySize-1) { data(i) += 5; } } }
          Hide
          David Grove added a comment -

          Qi Ming did some profiling, and what actually appears to be happening is that the array initialization isn't scaling. The forloop's themselves are actually scaling ok.

          We can re-write the array constructor to use a parallel loop, which will help scalability.

          However, sequential performance of array construction is still fairly poor because the X10 compiler doesn't implement the necessary optimizations to inline and optimize the initialization loop. We also can't hand specialize the constructor due to Java backend problems (can't overload on closure types: XTENLANG-979).

          Show
          David Grove added a comment - Qi Ming did some profiling, and what actually appears to be happening is that the array initialization isn't scaling. The forloop's themselves are actually scaling ok. We can re-write the array constructor to use a parallel loop, which will help scalability. However, sequential performance of array construction is still fairly poor because the X10 compiler doesn't implement the necessary optimizations to inline and optimize the initialization loop. We also can't hand specialize the constructor due to Java backend problems (can't overload on closure types: XTENLANG-979 ).
          Hide
          David Grove added a comment -

          bump all user-opened issues to critical for 2.0.4 triage purposes.

          Show
          David Grove added a comment - bump all user-opened issues to critical for 2.0.4 triage purposes.
          Hide
          David Grove added a comment -

          Since we now believe the behavior of this test program is about scalability of array initialization, I'm going to defer to 2.1.0.

          There are two things that need to be done: first improve the sequential efficiency of array initialization. The key item for this is to enable inlining of constructors. We need to be able to inline the array constructor to avoid the closure call on every loop iteration, and also to discover that the region is rectangular (and the rank of the region), thus letting us optimize the for loop in the array constructor. Until we can get that inlining done, the sequential performance is bad enough that I'm not sure it's worth the effort of attempting to parallelize the initialization loop.

          Show
          David Grove added a comment - Since we now believe the behavior of this test program is about scalability of array initialization, I'm going to defer to 2.1.0. There are two things that need to be done: first improve the sequential efficiency of array initialization. The key item for this is to enable inlining of constructors. We need to be able to inline the array constructor to avoid the closure call on every loop iteration, and also to discover that the region is rectangular (and the rank of the region), thus letting us optimize the for loop in the array constructor. Until we can get that inlining done, the sequential performance is bad enough that I'm not sure it's worth the effort of attempting to parallelize the initialization loop.
          Hide
          David Grove added a comment -

          pushing this out until workstealing is expected to become available. Then re-evaluate how to best to parallalize things like the initialization loop.

          Show
          David Grove added a comment - pushing this out until workstealing is expected to become available. Then re-evaluate how to best to parallalize things like the initialization loop.
          Hide
          David Grove added a comment -

          bulk defer of unresolved 2.1.0 bugs to 2.1.1.

          Show
          David Grove added a comment - bulk defer of unresolved 2.1.0 bugs to 2.1.1.
          Hide
          Vijay Saraswat added a comment -

          Hai Chuan, Olivier — With work-stealing what is the perf of async for loops and array initializers.

          Show
          Vijay Saraswat added a comment - Hai Chuan, Olivier — With work-stealing what is the perf of async for loops and array initializers.
          Hide
          David Grove added a comment -

          bulk move of all unresolved issues from 2.1.1 to 2.1.2

          Show
          David Grove added a comment - bulk move of all unresolved issues from 2.1.1 to 2.1.2
          Hide
          David Grove added a comment -

          defer all non-critical 2.1.2 issues to 2.2.

          Show
          David Grove added a comment - defer all non-critical 2.1.2 issues to 2.2.
          Hide
          David Grove added a comment -

          bulk defer of open issues to 2.2.2.

          Show
          David Grove added a comment - bulk defer of open issues to 2.2.2.
          Hide
          David Grove added a comment -

          bulk defer of issues to 2.2.3.

          Show
          David Grove added a comment - bulk defer of issues to 2.2.3.
          Hide
          David Grove added a comment -

          bulk defer of 2.3.0 open issues to 2.3.1.

          Show
          David Grove added a comment - bulk defer of 2.3.0 open issues to 2.3.1.
          Hide
          David Grove added a comment -

          bulk defer to 2.3.2

          Show
          David Grove added a comment - bulk defer to 2.3.2
          Hide
          David Grove added a comment -

          defer to 2.4.1

          Show
          David Grove added a comment - defer to 2.4.1
          Hide
          David Grove added a comment -

          bulk defer of features from 2.5 to 2.6.

          Show
          David Grove added a comment - bulk defer of features from 2.5 to 2.6.

            People

            • Assignee:
              David Grove
              Reporter:
              Josh Milthorpe
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated: