The current implementation of collecting finish, while correct and general, is inefficient for large result objects and SPMD-style programs.
Suppose I have something like the following code. Each place computes a partial contribution to a large matrix; these are then summed to form the complete matrix. (This example is taken from the Hartree-Fock code in ANUChem.)
This could be expressed more succinctly with collecting finish:
However, with the current implementation of collecting finish, this is neither speed nor memory efficient. Collecting finish will create a number Place.MAX_PLACES of arrays in which to store the results from each place, as well as other temporary arrays to hold the partially reduced results. If A is very large, this could be a significant cost.
There may be several possibilities for improvement, depending on the specific case:
- allow for pre-allocation of the results rail (one element per place) at place 0. The allocation could then be hoisted above the outer loop.
- support stack allocation of intermediate results e.g. for small arrays.
- where there is only one offer per place ("SPMD collecting finish"?), each place need not hold a rail of results for each thread - the offer can be sent immediately to the root place. Additionally, there is no need to allocate an initial zero result for each place in the results rail at place 0.