Speed-up of nonblocking collectives
I had the possibility to measure the effect of Torsten Höfler's nonblocking (http://www.unixer.de/research/nbcoll/libnbc/) collectives at MareNostrum. They are supposed to improve the performance of the operation by
- exchanging ghost points asynchronously,
- calculate the potential parts of the Hamiltonian, and
- apply the Laplacian.
Here are the results for a grid of 412929 inner points (589785 with mesh enlargement included). The table and the plot show the time the code spent between the HPSI-profiling tag with the second column listing the time using the nonblocking collective and the fourth using the standard blocking one.
|2||463 s||509 s||9 %|
|4||355 s||306 s||none|
|8||232 s||222 s||none|
|16||150 s||170 s||12 %|
|32||154 s||222 s||31 %|
One can clearly see that more than 16 processors does not make sense for this grid size. The two runs with 4 and 8 processors are actually not better than the standard implementation but this might be due to process placement in the cluster. I have not investigated this. In general, it seems to be okay to use them, especially for larger numbers of processors where the latency-hiding effect of the nonblocking communication comes more into play.