Speed-up of nonblocking collectives

From OctopusWiki
Jump to: navigation, search

I had the possibility to measure the effect of Torsten Höfler's nonblocking (http://www.unixer.de/research/nbcoll/libnbc/) collectives at MareNostrum. They are supposed to improve the performance of the H|\psi\rangle\,\! operation by

  1. exchanging ghost points asynchronously,
  2. calculate the potential parts of the Hamiltonian, and
  3. apply the Laplacian.

Here are the results for a grid of 412929 inner points (589785 with mesh enlargement included). The table and the plot show the time the code spent between the HPSI-profiling tag with the second column listing the time using the nonblocking collective and the fourth using the standard blocking one.

Processors NBC_Ialltoallv MPI_Alltoallv Improvement
2 463 s 509 s 9 %
4 355 s 306 s none
8 232 s 222 s none
16 150 s 170 s 12 %
32 154 s 222 s 31 %

Nbc speed up.png

One can clearly see that more than 16 processors does not make sense for this grid size. The two runs with 4 and 8 processors are actually not better than the standard implementation but this might be due to process placement in the cluster. I have not investigated this. In general, it seems to be okay to use them, especially for larger numbers of processors where the latency-hiding effect of the nonblocking communication comes more into play.