Speed-up of nonblocking collectives
From OctopusWiki
I had the possibility to measure the effect of Torsten Höfler's nonblocking (http://www.unixer.de/research/nbcoll/libnbc/) collectives at MareNostrum. They are supposed to improve the performance of the
operation by
- exchanging ghost points asynchronously,
- calculate the potential parts of the Hamiltonian, and
- apply the Laplacian.
Here are the results for a grid of 412929 inner points (589785 with mesh enlargement included). The table and the plot show the time the code spent between the HPSI-profiling tag with the second column listing the time using the nonblocking collective and the fourth using the standard blocking one.
| Processors | NBC_Ialltoallv | MPI_Alltoallv | Improvement |
|---|---|---|---|
| 2 | 463 s | 509 s | 9 % |
| 4 | 355 s | 306 s | none |
| 8 | 232 s | 222 s | none |
| 16 | 150 s | 170 s | 12 % |
| 32 | 154 s | 222 s | 31 % |
One can clearly see that more than 16 processors does not make sense for this grid size. The two runs with 4 and 8 processors are actually not better than the standard implementation but this might be due to process placement in the cluster. I have not investigated this. In general, it seems to be okay to use them, especially for larger numbers of processors where the latency-hiding effect of the nonblocking communication comes more into play.


