Profiling
From OctopusWiki
I have started to do some profiling of the parallelization. Due to the numerical problems (increased number of SCF cycles) when computing the Hartree term, I started with non-interacting electrons. Although this is not a very realistic example the scaling of the non-local operator and the dot-product can be seen.
The profiling was done on 1, 2, 4, 6, 8, 10 AMD64 processors at the FU Berlin Physics department. I had liked to use more nodes but more than 10 were not available.
I did benchmarking with two different input files:
- uniform grid, cg eigensolver, about 700000 points, stencil size 4 points in each direction in 3D, which looks like this:
BoxShape = parallelepiped %Lsize 2 | 2 | 2 % %Spacing 0.05 | 0.05 | 0.05 % LCAOStart = no ProfilingMode = yes Dimensions = 3 XFunctional = no CFunctional = no %Species "HO" | 2 | 1 | 1 | "0.5*(x^2+y^2+z^2)" % %Coordinates "HO" | 0.00 | 0.00 | 0.00 "HO" | 0.50 | 0.00 | 0.00 "HO" | -0.50 | 0.00 | 0.00 % NonInteractingElectrons = yes OutputKSPotential = yes OutputDensity = yes OutputWfs = yes OutputELF = no OutputGeometry = no
- non-uniform grid, lanczos eigensolver, about 350000 points, stencil size 4 points in each direction in 3D, which looks like this
BoxShape = parallelepiped %Lsize 2 | 2 | 2 % %Spacing 0.2 | 0.2 | 0.2 % LCAOStart = no ProfilingMode = yes Dimensions = 3 CurvMethod = curv_gygi CurvGygiA = 1.0 CurvGygiAlpha = 3.0 CurvGygiBeta = 7.0 DerivativesStencil = stencil_starplus EigenSolver = lanczos EigenSolverInitTolerance = 5e-3 EigenSolverFinalTolerance = 5e-7 EigenSolverFinalToleranceIteration = 10 EigenSolverMaxIter = 200 XFunctional = no CFunctional = no %Species "HO" | 2 | 1 | 1 | "0.5*(x^2+y^2+z^2)" % %Coordinates "HO" | 0.00 | 0.00 | 0.00 "HO" | 0.50 | 0.00 | 0.00 "HO" | -0.50 | 0.00 | 0.00 % NonInteractingElectrons = yes OutputKSPotential = yes OutputDensity = yes OutputWfs = yes OutputELF = no OutputGeometry = no
Uniform grid
The x-axis is the number of processors and tehe y-axis is the average time for one SCF cycle, computation of a non-local operator, dot-product respectively. In the diagrams for the non-local operator and the dot-product the time consumed by communication is shown in green.
- Scaling of the SCF cylces.
- The non-local operator. It seems that it only scales well up to a certain point.
- This looks rather wild - the dot-product.
The reason for this irregular scaling is that an allreduce operation is performed in a binary tree fashion, which works particular well for 2^n nodes. It is important to have this in mind because it implies that it is better to run on four nodes than on six (or on eight if available, of course).
Non-uniform grid
Please keep in mind that these runs were done on fewer points than those with the uniform grid.
- Scaling of the SCF cylces.
- The non-local operator.
- The dot-product.
To do
- Profiling with more processors.
- Profiling with the Hartree-term.
- More measurements to get an idea of the errors in these results...
- Scaling of the non-local operator depending on the stencil-size.
- _Please add whatever you think is necessary._






