Profiling

From OctopusWiki

Jump to: navigation, search

I have started to do some profiling of the parallelization. Due to the numerical problems (increased number of SCF cycles) when computing the Hartree term, I started with non-interacting electrons. Although this is not a very realistic example the scaling of the non-local operator and the dot-product can be seen.

The profiling was done on 1, 2, 4, 6, 8, 10 AMD64 processors at the FU Berlin Physics department. I had liked to use more nodes but more than 10 were not available.

I did benchmarking with two different input files:

  • uniform grid, cg eigensolver, about 700000 points, stencil size 4 points in each direction in 3D, which looks like this:
BoxShape = parallelepiped

%Lsize
 2 | 2 | 2
%

%Spacing
  0.05 | 0.05 | 0.05
%

LCAOStart = no
ProfilingMode = yes
Dimensions = 3
XFunctional = no
CFunctional = no

%Species
  "HO" | 2 | 1 | 1 | "0.5*(x^2+y^2+z^2)"
%

%Coordinates
  "HO" |  0.00 | 0.00 | 0.00
  "HO" |  0.50 | 0.00 | 0.00
  "HO" | -0.50 | 0.00 | 0.00
%

NonInteractingElectrons = yes

OutputKSPotential = yes
OutputDensity     = yes
OutputWfs         = yes
OutputELF         = no
OutputGeometry    = no
  • non-uniform grid, lanczos eigensolver, about 350000 points, stencil size 4 points in each direction in 3D, which looks like this
BoxShape = parallelepiped

%Lsize
 2 | 2 | 2
%

%Spacing
  0.2 | 0.2 | 0.2
%

LCAOStart = no
ProfilingMode = yes
Dimensions = 3

CurvMethod = curv_gygi
CurvGygiA = 1.0
CurvGygiAlpha = 3.0
CurvGygiBeta = 7.0
DerivativesStencil = stencil_starplus

EigenSolver = lanczos
EigenSolverInitTolerance = 5e-3
EigenSolverFinalTolerance = 5e-7
EigenSolverFinalToleranceIteration = 10
EigenSolverMaxIter = 200

XFunctional = no
CFunctional = no

%Species
  "HO" | 2 | 1 | 1 | "0.5*(x^2+y^2+z^2)"
%

%Coordinates
  "HO" |  0.00 | 0.00 | 0.00
  "HO" |  0.50 | 0.00 | 0.00
  "HO" | -0.50 | 0.00 | 0.00
%

NonInteractingElectrons = yes

OutputKSPotential = yes
OutputDensity     = yes
OutputWfs         = yes
OutputELF         = no
OutputGeometry    = no

Uniform grid

The x-axis is the number of processors and tehe y-axis is the average time for one SCF cycle, computation of a non-local operator, dot-product respectively. In the diagrams for the non-local operator and the dot-product the time consumed by communication is shown in green.

  • Scaling of the SCF cylces.

Image:profile_scf.jpg

  • The non-local operator. It seems that it only scales well up to a certain point.

Image:profile_nlop.jpg

  • This looks rather wild - the dot-product.

The reason for this irregular scaling is that an allreduce operation is performed in a binary tree fashion, which works particular well for 2^n nodes. It is important to have this in mind because it implies that it is better to run on four nodes than on six (or on eight if available, of course). Image:profile_dotp.jpg

Non-uniform grid

Please keep in mind that these runs were done on fewer points than those with the uniform grid.

  • Scaling of the SCF cylces.

Image:profile_scf_c.jpg

  • The non-local operator.

Image:profile_nlop_c.jpg

  • The dot-product.

Image:profile_dotp_c.jpg

To do

  • Profiling with more processors.
  • Profiling with the Hartree-term.
  • More measurements to get an idea of the errors in these results...
  • Scaling of the non-local operator depending on the stencil-size.
  • _Please add whatever you think is necessary._
Personal tools