Profiling

From OctopusWiki
Jump to: navigation, search

I have started to do some profiling of the parallelization. Due to the numerical problems (increased number of SCF cycles) when computing the Hartree term, I started with non-interacting electrons. Although this is not a very realistic example the scaling of the non-local operator and the dot-product can be seen.

The profiling was done on 1, 2, 4, 6, 8, 10 AMD64 processors at the FU Berlin Physics department. I had liked to use more nodes but more than 10 were not available.

I did benchmarking with two different input files:

  • uniform grid, cg eigensolver, about 700000 points, stencil size 4 points in each direction in 3D, which looks like this:
BoxShape = parallelepiped

%Lsize
 2 | 2 | 2
%

%Spacing
  0.05 | 0.05 | 0.05
%

LCAOStart = no
ProfilingMode = yes
Dimensions = 3
XFunctional = no
CFunctional = no

%Species
  "HO" | 2 | 1 | 1 | "0.5*(x^2+y^2+z^2)"
%

%Coordinates
  "HO" |  0.00 | 0.00 | 0.00
  "HO" |  0.50 | 0.00 | 0.00
  "HO" | -0.50 | 0.00 | 0.00
%

NonInteractingElectrons = yes

OutputKSPotential = yes
OutputDensity     = yes
OutputWfs         = yes
OutputELF         = no
OutputGeometry    = no
  • non-uniform grid, lanczos eigensolver, about 350000 points, stencil size 4 points in each direction in 3D, which looks like this
BoxShape = parallelepiped

%Lsize
 2 | 2 | 2
%

%Spacing
  0.2 | 0.2 | 0.2
%

LCAOStart = no
ProfilingMode = yes
Dimensions = 3

CurvMethod = curv_gygi
CurvGygiA = 1.0
CurvGygiAlpha = 3.0
CurvGygiBeta = 7.0
DerivativesStencil = stencil_starplus

EigenSolver = lanczos
EigenSolverInitTolerance = 5e-3
EigenSolverFinalTolerance = 5e-7
EigenSolverFinalToleranceIteration = 10
EigenSolverMaxIter = 200

XFunctional = no
CFunctional = no

%Species
  "HO" | 2 | 1 | 1 | "0.5*(x^2+y^2+z^2)"
%

%Coordinates
  "HO" |  0.00 | 0.00 | 0.00
  "HO" |  0.50 | 0.00 | 0.00
  "HO" | -0.50 | 0.00 | 0.00
%

NonInteractingElectrons = yes

OutputKSPotential = yes
OutputDensity     = yes
OutputWfs         = yes
OutputELF         = no
OutputGeometry    = no

Uniform grid

The x-axis is the number of processors and tehe y-axis is the average time for one SCF cycle, computation of a non-local operator, dot-product respectively. In the diagrams for the non-local operator and the dot-product the time consumed by communication is shown in green.

  • Scaling of the SCF cylces.

Profile scf.jpg

  • The non-local operator. It seems that it only scales well up to a certain point.

Profile nlop.jpg

  • This looks rather wild - the dot-product.

The reason for this irregular scaling is that an allreduce operation is performed in a binary tree fashion, which works particular well for 2^n nodes. It is important to have this in mind because it implies that it is better to run on four nodes than on six (or on eight if available, of course). Profile dotp.jpg

Non-uniform grid

Please keep in mind that these runs were done on fewer points than those with the uniform grid.

  • Scaling of the SCF cylces.

Profile scf c.jpg

  • The non-local operator.

Profile nlop c.jpg

  • The dot-product.

Profile dotp c.jpg

To do

  • Profiling with more processors.
  • Profiling with the Hartree-term.
  • More measurements to get an idea of the errors in these results...
  • Scaling of the non-local operator depending on the stencil-size.
  • _Please add whatever you think is necessary._