Parallelization and performance
This tutorial page was set up for the Benasque TDDFT school 2014. The specific references to the supercomputer used at that time will have to be adapted for others to use this tutorial.
In this tutorial we will see how to run a relatively big system in the Hopper supercomputer (at NERSC in California), and how to measure its performance. There are a few key things you need to know about how to interact with the machine. To log in, run ssh trainX@hopper.nersc.gov
in your terminal, substituting the actual name of your training account. Be aware that since this machine is far away, you should not try running X-Windows programs! You submit jobs by the qsub
command, ‘‘e.g.’’ qsub job.scr
, which will put them in the queue for execution when there is free space. You can see what jobs you currently have in the queue by executing qstat -u $USER
, so you can see when your job finishes. A status code will be shown: Q = waiting in the queue, R = running, C = complete. You can cancel a job by qdel
+ the job number, as written by qstat
. The job script (‘‘e.g.’’ job.scr
) specifies parameters to the PBS/Torque queuing system about how many cores to use, what commands to run, etc.
Running the ground state
We will need the input file, job submission script, coordinates file, and a pseudopotential for Mg (for the other elements we will use the default ones that come with Octopus). The pseudopotential is available in the Octopus directory at jube/input/Mg.fhi
. You can copy the files on hopper directly from /global/homes/d/dstrubbe/octopus_tutorial
to your scratch directory as follows:
cd $SCRATCH
cp -r /global/homes/d/dstrubbe/octopus_tutorial .
The input file (inp ):
CalculationMode = gs
#### System size and parameters
Spacing = 0.20
Radius = 4.0
Units = ev_angstrom
XYZCoordinates = "xyz"
%Species
"Mg" | 24.305 | spec_ps_fhi | 12 | 3 | 2
%
ExcessCharge = 0
XCFunctional = gga_x_pbe + gga_c_pbe
ExtraStates = 18
Eigensolver = rmmdiis
LCAOAlternative = yes
SmearingFunction = fermi_dirac
Smearing = 0.1
Mixing = 0.15
#### GS
MaximumIter = 300
EigensolverTolerance = 1e-8
ConvRelDens = 5e-8
#### Saving memory
SymmetriesCompute = no
PartitionPrint = no
MeshPartitionPackage = metis
# Additional options
ExperimentalFeatures = yes
Submission script job.scr with 24 CPU processor cores:
#!/bin/bash
#PBS -q regular
#PBS -l mppwidth=24
#PBS -l advres=benasque.348
#PBS -l walltime=0:30:00
#PBS -N testing_chl
#PBS -V
module load octopus/4.1.2
cd $PBS_O_WORKDIR
aprun -n 24 octopus_mpi &> output_gs_24
To run:
qsub job.scr
Coordinates file xyz
. Take a look at it (on your local machine) with visualization software such as xcrysden
to see what kind of molecule we are dealing with.
When your job finishes, take a look at the output to see what happened and make sure it completed successfully. Then we can do time-propagation.
Running the time-dependent profiling
We change the input file accordingly. Change the CalculationMode from gs
to td
, and add the following lines:
##### TD
T = 18
dt = 0.003
TDPropagator = aetrs
TDTimeStep = dt
# Profiling
ProfilingMode = prof_memory
TDMaxSteps = 30
FromScratch = yes
Now it is the time to do exactly the same TD run, changing the number of CPU processor cores. You have to change the XXX to powers of 2, 2^x. Start at 64 (which will be fastest) and divide by 2, in steps down to 4. (Running on 2 or 1 cores may not work.)
#!/bin/bash
#PBS -q regular
#PBS -l mppwidth=XXX
#PBS -l advres=benasque.348
#PBS -l walltime=0:30:00
#PBS -N testing_chl
#PBS -V
module load octopus/4.1.2
cd $PBS_O_WORKDIR
aprun -n XXX octopus_mpi &> output_td_XXX
Different profiling.000xxx
folders will be created with each execution. We need to process them, mainly to be able to plot the information they contain. For that we can run the next script. It runs fine without any argument, but we can have more control in the files that it is going to process by using the following arguments: “analyze.sh 64 000004 2”. The first argument is the biggest number of CPU processor cores that is going to be considered. The second optional argument is the number of the reference file that is going to be used. The third one is the starting number, i.e. the smallest number of CPU cores to consider.
#!/bin/bash
## Copyright (C) 2012,2014 J. Alberdi-Rodriguez
##
## This program is free software; you can redistribute it and/or modify
## it under the terms of the GNU General Public License as published by
## the Free Software Foundation; either version 2, or (at your option)
## any later version.
##
## This program is distributed in the hope that it will be useful,
## but WITHOUT ANY WARRANTY; without even the implied warranty of
## MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
## GNU General Public License for more details.
##
## You should have received a copy of the GNU General Public License
## along with this program; if not, write to the Free Software
## Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA
## 02111-1307, USA.
##
## analyze.sh
# Define the biggest number of processors.
if [ -z "$1" ]; then
last_proc=64
else
last_proc=$1
fi
# Define the reference file/folder
if [ -z "$2" ]; then
ref=000004
else
ref=$2
fi
# Define the starting value
if [ -z "$3" ]; then
start=1
else
start=$3
fi
#Initialise the output file
echo "- " > profile_$start
for ((num=$start;num<=$last_proc;num*=2)); do
echo $num >> profile_$start
done
rm -f tmp1
# Analyze all profiling.XXXXXX/time.000000 to get the time per subroutine
count=0
for function_name in $(less profiling.$ref/time.000000 | awk '{print $1}')
do
if [ $count -lt 4 ]; then
count=$((count + 1))
else
echo $function_name >> tmp1
-iterate over power of two profilings
for ((num=$start;num<=$last_proc;num*=2)); do
folder=`printf 'profiling.%06d\n' $num `
x=$(less $folder/time.000000 | grep "^$function_name " | awk '{print $3}' )
zero=_"$x"_
if [ "$zero" != "__" ]; then
echo $x >> tmp1
else
echo "0" >> tmp1
fi
done
paste profile_$start tmp1 > tmp2
rm tmp1
cp tmp2 profile_$start
fi
done
echo "The result is in the \"profile_$start\" file"
At this point we should run “analyze.sh 64 000004 2”. Thus, we will create files named “profile_2”. You can take a look at the following columns in the profiling data:
- TIME_STEP; the iteration time. It has a good scaling.
- COMPLETE_DATASET; the whole time of the execution. In general it decreases, it is more obvious in a real execution, where the initialization time is the same and execution one is bigger.
- SYSTEM_INIT; initialization time. We were able to stop the increasing time, and now is almost constant independently of the number of processes.
- POISSON_SOLVER; execution time for the Poisson. It is somehow constant in this case, but now with the other solvers and domain parallelization.
- RESTART_WRITE; time for writing the restart files. It depends much in the system status, more than in the number of running processes. Could be heavily decreased if it is written to the local drive.
Now we can plot it using the following script:
#!/bin/bash
## Copyright (C) 2014 J. Alberdi-Rodriguez
##
## This program is free software; you can redistribute it and/or modify
## it under the terms of the GNU General Public License as published by
## the Free Software Foundation; either version 2, or (at your option)
## any later version.
##
## This program is distributed in the hope that it will be useful,
## but WITHOUT ANY WARRANTY; without even the implied warranty of
## MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
## GNU General Public License for more details.
##
## You should have received a copy of the GNU General Public License
## along with this program; if not, write to the Free Software
## Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA
## 02111-1307, USA.
##
## plot_function.sh
if [ $- -eq 0 ]; then
function="TIME_STEP"
else
function=$1
fi
echo $function
column_number=$( awk -v fun=$function '
{ for(i=1;i<=NF;i++){
if ($i == fun)
{print i+1 }
}
}' profile_2 )
script_dir="$( cd "$( dirname "${BASH_SOURCE[0]}" )" && pwd )"
sed "s/REF/$column_number/g" $script_dir/plot_ref > plot_base
sed -i "s/FUNCTION/$function/g" plot_base
gnuplot plot_base
We also need this auxiliary file:
## Copyright (C) 2014 J. Alberdi-Rodriguez
##
## This program is free software; you can redistribute it and/or modify
## it under the terms of the GNU General Public License as published by
## the Free Software Foundation; either version 2, or (at your option)
## any later version.
##
## This program is distributed in the hope that it will be useful,
## but WITHOUT ANY WARRANTY; without even the implied warranty of
## MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
## GNU General Public License for more details.
##
## You should have received a copy of the GNU General Public License
## along with this program; if not, write to the Free Software
## Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA
## 02111-1307, USA.
##
## plot_ref
set t postscript eps enhanced color solid
set output "gnuplot.eps"
set xlabel "MPI processes"
set ylabel "t (s)"
set logscale yx 2
plot "profile_2" u 1:REF w linespoint t "FUNCTION 2^x"
Something else you can try is 12, 24, 48 and 96 cores, because each of the nodes has 24 cores. In this case, you would need “analyze.sh 96 000003 3” to make “profile_3”, and then in the plotting script,
plot "profile_2" u 1:REF w linespoint t "FUNCTION 2^x", "profile_3" u 1:REF w lp t "FUNCTION 3ยท(2^x)"
Parallelization in domains vs states
We can divide up the work among the processors in different ways, by dividing up the points into domains for each processor, or dividing the states into groups for each processor, or a combination of both. Try out different combinations by adding to your input file
ParStates = 2
ParDomains = 12
and run on 24 cores, with different numbers in the first two fields whose product is the total number of processors (‘‘e.g.’’ 6 x 4, 3 x 8, …).
PFFT Poisson solver
Another thing you can try is to compare the PFFT (parallel FFT) Poisson solver against the one we were using before (look in the output file to see which one it was). You will need to use this aprun line in your job script instead of the previous one:
aprun -n XXX /global/homes/j/joseba/octopus/bin/octopus_mpi &> output_td_XXX
in order to use a different octopus compilation that uses that library, and add these lines to your input file:
PoissonSolver = fft
FFTLibrary = pfft
Compare some runs against one on a similar number of processors that you did previously. How does the time for this solver compare? You can also try ordinary FFTW (not parallel) with
FFTLibrary = fftw
Parallelization of the ground state
We can also try different parameters and algorithms to see their effect on the speed of the ground-state calculation, for 24, 48, or 96 processors. Look each up in the variable reference to see what they mean, and see which of the options you were using in the previous runs.
- parallelization in domains vs states (as above)
- Eigensolver = rmmdiis, plan, cg, cg_new, lobpcg.
- StatesOrthogonalization = cholesky_serial, cholesky_parallel, mgs, qr
- SubspaceDiagonalization = standard, scalapack
- linear-combination of atomic orbitals (LCAO) for initial guess: LCAOAlternative = yes, no. In this case, add MaximumIter = 0 to do just LCAO rather than the whole calculations.