Parallelization and performance

This tutorial page was set up for the Benasque TDDFT school 2014. The specific references to the supercomputer used at that time will have to be adapted for others to use this tutorial.

In this tutorial we will see how to run a relatively big system in the Hopper supercomputer (at NERSC in California), and how to measure its performance. There are a few key things you need to know about how to interact with the machine. To log in, run ssh trainX@hopper.nersc.gov in your terminal, substituting the actual name of your training account. Be aware that since this machine is far away, you should not try running X-Windows programs! You submit jobs by the qsub command, ‘‘e.g.’’ qsub job.scr, which will put them in the queue for execution when there is free space. You can see what jobs you currently have in the queue by executing qstat -u $USER, so you can see when your job finishes. A status code will be shown: Q = waiting in the queue, R = running, C = complete. You can cancel a job by qdel + the job number, as written by qstat. The job script (‘‘e.g.’’ job.scr) specifies parameters to the PBS/Torque queuing system about how many cores to use, what commands to run, etc.

Running the ground state

We will need the input file, job submission script, coordinates file, and a pseudopotential for Mg (for the other elements we will use the default ones that come with Octopus). The pseudopotential is available in the Octopus directory at jube/input/Mg.fhi. You can copy the files on hopper directly from /global/homes/d/dstrubbe/octopus_tutorial to your scratch directory as follows:

 cd $SCRATCH
 cp -r /global/homes/d/dstrubbe/octopus_tutorial .

The input file (inp ):


 CalculationMode = gs

 #### System size and parameters

 Spacing = 0.20
 Radius = 4.0

 Units = ev_angstrom
 XYZCoordinates = "xyz"
 %Species
  "Mg" | 24.305 | spec_ps_fhi | 12 | 3 | 2
 %
 ExcessCharge = 0

 XCFunctional = gga_x_pbe + gga_c_pbe

 ExtraStates = 18
 Eigensolver = rmmdiis
 LCAOAlternative = yes
 SmearingFunction = fermi_dirac
 Smearing = 0.1
 Mixing = 0.15

 #### GS
 MaximumIter = 300
 EigensolverTolerance = 1e-8
 ConvRelDens = 5e-8

 #### Saving memory

 SymmetriesCompute = no
 PartitionPrint = no
 MeshPartitionPackage = metis

 # Additional options
 ExperimentalFeatures = yes

Submission script job.scr with 24 CPU processor cores:

 #!/bin/bash
 #PBS -q regular
 #PBS -l mppwidth=24
 #PBS -l advres=benasque.348
 #PBS -l walltime=0:30:00
 #PBS -N testing_chl
 #PBS -V

 module load octopus/4.1.2
 cd $PBS_O_WORKDIR
 aprun -n 24 octopus_mpi &> output_gs_24

To run:

 qsub job.scr

Coordinates file xyz . Take a look at it (on your local machine) with visualization software such as xcrysden to see what kind of molecule we are dealing with.

Expand: xyz file

When your job finishes, take a look at the output to see what happened and make sure it completed successfully. Then we can do time-propagation.

Running the time-dependent profiling

We change the input file accordingly. Change the CalculationMode from gs to td, and add the following lines:


 ##### TD

 T = 18
 dt = 0.003
 TDPropagator = aetrs
 TDTimeStep = dt

 # Profiling

 ProfilingMode = prof_memory
 TDMaxSteps = 30
 FromScratch = yes

Now it is the time to do exactly the same TD run, changing the number of CPU processor cores. You have to change the XXX to powers of 2, 2^x. Start at 64 (which will be fastest) and divide by 2, in steps down to 4. (Running on 2 or 1 cores may not work.)

 #!/bin/bash
 #PBS -q regular
 #PBS -l mppwidth=XXX
 #PBS -l advres=benasque.348
 #PBS -l walltime=0:30:00
 #PBS -N testing_chl
 #PBS -V

 module load octopus/4.1.2
 cd $PBS_O_WORKDIR
 aprun -n XXX octopus_mpi &> output_td_XXX

Different profiling.000xxx folders will be created with each execution. We need to process them, mainly to be able to plot the information they contain. For that we can run the next script. It runs fine without any argument, but we can have more control in the files that it is going to process by using the following arguments: “analyze.sh 64 000004 2”. The first argument is the biggest number of CPU processor cores that is going to be considered. The second optional argument is the number of the reference file that is going to be used. The third one is the starting number, i.e. the smallest number of CPU cores to consider.

 #!/bin/bash

 ## Copyright (C) 2012,2014 J. Alberdi-Rodriguez
 ##
 ## This program is free software; you can redistribute it and/or modify
 ## it under the terms of the GNU General Public License as published by
 ## the Free Software Foundation; either version 2, or (at your option)
 ## any later version.
 ##
 ## This program is distributed in the hope that it will be useful,
 ## but WITHOUT ANY WARRANTY; without even the implied warranty of
 ## MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
 ## GNU General Public License for more details.
 ##
 ## You should have received a copy of the GNU General Public License
 ## along with this program; if not, write to the Free Software
 ## Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA
 ## 02111-1307, USA.
 ##
 ## analyze.sh

 # Define the biggest number of processors.
 if [ -z "$1" ]; then
     last_proc=64
 else
     last_proc=$1
 fi

 # Define the reference file/folder
 if [ -z "$2" ]; then
     ref=000004
 else
     ref=$2
 fi

 # Define the starting value
 if [ -z "$3" ]; then
     start=1
 else
     start=$3
 fi

 #Initialise the output file
 echo "-  " > profile_$start
 for ((num=$start;num<=$last_proc;num*=2)); do
     echo $num >> profile_$start
 done

 rm -f tmp1

 # Analyze all profiling.XXXXXX/time.000000 to get the time per subroutine
 count=0
 for function_name in $(less profiling.$ref/time.000000  |  awk '{print $1}')
 do
     if [ $count -lt 4 ]; then
 	count=$((count + 1))
     else
 	echo $function_name >> tmp1
        -iterate over power of two profilings
 	for ((num=$start;num<=$last_proc;num*=2)); do
 	    folder=`printf 'profiling.%06d\n' $num `
 	    x=$(less $folder/time.000000 | grep "^$function_name " | awk '{print $3}' )
 	    zero=_"$x"_
 	    if [ "$zero" != "__" ]; then
 		echo $x >> tmp1
 	    else
 		echo "0" >> tmp1
 	    fi
 	done
 	paste profile_$start tmp1 > tmp2
 	rm tmp1
 	cp tmp2 profile_$start
     fi
 done

 echo "The result is in the \"profile_$start\" file"

At this point we should run “analyze.sh 64 000004 2”. Thus, we will create files named “profile_2”. You can take a look at the following columns in the profiling data:

Now we can plot it using the following script:

 #!/bin/bash

 ## Copyright (C) 2014 J. Alberdi-Rodriguez
 ##
 ## This program is free software; you can redistribute it and/or modify
 ## it under the terms of the GNU General Public License as published by
 ## the Free Software Foundation; either version 2, or (at your option)
 ## any later version.
 ##
 ## This program is distributed in the hope that it will be useful,
 ## but WITHOUT ANY WARRANTY; without even the implied warranty of
 ## MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
 ## GNU General Public License for more details.
 ##
 ## You should have received a copy of the GNU General Public License
 ## along with this program; if not, write to the Free Software
 ## Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA
 ## 02111-1307, USA.
 ##
 ## plot_function.sh

 if [ $- -eq 0 ]; then
    function="TIME_STEP"
 else
    function=$1
 fi
 echo $function

 column_number=$( awk -v fun=$function '
 { for(i=1;i<=NF;i++){
    if ($i == fun)
       {print i+1 }
    }
 }' profile_2 )

 script_dir="$( cd "$( dirname "${BASH_SOURCE[0]}" )" && pwd )"

 sed  "s/REF/$column_number/g" $script_dir/plot_ref > plot_base
 sed -i "s/FUNCTION/$function/g" plot_base
 gnuplot plot_base

We also need this auxiliary file:

 ## Copyright (C) 2014 J. Alberdi-Rodriguez
 ##
 ## This program is free software; you can redistribute it and/or modify
 ## it under the terms of the GNU General Public License as published by
 ## the Free Software Foundation; either version 2, or (at your option)
 ## any later version.
 ##
 ## This program is distributed in the hope that it will be useful,
 ## but WITHOUT ANY WARRANTY; without even the implied warranty of
 ## MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
 ## GNU General Public License for more details.
 ##
 ## You should have received a copy of the GNU General Public License
 ## along with this program; if not, write to the Free Software
 ## Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA
 ## 02111-1307, USA.
 ##
 ## plot_ref

 set t postscript eps enhanced color solid
 set output "gnuplot.eps"

 set xlabel "MPI processes"
 set ylabel "t (s)"
 set logscale yx 2

 plot "profile_2" u 1:REF w linespoint t "FUNCTION 2^x"

Something else you can try is 12, 24, 48 and 96 cores, because each of the nodes has 24 cores. In this case, you would need “analyze.sh 96 000003 3” to make “profile_3”, and then in the plotting script,

 plot "profile_2" u 1:REF w linespoint t "FUNCTION 2^x", "profile_3" u 1:REF w lp t "FUNCTION 3ยท(2^x)"

Parallelization in domains vs states

We can divide up the work among the processors in different ways, by dividing up the points into domains for each processor, or dividing the states into groups for each processor, or a combination of both. Try out different combinations by adding to your input file


 ParStates = 2
 ParDomains = 12

and run on 24 cores, with different numbers in the first two fields whose product is the total number of processors (‘‘e.g.’’ 6 x 4, 3 x 8, …).

PFFT Poisson solver

Another thing you can try is to compare the PFFT (parallel FFT) Poisson solver against the one we were using before (look in the output file to see which one it was). You will need to use this aprun line in your job script instead of the previous one:

 aprun -n XXX /global/homes/j/joseba/octopus/bin/octopus_mpi &> output_td_XXX

in order to use a different octopus compilation that uses that library, and add these lines to your input file:


 PoissonSolver = fft
 FFTLibrary = pfft

Compare some runs against one on a similar number of processors that you did previously. How does the time for this solver compare? You can also try ordinary FFTW (not parallel) with


 FFTLibrary = fftw

Parallelization of the ground state

We can also try different parameters and algorithms to see their effect on the speed of the ground-state calculation, for 24, 48, or 96 processors. Look each up in the variable reference to see what they mean, and see which of the options you were using in the previous runs.