Tutorial:Running Octopus on Graphical Processing Units (GPUs)

From OctopusWiki
Jump to: navigation, search

Recent versions of Octopus support execution on graphical processing units (GPUs). In this tutorial we explain how the GPU support works and how it can be used to speed-up calculations.

Consideration before using a GPU

Calculations that will be accelerated by GPUs

A GPU is a massively parallel processor, to work efficiently it need large amounts of data to process. This means that using GPUs to simulate small systems, like molecules of less than 20 atoms or low dimensionality systems, will probably will be slower than using the CPU.

Not all the calculations you can do with Octopus will be effectively accelerated by GPUs. Essentially ground state calculations with the rmmdiis eigensolver (and calculations based on ground state calculations, like geometry optimization) and time propagation with the etrs and aetrs propagators are the simulations that will use the GPU more effectively. For other types of calculations you might see some improvements in perfomance. Moreover, some Octopus features do not support GPUs at all. Currently these are:

  • Periodic systems.
  • HGH or relativistic pseudopotentials.
  • Non-collinear spin.
  • Curvilinear coordinates.

In these cases Octopus will be silently executed on the CPU.

Supported GPUs

Octopus GPU support is based on the OpenCL framework, so it is vendor independent, currently it has been tested on GPUs from Nvidia and AMD (ex-ATI).

In order to run Octopus, the GPU must have double precision support. For AMD/ATI cards this is only available in high-end models: Radeon 58xx, 69xx, 78xx, 79xx, and Firepro/Firestream cards. All recent Nvidia cards support double precision (4xx and newer), low- and middle-end cards however have very limited processing power and are probably slower running Octopus than a CPU.

The other important factor is GPU memory, you need a GPU with at least 1.5 GiB of RAM to run Octopus, but 3 GiB or more are recommended.

You can run the OpenCL version of Octopus on CPUs, however this is much slower than running Octopus directly.

Activating GPU support

Octopus has GPU support since version 4.0; to activate it you need a version compiled with OpenCL support enabled (see the manual for details). If your version of Octopus was compiled with OpenCL support you should see that 'opencl' among the configuration options:

                           Running octopus

Version                : superciliosus
Revision               :
Build time             : Thu Feb 14 22:19:10 EST 2013
Configuration options  : max-dim=3 openmp opencl sse2 avx
Optional libraries     : netcdf clamdfft clamdblas

Ideally your Octopus version should also be compiled with the optional libraries 'clamdfft' and 'clamdblas', that are part of the AMD APPML package (these libraries work on hardware from other vendors too, including Nvidia).

Starting the tutorial: running without OpenCL

If you have a version of Octopus compiled with OpenCL support, by setting the DisableOpenCL to yes you can tell Octopus not to use OpenCL.

Selecting the OpenCL device

OpenCL is designed to work with multiple devices from different vendors. Most likely you will have a single OpenCL device, but in some cases you need to select the OpenCL device that Octopus should use.

In OpenCL a 'platform' identifies an OpenCL implementation, you can have multiple platforms installed in a system. Octopus will list the available platforms at the beginning of the execution, for example:

Info: Available CL platforms: 2
    * Platform 0 : NVIDIA CUDA (OpenCL 1.1 CUDA 4.2.1)
      Platform 1 : AMD Accelerated Parallel Processing (OpenCL 1.2 AMD-APP (1084.4))

By default Octopus will use the first platform. You can select a different platform with the OpenCLPlatform variable. You can select the variable by name, for example:

 OpenCLPlatform = amd

or you can use numeric values that select the platform by its index in the list:

 OpenCLPlatform = 1

The first form should be preferred as it is independent of the order of the platforms.

Each OpenCL platform can contain several devices, the available devices on the selected platform are printed by Octopus, like this:

Info: Available CL devices: 2
      Device 0 : Tahiti
      Device 1 : Intel(R) Core(TM) i7-3820 CPU @ 3.60GHz

Each OpenCL device has a certain type that can be 'cpu', 'gpu', or 'accelerator'. By default Octopus tries to use the first device of type 'gpu', if this fails Octopus tries to use the default device for the system, as defined by OpenCL implementation. To select a different device you can use the OpenCLDevice variable, you can select the device by its index or by its type ('cpu', 'gpu', or 'accelerator').

Note: the Octopus OpenCL implementation is currently designed to execute efficiently on GPUs. Octopus will run on CPUs using OpenCL, but it will be slower than using the CPU directly.

After Octopus has selected an OpenCL device it will print some information about it, for example:

Selected CL device:
     Device vendor          : Advanced Micro Devices, Inc.
     Device name            : Tahiti
     Driver version         : 1084.4 (VM)
     Compute units          : 32
     Clock frequency        : 925 GHz
     Device memory          : 2048 MiB
     Max alloc size         : 512 MiB
     Device cache           : 16 KiB
     Local memory           : 32 KiB
     Constant memory        : 64 KiB
     Max. workgroup size    : 256
     Extension cl_khr_fp64  : yes
     Extension cl_amd_fp64  : yes


The block-size

To obtain good peformance on the GPU (and CPUs), Octopus uses blocks of states as the basic data object. The size of these blocks is controlled by the StatesBlockSize variable. For GPUs the value must be a power of 2 (the default is 32) and for optimal performance must be as large as possible, the catch is that for large values you might run out of GPU memory.

StatesPack

The Poisson solver

If you want to used a GPU-accelerated Poisson solver, you need to use the 'fft' solver and to have the 'clamdfft' library support compiled. To select the solver, just add

PoissonSolver = fft
FFTLibrary = clfft

to your input file.



Previous Tutorial:Basic QOCT - Next Tutorial:Sternheimer linear response

Back to Tutorial