# Developers Manual:LOBPCG

## Contents

# Description of the algorithm

The following is a basic description of the locally optimal block preconditioned conjugate gradient algorithm as implemented in the file ** eigen_lobpcg_inc.F90**. Locking of already converged eigenvectors is omitted.

For more details, see [1].

Given a preconditioner and initial real or complex vectors , we form the column matrix

and solve the eigenvalue problem

for the lowest eigenpairs by the following iterative procedure:

1. Make initial vectors orthonormal:

(The denotes for real and for complex .)

Calculate the upper triangular matrix of the Cholesky factorization.

2. Get initial Ritz-vectors and -values:

Solve the following eigenvalue problem for all eigenvalues and -vectors:

Save the initial eigenvalues in a vector:

Multiply the initial vectors by the column matrix of the Ritz-vectors

and set the conjugate directions to zero:

3. **for** **do**

Calculate residuals:

4. Terminate convergence loop if all residuals are below a certain threshold.

6. Apply the preconditioner to the residuals:

6. Make residuals orthogonal to eigenvectors:

7. Orthonormalize residuals:

8. **if** **then**

Orthonormalize conjugate directions:

**end if**

9. Rayleigh-Ritz procedure on subspace .

Set up symmetric Gram-matrices:

**if** **then**

**else**

**end if**

10. Calculate the lowest eigenpairs of the (for ) or (for ) generalized eigenvalue problem

11. Partition the column-matrix of the results into two or three block matrices of size:

, for , and

, for

12. Calculate new eigenvectors and, if , new conjugate directions:

**if** **then**

**else**

**end if**

**done**

## Notes

Despite solving only a standard eigenvalue problem , it does not work to use the usual Rayleigh-Ritz procedure, i. e. solving , because this is unstable. The reason for this is that not all the vectors in the subspace are orthogonal, i. e. the matrix is **not** identity. One could do an orthogonalization step before doing the Rayleigh-Ritz procedure but this is more computationally expensive (Thanks for Andrew Knyazev, the inventor of the algorithm, for explaining this to me).

# Parallel implementation

The LOBPCG implementation in Octopus is parallelized in states and in domains. In other word, the column matrix of states

is distributed on the nodes as

,

where we have nodes and the are blocks like

.

So, we see the real-space points are distributed in the direction (via the usual domain decomposition) and the states are distributed in the direction (by the `states_distribute_nodes`

routine).

In LOBPCG we have to operations to perform two operations on these matrices:

- Multiplication of two big matrices with , distributed as explained above, where the small result matrix is required on all nodes. We need this to calculate the subspaces for example.
- Multiplication of a distributed matrix by a small matrix from the right. The resulting (big) matrix in turn has to be distributed on all nodes like the input matrix . This operation, for example, performs the calculation of the new conjugate directions.

We now describe the communication and computation pattern used for the parallel implementation of the two operations.

## by `Xstates_blockt_mul`

Consider the matrix product

.

Each node can at first calculate the product of its two local blocks and write the result to the corresponding block. Then the blocks of are rotated to the left, i. e. the distribution now looks like this:

Now, each node again does the product of its two local blocks but this time writes the result right next to the previous result. If there is no right block, we circle to the first block. This is scheme is iterated times until all blocks have entries.

The intermediate local results on the nodes now consist of one summand for each block, for example for node hosting the , block:

To obtain the full result, we simply do an `MPI_Allreduce`

to sum up all contributions.

The algorithm used here is basically a variation of Cannon's algorithm. In the original algorithm the blocks of the first matrix are rotated upwards. This scheme leads to a distribution of the result and would require an additional `MPI_Alltoall`

because we need the result matrix on all nodes. By omitting the downward rotation and replacing the `MPI_Alltoall`

by an `MPI_Allreduce`

we obtain a faster implementation.

To have everything explicit, we write down the algorithm. Let be the result matrix and , the blocks currently living on node with coordinates , with and :

fordorotate columnwise to the leftdoneperform an allreduce on with

To complicate things, we observe that we can overlap the rotating of with the calculation of . That is, while calculating the product, the new block for is being received and the old one sent away. This leads to the following algorithmic formulation with send and receive buffers , respectively:

fordoifthenwait for message transmission to completeend ififthenasynchronoulsy receive block from right neighbour in and asynchronously send to left neighbourend ifdoneperform an allreduce with

This is how the operation is implemented in the code using `MPI_Isend`

and `MPI_Irecv`

and the cartesian topology facility of MPI to obtain the rank number of the neighbour.

One subtle issue is that the blocks need not be of the same size but this is handled by providing a communication buffers the largest block fits in.

## by `Xstates_block_matr_mul_add`

There remains not much to say about this routine. It uses the same communication pattern as the `Xstates_blockt_mul`

operation except that it rotates the blocks of the matrix since is known on all nodes. As the result is distributed, no need for an allreduce steps arises.

# References

A. V. Knyazev, *Toward the Optimal Preconditioned Eigensolver: Locally Optimal Block Preconditioned Conjugate Gradient Method*, SIAM Journal on Scientific Computing **23** 517-541 (2001)

L. E. Cannon, *A cellular computer to implement the Kalman filter algorithm*, PhD thesis (1969)