# Developers Manual:LOBPCG

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

# Description of the algorithm

The following is a basic description of the locally optimal block preconditioned conjugate gradient algorithm as implemented in the file eigen_lobpcg_inc.F90. Locking of already converged eigenvectors is omitted.

For more details, see [1].

Given a preconditioner and initial real or complex vectors , we form the column matrix

and solve the eigenvalue problem

for the lowest eigenpairs by the following iterative procedure:

1. Make initial vectors orthonormal:

(The denotes for real and for complex .)

Calculate the upper triangular matrix of the Cholesky factorization.

2. Get initial Ritz-vectors and -values:

Solve the following eigenvalue problem for all eigenvalues and -vectors:

Save the initial eigenvalues in a vector:

Multiply the initial vectors by the column matrix of the Ritz-vectors

and set the conjugate directions to zero:

3. for do

Calculate residuals:

4. Terminate convergence loop if all residuals are below a certain threshold.

6. Apply the preconditioner to the residuals:

6. Make residuals orthogonal to eigenvectors:

7. Orthonormalize residuals:

8. if then

Orthonormalize conjugate directions:

end if

9. Rayleigh-Ritz procedure on subspace .

Set up symmetric Gram-matrices:

if then

else

end if

10. Calculate the lowest eigenpairs of the (for ) or (for ) generalized eigenvalue problem

11. Partition the column-matrix of the results into two or three block matrices of size:

, for , and

, for

12. Calculate new eigenvectors and, if , new conjugate directions:

if then

else

end if

done

## Notes

Despite solving only a standard eigenvalue problem , it does not work to use the usual Rayleigh-Ritz procedure, i. e. solving , because this is unstable. The reason for this is that not all the vectors in the subspace are orthogonal, i. e. the matrix is not identity. One could do an orthogonalization step before doing the Rayleigh-Ritz procedure but this is more computationally expensive (Thanks for Andrew Knyazev, the inventor of the algorithm, for explaining this to me).

# Parallel implementation

The LOBPCG implementation in Octopus is parallelized in states and in domains. In other word, the column matrix of states

is distributed on the nodes as

,

where we have nodes and the are blocks like

.

So, we see the real-space points are distributed in the direction (via the usual domain decomposition) and the states are distributed in the direction (by the states_distribute_nodes routine).

In LOBPCG we have to operations to perform two operations on these matrices:

• Multiplication of two big matrices with , distributed as explained above, where the small result matrix is required on all nodes. We need this to calculate the subspaces for example.
• Multiplication of a distributed matrix by a small matrix from the right. The resulting (big) matrix in turn has to be distributed on all nodes like the input matrix . This operation, for example, performs the calculation of the new conjugate directions.

We now describe the communication and computation pattern used for the parallel implementation of the two operations.

## by Xstates_blockt_mul

Consider the matrix product

.

Each node can at first calculate the product of its two local blocks and write the result to the corresponding block. Then the blocks of are rotated to the left, i. e. the distribution now looks like this:

Now, each node again does the product of its two local blocks but this time writes the result right next to the previous result. If there is no right block, we circle to the first block. This is scheme is iterated times until all blocks have entries.

The intermediate local results on the nodes now consist of one summand for each block, for example for node hosting the , block:

To obtain the full result, we simply do an MPI_Allreduce to sum up all contributions.

The algorithm used here is basically a variation of Cannon's algorithm. In the original algorithm the blocks of the first matrix are rotated upwards. This scheme leads to a distribution of the result and would require an additional MPI_Alltoall because we need the result matrix on all nodes. By omitting the downward rotation and replacing the MPI_Alltoall by an MPI_Allreduce we obtain a faster implementation.

To have everything explicit, we write down the algorithm. Let be the result matrix and , the blocks currently living on node with coordinates , with and :



for  do
rotate  columnwise to the left

done
perform an allreduce on  with


To complicate things, we observe that we can overlap the rotating of with the calculation of . That is, while calculating the product, the new block for is being received and the old one sent away. This leads to the following algorithmic formulation with send and receive buffers , respectively:



for  do

if  then
wait for message transmission to complete

end if
if  then
asynchronoulsy receive block from right neighbour in  and
asynchronously send  to left neighbour
end if

done
perform an allreduce  with


This is how the operation is implemented in the code using MPI_Isend and MPI_Irecv and the cartesian topology facility of MPI to obtain the rank number of the neighbour.

One subtle issue is that the blocks need not be of the same size but this is handled by providing a communication buffers the largest block fits in.

## by Xstates_block_matr_mul_add

There remains not much to say about this routine. It uses the same communication pattern as the Xstates_blockt_mul operation except that it rotates the blocks of the matrix since is known on all nodes. As the result is distributed, no need for an allreduce steps arises.

# References

A. V. Knyazev, Toward the Optimal Preconditioned Eigensolver: Locally Optimal Block Preconditioned Conjugate Gradient Method, SIAM Journal on Scientific Computing 23 517-541 (2001)

L. E. Cannon, A cellular computer to implement the Kalman filter algorithm, PhD thesis (1969)