Description of the algorithm
The following is a basic description of the locally optimal block preconditioned conjugate gradient algorithm as implemented in the file eigen_lobpcg_inc.F90. Locking of already converged eigenvectors is omitted.
For more details, see .
Given a preconditioner and initial real or complex vectors , we form the column matrix
and solve the eigenvalue problem
for the lowest eigenpairs by the following iterative procedure:
1. Make initial vectors orthonormal:
(The denotes for real and for complex .)
Calculate the upper triangular matrix of the Cholesky factorization.
2. Get initial Ritz-vectors and -values:
Solve the following eigenvalue problem for all eigenvalues and -vectors:
Save the initial eigenvalues in a vector:
Multiply the initial vectors by the column matrix of the Ritz-vectors
and set the conjugate directions to zero:
3. for do
4. Terminate convergence loop if all residuals are below a certain threshold.
6. Apply the preconditioner to the residuals:
6. Make residuals orthogonal to eigenvectors:
7. Orthonormalize residuals:
8. if then
Orthonormalize conjugate directions:
9. Rayleigh-Ritz procedure on subspace .
Set up symmetric Gram-matrices:
10. Calculate the lowest eigenpairs of the (for ) or (for ) generalized eigenvalue problem
11. Partition the column-matrix of the results into two or three block matrices of size:
, for , and
12. Calculate new eigenvectors and, if , new conjugate directions:
Despite solving only a standard eigenvalue problem , it does not work to use the usual Rayleigh-Ritz procedure, i. e. solving , because this is unstable. The reason for this is that not all the vectors in the subspace are orthogonal, i. e. the matrix is not identity. One could do an orthogonalization step before doing the Rayleigh-Ritz procedure but this is more computationally expensive (Thanks for Andrew Knyazev, the inventor of the algorithm, for explaining this to me).
The LOBPCG implementation in Octopus is parallelized in states and in domains. In other word, the column matrix of states
is distributed on the nodes as
where we have nodes and the are blocks like
So, we see the real-space points are distributed in the direction (via the usual domain decomposition) and the states are distributed in the direction (by the
In LOBPCG we have to operations to perform two operations on these matrices:
- Multiplication of two big matrices with , distributed as explained above, where the small result matrix is required on all nodes. We need this to calculate the subspaces for example.
- Multiplication of a distributed matrix by a small matrix from the right. The resulting (big) matrix in turn has to be distributed on all nodes like the input matrix . This operation, for example, performs the calculation of the new conjugate directions.
We now describe the communication and computation pattern used for the parallel implementation of the two operations.
Consider the matrix product
Each node can at first calculate the product of its two local blocks and write the result to the corresponding block. Then the blocks of are rotated to the left, i. e. the distribution now looks like this:
Now, each node again does the product of its two local blocks but this time writes the result right next to the previous result. If there is no right block, we circle to the first block. This is scheme is iterated times until all blocks have entries.
The intermediate local results on the nodes now consist of one summand for each block, for example for node hosting the , block:
To obtain the full result, we simply do an
MPI_Allreduce to sum up all contributions.
The algorithm used here is basically a variation of Cannon's algorithm. In the original algorithm the blocks of the first matrix are rotated upwards. This scheme leads to a distribution of the result and would require an additional
MPI_Alltoall because we need the result matrix on all nodes. By omitting the downward rotation and replacing the
MPI_Alltoall by an
MPI_Allreduce we obtain a faster implementation.
To have everything explicit, we write down the algorithm. Let be the result matrix and , the blocks currently living on node with coordinates , with and :
for do rotate columnwise to the left done perform an allreduce on with
To complicate things, we observe that we can overlap the rotating of with the calculation of . That is, while calculating the product, the new block for is being received and the old one sent away. This leads to the following algorithmic formulation with send and receive buffers , respectively:
for do if then wait for message transmission to complete end if if then asynchronoulsy receive block from right neighbour in and asynchronously send to left neighbour end if done perform an allreduce with
This is how the operation is implemented in the code using
MPI_Irecv and the cartesian topology facility of MPI to obtain the rank number of the neighbour.
One subtle issue is that the blocks need not be of the same size but this is handled by providing a communication buffers the largest block fits in.
There remains not much to say about this routine. It uses the same communication pattern as the
Xstates_blockt_mul operation except that it rotates the blocks of the matrix since is known on all nodes. As the result is distributed, no need for an allreduce steps arises.
A. V. Knyazev, Toward the Optimal Preconditioned Eigensolver: Locally Optimal Block Preconditioned Conjugate Gradient Method, SIAM Journal on Scientific Computing 23 517-541 (2001)
L. E. Cannon, A cellular computer to implement the Kalman filter algorithm, PhD thesis (1969)