Developers Manual:LOBPCG

From OctopusWiki
Jump to: navigation, search

Description of the algorithm

The following is a basic description of the locally optimal block preconditioned conjugate gradient algorithm as implemented in the file eigen_lobpcg_inc.F90. Locking of already converged eigenvectors is omitted.

For more details, see [1].

Given a preconditioner T\,\! and N\,\! initial real or complex vectors \vec\varphi_i^{(1)}\,\!, i=1,\ldots,N\,\! we form the column matrix


\Phi^{(1)} \leftarrow \left[\vec\varphi_1^{(1)} \cdots \vec\varphi_N^{(1)}\right]
\,\!

and solve the eigenvalue problem


H\vec\varphi_i = \vec\varepsilon_i\varphi_i
\,\!

for the N\,\! lowest eigenpairs (\varepsilon_i, \vec\varphi_i)\,\! by the following iterative procedure:

1. Make initial vectors orthonormal:


u \leftarrow \Phi^{(1)+}\Phi^{(1)}
\,\!

(The A^+\,\! denotes A^T\,\! for real A\,\! and A^\dagger\,\! for complex A\,\!.)

Calculate the upper triangular matrix U\,\! of the Cholesky factorization.


U \leftarrow \mathrm{chol}\,u
\,\!


\Phi^{(1)} \leftarrow \Phi^{(1)} U^{-1}
\,\!

2. Get initial Ritz-vectors and -values:


G_H \leftarrow \Phi^{(1)+}H\Phi^{(1)}
\,\!

Solve the following N \times N\,\! eigenvalue problem for all N\,\! eigenvalues and -vectors:


G_H \vec v_i = \lambda_i \vec v_i
\,\!

Save the initial eigenvalues in a vector:


\vec\varepsilon^{(1)} \leftarrow \begin{bmatrix}
\lambda_1 \\
\vdots \\
\lambda_N
\end{bmatrix}
\,\!

Multiply the initial vectors by the column matrix of the Ritz-vectors


\Phi^{(1)} \leftarrow \Phi^{(1)} \left[\vec v_1 \cdots \vec v_n\right]
\,\!

and set the conjugate directions to zero:


P^{(1)} = \left[\vec p^{(1)}_1 \cdots \vec p^{(1)}_N\right] \leftarrow 0
\,\!

3. for k\leftarrow 1, \ldots, \mathit{max\_iter}\,\! do

Calculate residuals:


\vec w_i^{(k)} \leftarrow H\vec\varphi^{(k)}_i - \varepsilon^{(k)}_i\vec\varphi^{(k)}_i, \quad i = 1, \ldots, N
\,\!


W^{(k)} \leftarrow \left[\vec w^{(k)}_1 \cdots \vec w^{(k)}_N\right]
\,\!

4. Terminate convergence loop if all residuals are below a certain threshold.

6. Apply the preconditioner to the residuals:

W^{(k)} \leftarrow T W^{(k)} \,\!

6. Make residuals orthogonal to eigenvectors:


W^{(k)} \leftarrow W^{(k)} - \Phi^{(k)}\left(\Phi^{(k)+} W^{(k)}\right)
\,\!

7. Orthonormalize residuals:


u \leftarrow W^{(k)+}W^{(k)}
\,\!


U \leftarrow \mathrm{chol}\,u
\,\!


W^{(k)} \leftarrow W^{(k)} U^{-1}
\,\!

8. if k>1\,\! then

Orthonormalize conjugate directions:


u \leftarrow P^{(k)+}P^{(k)}
\,\!


U \leftarrow \mathrm{chol}\,u
\,\!


P^{(k)} \leftarrow P^{(k)} U^{-1}
\,\!

end if

9. Rayleigh-Ritz procedure on subspace \mathrm{span}\left\{\vec\varphi^{(k)}, \ldots, \vec\varphi^{(k)}, \vec w^{(k)}_1, \ldots, \vec w^{(k)}_N, \vec p^{(k)}_1, \ldots, \vec p^{(k)}_N\right\}\,\!.

Set up symmetric Gram-matrices:

if k\leftarrow1\,\! then


G_H \leftarrow \begin{bmatrix}
\begin{matrix}
\varepsilon^{(k)}_1 & & 0 \\
& \ddots \\
0 & & \varepsilon^{(k)}_N
\end{matrix} & \left(H\Phi^{(k)}\right)^+W^{(k)} \\ \\
\cdot & W^{(k)+}\left(HW^{(k)}\right) \\
\quad
\end{bmatrix}
\,\!


G_I \leftarrow \begin{bmatrix}
\quad I \quad & \Phi^{(k)+}W^{(k)} \\
\cdot & I
\end{bmatrix}\,\!

else


G_H \leftarrow \begin{bmatrix}
\begin{matrix}
\varepsilon^{(k)}_1 & & 0 \\
& \ddots \\
0 & & \varepsilon^{(k)}_N
\end{matrix} & \left(H\Phi^{(k)}\right)^+W^{(k)} & \left(H\Phi^{(k)}\right)^+P^{(k)} \\ \\
\cdot & W^{(k)+}\left(HW^{(k)}\right) & \left(H W^{(k)}\right)^+ P^{(k)} \\ \\
\cdot & \cdot & P^{(k)+}\left(H P^{(k)}\right) \\
\quad
\end{bmatrix}
\,\!


G_I \leftarrow \begin{bmatrix}
\quad I \quad & \Phi^{(k)+}W^{(k)} & \Phi^{(k)+}P^{(k)} \\
\cdot & I & W^{(k)+}P^{k} \\
\cdot & \cdot & I
\end{bmatrix}\,\!

end if

10. Calculate the lowest N\,\! eigenpairs of the 2N \times 2N\,\! (for k=1\,\!) or 3N \times 3N\,\! (for k>1\,\!) generalized eigenvalue problem


G_H\vec v_i = \lambda_i G_I \vec v_i
\,\!

11. Partition the column-matrix of the results into two or three block matrices of N\times N\,\! size:


\left[\vec v_1 \cdots \vec v_N\right] \leftarrow \begin{bmatrix}
V_\phi \\
V_w
\end{bmatrix}
\,\!, for k=1\,\!, and


\left[\vec v_1 \cdots \vec v_N\right] \leftarrow \begin{bmatrix}
V_\phi \\
V_w \\
V_p
\end{bmatrix}
\,\!, for k>1\,\!

12. Calculate new eigenvectors and, if k>1\,\!, new conjugate directions:

if k=1\,\! then

P^{(k+1)} = W^{(k)}V_w\,\!

else

\,\!P^{(k+1)} = P^{(k)} V_p + W^{(k)}V_w

end if

\Phi^{(k+1)} = \Phi^{(k)} V_\phi + P^{(k+1)}\,\!

done

Notes

Despite solving only a standard eigenvalue problem H\vec\varphi_i = \varepsilon_i\vec\varphi_i\,\!, it does not work to use the usual Rayleigh-Ritz procedure, i. e. solving G_H \vec v_i = \lambda_i \vec v_i\,\!, because this is unstable. The reason for this is that not all the vectors in the subspace are orthogonal, i. e. the G_I\,\! matrix is not identity. One could do an orthogonalization step before doing the Rayleigh-Ritz procedure but this is more computationally expensive (Thanks for Andrew Knyazev, the inventor of the algorithm, for explaining this to me).

Parallel implementation

The LOBPCG implementation in Octopus is parallelized in states and in domains. In other word, the column matrix of states

\left[\vec\varphi_1 \cdots \vec\varphi_N\right]\,\!

is distributed on the nodes as

\begin{bmatrix}
\Phi_{11} & \cdots & \Phi_{1m} \\
\vdots & \ddots & \vdots \\
\Phi_{n1} & \cdots & \Phi_{nm}
\end{bmatrix}\!\,,

where we have P=nm\,\! nodes and the \Phi_{ij}\,\! are blocks like

\Phi_{ij} =
\begin{bmatrix}
\vec\varphi_{i,\mathit{st\_begin}(j)} \cdots \vec\varphi_{i,\mathit{st\_end}(j)}
\end{bmatrix}\,\!.

So, we see the real-space points are distributed in the n\,\! direction (via the usual domain decomposition) and the states are distributed in the m\,\! direction (by the states_distribute_nodes routine).

In LOBPCG we have to operations to perform two operations on these matrices:

  • Multiplication of two big matrices A^+ B\,\! with A\,\!, B\,\! distributed as explained above, where the small result matrix is required on all nodes. We need this to calculate the subspaces for example.
  • Multiplication of a distributed matrix A\,\! by a small matrix C\,\! from the right. The resulting (big) matrix in turn has to be distributed on all nodes like the input matrix A\,\!. This operation, for example, performs the calculation of the new conjugate directions.

We now describe the communication and computation pattern used for the parallel implementation of the two operations.

Q\leftarrow A^+B\,\! by Xstates_blockt_mul

Consider the matrix product


\begin{bmatrix}
A_{11} & \cdots & A_{1n} \\
\vdots & \ddots & \vdots \\
A_{m1} & \cdots & A_{mn}
\end{bmatrix}\cdot\begin{bmatrix}
B_{11} & \cdots & B_{1m} \\
\vdots & \ddots & \vdots \\
A_{n1} & \cdots & A_{nm}
\end{bmatrix} =
\begin{bmatrix}
A_{11}B_{11} + \cdots + A_{1n}B_{n1} & \cdots & A_{11}B_{1m} + \cdots + A_{1n}B_{nm} \\
\vdots & \ddots & \vdots \\
A_{m1}B_{11} + \cdots + A_{mn}B_{n1} & \cdots & A_{m1}B_{1m} + \cdots + A_{mn}B_{nm}
\end{bmatrix}
\,\!.

Each node can at first calculate the product of its two local blocks and write the result to the corresponding block. Then the blocks of B\,\! are rotated to the left, i. e. the distribution now looks like this:

\begin{bmatrix}
A_{11} & \cdots & A_{1n} \\
\vdots & \ddots & \vdots \\
A_{m1} & \cdots & A_{mn}
\end{bmatrix}\cdot\begin{bmatrix}
B_{12} & \cdots & B_{1m} & B_{11} \\
\vdots & \ddots & \vdots & \vdots \\
B_{n2} & \cdots & B_{nm} & B_{n1}
\end{bmatrix}\,\!

Now, each node again does the product of its two local blocks but this time writes the result right next to the previous result. If there is no right block, we circle to the first block. This is scheme is iterated m-1\,\! times until all blocks have entries.

The intermediate local results on the nodes now consist of one summand for each block, for example for node hosting the k\,\!, l\,\! block:

\begin{bmatrix}
& & \mathbf{0} \\
A_{kl}B_{k1} & \cdots & A_{kl}B_{kl} & \cdots & A_{kl}B_{kn} \\
& & \mathbf{0}
\end{bmatrix}\,\!

To obtain the full result, we simply do an MPI_Allreduce to sum up all contributions.

The algorithm used here is basically a variation of Cannon's algorithm. In the original algorithm the blocks of the first matrix are rotated upwards. This scheme leads to a distribution of the result and would require an additional MPI_Alltoall because we need the result matrix on all nodes. By omitting the downward rotation and replacing the MPI_Alltoall by an MPI_Allreduce we obtain a faster implementation.

To have everything explicit, we write down the algorithm. Let Q\,\! be the result matrix and A,\!, B\,\! the blocks currently living on node with coordinates p\,\!, q\,\! with p=0,\ldots,m-1\,\! and q=0,\ldots,n-1:

Q \leftarrow 0\,\!
Q_{p+1,p+1} \leftarrow AB\,\!
for i \leftarrow 1, \ldots, m-1\,\! do
  rotate B\,\! columnwise to the left
  k \leftarrow (p+i) \mod m\,\!
  Q_{p+1,k+1} \leftarrow AB\,\!
done
perform an allreduce on Q\,\! with + \,\!

To complicate things, we observe that we can overlap the rotating of B\,\! with the calculation of AB\,\!. That is, while calculating the product, the new block for B\,\! is being received and the old one sent away. This leads to the following algorithmic formulation with send and receive buffers S\,\!, R\,\! respectively:

Q \leftarrow 0\,\!
S \leftarrow B\,\!
for i \leftarrow 0, \ldots, m-1\,\! do
  k \leftarrow (p+i) \mod m\,\!
  if i > 0\,\! then
    wait for message transmission to complete
    S \leftrightarrow R\,\!
  end if
  if i < m-1\,\! then
    asynchronoulsy receive block from right neighbour in R\,\! and
    asynchronously send S\,\! to left neighbour
  end if
  Q_{p+1,k+1} \leftarrow AS\,\!
done
perform an allreduce Q\,\! with + \,\!

This is how the operation is implemented in the code using MPI_Isend and MPI_Irecv and the cartesian topology facility of MPI to obtain the rank number of the neighbour.

One subtle issue is that the blocks need not be of the same size but this is handled by providing a communication buffers the largest block fits in.

Q\leftarrow AC+Q\,\! by Xstates_block_matr_mul_add

There remains not much to say about this routine. It uses the same communication pattern as the Xstates_blockt_mul operation except that it rotates the blocks of the A\,\! matrix since C\,\! is known on all nodes. As the result is distributed, no need for an allreduce steps arises.


References

A. V. Knyazev, Toward the Optimal Preconditioned Eigensolver: Locally Optimal Block Preconditioned Conjugate Gradient Method, SIAM Journal on Scientific Computing 23 517-541 (2001)

L. E. Cannon, A cellular computer to implement the Kalman filter algorithm, PhD thesis (1969)