Developers:LOBPCG
From OctopusWiki
Contents |
Description of the algorithm
The following is a basic description of the locally optimal block preconditioned conjugate gradient algorithm as implemented in the file eigen_lobpcg_inc.F90. Locking of alreeady converged eigenvectors is omitted.
For more details, see [1].
Given a preconditioner
and
initial real or complex vectors
,
we form the column matrix
and solve the eigenvalue problem
for the
lowest eigenpairs
by the following iterative procedure:
1. Make initial vectors orthonormal:
(The
denotes
for real
and
for complex
.)
Calculate the upper triangular matrix
of the Cholesky factorization.
2. Get initial Ritz-vectors and -values:
Solve the following
eigenvalue problem for all
eigenvalues and -vectors:
Save the initial eigenvalues in a vector:
Multiply the initial vectors by the column matrix of the Ritz-vectors
and set the conjugate directions to zero:
3. for
do
Calculate residuals:
4. Terminate convergence loop if all residuals are below a certain threshold.
6. Apply the preconditioner to the residuals:
6. Make residuals orthogonal to eigenvectors:
7. Orthonormalize residuals:
8. if
then
Orthonormalize conjugate directions:
end if
9. Rayleigh-Ritz procedure on subspace
.
Set up symmetric Gram-matrices:
if
then
else
end if
10. Calculate the lowest
eigenpairs of the
(for
) or
(for
) generalized eigenvalue problem
11. Partition the column-matrix of the results into two or three block matrices of
size:
, for
, and
, for
12. Caclulate new eigenvectors and, if
, new conjugate directions:
if
then
else
end if
done
Notes
Despite solving only a standard eigenvalue problem
, it does not work to use the usual Rayleigh-Ritz procedure, i. e. solving
, because this is unstable. The reason for this is that not all the vectors in the subspace are orthogonal, i. e. the
matrix is not identity. One could do an orthogonalization step before doing the Rayleigh-Ritz procedure but this is more computationally expensive (Thanks for Andrew Knyazev, the inventor of the algorithm, for explaining this to me).
Parallel implementation
The LOBPCG implementation in Octopus is parallelized in states and in domains. In other word, the column matrix of states
is distributed on the nodes as
,
where we have
nodes and the
are blocks like
.
So, we see the real space points are distributed in the
direction (via the usual domain decomposition) and the states are distributed in the
direction (by the states_distribute_nodes routine).
In LOBPCG we have to operations to perform to operations on these matrices:
- Multiplication of two big matrices
with
,
distributed as explainded above where the small result matrix is required on all nodes. We need this to calculate the supaspaces for example.
- Multiplication of a distributed matrix
by a small matrix
from the right. The resulting (big) matrix in turn has to distributed on all nodes like the input matrix
. Thise operation for example performs the calculation of the new conjugate directions.
We now describe the communication and computation pattern used for the parallel implementation of the two operations.
by Xstates_blockt_mul
Consider the matrix product
.
Each node can at first calculate the product of its two local blocks and write the result to the corresponding block. Then the blocks of
are rotated to the left, i. e. the distribution now looks like this:
Now, each node again does the product of its two local blocks but this time writes the result right next to the previous result. If there is no right block, we circle to the first block. This is scheme is iterated
times until all blocks have entries.
The intermediate local results on the nodes now consist of one summand for each block, for example for node hosting the
,
block:
To obtain the full result, we simply do an MPI_Allreduce to sum up all contributions.
The algorithm used here is basically a variation of Cannon's algorithm. In the original algorithm the blocks of the first matrix are rotated upwards. This scheme leads to a distribution of the result and would require an additional MPI_Alltoall because we need the result matrix on all nodes. By omitting the downward rotation and replacing the MPI_Alltoall by an MPI_Allreduce we obtain a faster implementation.
To have everything explicit, we write down the algorithm. Let
be the result matrix and
,
the blocks currently living on node with coordinates
,
with
and
:
![]()
for
do rotate
columnwise to the left
![]()
done perform an allreduce on
with
![]()
To complicate things, we observe that we can overlap the rotating of
with the calculation of
. That is, while calculating the product, the new block for
is being received and the old one sended away. This leads to the following algorithmic formulation with send and receive buffers
,
respectively:
![]()
for
do
if
then wait for message transmission to complete
end if if
then asynchronoulsy receive block from right neighbour in
and asynchronously send
to left neighbour end if
done perform an allreduce
with
![]()
This is how the operation is implemented in the code using MPI_Isend and MPI_Irecv and the cartesian topology facility of MPI to obtain the rank number of the neighbour.
One subtle issue is that the blocks need not be of the same size but this is handled by providing a communication buffers the largest block fits in.
by Xstates_block_matr_mul_add
There remains not much to say about this routine. It uses the same communication pattern as the Xstates_blockt_mul operation except that it rotates the blocks of the
matrix since
is known on all nodes. As the result is distributed, no need for an allreduce steps arises.
References
A. V. Knyazev, Toward the Optimal Preconditioned Eigensolver: Locally Optimal Block Preconditioned Conjugate Gradient Method, SIAM Journal on Scientific Computing 23 517-541 (2001)
L. E. Cannon, A cellular computer to implement the Kalman filter algorithm, PhD thesis (1969)
for
do
rotate
done
perform an allreduce on
for
do
then
wait for message transmission to complete
end if
if
then
asynchronoulsy receive block from right neighbour in
done
perform an allreduce 
