ViennaCL - The Vienna Computing Library  1.6.0
Free open-source GPU-accelerated linear algebra and solver library.
 All Classes Namespaces Files Functions Variables Typedefs Enumerations Enumerator Friends Macros Pages
Algorithms

This chapter gives an overview over the available algorithms in ViennaCL. The focus of ViennaCL is on iterative solvers, for which generic implementations that allows the use of the same code on the CPU (either using Boost.uBLAS, Eigen, MTL4, or ViennaCL types) and on the GPU (using ViennaCL types) are provided.

Direct Solvers

ViennaCL provides triangular solvers and LU factorization without pivoting for the solution of dense linear systems. The interface is similar to that of Boost.uBLAS.

using namespace viennacl::linalg; //to keep solver calls short
// Set up matrix and vectors here
// solution of an upper triangular system:
vcl_result = solve(vcl_matrix, vcl_rhs, upper_tag());
//solution of a lower triangular system:
vcl_result = solve(vcl_matrix, vcl_rhs, lower_tag());
// solution of a full system right into the load vector vcl_rhs:
lu_factorize(vcl_matrix);
lu_substitute(vcl_matrix, vcl_rhs);

In ViennaCL there is no pivoting included in the LU factorization process, hence the computation may break down or yield results with poor accuracy. However, for certain classes of matrices (like diagonal dominant matrices) good results can be obtained without pivoting.

It is also possible to solve for multiple right hand sides:

using namespace viennacl::linalg; //to keep solver calls short
viennacl::matrix<float> vcl_rhs_matrix;
// Set up matrices here
// solution of an upper triangular system:
vcl_result = solve(vcl_matrix, vcl_rhs_matrix, upper_tag());
// solution of a lower triangular system:
vcl_result = solve(vcl_matrix, vcl_rhs_matrix, lower_tag());

Iterative Solvers

ViennaCL provides different iterative solvers for various classes of matrices:

Unlike direct solvers, the convergence of iterative solvers relies on certain properties of the system matrix. Keep in mind that an iterative solver may fail to converge, especially if the matrix is ill conditioned or a wrong solver is chosen.

Note
The iterative solvers can also be used for Boost.uBLAS, Eigen and MTL4 objects! Have a look at Interfacing Other Libraries and the respective tutorials in the examples/tutorials/ folder.
// Set up matrix and vectors here
// solution using conjugate gradient solver:
vcl_result = viennacl::linalg::solve(vcl_matrix, vcl_rhs, viennacl::linalg::cg_tag());
// solution using BiCGStab solver:
vcl_result = viennacl::linalg::solve(vcl_matrix, vcl_rhs, viennacl::linalg::bicgstab_tag());
// solution using GMRES solver:
vcl_result = viennacl::linalg::solve(vcl_matrix, vcl_rhs, viennacl::linalg::gmres_tag());
Method Matrix class ViennaCL
Conjugate Gradient (CG) symmetric positive definite y = solve(A, x, cg_tag());
Stabilized Bi-CG (BiCGStab) non-symmetric y = solve(A, x, bicgstab_tag());
Generalized Minimum Residual (GMRES) general y = solve(A, x, gmres_tag());

Linear solver routines in ViennaCL for the computation of $ y $ in the expression $ Ay = x $ with given $ A $, $ x $.

Customized error tolerances can be set in the solver tags. The convention is that solver tags take the relative error tolerance as first argument and the maximum number of iteration steps as second argument. Furthermore, after the solver run the number of iterations and the estimated error can be obtained from the solver tags as follows:

// conjugate gradient solver with tolerance 1e10
// and at most 100 iterations:
viennacl::linalg::cg_tag custom_cg(1e-10, 100);
vcl_result = viennacl::linalg::solve(vcl_matrix, vcl_rhs, custom_cg);
// print number of iterations taken and estimated error:
std::cout << "No. of iters: " << custom_cg.iters() << std::endl;
std::cout << "Est. error: " << custom_cg.error() << std::endl;

The BiCGStab solver tag can be customized in exactly the same way. The GMRES solver tag takes as third argument the dimension of the Krylov space. Thus, a tag for GMRES(30) with tolerance $ 10^{-10} $ and at most 100 total iterations (hence, up to three restarts) can be set up by

viennacl::linalg::gmres_tag custom_gmres(1e-10, 100, 30);

Preconditioners

ViennaCL ships with a generic implementation of several preconditioners. The preconditioner setup is expect for simple diagonal preconditioners always carried out on the CPU host due to the need for dynamically allocating memory. Thus, one may not obtain an overall performance benefit if too much time is spent on the preconditioner setup.

Note
The preconditioner also works for Boost.uBLAS types!

An overview of preconditioners available for the various sparse matrix types is as follows:

Matrix Type ICHOL (Block-)ILU[0/T] Jacobi Row-scaling AMG SPAI
compressed_matrix yes yes yes yes yes yes
coordinate_matrix no no yes yes no no
ell_matrix no no no no no no
hyb_matrix no no no no no no

Broader support of preconditioners particularly for ell_matrix and hyb_matrix is scheduled for future releases. AMG and SPAI preconditioners are described in Additional Algorithms section.

Incomplete LU Factorization with Threshold (ILUT)

The incomplete LU factorization preconditioner aims at computing sparse matrices lower and upper triangular matrices $ L $ and $ U $ such that the sparse system matrix is approximately given by $ A \approx LU $. In order to control the sparsity pattern of $ L $ and $ U $, a threshold strategy is used (ILUT) [21] . Due to the serial nature of the preconditioner, the setup of ILUT is always computed on the CPU using the respective ViennaCL backend.

// compute ILUT preconditioner:
viennacl::linalg::ilut_precond< SparseMatrix > vcl_ilut(vcl_matrix, ilut_config);
// solve (e.g. using conjugate gradient solver)
vcl_result = viennacl::linalg::solve(vcl_matrix, vcl_rhs,
vcl_ilut); // preconditioner here

The triangular substitutions may be applied in parallel on GPUs by enabling level-scheduling [21] via the member function call use_level_scheduling(true) in the ilut_config object.

Three parameters can be passed to the constructor of ilut_tag: The first specifies the maximum number of entries per row in $ L $ and $ U $, while the second parameter specifies the drop tolerance. The third parameter is the boolean specifying whether level scheduling should be used.

Note
The performance of level scheduling depends strongly on the matrix pattern and is thus disabled by default.

Incomplete LU Factorization with Static Pattern (ILU0)

Similar to ILUT, ILU0 computes an approximate LU factorization with sparse factors L and U. While ILUT determines the location of nonzero entries on the fly, ILU0 uses the sparsity pattern of A for the sparsity pattern of L and U [21] Due to the serial nature of the preconditioner, the setup of ILU0 is computed on the CPU.

// compute ILU0 preconditioner:
viennacl::linalg::ilu0_precond< SparseMatrix > vcl_ilut(vcl_matrix, ilu0_config);
// solve (e.g. using conjugate gradient solver)
vcl_result = viennacl::linalg::solve(vcl_matrix, vcl_rhs,
vcl_ilut); // preconditioner here

The triangular substitutions may be applied in parallel on GPUs by enabling level-scheduling [21] via the member function call use_level_scheduling(true) in the ilu0_config object.

One parameter can be passed to the constructor of ilu0_tag, being the boolean specifying whether level scheduling should be used.

Note
The performance of level scheduling depends strongly on the matrix pattern and is thus disabled by default.

Block-ILU

To overcome the serial nature of ILUT and ILU0 applied to the full system matrix, a parallel variant is to apply ILU to diagonal blocks of the system matrix. This is accomplished by the block_ilu preconditioner, which takes the system matrix type as first template argument and the respective ILU-tag type as second template argument (either ilut_tag or ilu0_tag). Support for accelerators using CUDA or OpenCL is provided.

// compute block-ILU preconditioner using ILU0 for each block:
block_ilu_precond<SparseMatrix, ilu0_tag> vcl_block_ilu0(vcl_matrix, ilu0_tag());
// solve
vcl_result = viennacl::linalg::solve(vcl_matrix, vcl_rhs,
vcl_block_ilu0);

A third argument can be passed to the constructor of block_ilu_precond: Either the number of blocks to be used (defaults to 8), or an index vector with fine-grained control over the blocks.

Note
The number of blocks is a design parameter for your sparse linear system at hand. Higher number of blocks leads to better memory bandwidth utilization on GPUs, but may increase the number of solver iterations.

Jacobi Preconditioner

A Jacobi preconditioner is a simple diagonal preconditioner given by the reciprocals of the diagonal entries of the system matrix. Use the preconditioner as follows:

//compute Jacobi preconditioner:
jacobi_precond< SparseMatrix > vcl_jacobi(vcl_matrix, viennacl::linalg::jacobi_tag());
//solve (e.g. using conjugate gradient solver)
vcl_result = viennacl::linalg::solve(vcl_matrix, vcl_rhs,
vcl_jacobi);

Row Scaling Preconditioner

A row scaling preconditioner is a simple diagonal preconditioner given by the reciprocals of the norms of the rows of the system matrix. Use the preconditioner as follows:

//compute row scaling preconditioner:
row_scaling< SparseMatrix > vcl_row_scaling(vcl_matrix, viennacl::linalg::row_scaling_tag());
//solve (e.g. using conjugate gradient solver)
vcl_result = viennacl::linalg::solve(vcl_matrix, vcl_rhs,
vcl_row_scaling);

The tag viennacl::linalg::row_scaling_tag() can be supplied with a parameter denoting the norm to be used. A value of 1 specifies the $ l^1 $-norm, while a value of $ 2 $ selects the $ l^2 $-norm (default).

Eigenvalue Computations

Two algorithms for the computations of the eigenvalues of a sparse matrix are implemented in ViennaCL:

  • The Power Iteration [10]
  • The Lanczos Algorithm [22]

The algorithms are called for a matrix object A by

std::vector<double> largest_eigenvalues = viennacl::linalg::eig(A, ltag);
double largest_eigenvalue = viennacl::linalg::eig(A, ptag);

Depending on the second parameter tag either of the two methods is called. Both algorithms can be used for either Boost.uBLAS or ViennaCL compressed matrices. In order to get the eigenvalue with the greatest absolut value, the power iteration should be called. The Lanczos algorithm returns a vector of the largest eigenvalues with the same type as the entries of the matrix.

Power Iteration

The Power iteration aims at computing the eigenvalues of a matrix by calculating the product of the matrix and a vector for several times, where the resulting vector is used for the next product of the matrix and so on. The computation stops as soon as the norm of the vector converges. The final vector is the eigenvector to the eigenvalue with the greatest absolut value. To call this algorithm, piter_tag must be used. This tag has only one parameter: terminationfactor defines the accuracy of the computation, i.e. if the new norm of the eigenvector changes less than this parameter the computation stops and returns the corresponding eigenvalue (default: $ 10^{-10} $ ) The call of the constructor may look as follows:

viennacl::linalg::piter_tag ptag(1e-8);
Note
Example code can be found in examples/tutorial/power-iter.cpp

The Lanczos Algorithm

In order to compute the eigenvalues of a sparse high-dimensional matrix the Lanczos algorithm can be used to find these. This algorithm reformulates the given high-dimensional matrix in a way such that the matrix can be rewritten in a tridiagonal matrix at much lower dimension. The eigenvalues of this tridiagonal matrix are equal to the largest eigenvalues of the original matrix and calculated by using the bisection method [10] . To call this Lanczos algorithm, lanczos_tag must be used. This tag has several parameters that can be passed to the constructor:

  • The exponent of epsilon for the tolerance of the reorthogonalization, defined by the parameter factor (default: 0.75)
  • The method of the Lanczos algorithm: 0 uses partial reorthogonalization, 1 full reothogonalization, and 2 does not use reorthogonalization (default: 0)
  • The number of eigenvalues that are returned is specified by num_eigenvalues (default: 10)
  • The size of the Krylov space used for the computations can be set by the parameter krylov_size (default: 100). The maximum number of iterations can be equal or less this parameter.

The call of the constructor may look like the following:

viennacl::linalg::lanczos_tag ltag(0.85, 15, 0, 200);
Note
Example code can be found in examples/tutorial/lanczos.cpp

QR Factorization

Note
The current QR factorization implementation depends on Boost.uBLAS.

A matrix $ A \in \mathbb{R}^{n\times m} $ can be factored into $ A = Q R $, where $ Q \in \mathbb{R}^{n\times n}$ is an orthogonal matrix and $ R \in \mathbb{R}^{n \times m}$ is upper triangular. This so-called QR-factorization is important for eigenvalue computations as well as for the solution of least-squares problems [10] . ViennaCL provides a generic implementation of the QR-factorization using Householder reflections in file viennacl/linalg/qr.hpp. An example application can be found in examples/tutorial/qr.cpp.

The Householder reflectors $ v_i $ defining the Householder reflection $ I - \beta_i v_i v_i^{\mathrm{T}} $ are stored in the columns below the diagonal of the input matrix $ A $ [10] . The normalization coefficients $ \beta_i $ are returned by the worker function inplace_qr. The upper triangular matrix $ R $ is directly written to the upper triangular part of $ A $.

std::vector<ScalarType> betas = viennacl::linalg::inplace_qr(A, 12);

If $ A $ is a dense matrix from Boost.uBLAS, the calculation is carried out on the CPU using a single thread. If $ A $ is a viennacl::matrix, a hybrid implementation is used: The panel factorization is carried out using Boost.uBLAS, while expensive BLAS level 3 operations are computed on the OpenCL device using multiple threads.

Typically, the orthogonal matrix $ Q $ is kept in inplicit form because of computational efficiency. However, if $ Q $ and $ R $ have to be computed explicitly, the function recoverQ can be used:

Here, A is the inplace QR-factored matrix, betas are the coefficients of the Householder reflectors as returned by inplace_qr, while Q and R are the destination matrices. However, the explicit formation of Q is expensive and is usually avoided. For a number of applications of the QR factorization it is only required to apply Q^T to a vector b. This is accomplished by

without setting up Q (or Q^T) explicitly.

Note
Have a look at examples/tutorial/least-squares.cpp for a least-squares computation using QR factorizations.