Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RBKI #61

Merged
merged 56 commits into from
Apr 1, 2024
Merged

RBKI #61

merged 56 commits into from
Apr 1, 2024

Conversation

TeachRaccooon
Copy link
Contributor

@TeachRaccooon TeachRaccooon commented Nov 17, 2023

This PR adds a new driver algorithm - RBKI (Randomized Block Krylov Iteration) (formal name subject to change).
RBKI algorithm is a method for finding truncated SVD based on block Krylov iterations.
RBKI is applicable for matrices with a slowly-decaying spectrum; such cases are canonically considered ''hard`` for a standard randomized SVD.

This algorithm is a version of Algorithm A.1 from https://arxiv.org/pdf/2306.12418.pdf
The main difference is in the fact that an economy SVD is performed only once at the very end
of the algorithm run and that the termination criteria are not based on singular vector residual evaluation.
Instead, the scheme terminates if:
1. ||R||_F > sqrt(1 - eps^2) ||A||_F, which ensures that we've exhausted all vectors and doing more
iterations would bring no benefit or that ||A - hat(A)||_F < eps * ||A||_F.
2. Stop if the bottom right entry of R or S is numerically close to zero (up to the square root of machine eps).

The main cost of this algorithm comes from large GEMMs with the input matrix A.
The algorithm optionally times all its subcomponents through a user-defined 'timing' parameter.
Some things about the algorithm to keep in mind:
1. The need to iteratively allocate space for Krylov subspace buffers (Y_od, X_ev), as well as for the matrices that are to
be decomposed via SVD at the end of the alg (R, S). Additional allocation is done via realloc(), followed up by memset()
call to ensure that the newly allocated space has been zeroed out.
2. Matrix R needs to be kept in a transposed format at all times, as the final decomposition via SVD is to be done on R'.
3. Reorthogonalization steps are a must. These require additional (n + b_sz) by b_sz and n by b_sz computational buffers.
4. 'Q-factors' from QR factorization at every iteration form the Krylov subspace -> we need explicit versions of those. Hence,
using ungqr() is required.
5. Even though the algorithm accepts a "tolerance" parameter, we can control its termination using the "max_iters"
parameter.

In addition to the RBKI, the following is also introduced in this PR:

  1. Switch from using std::vector to pointers in the /misc/rl_gen.hh and /misc/rl_util.hh
  2. Addition of process_input_mat() functionality in /misc/rl_gen.hh - allows to read a given input matrix into RandLAPACK (without knowing its size in advance) from a file.
  3. Gemm vs ormqr speed benchmark.
  4. RBKI runtime breakdown benchmark - assesses the time taken by each subcomponent of RBKI.
  5. RBKI speed comparison benchmark - technically only runs RBKI, but has an option to run SVD (gesdd()) to be compared against RBKI (direct SVD is WAY slower than RBKI). The user is required to provide a matrix file to be read, set min and max numbers of large gemms (Krylov iterations) that the algorithm is allowed to perform min and max block sizes that RBKI is to use; furthermore, the user is to provide a 'custom rank' parameter (number of singular vectors to approximate by RBKI). The benchmark outputs the basic data of a given run, as well as the RBKI runtime and singular vector residual error, which is computed as "sqrt(||AV - SU||^2_F + ||A'U - VS||^2_F / sqrt(custom_rank)" (for "custom rank" singular vectors and values).

@TeachRaccooon
Copy link
Contributor Author

@rileyjmurray I've added all the major changes to RBKI that I wanted to. Please give me comments on this PR when you have time. Also, LMK if we will be updating RandBLAS submodule in this PR. I will hold off on working on the GPU PR until we complete this.

P.S.
The reason why the compilation on GIT fails is the HQRRP calls to LAPACK with/without additional parameters. The current git version of MKL is different from that on Elephant and ISAAC. This is a super quick fix that I will apply before we merge the PR.

Copy link
Contributor

@rileyjmurray rileyjmurray left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All the significant stuff looks good. I still left lots of comments though. I also have some recurring comments that I didn't bother mentioning for each individual file (see below).

benchmark documentation

I'd like for all benchmark scripts to have a comment towards the top that explains what it tests. Right now only a few files have this. Just a couple of sentences should suffice. I want to be able to easily infer what the differences are between things like RBKI_runtime_benchmark.cc and RBKI_speed_comparisons.cc. (I know you can explain it to be here, or in an email, or on a call, but I want it written in the files themselves.)

benchmarks on macOS

Several benchmarks have a pragma like #if !defined(__APPLE__) inside main which determines whether or not the executable does something nontrivial. This difference in behavior across platforms should be easier to identify at a glance. I suggest you use a pattern like the following:

#ifdef(__APPLE__)
#include <iostream>
int main(int argc, char **argc) {
    std::cout << "This benchmark cannot run on Apple machines." << std::endl;
    return 1;
}
#else
// all of your actual code ....
#endif

You can take care of this now, or you can open a new issue with this as a TODO and handle it in another PR.

template parameters in benchmarks

Most benchmark scripts include a function definition like

template <typename T, typename RNG>
static void 
test_speed(int64_t m, int64_t n, int64_t runs, RandBLAS::RNGState<RNG> const_state) 

which you invoke with lines like

test_speed<double, r123::Philox4x32>(std::pow(2, 10), std::pow(2, 5),  10, state);

Specifying r123::Philox4x32 as a template parameter shouldn't actually be needed. The compiler should be able to deduce it from the type of state. If you run into issues with compilers automatically deducing types then I'd prefer you change the function definition to have a default type, along the following lines:

template <typename T, typename RNG = r123::Philox4x32>
static void 
test_speed(int64_t m, int64_t n, int64_t runs, RandBLAS::RNGState<RNG> const_state)

I'd like for you to make these changes throughout all benchmarks. You can take care of this now, or you can open a new issue with this as a TODO and handle it in another PR.

@TeachRaccooon
Copy link
Contributor Author

All the significant stuff looks good. I still left lots of comments though. I also have some recurring comments that I didn't bother mentioning for each individual file (see below).

benchmark documentation

I'd like for all benchmark scripts to have a comment towards the top that explains what it tests. Right now only a few files have this. Just a couple of sentences should suffice. I want to be able to easily infer what the differences are between things like RBKI_runtime_benchmark.cc and RBKI_speed_comparisons.cc. (I know you can explain it to be here, or in an email, or on a call, but I want it written in the files themselves.)

benchmarks on macOS

Several benchmarks have a pragma like #if !defined(__APPLE__) inside main which determines whether or not the executable does something nontrivial. This difference in behavior across platforms should be easier to identify at a glance. I suggest you use a pattern like the following:

#ifdef(__APPLE__)
#include <iostream>
int main(int argc, char **argc) {
    std::cout << "This benchmark cannot run on Apple machines." << std::endl;
    return 1;
}
#else
// all of your actual code ....
#endif

You can take care of this now, or you can open a new issue with this as a TODO and handle it in another PR.

template parameters in benchmarks

Most benchmark scripts include a function definition like

template <typename T, typename RNG>
static void 
test_speed(int64_t m, int64_t n, int64_t runs, RandBLAS::RNGState<RNG> const_state) 

which you invoke with lines like

test_speed<double, r123::Philox4x32>(std::pow(2, 10), std::pow(2, 5),  10, state);

Specifying r123::Philox4x32 as a template parameter shouldn't actually be needed. The compiler should be able to deduce it from the type of state. If you run into issues with compilers automatically deducing types then I'd prefer you change the function definition to have a default type, along the following lines:

template <typename T, typename RNG = r123::Philox4x32>
static void 
test_speed(int64_t m, int64_t n, int64_t runs, RandBLAS::RNGState<RNG> const_state)

I'd like for you to make these changes throughout all benchmarks. You can take care of this now, or you can open a new issue with this as a TODO and handle it in another PR.

I applied all the explicitly-requested fixed & added small comments to each benchmark.
I have also created new issues per your MacOS and RNG templating comments.
The former I anticipate fixing in the next PR, the latter may be fixable once I update the version of RandBLAS used (also next PR).

Copy link
Contributor

@rileyjmurray rileyjmurray left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm marking as approved, but I do have two tiny comments that it would be nice if you addressed.

Merge at your discretion.

@TeachRaccooon
Copy link
Contributor Author

Looks like a bunch of QB & SVD tests are failing on Git (they are all perfectly fine on Elephant). Investigating.

@TeachRaccooon
Copy link
Contributor Author

There was a memory leak in the condition number estimation utility function, as I forgot to free the memory allocated there.
Also, looked like there was some issue with OpenMP usage in MacOS; I applied a temporary fix and opened an issue for myself to resolve in the next PR.

@TeachRaccooon TeachRaccooon merged commit 5432731 into main Apr 1, 2024
2 checks passed
@TeachRaccooon TeachRaccooon deleted the RBKI branch April 1, 2024 21:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants