RBKI #61

TeachRaccooon · 2023-11-17T20:39:49Z

This PR adds a new driver algorithm - RBKI (Randomized Block Krylov Iteration) (formal name subject to change).
RBKI algorithm is a method for finding truncated SVD based on block Krylov iterations.
RBKI is applicable for matrices with a slowly-decaying spectrum; such cases are canonically considered ''hard`` for a standard randomized SVD.

This algorithm is a version of Algorithm A.1 from https://arxiv.org/pdf/2306.12418.pdf
The main difference is in the fact that an economy SVD is performed only once at the very end
of the algorithm run and that the termination criteria are not based on singular vector residual evaluation.
Instead, the scheme terminates if:
1. ||R||_F > sqrt(1 - eps^2) ||A||_F, which ensures that we've exhausted all vectors and doing more
iterations would bring no benefit or that ||A - hat(A)||_F < eps * ||A||_F.
2. Stop if the bottom right entry of R or S is numerically close to zero (up to the square root of machine eps).

The main cost of this algorithm comes from large GEMMs with the input matrix A.
The algorithm optionally times all its subcomponents through a user-defined 'timing' parameter.
Some things about the algorithm to keep in mind:
1. The need to iteratively allocate space for Krylov subspace buffers (Y_od, X_ev), as well as for the matrices that are to
be decomposed via SVD at the end of the alg (R, S). Additional allocation is done via realloc(), followed up by memset()
call to ensure that the newly allocated space has been zeroed out.
2. Matrix R needs to be kept in a transposed format at all times, as the final decomposition via SVD is to be done on R'.
3. Reorthogonalization steps are a must. These require additional (n + b_sz) by b_sz and n by b_sz computational buffers.
4. 'Q-factors' from QR factorization at every iteration form the Krylov subspace -> we need explicit versions of those. Hence,
using ungqr() is required.
5. Even though the algorithm accepts a "tolerance" parameter, we can control its termination using the "max_iters"
parameter.

In addition to the RBKI, the following is also introduced in this PR:

Switch from using std::vector to pointers in the /misc/rl_gen.hh and /misc/rl_util.hh
Addition of process_input_mat() functionality in /misc/rl_gen.hh - allows to read a given input matrix into RandLAPACK (without knowing its size in advance) from a file.
Gemm vs ormqr speed benchmark.
RBKI runtime breakdown benchmark - assesses the time taken by each subcomponent of RBKI.
RBKI speed comparison benchmark - technically only runs RBKI, but has an option to run SVD (gesdd()) to be compared against RBKI (direct SVD is WAY slower than RBKI). The user is required to provide a matrix file to be read, set min and max numbers of large gemms (Krylov iterations) that the algorithm is allowed to perform min and max block sizes that RBKI is to use; furthermore, the user is to provide a 'custom rank' parameter (number of singular vectors to approximate by RBKI). The benchmark outputs the basic data of a given run, as well as the RBKI runtime and singular vector residual error, which is computed as "sqrt(||AV - SU||^2_F + ||A'U - VS||^2_F / sqrt(custom_rank)" (for "custom rank" singular vectors and values).

…h get_line function locally.

…bug, fixed norm bug.

… rank.

TeachRaccooon · 2024-03-21T21:39:12Z

@rileyjmurray I've added all the major changes to RBKI that I wanted to. Please give me comments on this PR when you have time. Also, LMK if we will be updating RandBLAS submodule in this PR. I will hold off on working on the GPU PR until we complete this.

P.S.
The reason why the compilation on GIT fails is the HQRRP calls to LAPACK with/without additional parameters. The current git version of MKL is different from that on Elephant and ISAAC. This is a super quick fix that I will apply before we merge the PR.

rileyjmurray

All the significant stuff looks good. I still left lots of comments though. I also have some recurring comments that I didn't bother mentioning for each individual file (see below).

benchmark documentation

I'd like for all benchmark scripts to have a comment towards the top that explains what it tests. Right now only a few files have this. Just a couple of sentences should suffice. I want to be able to easily infer what the differences are between things like RBKI_runtime_benchmark.cc and RBKI_speed_comparisons.cc. (I know you can explain it to be here, or in an email, or on a call, but I want it written in the files themselves.)

benchmarks on macOS

Several benchmarks have a pragma like #if !defined(__APPLE__) inside main which determines whether or not the executable does something nontrivial. This difference in behavior across platforms should be easier to identify at a glance. I suggest you use a pattern like the following:

#ifdef(__APPLE__)
#include <iostream>
int main(int argc, char **argc) {
    std::cout << "This benchmark cannot run on Apple machines." << std::endl;
    return 1;
}
#else
// all of your actual code ....
#endif

You can take care of this now, or you can open a new issue with this as a TODO and handle it in another PR.

template parameters in benchmarks

Most benchmark scripts include a function definition like

template <typename T, typename RNG>
static void 
test_speed(int64_t m, int64_t n, int64_t runs, RandBLAS::RNGState<RNG> const_state)

which you invoke with lines like

test_speed<double, r123::Philox4x32>(std::pow(2, 10), std::pow(2, 5),  10, state);

Specifying r123::Philox4x32 as a template parameter shouldn't actually be needed. The compiler should be able to deduce it from the type of state. If you run into issues with compilers automatically deducing types then I'd prefer you change the function definition to have a default type, along the following lines:

template <typename T, typename RNG = r123::Philox4x32>
static void 
test_speed(int64_t m, int64_t n, int64_t runs, RandBLAS::RNGState<RNG> const_state)

I'd like for you to make these changes throughout all benchmarks. You can take care of this now, or you can open a new issue with this as a TODO and handle it in another PR.

RandLAPACK/comps/rl_orth.hh

RandLAPACK/drivers/rl_cqrrp.hh

RandLAPACK/drivers/rl_rbki.hh

benchmark/bench_RBKI/RBKI_runtime_benchmark.cc

test/drivers/test_rbki.cc

TeachRaccooon · 2024-04-01T18:37:27Z

All the significant stuff looks good. I still left lots of comments though. I also have some recurring comments that I didn't bother mentioning for each individual file (see below).

benchmark documentation

I'd like for all benchmark scripts to have a comment towards the top that explains what it tests. Right now only a few files have this. Just a couple of sentences should suffice. I want to be able to easily infer what the differences are between things like RBKI_runtime_benchmark.cc and RBKI_speed_comparisons.cc. (I know you can explain it to be here, or in an email, or on a call, but I want it written in the files themselves.)

benchmarks on macOS

Several benchmarks have a pragma like #if !defined(__APPLE__) inside main which determines whether or not the executable does something nontrivial. This difference in behavior across platforms should be easier to identify at a glance. I suggest you use a pattern like the following:
#ifdef(__APPLE__)
#include <iostream>
int main(int argc, char **argc) {
    std::cout << "This benchmark cannot run on Apple machines." << std::endl;
    return 1;
}
#else
// all of your actual code ....
#endif
You can take care of this now, or you can open a new issue with this as a TODO and handle it in another PR.

template parameters in benchmarks

Most benchmark scripts include a function definition like
template <typename T, typename RNG>
static void 
test_speed(int64_t m, int64_t n, int64_t runs, RandBLAS::RNGState<RNG> const_state) 
which you invoke with lines like
test_speed<double, r123::Philox4x32>(std::pow(2, 10), std::pow(2, 5),  10, state);
Specifying r123::Philox4x32 as a template parameter shouldn't actually be needed. The compiler should be able to deduce it from the type of state. If you run into issues with compilers automatically deducing types then I'd prefer you change the function definition to have a default type, along the following lines:
template <typename T, typename RNG = r123::Philox4x32>
static void 
test_speed(int64_t m, int64_t n, int64_t runs, RandBLAS::RNGState<RNG> const_state)
I'd like for you to make these changes throughout all benchmarks. You can take care of this now, or you can open a new issue with this as a TODO and handle it in another PR.

I applied all the explicitly-requested fixed & added small comments to each benchmark.
I have also created new issues per your MacOS and RNG templating comments.
The former I anticipate fixing in the next PR, the latter may be fixable once I update the version of RandBLAS used (also next PR).

rileyjmurray

I'm marking as approved, but I do have two tiny comments that it would be nice if you addressed.

Merge at your discretion.

benchmark/bench_CQRRP/CQRRP_runtime_breakdown.cc

benchmark/bench_CQRRP/CQRRP_pivot_quality.cc

test/drivers/test_rbki.cc

TeachRaccooon · 2024-04-01T20:28:05Z

Looks like a bunch of QB & SVD tests are failing on Git (they are all perfectly fine on Elephant). Investigating.

TeachRaccooon · 2024-04-01T21:25:13Z

There was a memory leak in the condition number estimation utility function, as I forgot to free the memory allocated there.
Also, looked like there was some issue with OpenMP usage in MacOS; I applied a temporary fix and opened an issue for myself to resolve in the next PR.

TeachRaccooon added 30 commits November 17, 2023 12:34

Adding RBKI files and a Gemm vs ormqr benchmark

26e3952

R not transposed

dd5dd14

Works for small cases, print statements in

c2fa698

Seems to be working

bf2ccb1

Cleanup

4bd772b

Cleanup

5964258

Update

e3cc36f

Update

78cca0b

Benchmark update

d5b4d2c

Trying to add matrix files processing capability; having an issue wit…

21c8a1c

…h get_line function locally.

Added capabilities to read input matrix.

56574e3

Reworking matrix generators.

28bcd20

All that is left is to change mat_gen signature

cc56005

Need to fix matrix read order.

96df795

Ready for RBKI benchmarking

583d8a6

Faced an issue with accuracy. Need to check Rob's implementation.

3f4430e

Tuned the benchmark for dataset 1

de2ca8d

Benchmark fux

5ee730f

Update

77a44bc

Small RBKI bug fix

ccb097f

Debugging

96c727a

Bug fixed

dd5f190

Added detailed (maybe too much) time profiling in RBKI, fixed openmp …

963fa0e

…bug, fixed norm bug.

Update

93c072f

Isolating GEQR in CQRRPT speed benchmark

76e4084

Update

c99a49a

Update

a477e3f

Update

0fd485e

Update

a148550

Adding a benchmark for apple project

ef1bfad

TeachRaccooon added 13 commits February 28, 2024 01:18

RBKI benchmark update, print statements in

d8a1fd2

Ready to benchmark on large matrices

01b1f8a

Update

7b81e1d

Benchmarking ICQRRP with QP3

e15cfd5

Finished yet another RBKI debug. prints in

e93299d

Ready for benchmarking

60f50b5

Reworked RBKI benchmark to be based on num matmuls rather than target…

6390469

… rank.

Update before reworking RBKI

2a3694f

Fix

b0e0001

Update before RBKI rewwork

b115bda

Reworkd allocation logic in RBKI. Old logic commented out

fc50a55

Removed commented out logic

8e6211e

Removed commented out logic

b824aa0

rileyjmurray requested changes Mar 28, 2024

View reviewed changes

TeachRaccooon added 2 commits April 1, 2024 11:42

Update per Riley's comments

109edae

Git compilation update

5cf905a

rileyjmurray approved these changes Apr 1, 2024

View reviewed changes

benchmark/bench_CQRRP/CQRRP_runtime_breakdown.cc Outdated Show resolved Hide resolved

benchmark/bench_CQRRP/CQRRP_pivot_quality.cc Outdated Show resolved Hide resolved

test/drivers/test_rbki.cc Outdated Show resolved Hide resolved

TeachRaccooon added 2 commits April 1, 2024 13:12

Updae

41a8117

Update

9865d28

TeachRaccooon added 5 commits April 1, 2024 13:32

Messing with the generator

5d869fc

Messing with the generator

e5dd024

Forgot to free the memory in condition number check.

9b8da93

Fixing omp mac issue

ecf60d3

Fixing omp mac issue

9e1e036

TeachRaccooon merged commit 5432731 into main Apr 1, 2024
2 checks passed

TeachRaccooon deleted the RBKI branch April 1, 2024 21:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RBKI #61

RBKI #61

TeachRaccooon commented Nov 17, 2023 •

edited

Loading

TeachRaccooon commented Mar 21, 2024

rileyjmurray left a comment

TeachRaccooon commented Apr 1, 2024

benchmark documentation

benchmarks on macOS

template parameters in benchmarks

rileyjmurray left a comment

TeachRaccooon commented Apr 1, 2024

TeachRaccooon commented Apr 1, 2024

RBKI #61

RBKI #61

Conversation

TeachRaccooon commented Nov 17, 2023 • edited Loading

TeachRaccooon commented Mar 21, 2024

rileyjmurray left a comment

Choose a reason for hiding this comment

benchmark documentation

benchmarks on macOS

template parameters in benchmarks

TeachRaccooon commented Apr 1, 2024

benchmark documentation

benchmarks on macOS

template parameters in benchmarks

rileyjmurray left a comment

Choose a reason for hiding this comment

TeachRaccooon commented Apr 1, 2024

TeachRaccooon commented Apr 1, 2024

TeachRaccooon commented Nov 17, 2023 •

edited

Loading