Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RBKI #61

Merged
merged 56 commits into from
Apr 1, 2024
Merged

RBKI #61

Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
56 commits
Select commit Hold shift + click to select a range
26e3952
Adding RBKI files and a Gemm vs ormqr benchmark
TeachRaccooon Oct 23, 2023
dd5dd14
R not transposed
TeachRaccooon Oct 27, 2023
c2fa698
Works for small cases, print statements in
TeachRaccooon Oct 31, 2023
bf2ccb1
Seems to be working
TeachRaccooon Nov 6, 2023
4bd772b
Cleanup
TeachRaccooon Nov 6, 2023
5964258
Cleanup
TeachRaccooon Nov 6, 2023
e3cc36f
Update
TeachRaccooon Nov 7, 2023
78cca0b
Update
TeachRaccooon Nov 13, 2023
d5b4d2c
Benchmark update
TeachRaccooon Nov 17, 2023
21c8a1c
Trying to add matrix files processing capability; having an issue wit…
TeachRaccooon Nov 17, 2023
56574e3
Added capabilities to read input matrix.
TeachRaccooon Nov 20, 2023
28bcd20
Reworking matrix generators.
TeachRaccooon Nov 20, 2023
cc56005
All that is left is to change mat_gen signature
TeachRaccooon Nov 20, 2023
96df795
Need to fix matrix read order.
TeachRaccooon Nov 20, 2023
583d8a6
Ready for RBKI benchmarking
TeachRaccooon Nov 20, 2023
3f4430e
Faced an issue with accuracy. Need to check Rob's implementation.
TeachRaccooon Nov 21, 2023
de2ca8d
Tuned the benchmark for dataset 1
TeachRaccooon Nov 28, 2023
5ee730f
Benchmark fux
TeachRaccooon Dec 7, 2023
77a44bc
Update
TeachRaccooon Dec 7, 2023
ccb097f
Small RBKI bug fix
TeachRaccooon Jan 4, 2024
96c727a
Debugging
TeachRaccooon Jan 16, 2024
dd5f190
Bug fixed
TeachRaccooon Jan 16, 2024
963fa0e
Added detailed (maybe too much) time profiling in RBKI, fixed openmp …
TeachRaccooon Jan 16, 2024
93c072f
Update
TeachRaccooon Jan 18, 2024
76e4084
Isolating GEQR in CQRRPT speed benchmark
TeachRaccooon Jan 19, 2024
c99a49a
Update
TeachRaccooon Jan 29, 2024
a477e3f
Update
TeachRaccooon Jan 30, 2024
0fd485e
Update
TeachRaccooon Jan 30, 2024
a148550
Update
TeachRaccooon Feb 1, 2024
ef1bfad
Adding a benchmark for apple project
TeachRaccooon Feb 9, 2024
c354fd7
Update
TeachRaccooon Feb 9, 2024
958fb6b
Added RBKI runtime benchmark
TeachRaccooon Feb 10, 2024
1089aa4
Reworking of RBKI speed benchmark.
TeachRaccooon Feb 12, 2024
28ee6ca
Update
TeachRaccooon Feb 12, 2024
d8a1fd2
RBKI benchmark update, print statements in
TeachRaccooon Feb 28, 2024
01b1f8a
Ready to benchmark on large matrices
TeachRaccooon Feb 28, 2024
7b81e1d
Update
TeachRaccooon Mar 5, 2024
e15cfd5
Benchmarking ICQRRP with QP3
TeachRaccooon Mar 6, 2024
e93299d
Finished yet another RBKI debug. prints in
TeachRaccooon Mar 11, 2024
60f50b5
Ready for benchmarking
TeachRaccooon Mar 11, 2024
6390469
Reworked RBKI benchmark to be based on num matmuls rather than target…
TeachRaccooon Mar 13, 2024
2a3694f
Update before reworking RBKI
TeachRaccooon Mar 20, 2024
b0e0001
Fix
TeachRaccooon Mar 20, 2024
b115bda
Update before RBKI rewwork
TeachRaccooon Mar 20, 2024
fc50a55
Reworkd allocation logic in RBKI. Old logic commented out
TeachRaccooon Mar 21, 2024
8e6211e
Removed commented out logic
TeachRaccooon Mar 21, 2024
b824aa0
Removed commented out logic
TeachRaccooon Mar 21, 2024
109edae
Update per Riley's comments
TeachRaccooon Apr 1, 2024
5cf905a
Git compilation update
TeachRaccooon Apr 1, 2024
41a8117
Updae
TeachRaccooon Apr 1, 2024
9865d28
Update
TeachRaccooon Apr 1, 2024
5d869fc
Messing with the generator
TeachRaccooon Apr 1, 2024
e5dd024
Messing with the generator
TeachRaccooon Apr 1, 2024
9b8da93
Forgot to free the memory in condition number check.
TeachRaccooon Apr 1, 2024
ecf60d3
Fixing omp mac issue
TeachRaccooon Apr 1, 2024
9e1e036
Fixing omp mac issue
TeachRaccooon Apr 1, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions RandLAPACK.hh
Original file line number Diff line number Diff line change
Expand Up @@ -21,5 +21,6 @@
#include "RandLAPACK/drivers/rl_cqrrpt.hh"
#include "RandLAPACK/drivers/rl_cqrrp.hh"
#include "RandLAPACK/drivers/rl_revd2.hh"
#include "RandLAPACK/drivers/rl_rbki.hh"

#endif
1 change: 1 addition & 0 deletions RandLAPACK/CMakeLists.txt
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@

set(RandLAPACK_cxx_sources
rl_rbki.hh
rl_lapackpp.hh
rl_cqrrpt.hh
rl_cqrrp.hh
Expand Down
4 changes: 2 additions & 2 deletions RandLAPACK/comps/rl_orth.hh
Original file line number Diff line number Diff line change
Expand Up @@ -95,8 +95,8 @@ int CholQRQ<T>::call(

// Scheme may succeed, but output garbage
if(this->cond_check) {
if(util::cond_num_check(k, k, Q_gram, this->Q_gram_cpy, this->s, this->verbosity) > (1 / std::sqrt(std::numeric_limits<T>::epsilon()))){
// return 1;
if(util::cond_num_check(k, k, Q_gram.data(), (this->Q_gram_cpy).data(), (this->s).data(), this->verbosity) > (1 / std::sqrt(std::numeric_limits<T>::epsilon()))){
return 1;
}
}

Expand Down
2 changes: 1 addition & 1 deletion RandLAPACK/comps/rl_rf.hh
Original file line number Diff line number Diff line change
Expand Up @@ -129,7 +129,7 @@ int RF<T, RNG>::call(

if(this->cond_check)
// Writes into this->cond_nums
this->cond_nums.push_back(util::cond_num_check(m, k, Q, this->Q_cpy, this->s, this->verbosity));
this->cond_nums.push_back(util::cond_num_check(m, k, Q.data(), (this->Q_cpy).data(), (this->s).data(), this->verbosity));

if(this->Orth_Obj.call(m, k, Q))
return 2; // Orthogonalization failed
Expand Down
4 changes: 2 additions & 2 deletions RandLAPACK/comps/rl_rs.hh
Original file line number Diff line number Diff line change
Expand Up @@ -160,7 +160,7 @@ int RS<T, RNG>::call(
++ p_done;

if(this->cond_check)
this->cond_nums.push_back(util::cond_num_check(m, k, Omega_1, this->Omega_1_cpy, this->s, this->verbosity));
this->cond_nums.push_back(util::cond_num_check(m, k, Omega_1.data(), (this->Omega_1_cpy).data(), (this->s).data(), this->verbosity));

if ((p_done % q == 0) && (this->Stab_Obj.call(m, k, Omega_1)))
return 1;
Expand All @@ -170,7 +170,7 @@ int RS<T, RNG>::call(
++ p_done;

if (this->cond_check)
this->cond_nums.push_back(util::cond_num_check(n, k, Omega, this->Omega_cpy, this->s, this->verbosity));
this->cond_nums.push_back(util::cond_num_check(n, k, Omega.data(), (this->Omega_cpy).data(), (this->s).data(), this->verbosity));

if ((p_done % q == 0) && (this->Stab_Obj.call(n, k, Omega)))
return 1;
Expand Down
2 changes: 1 addition & 1 deletion RandLAPACK/comps/rl_syrf.hh
Original file line number Diff line number Diff line change
Expand Up @@ -139,7 +139,7 @@ int SYRF<T, RNG>::call(
util::upsize(m * k, this->cond_work_mat);
util::upsize(k, this->cond_work_vec);
this->cond_nums.push_back(
util::cond_num_check(m, k, Q, this->cond_work_mat, this->cond_work_vec, this->verbose)
util::cond_num_check(m, k, Q.data(), (this->cond_work_mat).data(), (this->cond_work_vec).data(), this->verbose)
);
}
if(this->Orth_Obj.call(m, k, Q))
Expand Down
64 changes: 29 additions & 35 deletions RandLAPACK/drivers/rl_cqrrp.hh
Original file line number Diff line number Diff line change
Expand Up @@ -85,6 +85,9 @@ class CQRRP_blocked : public CQRRPalg<T, RNG> {
/// @param[in] tau
/// Pointer to a vector of size n. On entry, is empty.
///
/// @param[in] state
/// RNG state parameter, required for sketching operator generation.
///
/// @param[out] A
/// Overwritten by Implicit Q and explicit R factors.
///
Expand Down Expand Up @@ -117,7 +120,7 @@ class CQRRP_blocked : public CQRRPalg<T, RNG> {
int64_t rank;
int64_t block_size;

// 11 entries - logs time for different portions of the algorithm
// 12 entries - logs time for different portions of the algorithm
std::vector<long> times;
// Times each iteration of the algorithm, divides size of a processed matrix by the time it took to process.
// At each iteration, the algorithm will process rows by b_sz matrix; rows -= b_sz.
Expand Down Expand Up @@ -145,50 +148,39 @@ int CQRRP_blocked<T, RNG>::call(
//-------TIMING VARS--------/
high_resolution_clock::time_point preallocation_t_stop;
high_resolution_clock::time_point preallocation_t_start;
long preallocation_t_dur = 0;

high_resolution_clock::time_point saso_t_stop;
high_resolution_clock::time_point saso_t_start;
long saso_t_dur = 0;

high_resolution_clock::time_point qrcp_t_start;
high_resolution_clock::time_point qrcp_t_stop;
long qrcp_t_dur = 0;

high_resolution_clock::time_point cholqr_t_start;
high_resolution_clock::time_point cholqr_t_stop;
long cholqr_t_dur = 0;

high_resolution_clock::time_point reconstruction_t_start;
high_resolution_clock::time_point reconstruction_t_stop;
long reconstruction_t_dur = 0;

high_resolution_clock::time_point preconditioning_t_start;
high_resolution_clock::time_point preconditioning_t_stop;
long preconditioning_t_dur = 0;

high_resolution_clock::time_point r_piv_t_start;
high_resolution_clock::time_point r_piv_t_stop;
long r_piv_t_dur = 0;

high_resolution_clock::time_point updating1_t_start;
high_resolution_clock::time_point updating1_t_stop;
long updating1_t_dur = 0;

high_resolution_clock::time_point updating2_t_start;
high_resolution_clock::time_point updating2_t_stop;
long updating2_t_dur = 0;

high_resolution_clock::time_point updating3_t_start;
high_resolution_clock::time_point updating3_t_stop;
long updating3_t_dur = 0;

high_resolution_clock::time_point total_t_start;
high_resolution_clock::time_point total_t_stop;
long total_t_dur = 0;

high_resolution_clock::time_point iter_t_start;
high_resolution_clock::time_point iter_t_stop;
long preallocation_t_dur = 0;
long saso_t_dur = 0;
long qrcp_t_dur = 0;
long cholqr_t_dur = 0;
long reconstruction_t_dur = 0;
long preconditioning_t_dur = 0;
long r_piv_t_dur = 0;
long updating1_t_dur = 0;
long updating2_t_dur = 0;
long updating3_t_dur = 0;
long total_t_dur = 0;

if(this -> timing) {
total_t_start = high_resolution_clock::now();
Expand Down Expand Up @@ -298,7 +290,7 @@ int CQRRP_blocked<T, RNG>::call(

RandBLAS::sketch_general(
Layout::ColMajor, Op::NoTrans, Op::NoTrans,
d, n, m, 1.0, S, 0, 0, A, lda, 0.0, A_sk, d
d, n, m, (T) 1.0, S, 0, 0, A, lda, (T) 0.0, A_sk, d
);

if(this -> timing) {
Expand All @@ -317,15 +309,17 @@ int CQRRP_blocked<T, RNG>::call(
// Zero-out data - may not be necessary
std::fill(&J_buffer[0], &J_buffer[n], 0);
std::fill(&J_buffer_lu[0], &J_buffer_lu[std::min(d, n)], 0);
std::fill(&Work2[0], &Work2[n], 0.0);
std::fill(&Work2[0], &Work2[n], (T) 0.0);

if(this -> timing)
qrcp_t_start = high_resolution_clock::now();


//lapack::geqp3(sampling_dimension, cols, A_sk, d, J_buffer, Work2);

// Perform pivoted LU on A_sk', follow it up by unpivoted QR on a permuted A_sk.
// Get a transpose of A_sk
for(i = 0; i < cols; ++i)
blas::copy(sampling_dimension, &A_sk[i * d], 1, &A_sk_trans[i], n);
util::transposition(sampling_dimension, cols, A_sk, d, A_sk_trans, n, 0);

// Perform a row-pivoted LU on a transpose of A_sk
lapack::getrf(cols, sampling_dimension, A_sk_trans, n, J_buffer_lu);
// Fill the pivot vector, apply swaps found via lu on A_sk'.
Expand All @@ -339,7 +333,7 @@ int CQRRP_blocked<T, RNG>::call(
util::col_swap(sampling_dimension, cols, cols, A_sk, d, J_buf);
// Perform an unpivoted QR on A_sk
lapack::geqrf(sampling_dimension, cols, A_sk, d, Work2);

if(this -> timing) {
qrcp_t_stop = high_resolution_clock::now();
qrcp_t_dur += duration_cast<microseconds>(qrcp_t_stop - qrcp_t_start).count();
Expand Down Expand Up @@ -370,7 +364,7 @@ int CQRRP_blocked<T, RNG>::call(

// A_pre = AJ(:, 1:b_sz) * inv(R_sk)
// Performing preconditioning of the current matrix A.
blas::trsm(Layout::ColMajor, Side::Right, Uplo::Upper, Op::NoTrans, Diag::NonUnit, rows, b_sz, 1.0, R_sk, d, A_work, lda);
blas::trsm(Layout::ColMajor, Side::Right, Uplo::Upper, Op::NoTrans, Diag::NonUnit, rows, b_sz, (T) 1.0, R_sk, d, A_work, lda);

if(this -> timing) {
preconditioning_t_stop = high_resolution_clock::now();
Expand All @@ -381,11 +375,11 @@ int CQRRP_blocked<T, RNG>::call(
cholqr_t_start = high_resolution_clock::now();

// Performing Cholesky QR
blas::syrk(Layout::ColMajor, Uplo::Upper, Op::Trans, b_sz, rows, 1.0, A_work, lda, 0.0, R_cholqr, b_sz_const);
blas::syrk(Layout::ColMajor, Uplo::Upper, Op::Trans, b_sz, rows, (T) 1.0, A_work, lda, (T) 0.0, R_cholqr, b_sz_const);
lapack::potrf(Uplo::Upper, b_sz, R_cholqr, b_sz_const);

// Compute Q_econ from Cholesky QR
blas::trsm(Layout::ColMajor, Side::Right, Uplo::Upper, Op::NoTrans, Diag::NonUnit, rows, b_sz, 1.0, R_cholqr, b_sz_const, A_work, lda);
blas::trsm(Layout::ColMajor, Side::Right, Uplo::Upper, Op::NoTrans, Diag::NonUnit, rows, b_sz, (T) 1.0, R_cholqr, b_sz_const, A_work, lda);

if(this -> timing) {
cholqr_t_stop = high_resolution_clock::now();
Expand Down Expand Up @@ -447,7 +441,7 @@ int CQRRP_blocked<T, RNG>::call(
// Alternatively, instead of trmm + copy, we could perform a single gemm.
// Compute R11 = R11_full(1:b_sz, :) * R_sk
// R11_full is stored in R_cholqr space, R_sk is stored in A_sk space.
blas::trmm(Layout::ColMajor, Side::Right, Uplo::Upper, Op::NoTrans, Diag::NonUnit, b_sz, b_sz, 1.0, R_sk, d, R_cholqr, b_sz_const);
blas::trmm(Layout::ColMajor, Side::Right, Uplo::Upper, Op::NoTrans, Diag::NonUnit, b_sz, b_sz, (T) 1.0, R_sk, d, R_cholqr, b_sz_const);
// Need to copy R11 over form R_cholqr into the appropriate space in A.
// We cannot avoid this copy, since trmm() assumes R_cholqr is a square matrix.
// In a global sense, this is identical to:
Expand Down Expand Up @@ -546,11 +540,11 @@ int CQRRP_blocked<T, RNG>::call(
// trsm (R_sk, R11) -> R_sk
// Clearing the lower-triangular portion here is necessary, if there is a more elegant way, need to use that.
RandLAPACK::util::get_U(b_sz, b_sz, R_sk, d);
blas::trsm(Layout::ColMajor, Side::Right, Uplo::Upper, Op::NoTrans, Diag::NonUnit, b_sz, b_sz, 1.0, R11, lda, R_sk, d);
blas::trsm(Layout::ColMajor, Side::Right, Uplo::Upper, Op::NoTrans, Diag::NonUnit, b_sz, b_sz, (T) 1.0, R11, lda, R_sk, d);
// R_sk_12 - R_sk_11 * inv(R_11) * R_12
// Side note: might need to be careful when d = b_sz.
// Cannot perform trmm here as an alternative, since matrix difference is involved.
blas::gemm(Layout::ColMajor, Op::NoTrans, Op::NoTrans, b_sz, cols - b_sz, b_sz, -1.0, R_sk, d, R12, lda, 1.0, &R_sk[d * b_sz], d);
blas::gemm(Layout::ColMajor, Op::NoTrans, Op::NoTrans, b_sz, cols - b_sz, b_sz, (T) -1.0, R_sk, d, R12, lda, (T) 1.0, &R_sk[d * b_sz], d);

// Changing the sampling dimension parameter
sampling_dimension = std::min(sampling_dimension, cols);
Expand Down
3 changes: 3 additions & 0 deletions RandLAPACK/drivers/rl_cqrrpt.hh
Original file line number Diff line number Diff line change
Expand Up @@ -92,6 +92,9 @@ class CQRRPT : public CQRRPTalg<T, RNG> {
/// Represents the upper-triangular R factor of QR factorization.
/// On entry, is empty and may not have any space allocated for it.
///
/// @param[in] state
/// RNG state parameter, required for sketching operator generation.
///
/// @param[out] A
/// Overwritten by an m-by-k orthogonal Q factor.
/// Matrix is stored explicitly.
Expand Down
Loading
Loading