Gpu bvls #316

craigwarner-ufastro · 2024-07-26T19:21:09Z

Implemented a CuPy based GPU version of scipy.optimize.lsq_linear's BVLS
algorithm.  This is in the optimize folder and will be submitted to CuPy
for inclusion at which point it will be removed from Redrock.

zscan has been updated to use this GPU based version instead of our NNLS
trick of adding negative legendre terms and using NNLS to emulate BVLS;
although that trick is preserved via self._solver_method =
"bvls_via_nnls" in archetypes.py where Archetype files are allowed to
specify their preferred solver method (pca, nnls, bvls, or
bvls_via_nnls).

This branch is forked from the branch legendre_nnls_merge_with_main,
which was ready to merge with main and had addressed conflicts that had
arisen during parallel developement by myself and Abhijeet.

zfind is modified slightly to add a timing print statement for the time
spent in each solver method.

Timing Results:
With 4 GPUs and 4 CPUs, BVLS on GPU takes ~17s resulting in a fine
redshift scan time of 72s compared to the NNLS trick taking ~14s for a
fine redshift scan time of 76s (the reason the fitting is faster but the
overall time is slower is due to other operations performed on larger
arrays with the added negative legendre terms).  This compares to using
scipy's CPU-based BVLS, which takes ~63s to do the fitting and 119s for
the fine redshift scan.

Full timing results:
GPU BVLS (4 CPU, 4 GPU)

time srun -n 4 -c 2 --gpu-bind=map_gpu:3,2,1,0 rrdesi_mpi --gpu --max-gpuprocs 4 --archetypes new-archetypes/ -i $CFS/desi/spectro/redux/fuji/tiles/cumulative/100/20210505/coadd-0-100-thru20210505.fits     -o /pscratch/sd/c/cdwarner/abhijeet_bvls_gpu_3.fits
Finding best redshift: 72.0 seconds
Total fitting time (method PCA): 1.8 seconds
Total fitting time (method BVLS): 17.4 seconds
Total run time: 82.3 seconds

GPU BVLS via NNLS trick on CPU (4 CPU, 4 GPU)


time srun -n 4 -c 2 --gpu-bind=map_gpu:3,2,1,0 rrdesi_mpi --gpu --max-gpuprocs 4 --archetypes new-archetypes/ -i $CFS/desi/spectro/redux/fuji/tiles/cumulative/100/20210505/coadd-0-100-thru20210505.fits     -o /pscratch/sd/c/cdwarner/abhijeet_nnls_gpu_3.fits

Finding best redshift: 75.7 seconds
Total fitting time (method PCA): 1.9 seconds
Total fitting time (method NNLS): 13.8 seconds
Total run time: 84.8 seconds

GPU BVLS via CPU BVLS (4 CPU, 4 GPU)

time srun -n 4 -c 2 --gpu-bind=map_gpu:3,2,1,0 rrdesi_mpi --gpu --max-gpuprocs 4 --archetypes new-archetypes/ -i $CFS/desi/spectro/redux/fuji/tiles/cumulative/100/20210505/coadd-0-100-thru20210505.fits     -o /pscratch/sd/c/cdwarner/abhijeet_bvls_gpu_cpu_1.fits
Finding best redshift: 119.3 seconds
Total fitting time (method PCA): 1.8 seconds
Total fitting time (method BVLS): 63.4 seconds
Total run time: 129.4 seconds

CPU BVLS (64 CPU, 0 GPU)


time srun -n 64 -c 2 rrdesi_mpi --archetypes new-archetypes/ -i $CFS/desi/spectro/redux/fuji/tiles/cumulative/100/20210505/coadd-0-100-thru20210505.fits     -o /pscratch/sd/c/cdwarner/abhijeet_bvls_cpu_3.fits
Finding best redshift: 91.2 seconds
Total fitting time (method PCA): 0.6 seconds
Total fitting time (method BVLS): 10.4 seconds
Total run time: 132.4 seconds

CPU BVLS via NNLS trick (64 CPU, 0 GPU)

time srun -n 64 -c 2 rrdesi_mpi --archetypes new-archetypes/ -i $CFS/desi/spectro/redux/fuji/tiles/cumulative/100/20210505/coadd-0-100-thru20210505.fits     -o /pscratch/sd/c/cdwarner/abhijeet_nnls_cpu_3.fits

Finding best redshift: 142.3 seconds
Total fitting time (method PCA): 0.7 seconds
Total fitting time (method NNLS): 1.9 seconds
Computing redshifts: 171.8 seconds

For comparison
PCA only, GPU (4 GPU, 4 CPU):

time srun -n 4 -c 2 --gpu-bind=map_gpu:3,2,1,0 rrdesi_mpi --gpu --max-gpuprocs 4 -i $CFS/desi/spectro/redux/fuji/tiles/cumulative/100/20210505/coadd-0-100-thru20210505.fits     -o /pscratch/sd/c/cdwarner/pca_3.fits
Finding best redshift: 11.4 seconds
Total fitting time (method PCA): 1.0 seconds
Total run time: 19.3 seconds

PCA only, CPU (64 CPU, 0 GPU)

time srun -n 64 -c 2 rrdesi_mpi -i $CFS/desi/spectro/redux/fuji/tiles/cumulative/100/20210505/coadd-0-100-thru20210505.fits     -o /pscratch/sd/c/cdwarner/pca_cpu_3.fits
Finding best redshift: 9.2 seconds
Total fitting time (method PCA): 0.6 seconds
Total run time: 45.4 seconds

Output verification results:
All 3 GPU methods are ==. And compared to CPU results are np.allclose.

cdwarner@nid200305:/global/cfs/cdirs/desi/users/cdwarner/code> python compare_redrock_output.py /pscratch/sd/c/cdwarner/abhijeet_bvls_cpu_3.fits /pscratch/sd/c/cdwarner/abhijeet_bvls_gpu_3.fits
col='Z'  ndiff=425  std=4.280722910930986e-14
col='ZERR'  ndiff=446  std=5.190012193925673e-13
col='CHI2'  ndiff=438  std=1.293574168179979e-08
isEqual=False
isClose=True
cdwarner@nid200305:/global/cfs/cdirs/desi/users/cdwarner/code> python compare_redrock_output.py /pscratch/sd/c/cdwarner/abhijeet_bvls_cpu_3.fits /pscratch/sd/c/cdwarner/abhijeet_nnls_gpu_3.fits
col='Z'  ndiff=425  std=4.280722910930986e-14
col='ZERR'  ndiff=446  std=5.190012193925673e-13
col='CHI2'  ndiff=438  std=1.293574168179979e-08
isEqual=False
isClose=True
cdwarner@nid200305:/global/cfs/cdirs/desi/users/cdwarner/code> python compare_redrock_output.py /pscratch/sd/c/cdwarner/abhijeet_bvls_cpu_3.fits /pscratch/sd/c/cdwarner/abhijeet_bvls_gpu_cpu_1.fits
col='Z'  ndiff=425  std=4.280722910930986e-14
col='ZERR'  ndiff=446  std=5.190012193925673e-13
col='CHI2'  ndiff=438  std=1.293574168179979e-08
isEqual=False
isClose=True
cdwarner@nid200305:/global/cfs/cdirs/desi/users/cdwarner/code> python compare_redrock_output.py /pscratch/sd/c/cdwarner/abhijeet_bvls_cpu_3.fits /pscratch/sd/c/cdwarner/abhijeet_nnls_cpu_3.fits
col='Z'  ndiff=0  std=0.0
col='ZERR'  ndiff=0  std=0.0
col='CHI2'  ndiff=349  std=5.094350830929183e-11
isEqual=False
isClose=True
cdwarner@nid200305:/global/cfs/cdirs/desi/users/cdwarner/code>
cdwarner@nid200305:/global/cfs/cdirs/desi/users/cdwarner/code> python compare_redrock_output.py /pscratch/sd/c/cdwarner/abhijeet_bvls_gpu_3.fits /pscratch/sd/c/cdwarner/abhijeet_nnls_gpu_3.fits
col='Z'  ndiff=0  std=0.0
col='ZERR'  ndiff=0  std=0.0
col='CHI2'  ndiff=0  std=0.0
isEqual=True
isClose=True
cdwarner@nid200305:/global/cfs/cdirs/desi/users/cdwarner/code> python compare_redrock_output.py /pscratch/sd/c/cdwarner/abhijeet_bvls_gpu_3.fits /pscratch/sd/c/cdwarner/abhijeet_bvls_gpu_cpu_1.fits
col='Z'  ndiff=0  std=0.0
col='ZERR'  ndiff=0  std=0.0
col='CHI2'  ndiff=0  std=0.0
isEqual=True
isClose=True

vector (in tdata) so that NNLS can be used instead of BVLS. Fixed bug when ValueError excepted in rebinning

negative legendre terms. Refactored a little to clean up. For archetypes with legendre terms, NNLS is now actually used with additional terms of -1*each legendre term so that we can use the faster NNLS to get the same result as BVLS.

because on CPU, the additional computation time for the larger array sizes dominates any time savings of NNLS vs BVLS

negative legendre terms on the GPU in lieu of BVLS into per_camera_coeff_with_least_square_batch. Moved prior_on_coeffs into zscan.py, called from within per_camera_coeff_with_least_square_batch. Pass prior_sigma, a scalar, insted of the array from fitz to archetypes. Bug fix on case where only some object types, not all, have an archetype.

terms with 0 coefficients and make the coeffs negative for negative terms. Refactored to remove the last bit of the BVLS-NNLS trick from archetypes.py and zfind.py by passing binned instead of tdata to per_camera_coeff_with_least_square_batch.

involve using a list of bands instead of ncam to be more general.

because results were different and timing as well.

…re_nnls_merge_with_main Resolved conflicts

algorithm. This is in the optimize folder and will be submitted to CuPy for inclusion at which point it will be removed from Redrock. zscan has been updated to use this GPU based version instead of our NNLS trick of adding negative legendre terms and using NNLS to emulate BVLS; although that trick is preserved via self._solver_method = "bvls_via_nnls" in archetypes.py where Archetype files are allowed to specify their preferred solver method (pca, nnls, bvls, or bvls_via_nnls). This branch is forked from the branch legendre_nnls_merge_with_main, which was ready to merge with main and had addressed conflicts that had arisen during parallel developement by myself and Abhijeet. zfind is modified slightly to add a timing print statement for the time spent in each solver method. Timing Results: With 4 GPUs and 4 CPUs, BVLS on GPU takes ~17s resulting in a fine redshift scan time of 72s compared to the NNLS trick taking ~14s for a fine redshift scan time of 76s (the reason the fitting is faster but the overall time is slower is due to other operations performed on larger arrays with the added negative legendre terms). This compares to using scipy's CPU-based BVLS, which takes ~63s to do the fitting and 119s for the fine redshift scan.

coveralls · 2024-07-26T19:23:38Z

coverage: 35.804% (-2.8%) from 38.554%
when pulling a55c2ec on gpu_bvls
into 0428a3a on main.

entered in redrock test data but bug was found when developing a unit test prior to submitting to CuPy.

few thousand random tests and caused a mismatch in array dimensions.

craigwarner-ufastro and others added 10 commits February 27, 2024 13:48

Modified archetypes using BVLS to add -1*all legendre terms to the basis

c4132a5

vector (in tdata) so that NNLS can be used instead of BVLS. Fixed bug when ValueError excepted in rebinning

Only use additional negative legendre terms and NNLS when in GPU mode

7911c5b

because on CPU, the additional computation time for the larger array sizes dominates any time savings of NNLS vs BVLS

Merge branch 'main' into legendre_nnls

bbf187b

Refactor to bring more in-line with main where the conflicts mostly

fd76617

involve using a list of bands instead of ncam to be more general.

Created a new branch to handle the merging of legendre_nnls and main

a4a9531

because results were different and timing as well.

Merge branch 'main' of https://github.com/desihub/redrock into legend…

425d3fc

…re_nnls_merge_with_main Resolved conflicts

craigwarner-ufastro added 3 commits July 26, 2024 12:37

Updated requirements.txt to bring inline with PR313, numpy<2.0

12ab910

Updated GPU BVLS implementation to bug fix inner loop that is not

37ad671

entered in redrock test data but bug was found when developing a unit test prior to submitting to CuPy.

Found and fixed one more bug in inner loop of BVLS that happened every

a55c2ec

few thousand random tests and caused a mismatch in array dimensions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Gpu bvls #316

Gpu bvls #316

craigwarner-ufastro commented Jul 26, 2024 •

edited

Loading

coveralls commented Jul 26, 2024 •

edited

Loading

Gpu bvls #316

Are you sure you want to change the base?

Gpu bvls #316

Conversation

craigwarner-ufastro commented Jul 26, 2024 • edited Loading

coveralls commented Jul 26, 2024 • edited Loading

craigwarner-ufastro commented Jul 26, 2024 •

edited

Loading

coveralls commented Jul 26, 2024 •

edited

Loading