Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gpu bvls #316

Open
wants to merge 13 commits into
base: main
Choose a base branch
from
Open

Gpu bvls #316

wants to merge 13 commits into from

Conversation

craigwarner-ufastro
Copy link
Contributor

@craigwarner-ufastro craigwarner-ufastro commented Jul 26, 2024

Implemented a CuPy based GPU version of scipy.optimize.lsq_linear's BVLS
algorithm.  This is in the optimize folder and will be submitted to CuPy
for inclusion at which point it will be removed from Redrock.

zscan has been updated to use this GPU based version instead of our NNLS
trick of adding negative legendre terms and using NNLS to emulate BVLS;
although that trick is preserved via self._solver_method =
"bvls_via_nnls" in archetypes.py where Archetype files are allowed to
specify their preferred solver method (pca, nnls, bvls, or
bvls_via_nnls).

This branch is forked from the branch legendre_nnls_merge_with_main,
which was ready to merge with main and had addressed conflicts that had
arisen during parallel developement by myself and Abhijeet.

zfind is modified slightly to add a timing print statement for the time
spent in each solver method.

Timing Results:
With 4 GPUs and 4 CPUs, BVLS on GPU takes ~17s resulting in a fine
redshift scan time of 72s compared to the NNLS trick taking ~14s for a
fine redshift scan time of 76s (the reason the fitting is faster but the
overall time is slower is due to other operations performed on larger
arrays with the added negative legendre terms).  This compares to using
scipy's CPU-based BVLS, which takes ~63s to do the fitting and 119s for
the fine redshift scan.

Full timing results:
GPU BVLS (4 CPU, 4 GPU)

time srun -n 4 -c 2 --gpu-bind=map_gpu:3,2,1,0 rrdesi_mpi --gpu --max-gpuprocs 4 --archetypes new-archetypes/ -i $CFS/desi/spectro/redux/fuji/tiles/cumulative/100/20210505/coadd-0-100-thru20210505.fits     -o /pscratch/sd/c/cdwarner/abhijeet_bvls_gpu_3.fits
Finding best redshift: 72.0 seconds
Total fitting time (method PCA): 1.8 seconds
Total fitting time (method BVLS): 17.4 seconds
Total run time: 82.3 seconds

GPU BVLS via NNLS trick on CPU (4 CPU, 4 GPU)


time srun -n 4 -c 2 --gpu-bind=map_gpu:3,2,1,0 rrdesi_mpi --gpu --max-gpuprocs 4 --archetypes new-archetypes/ -i $CFS/desi/spectro/redux/fuji/tiles/cumulative/100/20210505/coadd-0-100-thru20210505.fits     -o /pscratch/sd/c/cdwarner/abhijeet_nnls_gpu_3.fits

Finding best redshift: 75.7 seconds
Total fitting time (method PCA): 1.9 seconds
Total fitting time (method NNLS): 13.8 seconds
Total run time: 84.8 seconds

GPU BVLS via CPU BVLS (4 CPU, 4 GPU)

time srun -n 4 -c 2 --gpu-bind=map_gpu:3,2,1,0 rrdesi_mpi --gpu --max-gpuprocs 4 --archetypes new-archetypes/ -i $CFS/desi/spectro/redux/fuji/tiles/cumulative/100/20210505/coadd-0-100-thru20210505.fits     -o /pscratch/sd/c/cdwarner/abhijeet_bvls_gpu_cpu_1.fits
Finding best redshift: 119.3 seconds
Total fitting time (method PCA): 1.8 seconds
Total fitting time (method BVLS): 63.4 seconds
Total run time: 129.4 seconds

CPU BVLS (64 CPU, 0 GPU)


time srun -n 64 -c 2 rrdesi_mpi --archetypes new-archetypes/ -i $CFS/desi/spectro/redux/fuji/tiles/cumulative/100/20210505/coadd-0-100-thru20210505.fits     -o /pscratch/sd/c/cdwarner/abhijeet_bvls_cpu_3.fits
Finding best redshift: 91.2 seconds
Total fitting time (method PCA): 0.6 seconds
Total fitting time (method BVLS): 10.4 seconds
Total run time: 132.4 seconds

CPU BVLS via NNLS trick (64 CPU, 0 GPU)

time srun -n 64 -c 2 rrdesi_mpi --archetypes new-archetypes/ -i $CFS/desi/spectro/redux/fuji/tiles/cumulative/100/20210505/coadd-0-100-thru20210505.fits     -o /pscratch/sd/c/cdwarner/abhijeet_nnls_cpu_3.fits

Finding best redshift: 142.3 seconds
Total fitting time (method PCA): 0.7 seconds
Total fitting time (method NNLS): 1.9 seconds
Computing redshifts: 171.8 seconds

For comparison
PCA only, GPU (4 GPU, 4 CPU):

time srun -n 4 -c 2 --gpu-bind=map_gpu:3,2,1,0 rrdesi_mpi --gpu --max-gpuprocs 4 -i $CFS/desi/spectro/redux/fuji/tiles/cumulative/100/20210505/coadd-0-100-thru20210505.fits     -o /pscratch/sd/c/cdwarner/pca_3.fits
Finding best redshift: 11.4 seconds
Total fitting time (method PCA): 1.0 seconds
Total run time: 19.3 seconds

PCA only, CPU (64 CPU, 0 GPU)

time srun -n 64 -c 2 rrdesi_mpi -i $CFS/desi/spectro/redux/fuji/tiles/cumulative/100/20210505/coadd-0-100-thru20210505.fits     -o /pscratch/sd/c/cdwarner/pca_cpu_3.fits
Finding best redshift: 9.2 seconds
Total fitting time (method PCA): 0.6 seconds
Total run time: 45.4 seconds

Output verification results:
All 3 GPU methods are ==. And compared to CPU results are np.allclose.

cdwarner@nid200305:/global/cfs/cdirs/desi/users/cdwarner/code> python compare_redrock_output.py /pscratch/sd/c/cdwarner/abhijeet_bvls_cpu_3.fits /pscratch/sd/c/cdwarner/abhijeet_bvls_gpu_3.fits
col='Z'  ndiff=425  std=4.280722910930986e-14
col='ZERR'  ndiff=446  std=5.190012193925673e-13
col='CHI2'  ndiff=438  std=1.293574168179979e-08
isEqual=False
isClose=True
cdwarner@nid200305:/global/cfs/cdirs/desi/users/cdwarner/code> python compare_redrock_output.py /pscratch/sd/c/cdwarner/abhijeet_bvls_cpu_3.fits /pscratch/sd/c/cdwarner/abhijeet_nnls_gpu_3.fits
col='Z'  ndiff=425  std=4.280722910930986e-14
col='ZERR'  ndiff=446  std=5.190012193925673e-13
col='CHI2'  ndiff=438  std=1.293574168179979e-08
isEqual=False
isClose=True
cdwarner@nid200305:/global/cfs/cdirs/desi/users/cdwarner/code> python compare_redrock_output.py /pscratch/sd/c/cdwarner/abhijeet_bvls_cpu_3.fits /pscratch/sd/c/cdwarner/abhijeet_bvls_gpu_cpu_1.fits
col='Z'  ndiff=425  std=4.280722910930986e-14
col='ZERR'  ndiff=446  std=5.190012193925673e-13
col='CHI2'  ndiff=438  std=1.293574168179979e-08
isEqual=False
isClose=True
cdwarner@nid200305:/global/cfs/cdirs/desi/users/cdwarner/code> python compare_redrock_output.py /pscratch/sd/c/cdwarner/abhijeet_bvls_cpu_3.fits /pscratch/sd/c/cdwarner/abhijeet_nnls_cpu_3.fits
col='Z'  ndiff=0  std=0.0
col='ZERR'  ndiff=0  std=0.0
col='CHI2'  ndiff=349  std=5.094350830929183e-11
isEqual=False
isClose=True
cdwarner@nid200305:/global/cfs/cdirs/desi/users/cdwarner/code>
cdwarner@nid200305:/global/cfs/cdirs/desi/users/cdwarner/code> python compare_redrock_output.py /pscratch/sd/c/cdwarner/abhijeet_bvls_gpu_3.fits /pscratch/sd/c/cdwarner/abhijeet_nnls_gpu_3.fits
col='Z'  ndiff=0  std=0.0
col='ZERR'  ndiff=0  std=0.0
col='CHI2'  ndiff=0  std=0.0
isEqual=True
isClose=True
cdwarner@nid200305:/global/cfs/cdirs/desi/users/cdwarner/code> python compare_redrock_output.py /pscratch/sd/c/cdwarner/abhijeet_bvls_gpu_3.fits /pscratch/sd/c/cdwarner/abhijeet_bvls_gpu_cpu_1.fits
col='Z'  ndiff=0  std=0.0
col='ZERR'  ndiff=0  std=0.0
col='CHI2'  ndiff=0  std=0.0
isEqual=True
isClose=True

craigwarner-ufastro and others added 10 commits February 27, 2024 13:48
vector (in tdata) so that NNLS can be used instead of BVLS.

Fixed bug when ValueError excepted in rebinning
negative legendre terms.  Refactored a little to clean up.

For archetypes with legendre terms, NNLS is now actually used with
additional terms of -1*each legendre term so that we can use the faster
NNLS to get the same result as BVLS.
because on CPU, the additional computation time for the larger array
sizes dominates any time savings of NNLS vs BVLS
negative legendre terms on the GPU in lieu of BVLS into
per_camera_coeff_with_least_square_batch.  Moved prior_on_coeffs into
zscan.py, called from within per_camera_coeff_with_least_square_batch.
Pass prior_sigma, a scalar, insted of the array from fitz to archetypes.

Bug fix on case where only some object types, not all, have an
archetype.
terms with 0 coefficients and make the coeffs negative for negative
terms.

Refactored to remove the last bit of the BVLS-NNLS trick from
archetypes.py and zfind.py by passing binned instead of tdata to
per_camera_coeff_with_least_square_batch.
involve using a list of bands instead of ncam to be more general.
because results were different and timing as well.
…re_nnls_merge_with_main

Resolved conflicts
algorithm.  This is in the optimize folder and will be submitted to CuPy
for inclusion at which point it will be removed from Redrock.

zscan has been updated to use this GPU based version instead of our NNLS
trick of adding negative legendre terms and using NNLS to emulate BVLS;
although that trick is preserved via self._solver_method =
"bvls_via_nnls" in archetypes.py where Archetype files are allowed to
specify their preferred solver method (pca, nnls, bvls, or
bvls_via_nnls).

This branch is forked from the branch legendre_nnls_merge_with_main,
which was ready to merge with main and had addressed conflicts that had
arisen during parallel developement by myself and Abhijeet.

zfind is modified slightly to add a timing print statement for the time
spent in each solver method.

Timing Results:
With 4 GPUs and 4 CPUs, BVLS on GPU takes ~17s resulting in a fine
redshift scan time of 72s compared to the NNLS trick taking ~14s for a
fine redshift scan time of 76s (the reason the fitting is faster but the
overall time is slower is due to other operations performed on larger
arrays with the added negative legendre terms).  This compares to using
scipy's CPU-based BVLS, which takes ~63s to do the fitting and 119s for
the fine redshift scan.
@coveralls
Copy link

coveralls commented Jul 26, 2024

Coverage Status

coverage: 35.804% (-2.8%) from 38.554%
when pulling a55c2ec on gpu_bvls
into 0428a3a on main.

entered in redrock test data but bug was found when developing a unit
test prior to submitting to CuPy.
few thousand random tests and caused a mismatch in array dimensions.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants