Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve Dataproc instructions #305

Closed
jacobtomlinson opened this issue Dec 6, 2023 · 2 comments · Fixed by #330
Closed

Improve Dataproc instructions #305

jacobtomlinson opened this issue Dec 6, 2023 · 2 comments · Fixed by #330
Assignees
Labels
cloud/gcp Google Cloud doc Improvements or additions to documentation

Comments

@jacobtomlinson
Copy link
Member

jacobtomlinson commented Dec 6, 2023

Following the latest Dataproc docs works, but doesn't give the best experience. We could improve the docs in a few ways.

  1. By default you get RAPIDS 22.10. You can override this with a metadata flag, we should document this.
...
--metadata gpu-driver-provider=NVIDIA,dask-runtime=$DASK_RUNTIME,rapids-runtime=DASK,rapids-version=23.12\
...
  1. Once things are set up if you use Jupyter Lab RAPIDS is not installed in the base environment. Also the RAPIDS environment isn't registered as a kernel that you can select in the notebook.

  2. We should show how to connect to the Dask cluster in a notebook and run some work.

@jacobtomlinson jacobtomlinson added doc Improvements or additions to documentation cloud/gcp Google Cloud labels Dec 6, 2023
@jacobtomlinson jacobtomlinson changed the title Improve Databroc instructions Improve Dataproc instructions Jan 4, 2024
@skirui-source
Copy link
Contributor

skirui-source commented Feb 8, 2024

UPDATES:

CC @jacobtomlinson

  1. I was able to successfully launch a dataproc cluster with rapids v23.12 stable, cuda 11.8 (both parsed via the metadata flags --rapids-version and --cuda-version)

  2. The install_gpu_driver.sh script downloads ubuntu 18.04, and has outdated versions of cuda and gpu drivers ; it installs cuda11.2 by default doesn't include cuda 12 as an option

  3. The latest rapids 24.02 is only compatible with ubuntu 20.04 and 22.04; Thus the install scripts need to be updated accordingly with the newer drivers as well

  4. To test RAPIDS libraries in the notebook environment, we need to edit rapids.sh script to activate the conda environment (dask-rapids) and register it as a kernel in Jupyter Lab/ Notebook

  5. For now the users will have to manually conda activate and register the dask-rapids kernel from the terminal in Jupyter.

  6. Alternatively, users can use the dataproc:conda.env.config.uri, which is absolute path to a Conda environment YAML config file located in Cloud Storage. This file will be used to create and activate a new Conda environment on the cluster. But this option is redundant because you first have to export the conda env into a .yaml file


@skirui-source
Copy link
Contributor

skirui-source commented Feb 8, 2024

Refer to #1137 -- tracking this issue on the Google Cloud Dataproc Github as well

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cloud/gcp Google Cloud doc Improvements or additions to documentation
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants