From 2e2525f8c463185497329b981953eadd9e282fcb Mon Sep 17 00:00:00 2001 From: skirui-source Date: Wed, 24 Jan 2024 02:13:53 -0800 Subject: [PATCH] add missing links --- .../xgboost-dask-databricks/notebook.ipynb | 24 +++++++++---------- 1 file changed, 12 insertions(+), 12 deletions(-) diff --git a/source/examples/xgboost-dask-databricks/notebook.ipynb b/source/examples/xgboost-dask-databricks/notebook.ipynb index 2a55bd32..ef42a5cf 100644 --- a/source/examples/xgboost-dask-databricks/notebook.ipynb +++ b/source/examples/xgboost-dask-databricks/notebook.ipynb @@ -36,7 +36,7 @@ "\n", "This notebook shows how to deploy Dask RAPIDS workflow in Databricks. We will focus on the HIGGS dataset, a moderately sized classification problem from the [UCI Machine Learning repository.](https://archive.ics.uci.edu/dataset/280/higgs)\n", "\n", - "In the following sections, we will begin by loading the dataset from Delta Lake and performing preprocessing with Dask. Then train an XGBoost model with various configurations and explore techniques for optimizing inference.\n" + "In the following sections, we will begin by loading the dataset from Delta Lake and performing preprocessing with [Dask](https://github.com/dask/dask). Then train an [XGBoost](https://xgboost.readthedocs.io/en/stable/) model with various configurations and explore techniques for optimizing inference.\n" ] }, { @@ -65,10 +65,10 @@ "\n", "This workflow example can be ran on GPU, and you don't even need to have the GPU locally since Databricks can provide one for you. Whereas Dask enables users to easily distribute or scale up computation tasks within a single GPU or across multiple GPUs.\n", "\n", - "Dask recently introduced [**dask-databricks**](https://github.com/dask-contrib/dask-databricks) (available via [conda](https://github.com/conda-forge/dask-databricks-feedstock) and [pip](https://pypi.org/project/dask-databricks/)). With this CLI tool, the `dask databricks run --cuda` command will launch a Dask scheduler in the driver node and `cuda` workers in the remaining nodes.\n", + "Dask recently introduced [**dask-databricks**](https://github.com/dask-contrib/dask-databricks) (available via [conda](https://github.com/conda-forge/dask-databricks-feedstock) and [pip](https://pypi.org/project/dask-databricks/)). With this CLI tool, the `dask databricks run --cuda` command will launch a Dask scheduler in the driver node and [`cuda` workers](https://dask-cuda.readthedocs.io/en/stable/worker.html) in the remaining nodes.\n", "\n", "From a high level, we could break down this section into the following steps:\n", - "* Create a new init script that installs RAPIDS and runs `dask-databricks`\n", + "* Create a new [init script](https://docs.databricks.com/en/init-scripts/index.html) that installs [RAPIDS](https://rapids.ai/) and runs `dask-databricks`\n", "* Create a new multi-node cluster that uses the init script\n", "* Once the cluster is running upload this notebook to Databricks and continue running these cells on there\n", "\n", @@ -453,7 +453,7 @@ "\n", "## Download dataset\n", "\n", - "First we download the dataset to Databrick File Storage (DBFS). Alternatively, you could also use cloud storage (S3, Google Cloud, Azure Data Lake). \n", + "First we download the dataset to Databrick File Storage (DBFS). Alternatively, you could also use cloud storage ([S3](https://aws.amazon.com/s3/), [Google Cloud](https://cloud.google.com/storage?hl=en), [Azure Data Lake](https://learn.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-introduction)\n", "Refer to [docs](https://docs.databricks.com/en/storage/index.html#:~:text=Databricks%20uses%20cloud%20object%20storage,storage%20locations%20in%20your%20account.) for more information\n" ] }, @@ -530,12 +530,12 @@ "source": [ "## Integrating Dask and Delta Lake\n", "\n", - "[Delta Lake](https://docs.databricks.com/en/delta/index.html) is an optimized storage layer within the Databricks lakehouse that provides a foundational platform for storing data and tables. This open-source software extends Parquet data files by incorporating a file-based transaction log to support [ACID transactions](https://docs.databricks.com/en/lakehouse/acid.html) and scalable metadata handling. \n", + "[**Delta Lake**](https://docs.databricks.com/en/delta/index.html) is an optimized storage layer within the Databricks lakehouse that provides a foundational platform for storing data and tables. This open-source software extends Parquet data files by incorporating a file-based transaction log to support [ACID transactions](https://docs.databricks.com/en/lakehouse/acid.html) and scalable metadata handling. \n", "\n", "Delta Lake is the default storage format for all operations on Databricks, i.e (unless otherwise specified, all tables on Databricks are Delta tables). \n", - "Check out [tutorial](https://docs.databricks.com/en/delta/tutorial.html) for examples with basic Delta Lake operations.\n", + "Check out [tutorial for examples with basic Delta Lake operations](https://docs.databricks.com/en/delta/tutorial.html).\n", "\n", - "Let's explore step-by-step how we can leverage Data Lake tables with Dask to accelerate data pre-processing with RAPIDS." + "Let's explore step-by-step how we can leverage Delta Lake tables with Dask to accelerate data pre-processing with RAPIDS." ] }, { @@ -555,7 +555,7 @@ "source": [ "## Read from Delta table with Dask\n", "\n", - "With Dask's [**dask-deltatable**](https://github.com/dask-contrib/dask-deltatable/tree/main), we can write the `.csv` file into a Delta table using spark then read and parallelize with Dask. " + "With Dask's [**dask-deltatable**](https://github.com/dask-contrib/dask-deltatable/tree/main), we can write the `.csv` file into a Delta table using [**Spark**](https://spark.apache.org/docs/latest/) then read and parallelize with [**Dask**](https://docs.dask.org/en/stable/). " ] }, { @@ -783,7 +783,7 @@ } }, "source": [ - "Calling `dask_deltalake.read_deltalake()` will return a `dask dataframe`. However, our objective is to utilize GPU acceleration for the entire ML pipeline, including data processing, model training and inference. For this reason, we will read the dask dataframe into a `cUDF dask-dataframe` using `dask_cudf.from_dask_dataframe()`\n", + "Calling `dask_deltalake.read_deltalake()` will return a `dask dataframe`. However, our objective is to utilize GPU acceleration for the entire ML pipeline, including data processing, model training and inference. For this reason, we will read the dask dataframe into a `cuDF dask-dataframe` using `dask_cudf.from_dask_dataframe()`\n", "\n", "**Note** that these operations will automatically leverage the Dask client we created, ensuring optimal performance boost through parallelism with dask." ] @@ -1225,7 +1225,7 @@ "source": [ "## Split data\n", "\n", - "In the preceding step, we used `dask-cudf` for loading data from the Delta table's, now use `train_test_split()` function from `dask-ml` to split up the dataset. \n", + "In the preceding step, we used [`dask-cudf`](https://docs.rapids.ai/api/dask-cudf/stable/) for loading data from the Delta table's, now use `train_test_split()` function from [`dask-ml`](https://ml.dask.org/modules/api.html) to split up the dataset. \n", "\n", "Most of the time, the GPU backend of Dask works seamlessly with utilities in `dask-ml` and we can accelerate the entire ML pipeline as such: \n" ] @@ -1554,7 +1554,7 @@ "source": [ "## Model training\n", "\n", - "There are two things to notice here. Firstly, we specify the number of rounds to trigger early stopping for training. XGBoost will stop the training process once the validation metric fails to improve in consecutive X rounds, where **X** is the number of rounds specified for early \n", + "There are two things to notice here. Firstly, we specify the number of rounds to trigger early stopping for training. [XGBoost](https://xgboost.readthedocs.io/en/release_1.7.0/) will stop the training process once the validation metric fails to improve in consecutive X rounds, where **X** is the number of rounds specified for early \n", "stopping. \n", "\n", "Secondly, we use a data type called `DaskDeviceQuantileDMatrix` for training but `DaskDMatrix` for validation. `DaskDeviceQuantileDMatrix` is a drop-in replacement of `DaskDMatrix` for GPU-based training inputs that avoids extra data copies." @@ -1848,7 +1848,7 @@ "\n", "When finished, be sure to destroy your cluster to avoid incurring extra costs for idle resources.\n", "\n", - "If you forget to destroy the cluster manually, it's important to note that Databricks clusters will automatically time out after a period (specified during cluster creation)." + "**Note** If you forget to destroy the cluster manually, it's important to note that Databricks clusters will automatically time out after a period (specified during cluster creation)." ] }, {