Skip to content

Commit

Permalink
add missing links
Browse files Browse the repository at this point in the history
  • Loading branch information
skirui-source committed Jan 24, 2024
1 parent 52c1d8e commit 2e2525f
Showing 1 changed file with 12 additions and 12 deletions.
24 changes: 12 additions & 12 deletions source/examples/xgboost-dask-databricks/notebook.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -36,7 +36,7 @@
"\n",
"This notebook shows how to deploy Dask RAPIDS workflow in Databricks. We will focus on the HIGGS dataset, a moderately sized classification problem from the [UCI Machine Learning repository.](https://archive.ics.uci.edu/dataset/280/higgs)\n",
"\n",
"In the following sections, we will begin by loading the dataset from Delta Lake and performing preprocessing with Dask. Then train an XGBoost model with various configurations and explore techniques for optimizing inference.\n"
"In the following sections, we will begin by loading the dataset from Delta Lake and performing preprocessing with [Dask](https://github.com/dask/dask). Then train an [XGBoost](https://xgboost.readthedocs.io/en/stable/) model with various configurations and explore techniques for optimizing inference.\n"
]
},
{
Expand Down Expand Up @@ -65,10 +65,10 @@
"\n",
"This workflow example can be ran on GPU, and you don't even need to have the GPU locally since Databricks can provide one for you. Whereas Dask enables users to easily distribute or scale up computation tasks within a single GPU or across multiple GPUs.\n",
"\n",
"Dask recently introduced [**dask-databricks**](https://github.com/dask-contrib/dask-databricks) (available via [conda](https://github.com/conda-forge/dask-databricks-feedstock) and [pip](https://pypi.org/project/dask-databricks/)). With this CLI tool, the `dask databricks run --cuda` command will launch a Dask scheduler in the driver node and `cuda` workers in the remaining nodes.\n",
"Dask recently introduced [**dask-databricks**](https://github.com/dask-contrib/dask-databricks) (available via [conda](https://github.com/conda-forge/dask-databricks-feedstock) and [pip](https://pypi.org/project/dask-databricks/)). With this CLI tool, the `dask databricks run --cuda` command will launch a Dask scheduler in the driver node and [`cuda` workers](https://dask-cuda.readthedocs.io/en/stable/worker.html) in the remaining nodes.\n",
"\n",
"From a high level, we could break down this section into the following steps:\n",
"* Create a new init script that installs RAPIDS and runs `dask-databricks`\n",
"* Create a new [init script](https://docs.databricks.com/en/init-scripts/index.html) that installs [RAPIDS](https://rapids.ai/) and runs `dask-databricks`\n",
"* Create a new multi-node cluster that uses the init script\n",
"* Once the cluster is running upload this notebook to Databricks and continue running these cells on there\n",
"\n",
Expand Down Expand Up @@ -453,7 +453,7 @@
"\n",
"## Download dataset\n",
"\n",
"First we download the dataset to Databrick File Storage (DBFS). Alternatively, you could also use cloud storage (S3, Google Cloud, Azure Data Lake). \n",
"First we download the dataset to Databrick File Storage (DBFS). Alternatively, you could also use cloud storage ([S3](https://aws.amazon.com/s3/), [Google Cloud](https://cloud.google.com/storage?hl=en), [Azure Data Lake](https://learn.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-introduction)\n",
"Refer to [docs](https://docs.databricks.com/en/storage/index.html#:~:text=Databricks%20uses%20cloud%20object%20storage,storage%20locations%20in%20your%20account.) for more information\n"
]
},
Expand Down Expand Up @@ -530,12 +530,12 @@
"source": [
"## Integrating Dask and Delta Lake\n",
"\n",
"[Delta Lake](https://docs.databricks.com/en/delta/index.html) is an optimized storage layer within the Databricks lakehouse that provides a foundational platform for storing data and tables. This open-source software extends Parquet data files by incorporating a file-based transaction log to support [ACID transactions](https://docs.databricks.com/en/lakehouse/acid.html) and scalable metadata handling. \n",
"[**Delta Lake**](https://docs.databricks.com/en/delta/index.html) is an optimized storage layer within the Databricks lakehouse that provides a foundational platform for storing data and tables. This open-source software extends Parquet data files by incorporating a file-based transaction log to support [ACID transactions](https://docs.databricks.com/en/lakehouse/acid.html) and scalable metadata handling. \n",
"\n",
"Delta Lake is the default storage format for all operations on Databricks, i.e (unless otherwise specified, all tables on Databricks are Delta tables). \n",
"Check out [tutorial](https://docs.databricks.com/en/delta/tutorial.html) for examples with basic Delta Lake operations.\n",
"Check out [tutorial for examples with basic Delta Lake operations](https://docs.databricks.com/en/delta/tutorial.html).\n",
"\n",
"Let's explore step-by-step how we can leverage Data Lake tables with Dask to accelerate data pre-processing with RAPIDS."
"Let's explore step-by-step how we can leverage Delta Lake tables with Dask to accelerate data pre-processing with RAPIDS."
]
},
{
Expand All @@ -555,7 +555,7 @@
"source": [
"## Read from Delta table with Dask\n",
"\n",
"With Dask's [**dask-deltatable**](https://github.com/dask-contrib/dask-deltatable/tree/main), we can write the `.csv` file into a Delta table using spark then read and parallelize with Dask. "
"With Dask's [**dask-deltatable**](https://github.com/dask-contrib/dask-deltatable/tree/main), we can write the `.csv` file into a Delta table using [**Spark**](https://spark.apache.org/docs/latest/) then read and parallelize with [**Dask**](https://docs.dask.org/en/stable/). "
]
},
{
Expand Down Expand Up @@ -783,7 +783,7 @@
}
},
"source": [
"Calling `dask_deltalake.read_deltalake()` will return a `dask dataframe`. However, our objective is to utilize GPU acceleration for the entire ML pipeline, including data processing, model training and inference. For this reason, we will read the dask dataframe into a `cUDF dask-dataframe` using `dask_cudf.from_dask_dataframe()`\n",
"Calling `dask_deltalake.read_deltalake()` will return a `dask dataframe`. However, our objective is to utilize GPU acceleration for the entire ML pipeline, including data processing, model training and inference. For this reason, we will read the dask dataframe into a `cuDF dask-dataframe` using `dask_cudf.from_dask_dataframe()`\n",
"\n",
"**Note** that these operations will automatically leverage the Dask client we created, ensuring optimal performance boost through parallelism with dask."
]
Expand Down Expand Up @@ -1225,7 +1225,7 @@
"source": [
"## Split data\n",
"\n",
"In the preceding step, we used `dask-cudf` for loading data from the Delta table's, now use `train_test_split()` function from `dask-ml` to split up the dataset. \n",
"In the preceding step, we used [`dask-cudf`](https://docs.rapids.ai/api/dask-cudf/stable/) for loading data from the Delta table's, now use `train_test_split()` function from [`dask-ml`](https://ml.dask.org/modules/api.html) to split up the dataset. \n",
"\n",
"Most of the time, the GPU backend of Dask works seamlessly with utilities in `dask-ml` and we can accelerate the entire ML pipeline as such: \n"
]
Expand Down Expand Up @@ -1554,7 +1554,7 @@
"source": [
"## Model training\n",
"\n",
"There are two things to notice here. Firstly, we specify the number of rounds to trigger early stopping for training. XGBoost will stop the training process once the validation metric fails to improve in consecutive X rounds, where **X** is the number of rounds specified for early \n",
"There are two things to notice here. Firstly, we specify the number of rounds to trigger early stopping for training. [XGBoost](https://xgboost.readthedocs.io/en/release_1.7.0/) will stop the training process once the validation metric fails to improve in consecutive X rounds, where **X** is the number of rounds specified for early \n",
"stopping. \n",
"\n",
"Secondly, we use a data type called `DaskDeviceQuantileDMatrix` for training but `DaskDMatrix` for validation. `DaskDeviceQuantileDMatrix` is a drop-in replacement of `DaskDMatrix` for GPU-based training inputs that avoids extra data copies."
Expand Down Expand Up @@ -1848,7 +1848,7 @@
"\n",
"When finished, be sure to destroy your cluster to avoid incurring extra costs for idle resources.\n",
"\n",
"If you forget to destroy the cluster manually, it's important to note that Databricks clusters will automatically time out after a period (specified during cluster creation)."
"**Note** If you forget to destroy the cluster manually, it's important to note that Databricks clusters will automatically time out after a period (specified during cluster creation)."
]
},
{
Expand Down

0 comments on commit 2e2525f

Please sign in to comment.