diff --git a/docs/user-guide/KubernetesCurator.rst b/docs/user-guide/KubernetesCurator.rst index b3647271..57ad3b89 100644 --- a/docs/user-guide/KubernetesCurator.rst +++ b/docs/user-guide/KubernetesCurator.rst @@ -125,9 +125,9 @@ use ``kubectl cp``, but ``exec`` has fewer surprises regarding compressed files: # Replace <...> with a path on your local machine LOCAL_WORKSPACE=<...> - # This copies $LOCAL_WORKSPACE/big_english to /big_english within the PVC. + # This copies $LOCAL_WORKSPACE/my_dataset to /my_dataset within the PVC. # Change foobar to the directory or file you wish to upload. - ( cd $LOCAL_WORKSPACE; tar cf - big_english | kubectl exec -i nemo-workspace-busybox -- tar xf - -C /nemo-workspace ) + ( cd $LOCAL_WORKSPACE; tar cf - my_dataset | kubectl exec -i nemo-workspace-busybox -- tar xf - -C /nemo-workspace ) .. note:: See :ref:`data-curator-download` for an example of how to download local data that can be uploaded to the PVC @@ -186,178 +186,180 @@ After creating a cluster, you should be able to proceed after confirming the sch # rapids-dask-default-worker-f8ff963886-5577fff76b-qmvcd 1/1 Running 3 (52m ago) 57m # rapids-dask-scheduler-654799869d-9bw4z 1/1 Running 0 57m -.. tab-set:: +(Opt #1) Running Existing Module +-------------------------------- - .. tab-item:: Running Existing Module +Here is an example of running the existing `gpu_exact_dedup` Curator module. The arguments and script name +will need to be changed according to the module you wish to run: - Here is an example of running the existing `gpu_exact_dedup` Curator module. The arguments and script name - will need to be changed according to the module you wish to run: +.. code-block:: bash - .. code-block:: bash - - # Set DASK_CLUSTER_NAME to the value of --name - DASK_CLUSTER_NAME=rapids-dask - SCHEDULER_POD=$(kubectl get pods -l "dask.org/cluster-name=$DASK_CLUSTER_NAME,dask.org/component=scheduler" -o name) - # Starts an interactive shell session in the scheduler pod - kubectl exec -it $SCHEDULER_POD -- bash - - ######################## - # Inside SCHEDULER_POD # - ######################## - # Run the following inside the interactive shell to launch script in the background and - # tee the logs to the /nemo-workspace PVC that was mounted in for persistence. - # The command line flags will need to be replaced with whatever the module script accepts. - # Recall that the PVC is mounted at /nemo-workspace, so any outputs should be written - # to somewhere under /nemo-workspace. - - mkdir -p /nemo-workspace/curator/{output,log,profile} - # Write logs to script.log and to a log file with a date suffix - LOGS="/nemo-workspace/curator/script.log /nemo-workspace/curator/script.log.$(date +%y_%m_%d-%H-%M-%S)" - ( - echo "Writing to: $LOGS" - gpu_exact_dedup \ - --input-data-dirs /nemo-workspace/big_english \ - --output-dir /nemo-workspace/curator/output \ - --hash-method md5 \ - --log-dir /nemo-workspace/curator/log \ - --num-files -1 \ - --files-per-partition 1 \ - --profile-path /nemo-workspace/curator/profile \ - --log-frequency 250 \ - --scheduler-address localhost:8786 \ - 2>&1 - echo "Finished!" - ) | tee $LOGS & - - # At this point, feel free to disconnect the shell via Ctrl+D or simply - exit - - At this point you can tail the logs and look for ``Finished!`` in ``/nemo-workspace/curator/script.log``: - - .. code-block:: bash - - # Command will follow the logs of the running module (Press ctrl+C to close) - kubectl exec -it $SCHEDULER_POD -- tail -f /nemo-workspace/curator/script.log - - # Writing to: /nemo-workspace/curator/script.log /nemo-workspace/curator/script.log.24_03_27-15-52-31 - # Computing hashes for /nemo-workspace/big_english - # adlr_id _hashes - # 0 cc-2023-14-0397113620 91b77eae49c10a65d485ac8ca18d6c43 - # 1 cc-2023-14-0397113621 a266f0794cc8ffbd431823e6930e4f80 - # 2 cc-2023-14-0397113622 baee533e2eddae764de2cd6faaa1286c - # 3 cc-2023-14-0397113623 87dd52a468448b99078f97e76f528eab - # 4 cc-2023-14-0397113624 a17664daf4f24be58e0e3a3dcf81124a - # Finished! - - - .. tab-item:: Running Custom Module - - In this example, we'll demonstrate how to run a NeMo Curator module that you have defined locally. - - Since your curator module may depend on version of the Curator that differs from what is in the - container, we will need to build a custom image with your code installed: - - .. code-block:: bash - - # Clone your repo. This example uses the official repo - git clone https://github.com/NVIDIA/NeMo-Curator.git NeMo-Curator-dev - - # Checkout specific ref. This example uses a commit in the main branch - git -C NeMo-Curator-dev checkout fc167a6edffd38a55c333742972a5a25b901cb26 - - # Example NeMo base image. Change it according to your requirements - BASE_IMAGE=nvcr.io/nvidian/bignlp-train:nemofw-nightly - docker build -t nemo-curator-custom ./NeMo-Curator-dev -f - </: accordingly - docker tag nemo-curator-custom /: - docker push /: - - .. note:: - When using a custom image, you will likely need to create a different secret unless you pushed to a public registry: - - .. code-block:: bash - - # Fill in // - kubectl create secret docker-registry my-private-registry --docker-server= --docker-username= --docker-password= - - And with this new secret, you create your new dask cluster: - - .. code-block:: bash - - # Fill in // - python create_dask_cluster.py \ - --name rapids-dask \ - --n_workers 2 \ - --n_gpus_per_worker 1 \ - --image /: \ - --image_pull_secret my-private-registry \ - --pvcs nemo-workspace:/nemo-workspace - - After the Dask cluster is deployed, you can proceed to run your module. In this example we'll use - the ``NeMo-Curator/nemo_curator/scripts/find_exact_duplicates.py`` module, but you can find other templates - in `NeMo-Curator/examples `__: - - .. code-block:: bash - - # Set DASK_CLUSTER_NAME to the value of --name - DASK_CLUSTER_NAME=rapids-dask - SCHEDULER_POD=$(kubectl get pods -l "dask.org/cluster-name=$DASK_CLUSTER_NAME,dask.org/component=scheduler" -o name) - # Starts an interactive shell session in the scheduler pod - kubectl exec -it $SCHEDULER_POD -- bash - - ######################## - # Inside SCHEDULER_POD # - ######################## - # Run the following inside the interactive shell to launch script in the background and - # tee the logs to the /nemo-workspace PVC that was mounted in for persistence. - # The command line flags will need to be replaced with whatever the module script accepts. - # Recall that the PVC is mounted at /nemo-workspace, so any outputs should be written - # to somewhere under /nemo-workspace. - - mkdir -p /nemo-workspace/curator/{output,log,profile} - # Append logs to script.log and write to a log file with a date suffix - LOGS="/nemo-workspace/curator/script.log /nemo-workspace/curator/script.log.$(date +%y_%m_%d-%H-%M-%S)" - ( - echo "Writing to: $LOGS" - # Recall that /NeMo-Curator-dev was copied and installed in the Dockerfile above - python3 -u /NeMo-Curator-dev/nemo_curator/scripts/find_exact_duplicates.py \ - --input-data-dirs /nemo-workspace/big_english \ - --output-dir /nemo-workspace/curator/output \ - --hash-method md5 \ - --log-dir /nemo-workspace/curator/log \ - --files-per-partition 1 \ - --profile-path /nemo-workspace/curator/profile \ - --log-frequency 250 \ - --scheduler-address localhost:8786 \ - 2>&1 - echo "Finished!" - ) | tee $LOGS & - - # At this point, feel free to disconnect the shell via Ctrl+D or simply - exit - - At this point you can tail the logs and look for ``Finished!`` in ``/nemo-workspace/curator/script.log``: - - .. code-block:: bash + # Set DASK_CLUSTER_NAME to the value of --name + DASK_CLUSTER_NAME=rapids-dask + SCHEDULER_POD=$(kubectl get pods -l "dask.org/cluster-name=$DASK_CLUSTER_NAME,dask.org/component=scheduler" -o name) + # Starts an interactive shell session in the scheduler pod + kubectl exec -it $SCHEDULER_POD -- bash + + ######################## + # Inside SCHEDULER_POD # + ######################## + # Run the following inside the interactive shell to launch script in the background and + # tee the logs to the /nemo-workspace PVC that was mounted in for persistence. + # The command line flags will need to be replaced with whatever the module script accepts. + # Recall that the PVC is mounted at /nemo-workspace, so any outputs should be written + # to somewhere under /nemo-workspace. + + mkdir -p /nemo-workspace/curator/{output,log,profile} + # Write logs to script.log and to a log file with a date suffix + LOGS="/nemo-workspace/curator/script.log /nemo-workspace/curator/script.log.$(date +%y_%m_%d-%H-%M-%S)" + ( + echo "Writing to: $LOGS" + gpu_exact_dedup \ + --input-data-dirs /nemo-workspace/my_dataset \ + --output-dir /nemo-workspace/curator/output \ + --hash-method md5 \ + --log-dir /nemo-workspace/curator/log \ + --num-files -1 \ + --files-per-partition 1 \ + --profile-path /nemo-workspace/curator/profile \ + --log-frequency 250 \ + --scheduler-address localhost:8786 \ + 2>&1 + echo "Finished!" + ) | tee $LOGS & + + # At this point, feel free to disconnect the shell via Ctrl+D or simply + exit + +At this point you can tail the logs and look for ``Finished!`` in ``/nemo-workspace/curator/script.log``: + +.. code-block:: bash + + # Command will follow the logs of the running module (Press ctrl+C to close) + kubectl exec -it $SCHEDULER_POD -- tail -f /nemo-workspace/curator/script.log + + # Writing to: /nemo-workspace/curator/script.log /nemo-workspace/curator/script.log.24_03_27-15-52-31 + # Computing hashes for /nemo-workspace/my_dataset + # adlr_id _hashes + # 0 cc-2023-14-0397113620 91b77eae49c10a65d485ac8ca18d6c43 + # 1 cc-2023-14-0397113621 a266f0794cc8ffbd431823e6930e4f80 + # 2 cc-2023-14-0397113622 baee533e2eddae764de2cd6faaa1286c + # 3 cc-2023-14-0397113623 87dd52a468448b99078f97e76f528eab + # 4 cc-2023-14-0397113624 a17664daf4f24be58e0e3a3dcf81124a + # Finished! + + +(Opt #2) Running Custom Module +------------------------------ + +In this example, we'll demonstrate how to run a NeMo Curator module that you have defined locally. + +Since your curator module may depend on version of the Curator that differs from what is in the +container, we will need to build a custom image with your code installed: + +.. code-block:: bash + + # Clone your repo. This example uses the official repo + git clone https://github.com/NVIDIA/NeMo-Curator.git NeMo-Curator-dev + + # Checkout specific ref. This example uses a commit in the main branch + git -C NeMo-Curator-dev checkout fc167a6edffd38a55c333742972a5a25b901cb26 + + # Example NeMo base image. Change it according to your requirements + BASE_IMAGE=nvcr.io/nvidian/bignlp-train:nemofw-nightly + docker build -t nemo-curator-custom ./NeMo-Curator-dev -f - </: accordingly + docker tag nemo-curator-custom /: + docker push /: + +.. note:: + When using a custom image, you will likely need to create a different secret unless you pushed to a public registry: + + .. code-block:: bash - # Command will follow the logs of the running module (Press ctrl+C to close) - kubectl exec -it $SCHEDULER_POD -- tail -f /nemo-workspace/curator/script.log + # Fill in // + kubectl create secret docker-registry my-private-registry --docker-server= --docker-username= --docker-password= - # Writing to: /nemo-workspace/curator/script.log /nemo-workspace/curator/script.log.24_03_27-20-52-07 - # Reading 2 files - # /NeMo-Curator-dev/nemo_curator/modules/exact_dedup.py:157: UserWarning: Output path f/nemo-workspace/curator/output/_exact_duplicates.parquet already exists and will be overwritten - # warnings.warn( - # Finished! + And with this new secret, you create your new dask cluster: + + .. code-block:: bash + + # Fill in // + python create_dask_cluster.py \ + --name rapids-dask \ + --n_workers 2 \ + --n_gpus_per_worker 1 \ + --image /: \ + --image_pull_secret my-private-registry \ + --pvcs nemo-workspace:/nemo-workspace + +After the Dask cluster is deployed, you can proceed to run your module. In this example we'll use +the ``NeMo-Curator/nemo_curator/scripts/find_exact_duplicates.py`` module, but you can find other templates +in `NeMo-Curator/examples `__: + +.. code-block:: bash + + # Set DASK_CLUSTER_NAME to the value of --name + DASK_CLUSTER_NAME=rapids-dask + SCHEDULER_POD=$(kubectl get pods -l "dask.org/cluster-name=$DASK_CLUSTER_NAME,dask.org/component=scheduler" -o name) + # Starts an interactive shell session in the scheduler pod + kubectl exec -it $SCHEDULER_POD -- bash + + ######################## + # Inside SCHEDULER_POD # + ######################## + # Run the following inside the interactive shell to launch script in the background and + # tee the logs to the /nemo-workspace PVC that was mounted in for persistence. + # The command line flags will need to be replaced with whatever the module script accepts. + # Recall that the PVC is mounted at /nemo-workspace, so any outputs should be written + # to somewhere under /nemo-workspace. + + mkdir -p /nemo-workspace/curator/{output,log,profile} + # Append logs to script.log and write to a log file with a date suffix + LOGS="/nemo-workspace/curator/script.log /nemo-workspace/curator/script.log.$(date +%y_%m_%d-%H-%M-%S)" + ( + echo "Writing to: $LOGS" + # Recall that /NeMo-Curator-dev was copied and installed in the Dockerfile above + python3 -u /NeMo-Curator-dev/nemo_curator/scripts/find_exact_duplicates.py \ + --input-data-dirs /nemo-workspace/my_dataset \ + --output-dir /nemo-workspace/curator/output \ + --hash-method md5 \ + --log-dir /nemo-workspace/curator/log \ + --files-per-partition 1 \ + --profile-path /nemo-workspace/curator/profile \ + --log-frequency 250 \ + --scheduler-address localhost:8786 \ + 2>&1 + echo "Finished!" + ) | tee $LOGS & + + # At this point, feel free to disconnect the shell via Ctrl+D or simply + exit + +At this point you can tail the logs and look for ``Finished!`` in ``/nemo-workspace/curator/script.log``: + +.. code-block:: bash + + # Command will follow the logs of the running module (Press ctrl+C to close) + kubectl exec -it $SCHEDULER_POD -- tail -f /nemo-workspace/curator/script.log + + # Writing to: /nemo-workspace/curator/script.log /nemo-workspace/curator/script.log.24_03_27-20-52-07 + # Reading 2 files + # /NeMo-Curator-dev/nemo_curator/modules/exact_dedup.py:157: UserWarning: Output path f/nemo-workspace/curator/output/_exact_duplicates.parquet already exists and will be overwritten + # warnings.warn( + # Finished! Deleting Cluster ---------------- -After you have finished using the created dask cluster, you can delete it to release the resources:: +After you have finished using the created dask cluster, you can delete it to release the resources: + +.. code-block:: bash # Where is the flag passed to create_dask_cluster.py. Example: `--name ` kubectl delete daskcluster