Add details about CUDA extensions

tridao · Oct 7, 2022 · 747f905 · 747f905
1 parent f6f82e9
commit 747f905
Showing 1 changed file with 12 additions and 3 deletions.
diff --git a/README.md b/README.md
@@ -1,12 +1,14 @@
 We use the template from `https://github.com/ashleve/lightning-hydra-template`.
 Please read the instructions there to understand the repo structure.
 
+## GPT2 training
 To train GPT2 on Openwebtext with 8 GPUs:
 ```sh
 python run.py experiment=owt/gpt2s-flash trainer.devices=8
 python run.py experiment=owt/gpt2m-flash trainer.devices=8
 python run.py experiment=owt/gpt2l-flash trainer.devices=8
 ```
+To train with bf16 instead of fp16, add `trainer.precision=bf16`.
 
 ## Requirements
 
@@ -15,10 +17,17 @@ We recommend CUDA 11.8 (e.g., using the Nvidia's Pytorch Docker image from https
 
 We provide a Dockerfile that lists all the required packages.
 
-To install the CUDA extensions:
+This repo includes the following CUDA extensions:
+1. Fused dropout + residual + LayerNorm, adapted from Apex's [FastLayerNorm](https://github.com/NVIDIA/apex/tree/master/apex/contrib/layer_norm).
 ```sh
-cd csrc/xentropy && pip install .
 cd csrc/layer_norm && pip install .
+```
+2. Fused matmul + bias (forward and backward), and fused matmul + bias + gelu
+(forward and backward), adapted from Apex's [FusedDense](https://github.com/NVIDIA/apex/tree/master/apex/fused_dense).
+```sh
 cd csrc/fused_dense_lib && pip install .
-cd csrc/cauchy && pip install .
+```
+3. Optimized cross-entropy loss, adapted from Apex's [Xentropy](https://github.com/NVIDIA/apex/tree/master/apex/contrib/xentropy).
+```sh
+cd csrc/xentropy && pip install .
 ```