From 747f905c432d9183a0f1b9951e7e94c0d4079f8b Mon Sep 17 00:00:00 2001
From: Tri Dao <tridpq@gmail.com>
Date: Fri, 7 Oct 2022 13:07:10 -0700
Subject: [PATCH] Add details about CUDA extensions

---
 README.md | 15 ++++++++++++---
 1 file changed, 12 insertions(+), 3 deletions(-)

diff --git a/README.md b/README.md
index c13c184..f7167b0 100644
--- a/README.md
+++ b/README.md
@@ -1,12 +1,14 @@
 We use the template from `https://github.com/ashleve/lightning-hydra-template`.
 Please read the instructions there to understand the repo structure.
 
+## GPT2 training
 To train GPT2 on Openwebtext with 8 GPUs:
 ```sh
 python run.py experiment=owt/gpt2s-flash trainer.devices=8
 python run.py experiment=owt/gpt2m-flash trainer.devices=8
 python run.py experiment=owt/gpt2l-flash trainer.devices=8
 ```
+To train with bf16 instead of fp16, add `trainer.precision=bf16`.
 
 ## Requirements
 
@@ -15,10 +17,17 @@ We recommend CUDA 11.8 (e.g., using the Nvidia's Pytorch Docker image from https
 
 We provide a Dockerfile that lists all the required packages.
 
-To install the CUDA extensions:
+This repo includes the following CUDA extensions:
+1. Fused dropout + residual + LayerNorm, adapted from Apex's [FastLayerNorm](https://github.com/NVIDIA/apex/tree/master/apex/contrib/layer_norm).
 ```sh
-cd csrc/xentropy && pip install .
 cd csrc/layer_norm && pip install .
+```
+2. Fused matmul + bias (forward and backward), and fused matmul + bias + gelu
+(forward and backward), adapted from Apex's [FusedDense](https://github.com/NVIDIA/apex/tree/master/apex/fused_dense).
+```sh
 cd csrc/fused_dense_lib && pip install .
-cd csrc/cauchy && pip install .
+```
+3. Optimized cross-entropy loss, adapted from Apex's [Xentropy](https://github.com/NVIDIA/apex/tree/master/apex/contrib/xentropy).
+```sh
+cd csrc/xentropy && pip install .
 ```