From 2cfc91577ede8526067f29c75a24a95fd7bfd619 Mon Sep 17 00:00:00 2001
From: Binyang Li <binyli@microsoft.com>
Date: Fri, 17 Jan 2025 23:16:42 +0000
Subject: [PATCH] address comments

---
 docs/design/mscclpp-dsl.md                   | 34 +++++++++++---------
 python/examples/allreduce_allpairs.py        |  4 +--
 python/examples/allreduce_allpairs_packet.py |  7 ++--
 3 files changed, 25 insertions(+), 20 deletions(-)

diff --git a/docs/design/mscclpp-dsl.md b/docs/design/mscclpp-dsl.md
index eaa7e724..8d13e172 100644
--- a/docs/design/mscclpp-dsl.md
+++ b/docs/design/mscclpp-dsl.md
@@ -1,6 +1,6 @@
 # MSCCL++ DSL
 ## MSCCLPPLang Introduction
-MSCCLPPLang is a Python moudule for writing high-performance commnunication algorithms. It is designed to be easy to use and efficient, while providing a high-level interface for writing communication algorithms. MSCCLPPLang program will be compiled to json based execution plan, which can be executed by MSCCLPP executor.
+MSCCLPPLang is a Python moudule for writing high-performance commnunication algorithms. It is designed to be easy to use and efficient, while providing a high-level interface for writing communication algorithms. MSCCLPPLang program will be compiled to json based execution plan, which can be executed by MSCCL++ executor.
 
 ## How to use MSCCLPPLang
 ### Install mscclpp package
@@ -40,15 +40,15 @@ A MSCCLPPProgram provides the context to write MSCCLPPLang program, which can be
 
 - `name`: Name of this program.
 - `collective`: Collective type of this program, should be from `mscclpp.language.collectives`.
-- `instances`: Number of parallel instances of this program.
+- `instances`: Number of parallel instances of this program. Please see the [Instance](#instance) section for more details.
 - `protocol`: Data transmission protocol used in this program, can be `LL` or `Simple`. Optional, default is `Simple`.
 - `instr_fusion`: Whether low-level instruction fusion is enabled. Optional, default is `True`.
-- `replication_policy`: Data replication policy, should be from `mscclpp.language.types.ReplicationPolicy`. Optional, default is `duplicated`.
+- `replication_policy`: Data replication policy, should be from `mscclpp.language.types.ReplicationPolicy`. Optional, default is `duplicated`. Please see the [Instance](#instance) section for more details.
 - `num_threads_per_block`: Thread block size. Optional, default is `1024`.
 - `use_double_scratch_buffer`: Whether requires double scratch buffer during execution. Optional, default is `False`.
 
 ### Collective:
-A collective is a communication operation that involves multiple GPUs. We provide a set of collective operations for users to utilize. For example, the `AllGather` operation gathers data from all GPUs to all GPUs. To instantiate a collective, the user needs to specify the number of ranks, the chunk factor (how many chunks each rank will be split into), and whether the operation is in-place.
+A collective is a communication operation that involves multiple GPUs. We provide a set of collective operations for users to utilize. For example, the `AllGather` operation gathers data from all GPUs to all GPUs. To instantiate a collective, the user needs to specify the number of ranks, the chunk factor (how many chunks the input buffer will be split into), and whether the operation is in-place.
 
 #### Chunk
 A chunk is a piece of data that is sent between GPUs. It is the basic unit of data in MSCCLPPLang. Chunk can be a piece of data from input buffer, output buffer or intermediate buffer.
@@ -59,7 +59,7 @@ c = chunk(rank, Buffer.input, index, size)
 - rank: the rank of the GPU that the chunk belongs to.
 - buffer: the buffer that the chunk belongs to. It can be Buffer.input, Buffer.output or Buffer.scratch.
 - index: the index of the chunk in the buffer.
-- size: the size of the chunk.
+- size: the number of unit chunks.
 
 Assume we split the input data in the buffer into 4 chunks. On GPU rank 0, we can retrieve the chunks from indices 0 to 2 using the following command:
 ```python
@@ -74,29 +74,29 @@ The operation can only be applied to the chunks. We provide a set of communicati
 #### Channel
 A channel is a communication channel between two GPUs. It is used to send and receive data between GPUs. We supports three types of channel: `ChannelType.sm`, `ChannelType.proxy` and `ChannelType.nvls`.
 
-`ChannelType.sm` is used for communication between GPUs on the same node. This channel will using GPU processors to transfer data.
+`ChannelType.sm` is used for communication between GPUs on the same node. This channel uses GPU processors to transfer data.
 
 `ChannelType.proxy` is used for communication between GPUs, whether they are on different nodes or the same node. This channel will offload the data transfer to CPU processors, which can provide better throughput compared to `ChannelType.sm`. However, this comes at the cost of higher latency compared to `ChannelType.sm`.
 
-`ChannelType.nvls` is used for communication between GPUs on the same node. This feature offloads the data processing task to the switch, requiring specific hardware support.
+`ChannelType.nvls` is used for communication between GPUs on the same node. This feature offloads the data processing task to the switch, requiring specific hardware support. Refer [nvdia documentation](https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__MULTICAST.html) for more details.
 
 #### Thread Block
 We can assign operations to a thread block. The thread block is a group of threads that are executed together on the GPU. In the operation function, we can specify the thread block that the operation belongs to via `sendtb` or `recvtb` parameter.
 
 #### Instance
 An instance is a parallel execution of the program. For example, if a collective algorithm is designed to run on `n` chunks with `m` thread blocks, setting the instance to 2 will run the algorithm on `2n` chunks with `2m` thread blocks. Serveral replication policies are supported, including `duplicated` and `interleaved`.
-- `duplicated`: Divide chunks equally among the number of instances. For example, ChunkA and ChunkB are duplicated to ChunkA0, ChunkA1, ChunkB0, and ChunkB1. ChunkA0 and ChunkA1 belong to Instance 0, while ChunkB0 and ChunkB1 belong to Instance 1.
-- `interleaved`: Assign chunks to instances in an interleaved manner. For example, ChunkA and ChunkB are interleaved to ChunkA0, ChunkA1, ChunkB0, and ChunkB1. ChunkA0 and ChunkB0 belong to Instance 0, while ChunkA1 and ChunkB1 belong to Instance 1.
+- `duplicated`: Each chunk is split into smaller parts based on the number of instances, duplicating the same instructions for all parts. For example, ChunkA is split into ChunkA0 and ChunkA1, while ChunkB is split into ChunkB0 and ChunkB1. Both ChunkA0 and ChunkA1 belong to Instance 0, and both ChunkB0 and ChunkB1 belong to Instance 1. 
+- `interleaved`: Assign chunks to instances in an interleaved manner. For example, ChunkA and ChunkB are split into to ChunkA0, ChunkA1, ChunkB0, and ChunkB1. ChunkA0 and ChunkB0 belong to Instance 0, while ChunkA1 and ChunkB1 belong to Instance 1.
 
 #### Instruction Fusion
-MSCCLPPLang provides the instruction fusion mechanism to fuse multiple operations into a single kernel. This can reduce the overhead of launching multiple instructions. When user create the MSCCLPPLang program, it can specify the `instr_fusion` parameter to enable the instruction fusion. By default, the instruction fusion is enabled.
+MSCCLPPLang provides the instruction fusion mechanism to fuse multiple operations into a single kernel. This can reduce the overhead of launching multiple instructions. When users create the MSCCLPPLang program, they can specify the `instr_fusion` parameter to enable the instruction fusion. By default, the instruction fusion is enabled.
 
 ## MSCCLPPLang APIs
 
 ### Basic APIs
 - `chunk(rank, buffer, index, size)`: create a chunk.
-- `put(self, dst, chunk, index, sendtb, chan_type)`: send the data from one GPU to another GPU. User can specify the index of the chunk in the destination buffer, the sendtb and the channel type.
-- `get(self, src, chunk, index, recvtb, chan_type)`: receive the data from another GPU. User can specify the index of the chunk in the destination buffer, the recvtb and the channel type.
+- `put(self, dst, buffer, index, sendtb, chan_type)`: send the data from one GPU to another GPU. User can specify the index of the chunk in the destination buffer, the sendtb and the channel type.
+- `get(self, src, buffer, index, recvtb, chan_type)`: receive the data from another GPU. User can specify the index of the chunk in the destination buffer, the recvtb and the channel type.
 - `signal(self, dst, buffer, index, sendtb, chan_type)`: send a signal to another GPU.
 - `wait(self, src, buffer, index, recvtb, chan_type)`: wait for a signal from another GPU.
 - `flush(self, dst, buffer, index, sendtb, chan_type)`: flush the data in the buffer to the destination GPU. This is used to make sure the data is sent to the destination GPU.
@@ -104,7 +104,11 @@ MSCCLPPLang provides the instruction fusion mechanism to fuse multiple operation
 - `reduce(self, other_chunkref, recvtb, channel_type)`: Reduces the chunk(s) referenced by other_chunkref into the chunk(s) referenced by this chunkref
 
 ### Packet APIs
-Packet APIs are used when user wants to use LL algorithm. The packet APIs are similar to the basic APIs, it will packet the data and flags into a packet and send the packet to the destination GPU. The destination GPU will unpack the packet and get the data and flags. So no synchronization is needed when using packet APIs.
-- `packet_put(self, dst, chunk, index, sendtb, chan_type)`: send the data from one GPU to another GPU using packet.
+Packet APIs are used when user wants to use LL algorithm. The packet APIs are similar to the basic APIs, it will packet the data and flags into a packet and send the packet to the destination GPU. The destination GPU will unpack the packet and get the data and flags. So no synchronization is needed when using packet APIs. (`ChannelType.nvls` is not supported for packet APIs)
+- `packet_put(self, dst, buffer, index, sendtb, chan_type)`: send the data from one GPU to another GPU using packet.
 - `copy_packet(self, dst, buffer, index, sendtb)`: copy the data from one buffer to another buffer in the same GPU using packet.
-- `reduce_packet(self, other_chunkref, recvtb)`: Reduces the chunk(s) referenced by other_chunkref into the chunk(s) referenced by this chunkref using packet.
\ No newline at end of file
+- `reduce_packet(self, other_chunkref, recvtb)`: Reduces the chunk(s) referenced by other_chunkref into the chunk(s) referenced by this chunkref using packet.
+
+
+### Examples
+We provide several examples demonstrating how to use the MSCCL++ DSL to write communication collective algorithms. For more details, please refer to the [examples](https://github.com/microsoft/mscclpp/tree/main/mscclpp-lang/python/examples) folder.
\ No newline at end of file
diff --git a/python/examples/allreduce_allpairs.py b/python/examples/allreduce_allpairs.py
index 86210634..100e0052 100644
--- a/python/examples/allreduce_allpairs.py
+++ b/python/examples/allreduce_allpairs.py
@@ -9,10 +9,10 @@
 
 def allreduce_allpairs(gpus, instances, protocol):
     """
-    Demostrate allreduce with all pairs algorithm with put semantics.
+    Demonstrate allreduce with all pairs algorithm using put semantics.
     Steps:
     1. Sync all ranks to ensure the data is ready.
-    2. Each rank read chunks from all peers and reduces the data.
+    2. Each rank reads chunks from all peers and reduces the data.
     3. Put the reduced data to all peers.
     4. Sync all ranks to ensure the data is received.
     """
diff --git a/python/examples/allreduce_allpairs_packet.py b/python/examples/allreduce_allpairs_packet.py
index 74c9ca5c..db35565b 100644
--- a/python/examples/allreduce_allpairs_packet.py
+++ b/python/examples/allreduce_allpairs_packet.py
@@ -11,9 +11,10 @@ def allreduce_allpairs(gpus, instances):
     """
     AllReduce with all pairs algorithm using packets format.
     Steps:
-    1. Each rank sends the nth chunk to the nth rank into scratch space.
-    2. Each rank performs a local reduction on the nth chunk. Then sends the reduced data to all other ranks.
-    3. Each rank retrieves the final result from scratch space.
+    1. Each rank sends its nth chunk to the nth rank's scratch space.
+    2. Each rank performs a local reduction on its nth chunk using data from all other ranks' scratch spaces.
+    3. Each rank sends the reduced data to all other ranks' scratch spaces.
+    4. Each rank retrieves the final reduced result from the scratch space.
     """
     size = gpus
     chunksperloop = gpus * gpus