Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Do not merge] Overlap benchmark: AG+GEMM distributed matmul with HostIr and ParallelType::Stream #3719

Open
wants to merge 24 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
24 commits
Select commit Hold shift + click to select a range
2c04df9
working simple benchmark
samnordmann Oct 23, 2024
af36cf1
minor
samnordmann Oct 25, 2024
68b858a
test script
samnordmann Oct 25, 2024
0c3493b
minor
samnordmann Oct 28, 2024
b30b44b
add nsight profiling
samnordmann Oct 29, 2024
0592a13
nsight and tl/nccl/ sync mode
samnordmann Oct 31, 2024
0037b1e
add cuStreamWriteValue but linkage error
samnordmann Nov 4, 2024
ec71e23
multiple pgs
samnordmann Nov 4, 2024
a15fdfc
reenable cuStreamValue32
samnordmann Nov 4, 2024
6682a33
add tl/cuda and ec/cuda flags in bash test script
samnordmann Nov 4, 2024
b01f1f4
add option to unfuse loops
samnordmann Nov 4, 2024
ea7fd37
add cuda graphs. Only working for NCCL and S1 bc there is a syncStrea…
samnordmann Nov 5, 2024
9dddac2
write matmul to sliced output
samnordmann Nov 26, 2024
faf8bbe
wip cuStreamWriteValue not working
samnordmann Nov 28, 2024
a6b5fd7
dummy benchmark
samnordmann Dec 2, 2024
8d927bf
add pre post comms option
samnordmann Dec 2, 2024
d9c581c
add pre post comms option
samnordmann Dec 2, 2024
bfc7fa6
cleanup test script
samnordmann Dec 6, 2024
1a1138c
update
samnordmann Jan 8, 2025
743185d
Merge branch 'overlap_bench/first_experiments' of github.com:samnordm…
samnordmann Jan 16, 2025
a2b1650
Merge branch 'main' of github.com:NVIDIA/Fuser into overlap_bench/fir…
samnordmann Jan 16, 2025
e037ee5
test with stream parallel type and host IR
samnordmann Jan 16, 2025
8328c28
add support for other dtypes
samnordmann Jan 20, 2025
2fecf02
remove trace print
samnordmann Jan 22, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 7 additions & 0 deletions bench/process_outputs
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
#!/bin/bash

FILE="/opt/pytorch/Fuser/bench/logs/${1}/info"

cat $FILE | grep "rank 0: " #| awk '{print $4}'

# | grep -E 'Streams32\b'
89 changes: 89 additions & 0 deletions bench/test
Original file line number Diff line number Diff line change
@@ -0,0 +1,89 @@
#!/bin/bash
EXPERIMENT=StreamParallelType_tests
DATE=$(date +%Y%m%d-%H%M)
LOG_BASE="/opt/pytorch/Fuser/bench/logs"

NP=8
BACKEND=UCC
M=32768
K=32768
N=1024

DTYPE="__half" # float, __bfloat

S=8
Streams=3
Pgs=1

# M=131072 #32768
# K=131072
# N=32768 #1024
# L=1048576 #268435456 #67108864 #131072
# PRE_COMM="_pre_comm"
# POST_COMM="_post_comm"
# UNFUSE="_unfused"
# GRAPH="_WithCudaGraph"
# cuStreamWrite=WithcuStreamWriteValue32_
# GTEST_PREFIX="OverlapBenchmark.PipelinedAGMatmulBenchmark/"
# GTEST_PREFIX="DummyOverlapBenchmark.PipelinedAGMatmulBenchmark/"
GTEST_PREFIX="OverlapBenchmark.PipelinedAGMatmulBenchmarkStreamParallelType/"
GTEST_POSTFIX="${BACKEND}_S${S}_M${M}_K${K}_N${N}_Streams${Streams}_${DTYPE}_${cuStreamWrite}Pgs${Pgs}${UNFUSE}${GRAPH}"
# GTEST_POSTFIX="${BACKEND}_M${M}_K${K}_N${N}_L${L}${PRE_COMM}${POST_COMM}"
export GTEST_FILTER="${GTEST_PREFIX}${GTEST_POSTFIX}"
echo "gtest filter: $GTEST_FILTER" | tee -a $LOG_FILE_INFO

MPIFLAGS=" -np $NP"

# MPIFLAGS+=" -x NCCL_P2P_NET_CHUNKSIZE=2MB"
# MPIFLAGS+=" -x NCCL_DEBUG=TRACE" #INFO
# MPIFLAGS+=" -x NCCL_MAX_NCHANNELS=1"

MPIFLAGS+=" -x UCC_CL_BASIC_TLS=nccl"
# MPIFLAGS+=" -x UCC_TL_NCCL_SYNC=event"

# MPIFLAGS+=" -x UCC_CL_BASIC_TLS=cuda"
# MPIFLAGS+=" -x UCC_TL_CUDA_SCRATCH_SIZE=32mb"
# MPIFLAGS+=" -x UCC_TL_CUDA_ALLGATHER_RING_MAX_RINGS=32"
# MPIFLAGS+=" -x UCC_TL_CUDA_ALLGATHER_RING_NUM_CHUNKS=32"

# MPIFLAGS+=" -x UCC_EC_CUDA_EXEC_NUM_WORKERS=8"
# MPIFLAGS+=" -x UCC_EC_CUDA_USE_COOPERATIVE_LAUNCH=0"
# MPIFLAGS+=" -x UCC_EC_CUDA_STREAM_TASK_MODE=driver"
# MPIFLAGS+=" -x UCC_EC_CUDA_STREAM_TASK_MODE=kernel"
# MPIFLAGS+=" -x UCC_EC_CUDA_EXEC_COPY_LARGE_THRESH=1M"
# MPIFLAGS+=" -x UCC_EC_CUDA_EXEC_NUM_THREADS=512"

# MPIFLAGS+=" -x UCC_CL_BASIC_TLS=ucp"
# MPIFLAGS+=" -x UCX_RNDV_THRESH=0 -x UCX_TLS=ib,cuda_copy"
# MPIFLAGS+=" -x UCX_RNDV_SCHEME=put_zcopy"
# MPIFLAGS+=" -x UCX_RNDV_SCHEME=get_zcopy"


MPIFLAGS+=" -x UCX_NET_DEVICES=mlx5_0:1"
# MPIFLAGS+=" -x UCC_CL_BASIC_TLS=^sharp,mlx5"
# MPIFLAGS+=" -x UCC_COLL_TRACE=info"
# MPIFLAGS+=" -x UCC_LOG_LEVEL=debug"
# MPIFLAGS+=" -x TORCH_NCCL_AVOID_RECORD_STREAMS=1"
# MPIFLAGS+=" -x CUDA_DEVICE_MAX_CONNECTIONS=2"


export LOGS="${LOG_BASE}/${EXPERIMENT}_${BACKEND}_${DATE}"
mkdir -p $LOGS
export LOG_FILE_INFO="${LOGS}/info.txt"
echo "Writing to $LOG_FILE_INFO" | tee -a $LOG_FILE_INFO

echo "mpi flags: $MPIFLAGS" | tee -a $LOG_FILE_INFO

TEST_CMD="$BUILD_DIRECTORY/test_multidevice --gtest_filter=${GTEST_FILTER}"
echo "test cmd: $TEST_CMD" | tee -a $LOG_FILE_INFO

MPICMD="mpirun $MPIFLAGS $TEST_CMD"
echo $MPICMD | tee -a $LOG_FILE_INFO

NSYSCMD="nsys profile --stats=false -w true -t cublas,cuda,nvtx,osrt,mpi,ucx -o ${LOGS}/${GTEST_POSTFIX} --capture-range-end stop --capture-range=cudaProfilerApi --cudabacktrace=memory,sync,kernel,other"

CMD="${NSYSCMD} ${MPICMD}"
sudo /bin/sh -c "echo '1' > /proc/sys/kernel/perf_event_paranoid"
echo $CMD | tee -a ${LOG_FILE_INFO}
$CMD | tee -a ${LOG_FILE_INFO}

1 change: 1 addition & 0 deletions csrc/driver_api.h
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,7 @@ namespace nvfuser {
fn(cuModuleGetFunction); \
fn(cuModuleLoadDataEx); \
fn(cuModuleUnload); \
fn(cuStreamWriteValue32); \
fn(cuOccupancyMaxActiveBlocksPerMultiprocessor)

#if (CUDA_VERSION >= 12000)
Expand Down
Loading
Loading