Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add CI/CD for unit tests #41

Merged
merged 107 commits into from
Feb 16, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
107 commits
Select commit Hold shift + click to select a range
1c79951
add CI/CD for unit tests
xrsrke Jan 19, 2024
04491d3
fix
xrsrke Jan 19, 2024
fdd5d1e
fix syntax
xrsrke Jan 19, 2024
91208dd
fix
xrsrke Jan 19, 2024
8da087d
fix
xrsrke Jan 19, 2024
00875c0
update actions/checkout
xrsrke Jan 19, 2024
cca7e56
new runner label
glegendre01 Jan 19, 2024
338c042
fix typo
glegendre01 Jan 19, 2024
0c6433c
add workflow dispatch
glegendre01 Jan 19, 2024
6de2472
remove path filter for triggering
glegendre01 Jan 19, 2024
79b22d8
test ci
xrsrke Jan 23, 2024
c73623b
update python version
xrsrke Jan 23, 2024
5efc135
add code quality
xrsrke Jan 23, 2024
4fb80a4
refactor
xrsrke Jan 23, 2024
ceb21c2
only check src
xrsrke Jan 23, 2024
05aa557
fix
xrsrke Jan 23, 2024
0010cfa
use docker image
xrsrke Jan 23, 2024
dba1eed
fix
xrsrke Jan 23, 2024
b2af5d0
use python 10
xrsrke Jan 23, 2024
8914de7
change docker image
xrsrke Jan 24, 2024
368beba
fix pip install
xrsrke Jan 24, 2024
565e081
add fa2-related tests
xrsrke Jan 24, 2024
7b38326
fix
xrsrke Jan 24, 2024
906477b
update FA2 version
xrsrke Jan 24, 2024
4491ce7
add on push
xrsrke Jan 24, 2024
5b22ede
update FA2 to flash-attn>=2.5.0
xrsrke Jan 24, 2024
5f3ce67
Merge branch 'main' of github.com:huggingface/nanotron into xrsrke/se…
xrsrke Jan 29, 2024
9a03a04
add searching for free ports in unit tests
xrsrke Jan 29, 2024
1cf4da2
remove searching port
xrsrke Jan 29, 2024
f6d9847
move searching ports to distributed
xrsrke Jan 29, 2024
f675daf
Update 3d_parallelism_unit_tests.yaml
xrsrke Jan 29, 2024
0908b74
Update 3d_parallelism_unit_tests.yaml
xrsrke Jan 29, 2024
df7cb9d
Update distributed.py
xrsrke Jan 29, 2024
839677a
Update 3d_parallelism_unit_tests.yaml
xrsrke Jan 29, 2024
b631186
Update 3d_parallelism_unit_tests.yaml
xrsrke Jan 30, 2024
128eea5
Update distributed.py
xrsrke Jan 30, 2024
f96808a
Refactor test_clip_grads_with_tp parameters
NouamaneTazi Jan 31, 2024
d123d1b
Skip test cases for ALL_REDUCE mode with async communication
NouamaneTazi Jan 31, 2024
b899564
Update init_method to use env://localhost:port
NouamaneTazi Jan 31, 2024
ff32ddb
tests run for all PRs
NouamaneTazi Jan 31, 2024
abe42c6
Update branch filter in GitHub workflows
NouamaneTazi Jan 31, 2024
0a754a1
skip ALL_REDUCE with async comm
NouamaneTazi Jan 31, 2024
5d822bb
make sure total_norm in clip grad is a scalar
NouamaneTazi Jan 31, 2024
e5e2045
Merge branch 'main' of github.com:huggingface/nanotron into xrsrke/se…
xrsrke Jan 31, 2024
5d9652a
refactor
xrsrke Jan 31, 2024
063020a
zeros([]
NouamaneTazi Feb 1, 2024
741966b
Merge pull request #52 from huggingface/nouamane/fix_ci
NouamaneTazi Feb 1, 2024
e2ed85f
exclude sanity_checks.py from CoL
xrsrke Feb 1, 2024
91234fa
exclude sanity_checks.py from CoL
xrsrke Feb 1, 2024
a57cb9b
Merge branch 'main' of github.com:huggingface/nanotron into xrsrke/se…
xrsrke Feb 10, 2024
8a98cfc
fix expectation
xrsrke Feb 10, 2024
29672db
remove empty context manager in tp tests
xrsrke Feb 10, 2024
0a34e65
add reruning a tests if a port is in used
xrsrke Feb 10, 2024
e3c3d11
fix checking total_norm should be a scalar
xrsrke Feb 10, 2024
63ca0d2
fix
xrsrke Feb 10, 2024
44c0e05
add more retrying
xrsrke Feb 10, 2024
b8eeb1e
fix clip grads
xrsrke Feb 10, 2024
b553c4e
remove testing dim in clip grads
xrsrke Feb 10, 2024
0b97c38
fuk
xrsrke Feb 10, 2024
8c7355e
refactor
xrsrke Feb 10, 2024
2a4e735
run tests in parallel
xrsrke Feb 10, 2024
d47555e
not run fa2
xrsrke Feb 10, 2024
3b70271
only run 5 tests in parallel
xrsrke Feb 10, 2024
30b8004
only run a test at a time
xrsrke Feb 10, 2024
51a804c
add forking RNG
xrsrke Feb 10, 2024
cec0c04
fix circular import
xrsrke Feb 10, 2024
f42a43e
fix rng
xrsrke Feb 10, 2024
5b375f5
remove parallel tests
xrsrke Feb 10, 2024
081b17d
add python random seed
xrsrke Feb 11, 2024
4dce881
remove dist test, and add destroying process group after running a test
xrsrke Feb 11, 2024
00bb0bf
fix
xrsrke Feb 11, 2024
957826e
edit
xrsrke Feb 11, 2024
dc65581
fix
xrsrke Feb 11, 2024
0fe7bdd
fix
xrsrke Feb 11, 2024
de52fc6
removing destroy pg
xrsrke Feb 11, 2024
f2afea3
add destroying parallel_context in unit tests
xrsrke Feb 11, 2024
97ebff4
ignore layer norm
xrsrke Feb 11, 2024
6a5fd81
wtf is going on
xrsrke Feb 11, 2024
9c7e1a7
add small run
xrsrke Feb 13, 2024
b2c71b0
run small with dist test
xrsrke Feb 13, 2024
0d21bba
debug missing destroy
xrsrke Feb 13, 2024
6bb69ff
fuck
xrsrke Feb 13, 2024
b39c831
f
xrsrke Feb 13, 2024
3bd346d
.
NouamaneTazi Feb 13, 2024
dd0079e
.
NouamaneTazi Feb 13, 2024
91cf7e3
try timeout-minutes and --rm
NouamaneTazi Feb 13, 2024
7e0fcce
try -v
NouamaneTazi Feb 13, 2024
6dcb73d
try
NouamaneTazi Feb 13, 2024
b64f04f
bring back parallel_context.destroy()
NouamaneTazi Feb 13, 2024
2d44ec7
add 3d tests
xrsrke Feb 14, 2024
5d03579
add all cicd
xrsrke Feb 14, 2024
ab09576
run parallel tests
xrsrke Feb 14, 2024
77e0764
only run 1 test
xrsrke Feb 14, 2024
f43687f
add directly spawning processes
xrsrke Feb 15, 2024
004e7f4
refactor spawn function as init_distributed
xrsrke Feb 15, 2024
558b341
please work
xrsrke Feb 15, 2024
98046f8
catch overlaping port from find_free_port
xrsrke Feb 15, 2024
d96c7fa
clean up
xrsrke Feb 15, 2024
f56f8a7
fix circular import
xrsrke Feb 15, 2024
a48b7bf
skip fp8 tests in FA2
xrsrke Feb 15, 2024
033aca9
update code quality
xrsrke Feb 15, 2024
d4c27e7
fix
xrsrke Feb 15, 2024
39e5846
fix
xrsrke Feb 15, 2024
6f7e4b2
remove uncessary files
xrsrke Feb 15, 2024
cd51bd9
fix search free poorts
xrsrke Feb 15, 2024
6c30d2c
set ParallelContext in wrapper
xrsrke Feb 16, 2024
c705f4d
remove uncessary comments
xrsrke Feb 16, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
63 changes: 63 additions & 0 deletions .github/workflows/3d_parallelism_unit_tests.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,63 @@
name: Run non-FA2-related unit tests

on:
push:
branches: [ main ]
# Only run tests if we modify the following files
paths:
- "src/**/*.py"
- "examples/**/*.py"
- "tests/**/*.py"

pull_request:
branches: [ '**' ]
paths:
- "src/**/*.py"
- "examples/**/*.py"
- "tests/**/*.py"

jobs:
tests:
runs-on: [multi-gpu, nvidia-gpu, 8-t4, ci]
container:
image: runpod/pytorch:2.1.1-py3.10-cuda12.1.1-devel-ubuntu22.04
ports:
- 80
options: --gpus all --shm-size "8G"
steps:
- uses: actions/checkout@v3
- name: Python environment
run: |
which python
python --version

- name: Check Pytorch version
run: |
nvidia-smi
python -c "import torch; print('torch:', torch.__version__, torch)"
python -c "import torch; print('CUDA available:', torch.cuda.is_available())"

- name: Instal nanotron
run: |
python -m pip install --upgrade pip
pip install packaging
pip install wheel
pip install -e .
pip install -e .[dev]
pip install -e .[test]

- name: Show installed libraries and their versions
run: pip freeze | tee installed.txt

- name: Run tests
# NOTE: -m "not fa2" will run all the unit tests that don't have the mark
# "fa2" (these are FA2-related tests, we can't run it on T4)
run: |
pytest \
-m "not fa2" \
--color=yes \
--durations=0 \
--ignore tests/kernels \
--ignore tests/fp8 \
--verbose \
tests/
26 changes: 26 additions & 0 deletions .github/workflows/code_quality.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
name: Code Quality

on:
workflow_dispatch:
push:
branches: [ main ]
# Only run tests if we modify the following files
paths:
- "src/**/*.py"

pull_request:
branches: [ '**' ]
paths:
- "src/**/*.py"

jobs:
cloc:
runs-on: ubuntu-latest

steps:
- uses: actions/checkout@v3

- name: Count Lines of Code (cloc)
uses: djdefi/cloc-action@6
with:
options: --by-file-by-lang --exclude-dir=docs,tests,examples --exclude-lang=YAML,Markdown,TOML --exclude-list-file=sanity_checks.py
58 changes: 58 additions & 0 deletions .github/workflows/fa2_unit_tests.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,58 @@
name: Run FA2-related unit tests

on:
workflow_dispatch:
push:
branches: [ main ]
# Only run tests if we modify the following files
paths:
- "src/**/*.py"
- "examples/**/*.py"
- "tests/**/*.py"

pull_request:
branches: [ '**' ]
paths:
- "src/**/*.py"
- "examples/**/*.py"
- "tests/**/*.py"

jobs:
tests:
runs-on: [single-gpu, nvidia-gpu, a10, ci]
container:
image: runpod/pytorch:2.1.1-py3.10-cuda12.1.1-devel-ubuntu22.04
ports:
- 80
options: --gpus all --shm-size "8G"
steps:
- uses: actions/checkout@v3

- name: Python environment
run: |
which python
python --version

- name: Check Pytorch version
run: |
nvidia-smi
python -c "import torch; print('torch:', torch.__version__, torch)"
python -c "import torch; print('CUDA available:', torch.cuda.is_available())"

- name: Instal nanotron
run: |
python -m pip install --upgrade pip
pip install packaging
pip install wheel
pip install "flash-attn>=2.5.0" --no-build-isolation
pip install -e .
pip install -e .[dev]
pip install -e .[test]

- name: Show installed libraries and their versions
run: pip freeze | tee installed.txt

- name: Run tests
# NOTE: -m fa2 will only run the unit tests that have the mark
# "fa2" (these are FA2-related tests)
run: pytest -m fa2 --color=yes --durations=0 --ignore tests/fp8 --verbose tests/
1 change: 0 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -160,6 +160,5 @@ cython_debug/
#.idea/

.vscode
.github

checkpoints/
14 changes: 13 additions & 1 deletion src/nanotron/distributed.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,8 @@
from torch.distributed import * # noqa
from torch.distributed.distributed_c10d import ProcessGroup

from nanotron.utils import find_free_port

torch_version_above_1_13 = version.parse(torch.__version__) >= version.parse("1.13.0")
Work = dist.Work if torch_version_above_1_13 else dist._Work
default_pg_timeout = datetime.timedelta(minutes=10)
Expand Down Expand Up @@ -257,5 +259,15 @@ def initialize_torch_distributed():
backend = "gloo"

# Call the init process.
dist.init_process_group(backend=backend, world_size=world_size, rank=rank, timeout=dist.default_pg_timeout)

port = os.getenv("MASTER_PORT")
if port is None:
port = find_free_port()
else:
port = int(port)

init_method = f"env://localhost:{port}"
dist.init_process_group(
init_method=init_method, backend=backend, world_size=world_size, rank=rank, timeout=dist.default_pg_timeout
)
return True
4 changes: 2 additions & 2 deletions src/nanotron/optim/clip_grads.py
Original file line number Diff line number Diff line change
Expand Up @@ -56,7 +56,7 @@ def clip_grad_norm(
torch.stack([torch.linalg.vector_norm(g.detach(), ord=torch.inf, dtype=torch.float) for g in grads])
)
else:
total_norm = torch.zeros(1, dtype=torch.float, device=torch.device("cuda"))
total_norm = torch.zeros([], dtype=torch.float, device=torch.device("cuda"))
dist.all_reduce(total_norm, group=mp_pg, op=dist.ReduceOp.MAX)

else:
Expand All @@ -68,7 +68,7 @@ def clip_grad_norm(
dtype=torch.float,
).pow(norm_type)
else:
total_norm = torch.zeros(1, dtype=torch.float, device=torch.device("cuda"))
total_norm = torch.zeros([], dtype=torch.float, device=torch.device("cuda"))
dist.all_reduce(total_norm, group=mp_pg, op=dist.ReduceOp.SUM)
total_norm.pow_(1.0 / norm_type)

Expand Down
9 changes: 8 additions & 1 deletion src/nanotron/parallel/context.py
Original file line number Diff line number Diff line change
Expand Up @@ -35,7 +35,7 @@ def __init__(
)

if not dist.is_available():
raise ValueError("`torch.distributed is not available as a package, please install it.")
raise ValueError("torch.distributed is not available as a package, please install it.")

self.tensor_parallel_size = tensor_parallel_size
self.pipeline_parallel_size = pipeline_parallel_size
Expand Down Expand Up @@ -148,3 +148,10 @@ def get_3d_ranks(self, world_rank: int) -> Tuple[int, int, int]:
dp_rank = (world_rank // self.tp_pg.size()) % self.dp_pg.size()
tp_rank = world_rank % self.tp_pg.size()
return (pp_rank, dp_rank, tp_rank)

def destroy(self):
if not dist.is_initialized():
return

dist.barrier()
dist.destroy_process_group()
14 changes: 14 additions & 0 deletions src/nanotron/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,8 @@
import inspect
import math
import os
import random
import socket
from contextlib import ExitStack, contextmanager
from typing import Callable, ContextManager, List, Optional

Expand Down Expand Up @@ -147,3 +149,15 @@ def tensor_from_untyped_storage(untyped_storage: torch.UntypedStorage, dtype: to
tensor = torch.empty([], dtype=dtype, device=device)
tensor.set_(source=untyped_storage)
return tensor


def find_free_port(min_port: int = 2000, max_port: int = 65000) -> int:
while True:
port = random.randint(min_port, max_port)
try:
with socket.socket() as sock:
sock.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
sock.bind(("localhost", port))
return port
except OSError:
continue
Loading
Loading