Skip to content

Commit

Permalink
Adapt to tt-system-tools hugepages configuration (#14396)
Browse files Browse the repository at this point in the history
### Tickets

#### metal-internal-workflows
Update provisioning to use new hugepages flow from tt-system-tools
-
https://github.com/tenstorrent-metal/metal-internal-workflows/issues/278
- https://github.com/tenstorrent-metal/metal-internal-workflows/pull/327

#### tt-metal
Many metal machines don't have correct hugepages config
#15675

### Problem description
The old hugepages configuration was janky, and the plan of record is to
move to use tt-system-tools method.

### What's changed
- Remove invocation of `sudo /etc/rc.local` in mount weka script.
- Use our custom action `ensure-active-weka-mount` consistently
everywhere we try to mount weka
- Remove setup_hugepages.py
- Add forced garbage collection after every pytest execution to reduce
pressure on system memory

### Checklist
- [ ] Post commit CI passes
- [ ] Blackhole Post commit (if applicable)
- [x] [Model regression CI testing
passes](https://github.com/tenstorrent/tt-metal/actions/runs/12290324750)
- [ ] Device performance regression CI testing passes (if applicable)
- [ ] New/Existing tests provide coverage for changes

---------

Co-authored-by: Raymond Kim <[email protected]>
  • Loading branch information
2 people authored and mcw-anasuya committed Jan 2, 2025
1 parent b7427b2 commit a6b2786
Show file tree
Hide file tree
Showing 8 changed files with 26 additions and 443 deletions.
4 changes: 1 addition & 3 deletions .github/actions/ensure-active-weka-mount/action.yml
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,4 @@ runs:
- name: Ensure active weka mount
shell: bash
run: |
sudo systemctl restart mnt-MLPerf.mount
sudo /etc/rc.local
ls -al /mnt/MLPerf/bit_error_tests
timeout --preserve-status 300 ./.github/scripts/cloud_utils/mount_weka.sh
22 changes: 21 additions & 1 deletion .github/scripts/cloud_utils/mount_weka.sh
Original file line number Diff line number Diff line change
Expand Up @@ -3,5 +3,25 @@
set -eo pipefail

sudo systemctl restart mnt-MLPerf.mount
sudo /etc/rc.local
ls -al /mnt/MLPerf/bit_error_tests

check_hugepages_service_status=0 && ( sudo systemctl status tenstorrent-hugepages.service ) || check_hugepages_service_status=$?
# Exit code 4 for systemctl means not found
if [ $check_hugepages_service_status -eq 4 ]; then
echo "::warning title=weka-mount-hugepages-service-not-found::Hugepages service not found. Using old rc.local method"
sudo /etc/rc.local
else
echo "::notice title=weka-mount-hugepages-service-found::Hugepages service found. Command returned with exit code $check_hugepages_service_status. Restarting it so we can ensure hugepages are available"
sudo systemctl restart tenstorrent-hugepages.service
fi

# Wait until the hugepages are written as the above are not blocking
hugepages_check_start=$(date +%s)
hugepages_check_timeout=60
while [[ "$(cat "/sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages")" -eq 0 ]]; do
sleep 1
if (( $(date +%s) - hugepages_check_start > hugepages_check_timeout )); then
echo "::error title=weka-mount-hugepages-not-set::nr_hugepages is still 0 after $hugepages_check_timeout seconds. Please let infra team know via issue."
exit 1
fi
done
6 changes: 1 addition & 5 deletions .github/workflows/full-regressions-and-models.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -28,11 +28,7 @@ jobs:
runs-on: ["model-runner-${{ matrix.arch }}", "in-service"]
steps:
- uses: tenstorrent/tt-metal/.github/actions/checkout-with-submodule-lfs@main
- name: Ensure weka mount is active
run: |
sudo systemctl restart mnt-MLPerf.mount
sudo /etc/rc.local
ls -al /mnt/MLPerf/bit_error_tests
- uses: ./.github/actions/ensure-active-weka-mount
- name: Set up dyanmic env vars for build
run: |
echo "TT_METAL_HOME=$(pwd)" >> $GITHUB_ENV
Expand Down
1 change: 0 additions & 1 deletion .github/workflows/package-and-release.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -177,7 +177,6 @@ jobs:
CHANGELOG.txt
README.md
INSTALLING.md
infra/machine_setup/scripts/setup_hugepages.py
ttnn-*+*.whl
fail_on_unmatched_files: true
create-docker-release-image:
Expand Down
6 changes: 1 addition & 5 deletions .github/workflows/perf-device-models-impl.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -27,12 +27,8 @@ jobs:
LD_LIBRARY_PATH: ${{ github.workspace }}/build/lib
runs-on: ${{ matrix.test-info.runs-on }}
steps:
- name: Ensure weka mount is active
run: |
sudo systemctl restart mnt-MLPerf.mount
sudo /etc/rc.local
ls -al /mnt/MLPerf/bit_error_tests
- uses: tenstorrent/tt-metal/.github/actions/checkout-with-submodule-lfs@main
- uses: ./.github/actions/ensure-active-weka-mount
- name: Set up dynamic env vars for build
run: |
echo "TT_METAL_HOME=$(pwd)" >> $GITHUB_ENV
Expand Down
6 changes: 1 addition & 5 deletions .github/workflows/single-card-demo-tests-impl.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -41,11 +41,7 @@ jobs:
if: ${{ matrix.test-group.name == 'N300_performance' }}
run: |
sudo cpupower frequency-set -g performance
- name: Ensure weka mount is active
run: |
sudo systemctl restart mnt-MLPerf.mount
sudo /etc/rc.local
ls -al /mnt/MLPerf/bit_error_tests
- uses: ./.github/actions/ensure-active-weka-mount
- name: Set up dynamic env vars for build
run: |
echo "TT_METAL_HOME=$(pwd)" >> $GITHUB_ENV
Expand Down
6 changes: 1 addition & 5 deletions .github/workflows/test-dispatch.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -67,13 +67,9 @@ jobs:
runs-on: ${{ fromJSON(inputs.runner-label) }}
steps:
- uses: tenstorrent/tt-metal/.github/actions/checkout-with-submodule-lfs@main
- name: Ensure weka mount is active
- uses: ./.github/actions/ensure-active-weka-mount
timeout-minutes: 3
if: ${{ inputs.arch != 'blackhole' }}
run: |
sudo systemctl restart mnt-MLPerf.mount
sudo /etc/rc.local
ls -al /mnt/MLPerf/bit_error_tests
- name: Set up dyanmic env vars for build
run: |
echo "TT_METAL_HOME=$(pwd)" >> $GITHUB_ENV
Expand Down
Loading

0 comments on commit a6b2786

Please sign in to comment.