-
Notifications
You must be signed in to change notification settings - Fork 111
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
test_convnet_mnist.py
in ttnn integration tests is starting to fail on all cards
#16824
Comments
First noticed it fail on run 6924 of nightly but re-runs of 6923 and 6922 are no good either... so it started more often around this time. Unsure what's happening. Is there a model cache for this? |
I can't run this model locally. |
@tt-rkim let's disable this on WH and GS in every place, it's blocking device perf pipeline |
Oh device perf? OK one sec |
skipped (cherry picked from commit 76ffe7d)
mnist is also causing issues despite the skip. Refer to this re-run of stuff: https://github.com/tenstorrent/tt-metal/actions/runs/12777133171 - it was passing on attempt 1, then it all fails on attempt 2 Something whack is going on in the environment. |
skipped (cherry picked from commit 76ffe7d)
skipped (cherry picked from commit 76ffe7d)
I've skipped both mnist and convnet mnist @eyonland @boris-drazic I believe either/both of you manage the third-party members who work on convnet mnist and mnist ttnn integration tests Please triage and ask the writers to take a look at this and fix. |
I have skipped both tests. |
@kkeerthana0573 can you please take look? From schedules seems you are working on Convnet_MNIST |
Any update on this? |
@boris-drazic, @tt-rkim,
I am retriggering the CIs to debug further. |
Please check out what's in main. Also note that we only seem to see this behaviour if you run the whole ttnn integration test suite: |
Hmm... if you're able to run a couple more times to show it's not non-deterministic on a recent base of main, I will be more comfortable re-enabling the test |
I've triggered (Single-card) Nightly model and ttnn tests thrice and the model performed as expected in each run. Similarly, the model worked fine in the (Single-card) Device perf regressions. I've verified the model in (Single-card) Tests for new models too. |
I see... ok thank you Please open up a PR for this so we can re-enable And did you take any look at regular mnist? |
I've checked ConvMnist today. I will verify Mnist and will create a PR. |
Thank you, please keep us updated |
@tt-rkim ,
Note: Device perf CI is currently unstable. Therefore, the links for mnist and convnet_mnist have been highlighted and shared shared above Corresponding PR #16965 |
Not sure what's going on... there seems to be some interplay issues between tests.
For example, we see this often: https://github.com/tenstorrent/tt-metal/actions/runs/12782446765/job/35685828403
but stress tests are ok:
e150: https://github.com/tenstorrent/tt-metal/actions/runs/12796843809
n150: https://github.com/tenstorrent/tt-metal/actions/runs/12793625104
So not sure what's happening 🤷
Skipping for now
cc: @mywoodstock @pavlejosipovic
The text was updated successfully, but these errors were encountered: