Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test_convnet_mnist.py in ttnn integration tests is starting to fail on all cards #16824

Open
tt-rkim opened this issue Jan 16, 2025 · 19 comments
Labels
bug Something isn't working ci-bug bugs found in CI CNN_bug P1

Comments

@tt-rkim
Copy link
Collaborator

tt-rkim commented Jan 16, 2025

Not sure what's going on... there seems to be some interplay issues between tests.

For example, we see this often: https://github.com/tenstorrent/tt-metal/actions/runs/12782446765/job/35685828403

but stress tests are ok:

e150: https://github.com/tenstorrent/tt-metal/actions/runs/12796843809
n150: https://github.com/tenstorrent/tt-metal/actions/runs/12793625104

So not sure what's happening 🤷

Skipping for now

cc: @mywoodstock @pavlejosipovic

@tt-rkim tt-rkim added bug Something isn't working ci-bug bugs found in CI CNN_bug P1 labels Jan 16, 2025
tt-rkim added a commit that referenced this issue Jan 16, 2025
@tt-rkim
Copy link
Collaborator Author

tt-rkim commented Jan 16, 2025

First noticed it fail on run 6924 of nightly

but re-runs of 6923 and 6922 are no good either...

so it started more often around this time. Unsure what's happening. Is there a model cache for this?

@pavlejosipovic
Copy link
Contributor

I can't run this model locally.
First one need to figure out how to get convnet_mnist.pt which isn't available on ird machines I'm working on (or any ird machine?).
Fortunately I have this file stashed locally that I got over slack from @vigneshkeerthivasanx.
Second thing is model tries to download mnist dataset from http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz
And this is just failing for me.

@pavlejosipovic
Copy link
Contributor

@tt-rkim let's disable this on WH and GS in every place, it's blocking device perf pipeline

@tt-rkim
Copy link
Collaborator Author

tt-rkim commented Jan 17, 2025

Oh device perf? OK one sec

tt-rkim added a commit that referenced this issue Jan 17, 2025
@tt-rkim
Copy link
Collaborator Author

tt-rkim commented Jan 17, 2025

mnist is also causing issues despite the skip.

Refer to this re-run of stuff: https://github.com/tenstorrent/tt-metal/actions/runs/12777133171 - it was passing on attempt 1, then it all fails on attempt 2

Something whack is going on in the environment.

tt-rkim added a commit that referenced this issue Jan 17, 2025
tt-rkim added a commit that referenced this issue Jan 17, 2025
@tt-rkim
Copy link
Collaborator Author

tt-rkim commented Jan 20, 2025

I've skipped both mnist and convnet mnist

@eyonland @boris-drazic I believe either/both of you manage the third-party members who work on convnet mnist and mnist ttnn integration tests

Please triage and ask the writers to take a look at this and fix.

@tt-rkim
Copy link
Collaborator Author

tt-rkim commented Jan 20, 2025

I have skipped both tests.

@boris-drazic
Copy link
Contributor

@kkeerthana0573 can you please take look? From schedules seems you are working on Convnet_MNIST

@tt-rkim
Copy link
Collaborator Author

tt-rkim commented Jan 21, 2025

Any update on this?

@kkeerthana0573
Copy link
Contributor

@boris-drazic, @tt-rkim,
ConvMnist works as expected on N150, N300 machines locally.
However, I noticed that the model has been failing in Nightly CIs over the past few days. Interestingly, the ConvMnist tests were passing in the CIs yesterday -

I am retriggering the CIs to debug further.

@tt-rkim
Copy link
Collaborator Author

tt-rkim commented Jan 21, 2025

Please check out what's in main.

Also note that we only seem to see this behaviour if you run the whole ttnn integration test suite: tests/scripts/single_card/nightly/run_ttnn.sh

@kkeerthana0573
Copy link
Contributor

On respective main,

  • Last Week, the model passed here.
  • Five days ago, it failed here.
  • Yesterday, on the vignesh/ttnn_convnet_mnist_data_parallel branch, the model passed again here.

@tt-rkim
Copy link
Collaborator Author

tt-rkim commented Jan 21, 2025

Hmm... if you're able to run a couple more times to show it's not non-deterministic on a recent base of main, I will be more comfortable re-enabling the test

@kkeerthana0573
Copy link
Contributor

I've triggered (Single-card) Nightly model and ttnn tests thrice and the model performed as expected in each run. Similarly, the model worked fine in the (Single-card) Device perf regressions. I've verified the model in (Single-card) Tests for new models too.

@tt-rkim
Copy link
Collaborator Author

tt-rkim commented Jan 21, 2025

I see... ok thank you

Please open up a PR for this so we can re-enable

And did you take any look at regular mnist?

@kkeerthana0573
Copy link
Contributor

I've checked ConvMnist today. I will verify Mnist and will create a PR.

@Sudharsan-V
Copy link
Contributor

cc: @mbahnasTT @saichandax

@tt-rkim
Copy link
Collaborator Author

tt-rkim commented Jan 21, 2025

Thank you, please keep us updated

@Sudharsan-V
Copy link
Contributor

Sudharsan-V commented Jan 22, 2025

@tt-rkim ,
Both the Nightly and Device perf CIs are successfully passing for MNIST and ConvNet MNIST.

Note: Device perf CI is currently unstable. Therefore, the links for mnist and convnet_mnist have been highlighted and shared shared above

Corresponding PR #16965

hschoi4448 pushed a commit that referenced this issue Feb 20, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working ci-bug bugs found in CI CNN_bug P1
Projects
None yet
Development

No branches or pull requests

5 participants