`test_convnet_mnist.py` in ttnn integration tests is starting to fail on all cards #16824

tt-rkim · 2025-01-16T21:01:04Z

Not sure what's going on... there seems to be some interplay issues between tests.

For example, we see this often: https://github.com/tenstorrent/tt-metal/actions/runs/12782446765/job/35685828403

but stress tests are ok:

e150: https://github.com/tenstorrent/tt-metal/actions/runs/12796843809
n150: https://github.com/tenstorrent/tt-metal/actions/runs/12793625104

So not sure what's happening 🤷

Skipping for now

cc: @mywoodstock @pavlejosipovic

…'s failing on all archs

tt-rkim · 2025-01-16T21:03:21Z

First noticed it fail on run 6924 of nightly

but re-runs of 6923 and 6922 are no good either...

so it started more often around this time. Unsure what's happening. Is there a model cache for this?

pavlejosipovic · 2025-01-17T09:06:32Z

I can't run this model locally.
First one need to figure out how to get convnet_mnist.pt which isn't available on ird machines I'm working on (or any ird machine?).
Fortunately I have this file stashed locally that I got over slack from @vigneshkeerthivasanx.
Second thing is model tries to download mnist dataset from http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz
And this is just failing for me.

pavlejosipovic · 2025-01-17T13:54:58Z

@tt-rkim let's disable this on WH and GS in every place, it's blocking device perf pipeline

tt-rkim · 2025-01-17T14:21:10Z

Oh device perf? OK one sec

skipped

skipped (cherry picked from commit 76ffe7d)

tt-rkim · 2025-01-17T18:14:44Z

mnist is also causing issues despite the skip.

Refer to this re-run of stuff: https://github.com/tenstorrent/tt-metal/actions/runs/12777133171 - it was passing on attempt 1, then it all fails on attempt 2

Something whack is going on in the environment.

skipped (cherry picked from commit 76ffe7d)

tt-rkim · 2025-01-20T04:52:03Z

I've skipped both mnist and convnet mnist

@eyonland @boris-drazic I believe either/both of you manage the third-party members who work on convnet mnist and mnist ttnn integration tests

Please triage and ask the writers to take a look at this and fix.

tt-rkim · 2025-01-20T04:52:10Z

I have skipped both tests.

boris-drazic · 2025-01-20T15:55:08Z

@kkeerthana0573 can you please take look? From schedules seems you are working on Convnet_MNIST

tt-rkim · 2025-01-21T06:17:16Z

Any update on this?

kkeerthana0573 · 2025-01-21T06:30:02Z

@boris-drazic, @tt-rkim,
ConvMnist works as expected on N150, N300 machines locally.
However, I noticed that the model has been failing in Nightly CIs over the past few days. Interestingly, the ConvMnist tests were passing in the CIs yesterday -

e2e-model-perf-single-card - GS, WH
nightly-single-card - GS, N150, N300
demos-single-card - N150, N300

I am retriggering the CIs to debug further.

tt-rkim · 2025-01-21T06:32:14Z

Please check out what's in main.

Also note that we only seem to see this behaviour if you run the whole ttnn integration test suite: tests/scripts/single_card/nightly/run_ttnn.sh

kkeerthana0573 · 2025-01-21T06:49:00Z

On respective main,

Last Week, the model passed here.
Five days ago, it failed here.
Yesterday, on the vignesh/ttnn_convnet_mnist_data_parallel branch, the model passed again here.

tt-rkim · 2025-01-21T14:32:22Z

Hmm... if you're able to run a couple more times to show it's not non-deterministic on a recent base of main, I will be more comfortable re-enabling the test

kkeerthana0573 · 2025-01-21T16:31:57Z

I've triggered (Single-card) Nightly model and ttnn tests thrice and the model performed as expected in each run. Similarly, the model worked fine in the (Single-card) Device perf regressions. I've verified the model in (Single-card) Tests for new models too.

tt-rkim · 2025-01-21T16:39:50Z

I see... ok thank you

Please open up a PR for this so we can re-enable

And did you take any look at regular mnist?

kkeerthana0573 · 2025-01-21T16:48:27Z

I've checked ConvMnist today. I will verify Mnist and will create a PR.

Sudharsan-V · 2025-01-21T16:57:16Z

cc: @mbahnasTT @saichandax

tt-rkim · 2025-01-21T21:54:46Z

Thank you, please keep us updated

Sudharsan-V · 2025-01-22T14:44:41Z

@tt-rkim ,
Both the Nightly and Device perf CIs are successfully passing for MNIST and ConvNet MNIST.

Nightly CI Nightly CI and ttnn tests.
Device Perf Mnist, convnet_mnist

Note: Device perf CI is currently unstable. Therefore, the links for mnist and convnet_mnist have been highlighted and shared shared above

Corresponding PR #16965

…'s failing on all archs

skipped

tt-rkim added bug Something isn't working ci-bug bugs found in CI CNN_bug P1 labels Jan 16, 2025

tt-rkim added a commit that referenced this issue Jan 16, 2025

#16824: [skip ci] Skip convnet mnist nightly tests for now because it…

c710edc

…'s failing on all archs

tt-rkim added a commit that referenced this issue Jan 17, 2025

#16824: [skip ci] skip device perf as well because underlying test is

76ffe7d

skipped

tt-rkim added a commit that referenced this issue Jan 17, 2025

#16824: [skip ci] skip device perf as well because underlying test is

a9351cc

skipped (cherry picked from commit 76ffe7d)

tt-rkim added a commit that referenced this issue Jan 17, 2025

#16824: [skip ci] skip device perf as well because underlying test is

b299fca

skipped (cherry picked from commit 76ffe7d)

tt-rkim added a commit that referenced this issue Jan 17, 2025

#16824: [skip ci] skip device perf as well because underlying test is

f33c8e6

skipped (cherry picked from commit 76ffe7d)

kkeerthana0573 mentioned this issue Jan 22, 2025

Enable ConvMnist and Mnist integration and performance tests. #16965

Open

6 tasks

hschoi4448 pushed a commit that referenced this issue Feb 20, 2025

#16824: [skip ci] Skip convnet mnist nightly tests for now because it…

5fcd552

…'s failing on all archs

hschoi4448 pushed a commit that referenced this issue Feb 20, 2025

#16824: [skip ci] skip device perf as well because underlying test is

042e98f

skipped

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`test_convnet_mnist.py` in ttnn integration tests is starting to fail on all cards #16824

`test_convnet_mnist.py` in ttnn integration tests is starting to fail on all cards #16824

tt-rkim commented Jan 16, 2025

tt-rkim commented Jan 16, 2025

pavlejosipovic commented Jan 17, 2025

pavlejosipovic commented Jan 17, 2025

tt-rkim commented Jan 17, 2025

tt-rkim commented Jan 17, 2025

tt-rkim commented Jan 20, 2025

tt-rkim commented Jan 20, 2025

boris-drazic commented Jan 20, 2025

tt-rkim commented Jan 21, 2025

kkeerthana0573 commented Jan 21, 2025

tt-rkim commented Jan 21, 2025

kkeerthana0573 commented Jan 21, 2025

tt-rkim commented Jan 21, 2025

kkeerthana0573 commented Jan 21, 2025

tt-rkim commented Jan 21, 2025 •

edited

Loading

kkeerthana0573 commented Jan 21, 2025

Sudharsan-V commented Jan 21, 2025

tt-rkim commented Jan 21, 2025

Sudharsan-V commented Jan 22, 2025 •

edited

Loading

test_convnet_mnist.py in ttnn integration tests is starting to fail on all cards #16824

test_convnet_mnist.py in ttnn integration tests is starting to fail on all cards #16824

Comments

tt-rkim commented Jan 16, 2025

tt-rkim commented Jan 16, 2025

pavlejosipovic commented Jan 17, 2025

pavlejosipovic commented Jan 17, 2025

tt-rkim commented Jan 17, 2025

tt-rkim commented Jan 17, 2025

tt-rkim commented Jan 20, 2025

tt-rkim commented Jan 20, 2025

boris-drazic commented Jan 20, 2025

tt-rkim commented Jan 21, 2025

kkeerthana0573 commented Jan 21, 2025

tt-rkim commented Jan 21, 2025

kkeerthana0573 commented Jan 21, 2025

tt-rkim commented Jan 21, 2025

kkeerthana0573 commented Jan 21, 2025

tt-rkim commented Jan 21, 2025 • edited Loading

kkeerthana0573 commented Jan 21, 2025

Sudharsan-V commented Jan 21, 2025

tt-rkim commented Jan 21, 2025

Sudharsan-V commented Jan 22, 2025 • edited Loading

`test_convnet_mnist.py` in ttnn integration tests is starting to fail on all cards #16824

`test_convnet_mnist.py` in ttnn integration tests is starting to fail on all cards #16824

tt-rkim commented Jan 21, 2025 •

edited

Loading

Sudharsan-V commented Jan 22, 2025 •

edited

Loading