-
Hi, I coded and tested the MNIST example that works on my Linux laptot and my colleague's Windows laptop. The tests were made using the CPU (1.7.0-backport-mkl-linux-x86_64). When I execute this in a compute node (Linux, CentOS), the process crashes. In the start I get this log:
All seems fine but then I get:
So my first question is, are their any OS dependencies? I don't recall installing MKL in the initial test machines. The log file shows:
|
Beta Was this translation helpful? Give feedback.
Replies: 6 comments 5 replies
-
This should work out of box. CentOS 7 are supported. Can you share the entire crash log for us to check? |
Beta Was this translation helpful? Give feedback.
-
@hmf Can this crash be consistently reproduced? Do you see the same error with python? |
Beta Was this translation helpful? Give feedback.
-
Unfortunately I was not able to try out the MXNet directly via Java/Scala. MXNet's Scala/Java API is based on Scala 2.11. Scala is now at 2.13 and I am using Scala 3 (next version). This means that I cannot use the library because it is binary incompatible. The admins of the compute cluster have not responded. So next step is to try Python. |
Beta Was this translation helpful? Give feedback.
-
Seems like the error is due to the set-up of the compute node - something to do with the JDK. Still not working but I can now start the training session. |
Beta Was this translation helpful? Give feedback.
-
@hmf Can you provide a minimal reproduce case and share your code if this can be reproduced? |
Beta Was this translation helpful? Give feedback.
-
Issue has been solved. The compute cluster is using a cluster management and job scheduling framework that by default assigns 5333MB to each job. We has used the JVM parameters to reduce the JVM's maximum do 129MB however the rest was being used by DJL/MXNet. Hence the error. We can now run the MNIST example assigning a little over 6GB - seems a little excessive and I assume it is because the full data set is being loaded into memory. |
Beta Was this translation helpful? Give feedback.
Issue has been solved. The compute cluster is using a cluster management and job scheduling framework that by default assigns 5333MB to each job. We has used the JVM parameters to reduce the JVM's maximum do 129MB however the rest was being used by DJL/MXNet. Hence the error. We can now run the MNIST example assigning a little over 6GB - seems a little excessive and I assume it is because the full data set is being loaded into memory.