Java Runtime fatal error: mxnet::runtime::Registry::set_body #463

hmf · 2020-12-22T17:44:47Z

hmf
Dec 22, 2020

Hi,

I coded and tested the MNIST example that works on my Linux laptot and my colleague's Windows laptop. The tests were made using the CPU (1.7.0-backport-mkl-linux-x86_64). When I execute this in a compute node (Linux, CentOS), the process crashes. In the start I get this log:

16:19:39.662 [main] INFO  examples.djl.TrainMNIST$ - runExample: parse arguments
16:19:39.804 [main] DEBUG ai.djl.util.cuda.CudaUtils - cudart library not found.
16:19:39.808 [main] DEBUG ai.djl.mxnet.jna.LibUtils - Using cache dir: /users2/cpca52142020/hugoferreira/.djl.ai/mxnet
16:19:40.953 [main] INFO  ai.djl.mxnet.jna.LibUtils - Downloading libgfortran.so.3 ...
16:19:41.767 [main] INFO  ai.djl.mxnet.jna.LibUtils - Downloading libgomp.so.1 ...
16:19:41.993 [main] INFO  ai.djl.mxnet.jna.LibUtils - Downloading libquadmath.so.0 ...
16:19:42.214 [main] INFO  ai.djl.mxnet.jna.LibUtils - Downloading libmxnet.so ...
16:19:47.846 [main] DEBUG ai.djl.mxnet.jna.LibUtils - Loading mxnet library from: /users2/cpca52142020/hugoferreira/.djl.ai/mxnet/1.7.0-backport-mkl-linux-x86_64/libmxnet.so

All seems fine but then I get:

#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGILL (0x4) at pc=0x00002ad770fc0097, pid=2305711, tid=2305712
#
# JRE version: OpenJDK Runtime Environment 18.9 (11.0.9.1+1) (build 11.0.9.1+1-LTS)
# Java VM: OpenJDK 64-Bit Server VM 18.9 (11.0.9.1+1-LTS, mixed mode, sharing, tiered, compressed oops, g1 gc, linux-amd64)
# Problematic frame:
# C  [libmxnet.so+0xfc0097]  mxnet::runtime::Registry::set_body(std::function<void (mxnet::runtime::MXNetArgs, mxnet::runtime::MXNetRetValue*)>)+0x57
#
# No core dump will be written. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again
#
# An error report file with more information is saved as:
# /users2/cpca52142020/hugoferreira/repos/pdm_toyadmos/hs_err_pid2305711.log
#
# If you would like to submit a bug report, please visit:
#   https://bugzilla.redhat.com/enter_bug.cgi?product=Red%20Hat%20Enterprise%20Linux%207&component=java-11-openjdk
# The crash happened outside the Java Virtual Machine in native code.
# See problematic frame for where to report the bug.
#
1 targets failed

So my first question is, are their any OS dependencies? I don't recall installing MKL in the initial test machines.

The log file shows:

Host: Intel(R) Xeon(R) CPU           L5420  @ 2.50GHz, 8 cores, 31G, CentOS Linux release 7.9.2009 (Core)

Answered by hmf

Mar 11, 2021

Issue has been solved. The compute cluster is using a cluster management and job scheduling framework that by default assigns 5333MB to each job. We has used the JVM parameters to reduce the JVM's maximum do 129MB however the rest was being used by DJL/MXNet. Hence the error. We can now run the MNIST example assigning a little over 6GB - seems a little excessive and I assume it is because the full data set is being loaded into memory.

View full answer

lanking520 · 2020-12-22T22:19:21Z

lanking520
Dec 22, 2020

This should work out of box. CentOS 7 are supported. Can you share the entire crash log for us to check?

2 replies

lanking520 Dec 22, 2020

And which DJL version you are using?

hmf Dec 22, 2020
Author

@lanking520 When you say "out-of-the-box" does that mean I need not install MKL or any other BLAS compatible library?

I am using the latest version.Here are the dependencies I am using:

  val mxNetVersion = "1.7.0"
  val TFVersion = "2.3.1"
  val dljVersion = "0.8.0"
  val dljBasicData = ivy"ai.djl:basicdataset:$dljVersion"                          // Basic Data
  val dljBasicDataZoo = ivy"ai.djl:model-zoo:$dljVersion"                          // Basic Data  Models

  // MxNet
  val djlMxnetZoo = ivy"ai.djl.mxnet:mxnet-model-zoo:$dljVersion"                  // MXNet Models Zoo
  val djlMxnetEngine = ivy"ai.djl.mxnet:mxnet-engine:$dljVersion"                  // MXNet training and inference
  val djlMxnetMKL = ivy"ai.djl.mxnet:mxnet-native-mkl:$mxNetVersion-backport"      // CPU MKL
  val djlMxnetAuto = ivy"ai.djl.mxnet:mxnet-native-auto:$mxNetVersion-backport"    // MxNet selects GPU/CPU recommended
  val djlMxnetGPU = ivy"ai.djl.mxnet:mxnet-native-cu102mkl:$mxNetVersion-backport" // Nvidia CUDA 10.2

  // Tensorflow
  val djlTFZoo = ivy"ai.djl.tensorflow:tensorflow-model-zoo:$dljVersion"             // Tensorflow Models Zoo
  val djlTFEngine = ivy"ai.djl.tensorflow:tensorflow-engine:$dljVersion"             // Tensorflow training and inference
  val djlTFAuto = ivy"ai.djl.tensorflow:tensorflow-native-auto:$TFVersion"           // Tensorflow selects GPU/CPU recommended
  val djlTFGPU = ivy"ai.djl.tensorflow:tensorflow-native-cu101:$TFVersion"         // CUDA 10.1
  val djlTFCPU = ivy"ai.djl.tensorflow:tensorflow-native-cpu:$TFVersion"             // CPU

I use:

    dljBasicData, dljBasicDataZoo,
      djlMxnetZoo, djlMxnetEngine, djlMxnetAuto,
      djlTFZoo, djlTFEngine, djlTFAuto,

This was tested in Ubuntu and Window 10 with success.
I have attached the crash log,

hs_err_pid2274738.log

frankfliu · 2020-12-31T18:47:47Z

frankfliu
Dec 31, 2020

@hmf
It looks like the crash happen at shared library loading time. MXNet crashed during initialization. This is the first time we saw this type of crash. I will ask MXNet community see anyone has seen this before.

Can this crash be consistently reproduced? Do you see the same error with python?

1 reply

hmf Jan 4, 2021
Author

@frankfliu

Thanks for looking into this.

Can this crash be consistently reproduced?

Yes. I tried several times, always with the same results.

Do you see the same error with python?

We have not used Python. To do this we need to prepare scripts to install the libraries and the python MNIST example. I am going to get someone to do this ASAP.

In the meantime I will also try a MNIST Java/Scala example just to be sure I get the same error. I will try a simple example to create and use an NDArray for example. First for MXNet only and then possibly DJL.

hmf · 2021-01-07T18:10:41Z

hmf
Jan 7, 2021
Author

Unfortunately I was not able to try out the MXNet directly via Java/Scala. MXNet's Scala/Java API is based on Scala 2.11. Scala is now at 2.13 and I am using Scala 3 (next version). This means that I cannot use the library because it is binary incompatible.

The admins of the compute cluster have not responded. So next step is to try Python.

0 replies

hmf · 2021-01-11T17:18:25Z

hmf
Jan 11, 2021
Author

Seems like the error is due to the set-up of the compute node - something to do with the JDK. Still not working but I can now start the training session.

0 replies

frankfliu · 2021-01-11T17:28:47Z

frankfliu
Jan 11, 2021

@hmf Can you provide a minimal reproduce case and share your code if this can be reproduced?

1 reply

hmf Jan 11, 2021
Author

Do you mean the project and source code I executed? Its a Mill + Scala project that has an MNIST example. However, this works perfectly on my machine (Ubuntu) and a colleague's (Windows 10). Problem only occurs in the CentOS 7 compute node. I can try and get the details on the CentOS 7 setup.

hmf · 2021-03-11T10:42:47Z

hmf
Mar 11, 2021
Author

Issue has been solved. The compute cluster is using a cluster management and job scheduling framework that by default assigns 5333MB to each job. We has used the JVM parameters to reduce the JVM's maximum do 129MB however the rest was being used by DJL/MXNet. Hence the error. We can now run the MNIST example assigning a little over 6GB - seems a little excessive and I assume it is because the full data set is being loaded into memory.

1 reply

TM12020 Aug 17, 2021

i'm seeing this same error message.
can you help me to debug and provide more details on how to resolve this issue?
thnx!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Java Runtime fatal error: mxnet::runtime::Registry::set_body #463

{{title}}

Replies: 6 comments 5 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Java Runtime fatal error: mxnet::runtime::Registry::set_body #463

hmf Dec 22, 2020

Replies: 6 comments · 5 replies

lanking520 Dec 22, 2020

lanking520 Dec 22, 2020

hmf Dec 22, 2020 Author

frankfliu Dec 31, 2020

hmf Jan 4, 2021 Author

hmf Jan 7, 2021 Author

hmf Jan 11, 2021 Author

frankfliu Jan 11, 2021

hmf Jan 11, 2021 Author

hmf Mar 11, 2021 Author

TM12020 Aug 17, 2021

hmf
Dec 22, 2020

Replies: 6 comments 5 replies

lanking520
Dec 22, 2020

hmf Dec 22, 2020
Author

frankfliu
Dec 31, 2020

hmf Jan 4, 2021
Author

hmf
Jan 7, 2021
Author

hmf
Jan 11, 2021
Author

frankfliu
Jan 11, 2021

hmf Jan 11, 2021
Author

hmf
Mar 11, 2021
Author