Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle CUDA crash in DFL #1

Open
Poikilos opened this issue Dec 1, 2020 · 2 comments
Open

Handle CUDA crash in DFL #1

Poikilos opened this issue Dec 1, 2020 · 2 comments

Comments

@Poikilos
Copy link
Collaborator

Poikilos commented Dec 1, 2020

What happens:

  • press p to update preview
  • Python has stopped working (press close to close preview window)
  • Press a key on console window
  • still must forcibly close the window

Example:

Starting. Press "Enter" to stop training and save model.

Trying to do the first iteration. If an error occurs, reduce the model parameters.

You are training the model from scratch. It is strongly recommended to use a pretrained model to speed up the training and improve the quality.

[16:11:19][#000002][0635ms][3.9305][5.1394]
2020-12-01 16:18:07.554484: E tensorflow/stream_executor/cuda/cuda_driver.cc:1011] could not synchronize on CUDA context: CUDA_ERROR_LAUNCH_FAILED: unspecified launch failure ::
2020-12-01 16:18:07.554484: E tensorflow/stream_executor/cuda/cuda_event.cc:48] Error polling for event status: failed to query event: CUDA_ERROR_LAUNCH_FAILED: unspecified launch failure
2020-12-01 16:18:07.571921: F tensorflow/core/common_runtime/gpu/gpu_event_mgr.cc:274] Unexpected Event status: 1
Press any key to continue . . .

Longer example:

error
[17:07:32][#003960][0564ms][0.4501][0.4884]
2020-12-01 17:10:03.624275: E tensorflow/stream_executor/cuda/cuda_driver.cc:1011] could not synchronize on CUDA context: CUDA_ERROR_LAUNCH_FAILED: unspecified launch failure ::
Error: GPU sync failed
Traceback (most recent call last):
  File "C:\PortableApps\DeepFaceLab\DeepFaceLab_NVIDIA\_internal\python-3.6.8\lib\site-packages\tensorflow\python\client\session.py", line 1334, in _do_call
    return fn(*args)
  File "C:\PortableApps\DeepFaceLab\DeepFaceLab_NVIDIA\_internal\python-3.6.8\lib\site-packages\tensorflow\python\client\session.py", line 1319, in _run_fn
    options, feed_dict, fetch_list, target_list, run_metadata)
  File "C:\PortableApps\DeepFaceLab\DeepFaceLab_NVIDIA\_internal\python-3.6.8\lib\site-packages\tensorflow\python\client\session.py", line 1407, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.InternalError: GPU sync failed

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\PortableApps\DeepFaceLab\DeepFaceLab_NVIDIA\_internal\DeepFaceLab\mainscripts\Trainer.py", line 123, in trainerThread
    iter, iter_time = model.train_one_iter()
  File "C:\PortableApps\DeepFaceLab\DeepFaceLab_NVIDIA\_internal\DeepFaceLab\models\ModelBase.py", line 462, in train_one_iter
    losses = self.onTrainOneIter()
  File "C:\PortableApps\DeepFaceLab\DeepFaceLab_NVIDIA\_internal\DeepFaceLab\models\Model_SAEHD\Model.py", line 636, in onTrainOneIter
    src_loss, dst_loss = self.src_dst_train (warped_src, target_src, target_srcm_all, warped_dst, target_dst, target_dstm_all)
  File "C:\PortableApps\DeepFaceLab\DeepFaceLab_NVIDIA\_internal\DeepFaceLab\models\Model_SAEHD\Model.py", line 503, in src_dst_train
    self.target_dstm_all:target_dstm_all,
  File "C:\PortableApps\DeepFaceLab\DeepFaceLab_NVIDIA\_internal\python-3.6.8\lib\site-packages\tensorflow\python\client\session.py", line 929, in run
    run_metadata_ptr)
  File "C:\PortableApps\DeepFaceLab\DeepFaceLab_NVIDIA\_internal\python-3.6.8\lib\site-packages\tensorflow\python\client\session.py", line 1152, in _run
    feed_dict_tensor, options, run_metadata)
  File "C:\PortableApps\DeepFaceLab\DeepFaceLab_NVIDIA\_internal\python-3.6.8\lib\site-packages\tensorflow\python\client\session.py", line 1328, in _do_run
    run_metadata)
  File "C:\PortableApps\DeepFaceLab\DeepFaceLab_NVIDIA\_internal\python-3.6.8\lib\site-packages\tensorflow\python\client\session.py", line 1348, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InternalError: GPU sync failed
Done.
2020-12-01 17:10:05.238839: E tensorflow/stream_executor/event.cc:34] error destroying CUDA event in context 0000027549A0C1F0: CUDA_ERROR_LAUNCH_FAILED: unspecified launch failure
2020-12-01 17:10:05.239041: E tensorflow/stream_executor/event.cc:34] error destroying CUDA event in context 0000027549A0C1F0: CUDA_ERROR_LAUNCH_FAILED: unspecified launch failure
2020-12-01 17:10:05.240316: E tensorflow/stream_executor/event.cc:34] error destroying CUDA event in context 0000027549A0C1F0: CUDA_ERROR_LAUNCH_FAILED: unspecified launch failure
2020-12-01 17:10:05.240804: E tensorflow/stream_executor/event.cc:34] error destroying CUDA event in context 0000027549A0C1F0: CUDA_ERROR_LAUNCH_FAILED: unspecified launch failure
2020-12-01 17:10:05.241408: E tensorflow/stream_executor/event.cc:34] error destroying CUDA event in context 0000027549A0C1F0: CUDA_ERROR_LAUNCH_FAILED: unspecified launch failure
2020-12-01 17:10:05.241955: E tensorflow/stream_executor/event.cc:34] error destroying CUDA event in context 0000027549A0C1F0: CUDA_ERROR_LAUNCH_FAILED: unspecified launch failure
2020-12-01 17:10:05.242391: E tensorflow/stream_executor/event.cc:34] error destroying CUDA event in context 0000027549A0C1F0: CUDA_ERROR_LAUNCH_FAILED: unspecified launch failure
2020-12-01 17:10:05.242900: E tensorflow/stream_executor/event.cc:34] error destroying CUDA event in context 0000027549A0C1F0: CUDA_ERROR_LAUNCH_FAILED: unspecified launch failure
2020-12-01 17:10:05.243282: E tensorflow/stream_executor/event.cc:34] error destroying CUDA event in context 0000027549A0C1F0: CUDA_ERROR_LAUNCH_FAILED: unspecified launch failure
2020-12-01 17:10:05.243671: E tensorflow/stream_executor/event.cc:34] error destroying CUDA event in context 0000027549A0C1F0: CUDA_ERROR_LAUNCH_FAILED: unspecified launch failure
2020-12-01 17:10:05.244093: E tensorflow/stream_executor/event.cc:34] error destroying CUDA event in context 0000027549A0C1F0: CUDA_ERROR_LAUNCH_FAILED: unspecified launch failure
2020-12-01 17:10:05.244562: E tensorflow/stream_executor/event.cc:34] error destroying CUDA event in context 0000027549A0C1F0: CUDA_ERROR_LAUNCH_FAILED: unspecified launch failure
2020-12-01 17:10:05.244931: E tensorflow/stream_executor/event.cc:34] error destroying CUDA event in context 0000027549A0C1F0: CUDA_ERROR_LAUNCH_FAILED: unspecified launch failure
2020-12-01 17:10:05.245296: E tensorflow/stream_executor/event.cc:34] error destroying CUDA event in context 0000027549A0C1F0: CUDA_ERROR_LAUNCH_FAILED: unspecified launch failure
2020-12-01 17:10:05.245666: E tensorflow/stream_executor/event.cc:34] error destroying CUDA event in context 0000027549A0C1F0: CUDA_ERROR_LAUNCH_FAILED: unspecified launch failure
2020-12-01 17:10:05.246011: E tensorflow/stream_executor/event.cc:34] error destroying CUDA event in context 0000027549A0C1F0: CUDA_ERROR_LAUNCH_FAILED: unspecified launch failure
2020-12-01 17:10:05.246581: E tensorflow/stream_executor/event.cc:34] error destroying CUDA event in context 0000027549A0C1F0: CUDA_ERROR_LAUNCH_FAILED: unspecified launch failure
2020-12-01 17:10:05.246836: E tensorflow/stream_executor/event.cc:34] error destroying CUDA event in context 0000027549A0C1F0: CUDA_ERROR_LAUNCH_FAILED: unspecified launch failure
2020-12-01 17:10:05.247205: E tensorflow/stream_executor/event.cc:34] error destroying CUDA event in context 0000027549A0C1F0: CUDA_ERROR_LAUNCH_FAILED: unspecified launch failure
2020-12-01 17:10:05.247541: E tensorflow/stream_executor/event.cc:34] error destroying CUDA event in context 0000027549A0C1F0: CUDA_ERROR_LAUNCH_FAILED: unspecified launch failure
2020-12-01 17:10:05.247952: E tensorflow/stream_executor/event.cc:34] error destroying CUDA event in context 0000027549A0C1F0: CUDA_ERROR_LAUNCH_FAILED: unspecified launch failure
2020-12-01 17:10:05.248310: E tensorflow/stream_executor/event.cc:34] error destroying CUDA event in context 0000027549A0C1F0: CUDA_ERROR_LAUNCH_FAILED: unspecified launch failure
2020-12-01 17:10:05.248695: E tensorflow/stream_executor/event.cc:34] error destroying CUDA event in context 0000027549A0C1F0: CUDA_ERROR_LAUNCH_FAILED: unspecified launch failure
Press any key to continue . . .
@Poikilos
Copy link
Collaborator Author

At least in the following example, pressing enter (once to wake up CMD if paused, then) twice or more doesn't exit the window:

2020-12-13 13:06:13.366743: E tensorflow/stream_executor/cuda/cuda_event.cc:48] Error polling for event status: failed to query event: CUDA_ERROR_LAUNCH_FAILED: unspecified launch failure
2020-12-13 13:06:13.366993: F tensorflow/core/common_runtime/gpu/gpu_event_mgr.cc:274] Unexpected Event status: 1
Press any key to continue . . .

@Poikilos
Copy link
Collaborator Author

At least the crashes that end with dozens of "error destroying" "unspecified launch failure" messages are preceded by a catchable Python exception, but there is an exception in the exception handler that is not handled, that being tensorflow.python.framework.errors_impl.InternalError:

2020-12-18 02:02:47.626717: E tensorflow/stream_executor/cuda/cuda_driver.cc:1011] could not synchronize on CUDA context: CUDA_ERROR_LAUNCH_FAILED: unspecified launch failure ::
Error: GPU sync failed
Traceback (most recent call last):
  File "C:\PortableApps\DeepFaceLab\DeepFaceLab_NVIDIA\_internal\python-3.6.8\lib\site-packages\tensorflow\python\client\session.py", line 1334, in _do_call
    return fn(*args)
  File "C:\PortableApps\DeepFaceLab\DeepFaceLab_NVIDIA\_internal\python-3.6.8\lib\site-packages\tensorflow\python\client\session.py", line 1319, in _run_fn
    options, feed_dict, fetch_list, target_list, run_metadata)
  File "C:\PortableApps\DeepFaceLab\DeepFaceLab_NVIDIA\_internal\python-3.6.8\lib\site-packages\tensorflow\python\client\session.py", line 1407, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.InternalError: GPU sync failed

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\PortableApps\DeepFaceLab\DeepFaceLab_NVIDIA\_internal\DeepFaceLab\mainscripts\Trainer.py", line 123, in trainerThread
    iter, iter_time = model.train_one_iter()
  File "C:\PortableApps\DeepFaceLab\DeepFaceLab_NVIDIA\_internal\DeepFaceLab\models\ModelBase.py", line 462, in train_one_iter
    losses = self.onTrainOneIter()
  File "C:\PortableApps\DeepFaceLab\DeepFaceLab_NVIDIA\_internal\DeepFaceLab\models\Model_SAEHD\Model.py", line 636, in onTrainOneIter
    src_loss, dst_loss = self.src_dst_train (warped_src, target_src, target_srcm_all, warped_dst, target_dst, target_dstm_all)
  File "C:\PortableApps\DeepFaceLab\DeepFaceLab_NVIDIA\_internal\DeepFaceLab\models\Model_SAEHD\Model.py", line 503, in src_dst_train
    self.target_dstm_all:target_dstm_all,
  File "C:\PortableApps\DeepFaceLab\DeepFaceLab_NVIDIA\_internal\python-3.6.8\lib\site-packages\tensorflow\python\client\session.py", line 929, in run
    run_metadata_ptr)
  File "C:\PortableApps\DeepFaceLab\DeepFaceLab_NVIDIA\_internal\python-3.6.8\lib\site-packages\tensorflow\python\client\session.py", line 1152, in _run
    feed_dict_tensor, options, run_metadata)
  File "C:\PortableApps\DeepFaceLab\DeepFaceLab_NVIDIA\_internal\python-3.6.8\lib\site-packages\tensorflow\python\client\session.py", line 1328, in _do_run
    run_metadata)
  File "C:\PortableApps\DeepFaceLab\DeepFaceLab_NVIDIA\_internal\python-3.6.8\lib\site-packages\tensorflow\python\client\session.py", line 1348, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InternalError: GPU sync failed
Done.
2020-12-18 02:02:49.788062: E tensorflow/stream_executor/event.cc:34] error destroying CUDA event in context 000001E56CD838A0: CUDA_ERROR_LAUNCH_FAILED: unspecified launch failure
2020-12-18 02:02:49.788633: E tensorflow/stream_executor/event.cc:34] error destroying CUDA event in context 000001E56CD838A0: CUDA_ERROR_LAUNCH_FAILED: unspecified launch failure
2020-12-18 02:02:49.791128: E tensorflow/stream_executor/event.cc:34] error destroying CUDA event in context 000001E56CD838A0: CUDA_ERROR_LAUNCH_FAILED: unspecified launch failure
2020-12-18 02:02:49.792440: E tensorflow/stream_executor/event.cc:34] error destroying CUDA event in context 000001E56CD838A0: CUDA_ERROR_LAUNCH_FAILED: unspecified launch failure
2020-12-18 02:02:49.793279: E tensorflow/stream_executor/event.cc:34] error destroying CUDA event in context 000001E56CD838A0: CUDA_ERROR_LAUNCH_FAILED: unspecified launch failure
2020-12-18 02:02:49.794154: E tensorflow/stream_executor/event.cc:34] error destroying CUDA event in context 000001E56CD838A0: CUDA_ERROR_LAUNCH_FAILED: unspecified launch failure
2020-12-18 02:02:49.795138: E tensorflow/stream_executor/event.cc:34] error destroying CUDA event in context 000001E56CD838A0: CUDA_ERROR_LAUNCH_FAILED: unspecified launch failure
2020-12-18 02:02:49.795996: E tensorflow/stream_executor/event.cc:34] error destroying CUDA event in context 000001E56CD838A0: CUDA_ERROR_LAUNCH_FAILED: unspecified launch failure
2020-12-18 02:02:49.796840: E tensorflow/stream_executor/event.cc:34] error destroying CUDA event in context 000001E56CD838A0: CUDA_ERROR_LAUNCH_FAILED: unspecified launch failure
2020-12-18 02:02:49.797798: E tensorflow/stream_executor/event.cc:34] error destroying CUDA event in context 000001E56CD838A0: CUDA_ERROR_LAUNCH_FAILED: unspecified launch failure
2020-12-18 02:02:49.798452: E tensorflow/stream_executor/event.cc:34] error destroying CUDA event in context 000001E56CD838A0: CUDA_ERROR_LAUNCH_FAILED: unspecified launch failure
2020-12-18 02:02:49.798953: E tensorflow/stream_executor/event.cc:34] error destroying CUDA event in context 000001E56CD838A0: CUDA_ERROR_LAUNCH_FAILED: unspecified launch failure
2020-12-18 02:02:49.799513: E tensorflow/stream_executor/event.cc:34] error destroying CUDA event in context 000001E56CD838A0: CUDA_ERROR_LAUNCH_FAILED: unspecified launch failure
2020-12-18 02:02:49.800099: E tensorflow/stream_executor/event.cc:34] error destroying CUDA event in context 000001E56CD838A0: CUDA_ERROR_LAUNCH_FAILED: unspecified launch failure
2020-12-18 02:02:49.800524: E tensorflow/stream_executor/event.cc:34] error destroying CUDA event in context 000001E56CD838A0: CUDA_ERROR_LAUNCH_FAILED: unspecified launch failure
2020-12-18 02:02:49.800982: E tensorflow/stream_executor/event.cc:34] error destroying CUDA event in context 000001E56CD838A0: CUDA_ERROR_LAUNCH_FAILED: unspecified launch failure
2020-12-18 02:02:49.801434: E tensorflow/stream_executor/event.cc:34] error destroying CUDA event in context 000001E56CD838A0: CUDA_ERROR_LAUNCH_FAILED: unspecified launch failure
2020-12-18 02:02:49.801877: E tensorflow/stream_executor/event.cc:34] error destroying CUDA event in context 000001E56CD838A0: CUDA_ERROR_LAUNCH_FAILED: unspecified launch failure
2020-12-18 02:02:49.802329: E tensorflow/stream_executor/event.cc:34] error destroying CUDA event in context 000001E56CD838A0: CUDA_ERROR_LAUNCH_FAILED: unspecified launch failure
2020-12-18 02:02:49.802781: E tensorflow/stream_executor/event.cc:34] error destroying CUDA event in context 000001E56CD838A0: CUDA_ERROR_LAUNCH_FAILED: unspecified launch failure
2020-12-18 02:02:49.803228: E tensorflow/stream_executor/event.cc:34] error destroying CUDA event in context 000001E56CD838A0: CUDA_ERROR_LAUNCH_FAILED: unspecified launch failure
2020-12-18 02:02:49.803696: E tensorflow/stream_executor/event.cc:34] error destroying CUDA event in context 000001E56CD838A0: CUDA_ERROR_LAUNCH_FAILED: unspecified launch failure
2020-12-18 02:02:49.804144: E tensorflow/stream_executor/event.cc:34] error destroying CUDA event in context 000001E56CD838A0: CUDA_ERROR_LAUNCH_FAILED: unspecified launch failure
Press any key to continue . . .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant