Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fails to run training notebook #2 #4

Open
natalieadye opened this issue Feb 20, 2023 · 16 comments
Open

Fails to run training notebook #2 #4

natalieadye opened this issue Feb 20, 2023 · 16 comments

Comments

@natalieadye
Copy link

Hi guys,
I tried retraining our model and I'm running into memory allocation problems. Can you help?? Not sure if this is the proper place to ask, but I thought I'd try, since it worked before in a different environment.


ResourceExhaustedError Traceback (most recent call last)
Cell In[46], line 2
1 median_size = calculate_extents(Y, np.median)
----> 2 fov = np.array(model._axes_tile_overlap('ZYX'))
3 print(f"median object size: {median_size}")
4 print(f"network field of view : {fov}")

File ~/miniconda3/envs/stardist-linux/lib/python3.8/site-packages/stardist/models/base.py:1084, in StarDistBase._axes_tile_overlap(self, query_axes)
1082 self._tile_overlap
1083 except AttributeError:
-> 1084 self._tile_overlap = self._compute_receptive_field()
1085 overlap = dict(zip(
1086 self.config.axes.replace('C',''),
1087 tuple(max(rf) for rf in self._tile_overlap)
1088 ))
1089 return tuple(overlap.get(a,0) for a in query_axes)

File ~/miniconda3/envs/stardist-linux/lib/python3.8/site-packages/stardist/models/base.py:1069, in StarDistBase._compute_receptive_field(self, img_size)
1067 z = np.zeros_like(x)
1068 x[(0,)+mid+(slice(None),)] = 1
-> 1069 y = self.keras_model.predict(x)[0][0,...,0]
1070 y0 = self.keras_model.predict(z)[0][0,...,0]
1071 grid = tuple((np.array(x.shape[1:-1])/np.array(y.shape)).astype(int))

File ~/miniconda3/envs/stardist-linux/lib/python3.8/site-packages/keras/utils/traceback_utils.py:70, in filter_traceback..error_handler(*args, **kwargs)
67 filtered_tb = _process_traceback_frames(e.traceback)
68 # To get the full stack trace, call:
69 # tf.debugging.disable_traceback_filtering()
---> 70 raise e.with_traceback(filtered_tb) from None
71 finally:
72 del filtered_tb

File ~/miniconda3/envs/stardist-linux/lib/python3.8/site-packages/tensorflow/python/eager/execute.py:54, in quick_execute(op_name, num_outputs, inputs, attrs, ctx, name)
52 try:
53 ctx.ensure_initialized()
---> 54 tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
55 inputs, attrs, num_outputs)
56 except core._NotOkStatusException as e:
57 if name is not None:

ResourceExhaustedError: Graph execution error:

SameWorkerRecvDone unable to allocate output tensor. Key: /job:localhost/replica:0/task:0/device:CPU:0;f411f7a4e10f780d;/job:localhost/replica:0/task:0/device:GPU:0;edge_33_IteratorGetNext;0:0
[[{{node IteratorGetNext/_2}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. This isn't available when running in Eager mode.
[Op:__inference_predict_function_2811]

@thawn
Copy link

thawn commented Feb 20, 2023

Hi Natalie,
thank you very much for posting this issue. Your computer is running out of graphics memory during training.
The first thing I would try in this situation would be to reduce the batch size. However, if that does not help, you may need to tile your data into smaller tiles.

For a more detailed help, it would be great to have access to the code and some small sample data set. I have created a private repository for this on the TU Gitlab, where you can upload your code:

https://gitlab.mn.tu-dresden.de/bia-pol/stardist-training

I am also tagging @lazigu and @zoccoler , since they have more experience with stardist than me.

@lazigu
Copy link

lazigu commented Feb 20, 2023

Hi, Natalie (@natalieadye),

You mentioned that it worked before in a different environment without out-of-memory errors, could you confirm that the same dataset and Stardist parameters were used? If so, most likely some other processes are occupying a part of memory leaving Stardist less of it. As Till mentioned, the parameters that can be changed to avoid OOM errors are batch size (I typically used the batch size of 1 on a laptop with 32Gb RAM and 8Gb GPU) and patch size, which will determine into how many overlapping patches the data is tiled.
Also, if previously Stardist was running on CPU and now on GPU it can result in OOM errors because GPUs typically have less memory.

@zoccoler
Copy link
Contributor

Hi @natalieadye ,

I agree with @thawn , you may have to use a different batch size. Sending the whole notebook would be ideal.

If you are running the Stardist example notebooks, 3 cells above the one you get the error, you should find the batch size and patch size, like this:


    train_patch_size = (48,96,96),
    train_batch_size = 2,

Could you send what you have in your notebook? I would try to halve these numbers

@natalieadye
Copy link
Author

natalieadye commented Feb 20, 2023

Thanks all for your comments.
@lazigu yes, that's correct - I used the exact same data and parameters in this notebook in a different environment and it worked previously, but not now. That's the troubling thing.
to all: I tried reducing the patch size to 32,96,96 and the batch size to 1 --> still no good
I uploaded to nataliesdata branch of the repository Till started for me above.
Note - I just uploaded some test data - there are a lot more files in the train data but didn't think it was necessary to upload them all. Let me know if you want more.

@natalieadye
Copy link
Author

Ahh, one more thing - I did install gpu-tools in this stardist-linux env: pip install gputools

@thawn
Copy link

thawn commented Feb 20, 2023

I tried reducing the patch size to 32,96,96 and the batch size to 1 --> still no good

that indeed sounds like another process (maybe gputools) is using up the GPU memory.

looking through the training notebook, I did not find any obvious candidates, so I suspect it is some other process or notebook.

You can check the GPU memory usage in a terminal with nvidia-smi:

  1. press ctrl+alt+t to open a terminal
  2. type nvidia-smi

the output looks something like this:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.36.06    Driver Version: 450.36.06    CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla K80           On   | 00000000:84:00.0 Off |                    0 |
| N/A   30C    P8    26W / 175W |      0MiB / 11441MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 | <== look here
+-----------------------------------------------------------------------------+

at the bottom (where I wrote <= look here), you will see a list of processes and how much GPU memory they use.

then shut down these processes (such as other notebooks, firefox or even other users that are logged in to the workstation)

Or just post the output here and we may be able to help you with choosing which process to shut down.

If all of the above does not help (e.g. because the process that occupies the memory is an important system process),

You can also change the following cell in the notebook:

if use_gpu:
    from csbdeep.utils.tf import limit_gpu_memory
    # adjust as necessary: limit GPU memory to be used by TensorFlow to leave some to OpenCL-based computations
    limit_gpu_memory(0.8,total_memory=24)
    # alternatively, try this:
    #limit_gpu_memory(None, allow_growth=True)

you could try to change 0.8 to 0.7 in the line limit_gpu_memory(0.8,total_memory=24) or comment the line out entirely and use the line limit_gpu_memory(None, allow_growth=True) instead.

@natalieadye
Copy link
Author

Hi all. I had already suspected the same, but there was really nothing running on the GPU. In fact, I had restarted the computer several times just to make sure nothing was holding up the GPU - still no go. I just removed the old environment and started a fresh (WITHOUT gputools) and it works. So the problem was really with my installation of gputools - good to know!

@thawn
Copy link

thawn commented Feb 21, 2023

This is indeed good to know. I recommend to write that info about gputools into the head of the training notebook (where you already have the information about closing other programs and logging out other users).

I am closing this issue as resolved.

@zoccoler
Copy link
Contributor

I have some updates and I will probably create an issue in the stardist repository to check this, but first I need to understand what is actually the problem.

  1. I believe this line in the notebook is wrong (Cell 10 of this notebook):
    use_gpu = False and gputools_available()
    This will always wield use_gpu to False.

  2. Even with use_gpu being False, it looks like it uses the GPU, here is the output of my test on Windows while training:
    image
    @thawn, can you check if this is right ? Maybe I can't see this in Windows (check this)?

  3. I then added gputools to the .yml file, so that it installs with conda and changed the line mentioned above to use_gpu = True and gputools_available(), which now, in the environment with gputools, evaluates to True. Then, while training, I get the same output
    image
    I also had to use this line limit_gpu_memory(None, allow_growth=True) instead of this limit_gpu_memory(0.8) in Cell 11, or limit_gpu_memory(0.8, total_memory=4096) in my case (cause it asks for total_memory).

To sum up, either everything is working without gputools and with use_gpu being False (weird...) or it is not using the GPU properly or not at all, then I would like to test a new .yml file that includes gputools.

@natalieadye if you notice something is still off, can you please send me an email so that I can test these options in place?

@natalieadye
Copy link
Author

Well, it's strange, because in my 2 notebook https://gitlab.mn.tu-dresden.de/dyelabatpol/organoids/stardist3d/-/blob/main/2_training.ipynb, that configuration cell says "use_gpu = True and gputools_available()", so that's what I was doing and it wasn't working. (Maybe we changed this together last year when I first started using Stardist???). I didn't change the line about limit_gpu_memory, though.

When I now run the notebook without gpu_tools, the GPU is definitely occupied according to nvidia-smi, so i presume it's being used.

@haesleinhuepf
Copy link
Member

Hi all,

just a minor side-note. I just had exactly the same error on my laptop. Pluging in an external GPU with more memory fixed the problem. When I run

from stardist import gputools_available
gputools_available()

The output is:

False

As I do NOT have installed gputools. My conda environment is very similar to Natalie's and thus, I'd say it's a purely internal StarDist problem, not related to gputools.

@thawn
Copy link

thawn commented Feb 22, 2023

@zoccoler the output of nvidia-smi looks suspicious, because a the GPU usage is low (only 33-52%) and it does not show that it uses any GPU memory. On the other hand, I have only used nvidia-smi on mac/linux so far, so this behavior may be normal on windows.

@zoccoler
Copy link
Contributor

@thawn , I was running on the default example dataset, that's why a low GPU percentage was there. I was curious about the N/A in GPU memory usage (which may be normal in Windows?).

@natalieadye yes, we probably changed that line at some point.

@haesleinhuepf thanks for these extra tests, they seem to show gpu gets used anyway.

I created an issue at the stardist repository here, let's look what the developers can indicate about it.

@haesleinhuepf
Copy link
Member

A hint from Till @thawn might be worth a try. We can limit memory used by Tensorflow. This might help with various kinds of memory-related errors.

import pyclesperanto_prototype as cle
from csbdeep.utils.tf import limit_gpu_memory

# configure GPU
gpu_memory_in_mb = int(cle.get_device().device.global_mem_size / 1024 / 1024)
# adjust as necessary: limit GPU memory to be used by 
# TensorFlow to leave some to OpenCL-based computations
limit_gpu_memory(0.5, total_memory=gpu_memory_in_mb)

@uschmidt83
Copy link

There seems to be quite some confusion about GPU use in general and the use_gpu = True and gputools_available() expression specifically.

  • StarDist uses TensorFlow, and TensorFlow is using the GPU if you installed it with GPU support (CUDA, cuDNN).
  • The use_gpu flag is only about GPU-based data generation for training (via OpenCL). Note that this is mentioned in the help and a comment in the notebook. (Using GPU-based data generation can in some cases speed up training substantially).
  • If use_gpu is true, we need to limit TensorFlow from grabbing all the GPU memory upfront (which it does by default). Hence the limit_gpu_memory stuff in our notebooks. If this isn't done, out of memory errors are to be expected.
  • I think our provided notebook contains use_gpu = False and gputools_available(), meaning that we disable use_gpu by default. If a user wants to enable it and uses use_gpu = True and gputools_available(), the and gputools_available() part acts as a "guard" to always disable this flag when gputools is not installed.

@zoccoler
Copy link
Contributor

Thanks @uschmidt83 , I think that is clear now.

Then, I suspect we were getting OOM errors because we were running out of GPU memory at the data generation step, but not afterwards.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants