-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fails to run training notebook #2 #4
Comments
Hi Natalie, For a more detailed help, it would be great to have access to the code and some small sample data set. I have created a private repository for this on the TU Gitlab, where you can upload your code: https://gitlab.mn.tu-dresden.de/bia-pol/stardist-training I am also tagging @lazigu and @zoccoler , since they have more experience with stardist than me. |
Hi, Natalie (@natalieadye), You mentioned that it worked before in a different environment without out-of-memory errors, could you confirm that the same dataset and Stardist parameters were used? If so, most likely some other processes are occupying a part of memory leaving Stardist less of it. As Till mentioned, the parameters that can be changed to avoid OOM errors are batch size (I typically used the batch size of 1 on a laptop with 32Gb RAM and 8Gb GPU) and patch size, which will determine into how many overlapping patches the data is tiled. |
Hi @natalieadye , I agree with @thawn , you may have to use a different batch size. Sending the whole notebook would be ideal. If you are running the Stardist example notebooks, 3 cells above the one you get the error, you should find the batch size and patch size, like this:
Could you send what you have in your notebook? I would try to halve these numbers |
Thanks all for your comments. |
Ahh, one more thing - I did install gpu-tools in this stardist-linux env: pip install gputools |
that indeed sounds like another process (maybe gputools) is using up the GPU memory. looking through the training notebook, I did not find any obvious candidates, so I suspect it is some other process or notebook. You can check the GPU memory usage in a terminal with
the output looks something like this:
at the bottom (where I wrote then shut down these processes (such as other notebooks, firefox or even other users that are logged in to the workstation) Or just post the output here and we may be able to help you with choosing which process to shut down. If all of the above does not help (e.g. because the process that occupies the memory is an important system process), You can also change the following cell in the notebook: if use_gpu:
from csbdeep.utils.tf import limit_gpu_memory
# adjust as necessary: limit GPU memory to be used by TensorFlow to leave some to OpenCL-based computations
limit_gpu_memory(0.8,total_memory=24)
# alternatively, try this:
#limit_gpu_memory(None, allow_growth=True) you could try to change 0.8 to 0.7 in the line |
Hi all. I had already suspected the same, but there was really nothing running on the GPU. In fact, I had restarted the computer several times just to make sure nothing was holding up the GPU - still no go. I just removed the old environment and started a fresh (WITHOUT gputools) and it works. So the problem was really with my installation of gputools - good to know! |
This is indeed good to know. I recommend to write that info about gputools into the head of the training notebook (where you already have the information about closing other programs and logging out other users). I am closing this issue as resolved. |
I have some updates and I will probably create an issue in the stardist repository to check this, but first I need to understand what is actually the problem.
To sum up, either everything is working without @natalieadye if you notice something is still off, can you please send me an email so that I can test these options in place? |
Well, it's strange, because in my 2 notebook https://gitlab.mn.tu-dresden.de/dyelabatpol/organoids/stardist3d/-/blob/main/2_training.ipynb, that configuration cell says "use_gpu = True and gputools_available()", so that's what I was doing and it wasn't working. (Maybe we changed this together last year when I first started using Stardist???). I didn't change the line about limit_gpu_memory, though. When I now run the notebook without gpu_tools, the GPU is definitely occupied according to nvidia-smi, so i presume it's being used. |
Hi all, just a minor side-note. I just had exactly the same error on my laptop. Pluging in an external GPU with more memory fixed the problem. When I run
The output is:
As I do NOT have installed gputools. My conda environment is very similar to Natalie's and thus, I'd say it's a purely internal StarDist problem, not related to gputools. |
@zoccoler the output of nvidia-smi looks suspicious, because a the GPU usage is low (only 33-52%) and it does not show that it uses any GPU memory. On the other hand, I have only used nvidia-smi on mac/linux so far, so this behavior may be normal on windows. |
@thawn , I was running on the default example dataset, that's why a low GPU percentage was there. I was curious about the @natalieadye yes, we probably changed that line at some point. @haesleinhuepf thanks for these extra tests, they seem to show gpu gets used anyway. I created an issue at the stardist repository here, let's look what the developers can indicate about it. |
A hint from Till @thawn might be worth a try. We can limit memory used by Tensorflow. This might help with various kinds of memory-related errors.
|
There seems to be quite some confusion about GPU use in general and the
|
Thanks @uschmidt83 , I think that is clear now. Then, I suspect we were getting OOM errors because we were running out of GPU memory at the data generation step, but not afterwards. |
Hi guys,
I tried retraining our model and I'm running into memory allocation problems. Can you help?? Not sure if this is the proper place to ask, but I thought I'd try, since it worked before in a different environment.
ResourceExhaustedError Traceback (most recent call last)
Cell In[46], line 2
1 median_size = calculate_extents(Y, np.median)
----> 2 fov = np.array(model._axes_tile_overlap('ZYX'))
3 print(f"median object size: {median_size}")
4 print(f"network field of view : {fov}")
File ~/miniconda3/envs/stardist-linux/lib/python3.8/site-packages/stardist/models/base.py:1084, in StarDistBase._axes_tile_overlap(self, query_axes)
1082 self._tile_overlap
1083 except AttributeError:
-> 1084 self._tile_overlap = self._compute_receptive_field()
1085 overlap = dict(zip(
1086 self.config.axes.replace('C',''),
1087 tuple(max(rf) for rf in self._tile_overlap)
1088 ))
1089 return tuple(overlap.get(a,0) for a in query_axes)
File ~/miniconda3/envs/stardist-linux/lib/python3.8/site-packages/stardist/models/base.py:1069, in StarDistBase._compute_receptive_field(self, img_size)
1067 z = np.zeros_like(x)
1068 x[(0,)+mid+(slice(None),)] = 1
-> 1069 y = self.keras_model.predict(x)[0][0,...,0]
1070 y0 = self.keras_model.predict(z)[0][0,...,0]
1071 grid = tuple((np.array(x.shape[1:-1])/np.array(y.shape)).astype(int))
File ~/miniconda3/envs/stardist-linux/lib/python3.8/site-packages/keras/utils/traceback_utils.py:70, in filter_traceback..error_handler(*args, **kwargs)
67 filtered_tb = _process_traceback_frames(e.traceback)
68 # To get the full stack trace, call:
69 #
tf.debugging.disable_traceback_filtering()
---> 70 raise e.with_traceback(filtered_tb) from None
71 finally:
72 del filtered_tb
File ~/miniconda3/envs/stardist-linux/lib/python3.8/site-packages/tensorflow/python/eager/execute.py:54, in quick_execute(op_name, num_outputs, inputs, attrs, ctx, name)
52 try:
53 ctx.ensure_initialized()
---> 54 tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
55 inputs, attrs, num_outputs)
56 except core._NotOkStatusException as e:
57 if name is not None:
ResourceExhaustedError: Graph execution error:
SameWorkerRecvDone unable to allocate output tensor. Key: /job:localhost/replica:0/task:0/device:CPU:0;f411f7a4e10f780d;/job:localhost/replica:0/task:0/device:GPU:0;edge_33_IteratorGetNext;0:0
[[{{node IteratorGetNext/_2}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. This isn't available when running in Eager mode.
[Op:__inference_predict_function_2811]
The text was updated successfully, but these errors were encountered: