Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cuda out of memory issue #15

Open
LaFeuilleMorte opened this issue Dec 13, 2024 · 4 comments
Open

Cuda out of memory issue #15

LaFeuilleMorte opened this issue Dec 13, 2024 · 4 comments

Comments

@LaFeuilleMorte
Copy link

LaFeuilleMorte commented Dec 13, 2024

Hi, I've met with cuda oom issue even with a small dataset 126 images. And I use the mcmc gaussian splatting and set cap_max=150,000 to reduce memory footprint. But the process on my A100 GPU crashed with OOM error.

| File "/aistudio/workspace/system-default/envs/droidsplat/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap |  
-- | -- | --
  |   | 2024-12-13 16:15:09.130 | self.run() |  
  |   | 2024-12-13 16:15:09.130 | File "/aistudio/workspace/system-default/envs/droidsplat/lib/python3.10/multiprocessing/process.py", line 108, in run |  
  |   | 2024-12-13 16:15:09.130 | self._target(*self._args, **self._kwargs) |  
  |   | 2024-12-13 16:15:09.130 | File "/aistudio/workspace/aigc/wangqihang013/aigc3d/repos/neural_rendering/sfm/DROID-Splat/src/slam.py", line 310, in tracking |  
  |   | 2024-12-13 16:15:09.130 | self.frontend(timestamp, image, depth, intrinsic, gt_pose, static_mask=static_mask) |  
  |   | 2024-12-13 16:15:09.130 | File "/aistudio/workspace/system-default/envs/droidsplat/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl |  
  |   | 2024-12-13 16:15:09.130 | return self._call_impl(*args, **kwargs) |  
  |   | 2024-12-13 16:15:09.130 | File "/aistudio/workspace/system-default/envs/droidsplat/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl |  
  |   | 2024-12-13 16:15:09.130 | return forward_call(*args, **kwargs) |  
  |   | 2024-12-13 16:15:09.130 | File "/aistudio/workspace/system-default/envs/droidsplat/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context |  
  |   | 2024-12-13 16:15:09.130 | return func(*args, **kwargs) |  
  |   | 2024-12-13 16:15:09.130 | File "/aistudio/workspace/aigc/wangqihang013/aigc3d/repos/neural_rendering/sfm/DROID-Splat/src/frontend.py", line 39, in forward |  
  |   | 2024-12-13 16:15:09.130 | self.optimizer() # Local Bundle Adjustment |  
  |   | 2024-12-13 16:15:09.130 | File "/aistudio/workspace/aigc/wangqihang013/aigc3d/repos/neural_rendering/sfm/DROID-Splat/src/frontend.py", line 220, in call |  
  |   | 2024-12-13 16:15:09.130 | self.__update() |  
  |   | 2024-12-13 16:15:09.130 | File "/aistudio/workspace/aigc/wangqihang013/aigc3d/repos/neural_rendering/sfm/DROID-Splat/src/frontend.py", line 100, in __update |  
  |   | 2024-12-13 16:15:09.130 | self.graph.rm_factors(self.graph.age > self.max_age, store=True) |  
  |   | 2024-12-13 16:15:09.130 | File "/aistudio/workspace/system-default/envs/droidsplat/lib/python3.10/site-packages/torch/amp/autocast_mode.py", line 16, in decorate_autocast |  
  |   | 2024-12-13 16:15:09.130 | return func(*args, **kwargs) |  
  |   | 2024-12-13 16:15:09.130 | File "/aistudio/workspace/aigc/wangqihang013/aigc3d/repos/neural_rendering/sfm/DROID-Splat/src/factor_graph.py", line 178, in rm_factors |  
  |   | 2024-12-13 16:15:09.130 | self.corr = self.corr[~mask] |  
  |   | 2024-12-13 16:15:09.130 | File "/aistudio/workspace/aigc/wangqihang013/aigc3d/repos/neural_rendering/sfm/DROID-Splat/src/modules/corr.py", line 72, in getitem |  
  |   | 2024-12-13 16:15:09.130 | self.corr_pyramid[i] = self.corr_pyramid[i][index] |  
  |   | 2024-12-13 16:15:09.130 | torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 334.00 MiB. GPU 0 has a total capacty of 39.42 GiB of which 289.06 MiB is free. Process 48412 has 31.97 GiB memory in use. Process 65823 has 2.55 GiB memory in use. Process 67515 has 1.78 GiB memory in use. Process 68107 has 416.00 MiB memory in use. Process 69865 has 2.02 GiB memory in use. Process 70462 has 416.00 MiB memory in use. Of the allocated memory 652.43 MiB is allocated by PyTorch, and 647.57 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

@ChenHoy
Copy link
Owner

ChenHoy commented Dec 14, 2024

That is very odd, what resolution do the images have? Can you maybe share the log of what parameters you used?

@LaFeuilleMorte
Copy link
Author

That is very odd, what resolution do the images have? Can you maybe share the log of what parameters you used?

Sorry for the late reply, My config was:

original camera parameters

cam:
H: 960
W: 960
H_out: 480 # 360
W_out: 480 # 640
H_edge: 0
W_edge: 0

We calibrate the camera once in prgbd mode without any scale optimization, which roughly gets the right parameters

fx: 275 # heuristic: 1296.0
fy: 275 # heuristic: 1296.0
cx: 480 # heuristic: 960
cy: 480 # heuristic: 540
calibration_txt: ''
camera_model: "pinhole"

And my running script:

python run.py data=Custom/hd.yaml
data.input_folder= {MY_DATA_FOLDER}
tracking=base
stride=1
mode=rgb
mapping.mcmc.cap_max=150000

@LaFeuilleMorte
Copy link
Author

LaFeuilleMorte commented Dec 17, 2024

That is very odd, what resolution do the images have? Can you maybe share the log of what parameters you used?

Sorry for the late reply, My config was:

cam:
H: 960
W: 960
H_out: 480 # 360
W_out: 480 # 640
H_edge: 0
W_edge: 0

fx: 275 # heuristic: 1296.0
fy: 275 # heuristic: 1296.0
cx: 480 # heuristic: 960
cy: 480 # heuristic: 540
calibration_txt: ''
camera_model: "pinhole"

And my running script:

python run.py data=Custom/hd.yaml
data.input_folder= {MY_DATA_FOLDER}
tracking=base
stride=1
mode=rgb
mapping.mcmc.cap_max=150000

@ChenHoy
Copy link
Owner

ChenHoy commented Dec 17, 2024

Hey,
I dont understand the calibration part: so you roughly get the right parameters after calibration? Does the OOM happen with intrinsics optimization or without?

Your resolution is not too big, you dont have a lot of images and you dont seem to have a lot of Gaussians either, I dont really understand why this OOM would happen. Can you give more info where/when this is triggered? Could you try to run the SLAM system without backend by setting run_backend=False? That way we can rule out, that its the global Bundle Adjustment, during which we OOM.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants