Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error in training #13

Closed
SaharGezer opened this issue Mar 14, 2021 · 3 comments
Closed

Error in training #13

SaharGezer opened this issue Mar 14, 2021 · 3 comments

Comments

@SaharGezer
Copy link

I tried to train BisenetV2 on my own data according to your instructions but I got the next error:

INFO:root:creating dataset and data loaders
INFO:root:loaded 6800 annotations from /home/nvidia/Documents/Projects/SemanticSegmentation/RailSem19_LabelMe/train
INFO:root:use augmentation: True
INFO:root:categories: ['rail']
INFO:root:loaded 1700 annotations from /home/nvidia/Documents/Projects/SemanticSegmentation/RailSem19_LabelMe/val
INFO:root:use augmentation: False
INFO:root:categories: ['rail']
INFO:root:creating dataloaders with 16 workers and a batch-size of 2
INFO:root:creating BiSeNetV2 and optimizer with initial lr of 0.0001
INFO:root:creating model with categories: ['rail']
INFO:root:creating trainer and evaluator engines
INFO:root:creating summary writer with tag seg_train
INFO:root:attaching lr scheduler
INFO:root:attaching event driven calls
INFO:root:training...
INFO:ignite.engine.engine.Engine:Engine run starting with max_epochs=30.
/home/nvidia/anaconda3/lib/python3.7/site-packages/torch/nn/modules/container.py:552: UserWarning: Setting attributes on ParameterDict is not supported.
warnings.warn("Setting attributes on ParameterDict is not supported.")
/home/nvidia/anaconda3/lib/python3.7/site-packages/torch/nn/modules/container.py:645: UserWarning: nn.ParameterDict is being used with DataParallel but this is not supported. This dict will appear empty for the models replicated on each GPU except the original one.
warnings.warn("nn.ParameterDict is being used with DataParallel but this is not "
ERROR:ignite.engine.engine.Engine:Current run is terminating due to exception: Caught ValueError in replica 0 on device 0.
Original Traceback (most recent call last):
File "/home/nvidia/anaconda3/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 61, in _worker
output = module(*input, **kwargs)
File "/home/nvidia/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/nvidia/Documents/Projects/SemanticSegmentation/semantic_segmentation/models/bisenetv2.py", line 304, in forward
x_semantic = self.semantic(x)
File "/home/nvidia/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/nvidia/Documents/Projects/SemanticSegmentation/semantic_segmentation/models/bisenetv2.py", line 193, in forward
x = self.stage5(x)
File "/home/nvidia/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/nvidia/anaconda3/lib/python3.7/site-packages/torch/nn/modules/container.py", line 117, in forward
input = module(input)
File "/home/nvidia/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/nvidia/Documents/Projects/SemanticSegmentation/semantic_segmentation/models/bisenetv2.py", line 146, in forward
x_gap = self.conv_project(x_gap)
File "/home/nvidia/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/nvidia/Documents/Projects/SemanticSegmentation/semantic_segmentation/models/bisenetv2.py", line 23, in forward
return F.leaky_relu(self.bn(self.conv(x)))
File "/home/nvidia/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/nvidia/anaconda3/lib/python3.7/site-packages/torch/nn/modules/batchnorm.py", line 136, in forward
self.weight, self.bias, bn_training, exponential_average_factor, self.eps)
File "/home/nvidia/anaconda3/lib/python3.7/site-packages/torch/nn/functional.py", line 2054, in batch_norm
_verify_batch_size(input.size())
File "/home/nvidia/anaconda3/lib/python3.7/site-packages/torch/nn/functional.py", line 2037, in _verify_batch_size
raise ValueError('Expected more than 1 value per channel when training, got input size {}'.format(size))
ValueError: Expected more than 1 value per channel when training, got input size torch.Size([1, 128, 1, 1])

ERROR:ignite.engine.engine.Engine:Engine run is terminating due to exception: Caught ValueError in replica 0 on device 0.
Original Traceback (most recent call last):
File "/home/nvidia/anaconda3/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 61, in _worker
output = module(*input, **kwargs)
File "/home/nvidia/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/nvidia/Documents/Projects/SemanticSegmentation/semantic_segmentation/models/bisenetv2.py", line 304, in forward
x_semantic = self.semantic(x)
File "/home/nvidia/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/nvidia/Documents/Projects/SemanticSegmentation/semantic_segmentation/models/bisenetv2.py", line 193, in forward
x = self.stage5(x)
File "/home/nvidia/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/nvidia/anaconda3/lib/python3.7/site-packages/torch/nn/modules/container.py", line 117, in forward
input = module(input)
File "/home/nvidia/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/nvidia/Documents/Projects/SemanticSegmentation/semantic_segmentation/models/bisenetv2.py", line 146, in forward
x_gap = self.conv_project(x_gap)
File "/home/nvidia/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/nvidia/Documents/Projects/SemanticSegmentation/semantic_segmentation/models/bisenetv2.py", line 23, in forward
return F.leaky_relu(self.bn(self.conv(x)))
File "/home/nvidia/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/nvidia/anaconda3/lib/python3.7/site-packages/torch/nn/modules/batchnorm.py", line 136, in forward
self.weight, self.bias, bn_training, exponential_average_factor, self.eps)
File "/home/nvidia/anaconda3/lib/python3.7/site-packages/torch/nn/functional.py", line 2054, in batch_norm
_verify_batch_size(input.size())
File "/home/nvidia/anaconda3/lib/python3.7/site-packages/torch/nn/functional.py", line 2037, in _verify_batch_size
raise ValueError('Expected more than 1 value per channel when training, got input size {}'.format(size))
ValueError: Expected more than 1 value per channel when training, got input size torch.Size([1, 128, 1, 1])

Traceback (most recent call last):
File "/home/nvidia/anaconda3/lib/python3.7/site-packages/ignite/engine/engine.py", line 775, in _internal_run
self._handle_exception(e)
File "/home/nvidia/anaconda3/lib/python3.7/site-packages/ignite/engine/engine.py", line 469, in _handle_exception
raise e
File "/home/nvidia/anaconda3/lib/python3.7/site-packages/ignite/engine/engine.py", line 745, in _internal_run
time_taken = self._run_once_on_dataset()
File "/home/nvidia/anaconda3/lib/python3.7/site-packages/ignite/engine/engine.py", line 850, in _run_once_on_dataset
self._handle_exception(e)
File "/home/nvidia/anaconda3/lib/python3.7/site-packages/ignite/engine/engine.py", line 469, in _handle_exception
raise e
File "/home/nvidia/anaconda3/lib/python3.7/site-packages/ignite/engine/engine.py", line 833, in _run_once_on_dataset
self.state.output = self._process_function(self, self.state.batch)
File "/home/nvidia/anaconda3/lib/python3.7/site-packages/ignite/engine/init.py", line 103, in _update
y_pred = model(x)
File "/home/nvidia/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/nvidia/anaconda3/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 161, in forward
outputs = self.parallel_apply(replicas, inputs, kwargs)
File "/home/nvidia/anaconda3/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 171, in parallel_apply
return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
File "/home/nvidia/anaconda3/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 86, in parallel_apply
output.reraise()
File "/home/nvidia/anaconda3/lib/python3.7/site-packages/torch/_utils.py", line 428, in reraise
raise self.exc_type(msg)
ValueError: Caught ValueError in replica 0 on device 0.
Original Traceback (most recent call last):
File "/home/nvidia/anaconda3/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 61, in _worker
output = module(*input, **kwargs)
File "/home/nvidia/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/nvidia/Documents/Projects/SemanticSegmentation/semantic_segmentation/models/bisenetv2.py", line 304, in forward
x_semantic = self.semantic(x)
File "/home/nvidia/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/nvidia/Documents/Projects/SemanticSegmentation/semantic_segmentation/models/bisenetv2.py", line 193, in forward
x = self.stage5(x)
File "/home/nvidia/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/nvidia/anaconda3/lib/python3.7/site-packages/torch/nn/modules/container.py", line 117, in forward
input = module(input)
File "/home/nvidia/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/nvidia/Documents/Projects/SemanticSegmentation/semantic_segmentation/models/bisenetv2.py", line 146, in forward
x_gap = self.conv_project(x_gap)
File "/home/nvidia/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/nvidia/Documents/Projects/SemanticSegmentation/semantic_segmentation/models/bisenetv2.py", line 23, in forward
return F.leaky_relu(self.bn(self.conv(x)))
File "/home/nvidia/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/nvidia/anaconda3/lib/python3.7/site-packages/torch/nn/modules/batchnorm.py", line 136, in forward
self.weight, self.bias, bn_training, exponential_average_factor, self.eps)
File "/home/nvidia/anaconda3/lib/python3.7/site-packages/torch/nn/functional.py", line 2054, in batch_norm
_verify_batch_size(input.size())
File "/home/nvidia/anaconda3/lib/python3.7/site-packages/torch/nn/functional.py", line 2037, in _verify_batch_size
raise ValueError('Expected more than 1 value per channel when training, got input size {}'.format(size))
ValueError: Expected more than 1 value per channel when training, got input size torch.Size([1, 128, 1, 1])

How can I solve it?

@WillBrennan
Copy link
Owner

Can you list the categories you've got in your validation set and in your training set and how many images are in each? Likewise the parameters you launched training with?

This error often occurs if you decrease the batch-size to 1.

@SaharGezer
Copy link
Author

I have only one category:
INFO:root:categories: ['rail']

Training Set : 6800 Images
Val Set: 1700 Images

The net: BiSeNetV2

@WillBrennan
Copy link
Owner

My mistake; forgot those are in the logs above! This error is caused by batchnorm having a small batch. If you increase the batch-size this'll fix it. Its failing a check in batch-norm where it requires more than one sample per a channel to calculate the sample standard deviation per a channel.

Call train will giving --batch-size 8 should fix it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants