Error in training #13

SaharGezer · 2021-03-14T07:29:58Z

I tried to train BisenetV2 on my own data according to your instructions but I got the next error:

INFO:root:creating dataset and data loaders
INFO:root:loaded 6800 annotations from /home/nvidia/Documents/Projects/SemanticSegmentation/RailSem19_LabelMe/train
INFO:root:use augmentation: True
INFO:root:categories: ['rail']
INFO:root:loaded 1700 annotations from /home/nvidia/Documents/Projects/SemanticSegmentation/RailSem19_LabelMe/val
INFO:root:use augmentation: False
INFO:root:categories: ['rail']
INFO:root:creating dataloaders with 16 workers and a batch-size of 2
INFO:root:creating BiSeNetV2 and optimizer with initial lr of 0.0001
INFO:root:creating model with categories: ['rail']
INFO:root:creating trainer and evaluator engines
INFO:root:creating summary writer with tag seg_train
INFO:root:attaching lr scheduler
INFO:root:attaching event driven calls
INFO:root:training...
INFO:ignite.engine.engine.Engine:Engine run starting with max_epochs=30.
/home/nvidia/anaconda3/lib/python3.7/site-packages/torch/nn/modules/container.py:552: UserWarning: Setting attributes on ParameterDict is not supported.
warnings.warn("Setting attributes on ParameterDict is not supported.")
/home/nvidia/anaconda3/lib/python3.7/site-packages/torch/nn/modules/container.py:645: UserWarning: nn.ParameterDict is being used with DataParallel but this is not supported. This dict will appear empty for the models replicated on each GPU except the original one.
warnings.warn("nn.ParameterDict is being used with DataParallel but this is not "
ERROR:ignite.engine.engine.Engine:Current run is terminating due to exception: Caught ValueError in replica 0 on device 0.
Original Traceback (most recent call last):
File "/home/nvidia/anaconda3/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 61, in _worker
output = module(*input, **kwargs)
File "/home/nvidia/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/nvidia/Documents/Projects/SemanticSegmentation/semantic_segmentation/models/bisenetv2.py", line 304, in forward
x_semantic = self.semantic(x)
File "/home/nvidia/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/nvidia/Documents/Projects/SemanticSegmentation/semantic_segmentation/models/bisenetv2.py", line 193, in forward
x = self.stage5(x)
File "/home/nvidia/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/nvidia/anaconda3/lib/python3.7/site-packages/torch/nn/modules/container.py", line 117, in forward
input = module(input)
File "/home/nvidia/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/nvidia/Documents/Projects/SemanticSegmentation/semantic_segmentation/models/bisenetv2.py", line 146, in forward
x_gap = self.conv_project(x_gap)
File "/home/nvidia/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/nvidia/Documents/Projects/SemanticSegmentation/semantic_segmentation/models/bisenetv2.py", line 23, in forward
return F.leaky_relu(self.bn(self.conv(x)))
File "/home/nvidia/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/nvidia/anaconda3/lib/python3.7/site-packages/torch/nn/modules/batchnorm.py", line 136, in forward
self.weight, self.bias, bn_training, exponential_average_factor, self.eps)
File "/home/nvidia/anaconda3/lib/python3.7/site-packages/torch/nn/functional.py", line 2054, in batch_norm
_verify_batch_size(input.size())
File "/home/nvidia/anaconda3/lib/python3.7/site-packages/torch/nn/functional.py", line 2037, in _verify_batch_size
raise ValueError('Expected more than 1 value per channel when training, got input size {}'.format(size))
ValueError: Expected more than 1 value per channel when training, got input size torch.Size([1, 128, 1, 1])

ERROR:ignite.engine.engine.Engine:Engine run is terminating due to exception: Caught ValueError in replica 0 on device 0.
Original Traceback (most recent call last):
File "/home/nvidia/anaconda3/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 61, in _worker
output = module(*input, **kwargs)
File "/home/nvidia/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/nvidia/Documents/Projects/SemanticSegmentation/semantic_segmentation/models/bisenetv2.py", line 304, in forward
x_semantic = self.semantic(x)
File "/home/nvidia/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/nvidia/Documents/Projects/SemanticSegmentation/semantic_segmentation/models/bisenetv2.py", line 193, in forward
x = self.stage5(x)
File "/home/nvidia/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/nvidia/anaconda3/lib/python3.7/site-packages/torch/nn/modules/container.py", line 117, in forward
input = module(input)
File "/home/nvidia/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/nvidia/Documents/Projects/SemanticSegmentation/semantic_segmentation/models/bisenetv2.py", line 146, in forward
x_gap = self.conv_project(x_gap)
File "/home/nvidia/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/nvidia/Documents/Projects/SemanticSegmentation/semantic_segmentation/models/bisenetv2.py", line 23, in forward
return F.leaky_relu(self.bn(self.conv(x)))
File "/home/nvidia/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/nvidia/anaconda3/lib/python3.7/site-packages/torch/nn/modules/batchnorm.py", line 136, in forward
self.weight, self.bias, bn_training, exponential_average_factor, self.eps)
File "/home/nvidia/anaconda3/lib/python3.7/site-packages/torch/nn/functional.py", line 2054, in batch_norm
_verify_batch_size(input.size())
File "/home/nvidia/anaconda3/lib/python3.7/site-packages/torch/nn/functional.py", line 2037, in _verify_batch_size
raise ValueError('Expected more than 1 value per channel when training, got input size {}'.format(size))
ValueError: Expected more than 1 value per channel when training, got input size torch.Size([1, 128, 1, 1])

Traceback (most recent call last):
File "/home/nvidia/anaconda3/lib/python3.7/site-packages/ignite/engine/engine.py", line 775, in _internal_run
self._handle_exception(e)
File "/home/nvidia/anaconda3/lib/python3.7/site-packages/ignite/engine/engine.py", line 469, in _handle_exception
raise e
File "/home/nvidia/anaconda3/lib/python3.7/site-packages/ignite/engine/engine.py", line 745, in _internal_run
time_taken = self._run_once_on_dataset()
File "/home/nvidia/anaconda3/lib/python3.7/site-packages/ignite/engine/engine.py", line 850, in _run_once_on_dataset
self._handle_exception(e)
File "/home/nvidia/anaconda3/lib/python3.7/site-packages/ignite/engine/engine.py", line 469, in _handle_exception
raise e
File "/home/nvidia/anaconda3/lib/python3.7/site-packages/ignite/engine/engine.py", line 833, in _run_once_on_dataset
self.state.output = self._process_function(self, self.state.batch)
File "/home/nvidia/anaconda3/lib/python3.7/site-packages/ignite/engine/init.py", line 103, in _update
y_pred = model(x)
File "/home/nvidia/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/nvidia/anaconda3/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 161, in forward
outputs = self.parallel_apply(replicas, inputs, kwargs)
File "/home/nvidia/anaconda3/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 171, in parallel_apply
return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
File "/home/nvidia/anaconda3/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 86, in parallel_apply
output.reraise()
File "/home/nvidia/anaconda3/lib/python3.7/site-packages/torch/_utils.py", line 428, in reraise
raise self.exc_type(msg)
ValueError: Caught ValueError in replica 0 on device 0.
Original Traceback (most recent call last):
File "/home/nvidia/anaconda3/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 61, in _worker
output = module(*input, **kwargs)
File "/home/nvidia/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/nvidia/Documents/Projects/SemanticSegmentation/semantic_segmentation/models/bisenetv2.py", line 304, in forward
x_semantic = self.semantic(x)
File "/home/nvidia/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/nvidia/Documents/Projects/SemanticSegmentation/semantic_segmentation/models/bisenetv2.py", line 193, in forward
x = self.stage5(x)
File "/home/nvidia/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/nvidia/anaconda3/lib/python3.7/site-packages/torch/nn/modules/container.py", line 117, in forward
input = module(input)
File "/home/nvidia/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/nvidia/Documents/Projects/SemanticSegmentation/semantic_segmentation/models/bisenetv2.py", line 146, in forward
x_gap = self.conv_project(x_gap)
File "/home/nvidia/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/nvidia/Documents/Projects/SemanticSegmentation/semantic_segmentation/models/bisenetv2.py", line 23, in forward
return F.leaky_relu(self.bn(self.conv(x)))
File "/home/nvidia/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/nvidia/anaconda3/lib/python3.7/site-packages/torch/nn/modules/batchnorm.py", line 136, in forward
self.weight, self.bias, bn_training, exponential_average_factor, self.eps)
File "/home/nvidia/anaconda3/lib/python3.7/site-packages/torch/nn/functional.py", line 2054, in batch_norm
_verify_batch_size(input.size())
File "/home/nvidia/anaconda3/lib/python3.7/site-packages/torch/nn/functional.py", line 2037, in _verify_batch_size
raise ValueError('Expected more than 1 value per channel when training, got input size {}'.format(size))
ValueError: Expected more than 1 value per channel when training, got input size torch.Size([1, 128, 1, 1])

How can I solve it?

WillBrennan · 2021-03-14T22:42:57Z

Can you list the categories you've got in your validation set and in your training set and how many images are in each? Likewise the parameters you launched training with?

This error often occurs if you decrease the batch-size to 1.

SaharGezer · 2021-03-15T07:38:33Z

I have only one category:
INFO:root:categories: ['rail']

Training Set : 6800 Images
Val Set: 1700 Images

The net: BiSeNetV2

WillBrennan · 2021-03-19T00:14:18Z

My mistake; forgot those are in the logs above! This error is caused by batchnorm having a small batch. If you increase the batch-size this'll fix it. Its failing a check in batch-norm where it requires more than one sample per a channel to calculate the sample standard deviation per a channel.

Call train will giving --batch-size 8 should fix it.

WillBrennan closed this as completed Mar 19, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error in training #13

Error in training #13

SaharGezer commented Mar 14, 2021

WillBrennan commented Mar 14, 2021

SaharGezer commented Mar 15, 2021

WillBrennan commented Mar 19, 2021

Error in training #13

Error in training #13

Comments

SaharGezer commented Mar 14, 2021

WillBrennan commented Mar 14, 2021

SaharGezer commented Mar 15, 2021

WillBrennan commented Mar 19, 2021