cuda10 trian issue #55

mama110 · 2019-06-10T07:41:58Z

when I run python3 train.py CenterNet-52, I meet this error(my video card is rtx 2080ti and I'm using CUDA10 and Pytorch 1.1, and I've modify the batch_size and chunk_sizes to 2):

RuntimeError: Expected object of type Variable but found type CUDAType for argument #0 'result' (checked_cast_variable at /pytorch/torch/csrc/autograd/VariableTypeManual.cpp:173)

Is there a way that I can run this code by pytorch 1.1(cuda 10)? Thanks

Duankaiwen · 2019-06-10T12:01:05Z

Hi @mama110 If you use pytorch1.1, please refer to this: princeton-vl/CornerNet@3809432. And you need to delete all the compiled files before recompiling the corner pooling layers.

mama110 · 2019-06-10T12:32:57Z

@Duankaiwen I've deleted the compiled folders(build and cpools-xxxxxxxx.egg) and recompiled the corner pool layer, but It doesn't work. Did I miss something?

Duankaiwen · 2019-06-10T12:40:10Z

Delete all files except src, init.py, setup.py

mama110 · 2019-06-11T02:03:49Z

I delete all compiled files and recompile the corner pooling layers, but the same error comes up.

RuntimeError: Expected object of type Variable but found type CUDAType for argument #0 'result' (checked_cast_variable at /pytorch/torch/csrc/autograd/VariableTypeManual.cpp:173)

By the way, the solution princeton-vl/CornerNet@3809432 solve the problem when I‘m compiling the corner pooling layer, but when I run the train.py, aforementioned error comes up.

Duankaiwen · 2019-06-11T02:24:25Z

Please show the full log

Duankaiwen · 2019-06-11T02:39:22Z

@mama110 princeton-vl/CornerNet#104

mama110 · 2019-06-11T03:06:47Z

@Duankaiwen

kun@pupa:~/master/CenterNet-master$ python3 train.py CenterNet-104
loading all datasets...
using 4 threads
loading from cache file: cache/coco_trainval2014.pkl
No cache file found...
loading annotations into memory...
Done (t=14.11s)
creating index...
index created!
118287it [00:38, 3092.92it/s]
loading annotations into memory...
Done (t=10.89s)
creating index...
index created!
loading from cache file: cache/coco_trainval2014.pkl
loading annotations into memory...
Done (t=9.46s)
creating index...
index created!
loading from cache file: cache/coco_trainval2014.pkl
loading annotations into memory...
Done (t=11.28s)
creating index...
index created!
loading from cache file: cache/coco_trainval2014.pkl
loading annotations into memory...
Done (t=9.95s)
creating index...
index created!
loading from cache file: cache/coco_minival2014.pkl
No cache file found...
loading annotations into memory...
Done (t=0.47s)
creating index...
index created!
5000it [00:01, 3069.90it/s]
loading annotations into memory...
Done (t=0.29s)
creating index...
index created!
system config...
{'batch_size': 2,
'cache_dir': 'cache',
'chunk_sizes': [2],
'config_dir': 'config',
'data_dir': './data',
'data_rng': <mtrand.RandomState object at 0x7fae23f95168>,
'dataset': 'MSCOCO',
'decay_rate': 10,
'display': 5,
'learning_rate': 0.00025,
'max_iter': 480000,
'nnet_rng': <mtrand.RandomState object at 0x7fae23f951b0>,
'opt_algo': 'adam',
'prefetch_size': 6,
'pretrain': None,
'result_dir': 'results',
'sampling_function': 'kp_detection',
'snapshot': 5000,
'snapshot_name': 'CenterNet-104',
'stepsize': 450000,
'test_split': 'testdev',
'train_split': 'trainval',
'val_iter': 500,
'val_split': 'minival',
'weight_decay': False,
'weight_decay_rate': 1e-05,
'weight_decay_type': 'l2'}
db config...
{'ae_threshold': 0.5,
'border': 128,
'categories': 80,
'data_aug': True,
'gaussian_bump': True,
'gaussian_iou': 0.7,
'gaussian_radius': -1,
'input_size': [511, 511],
'kp_categories': 1,
'lighting': True,
'max_per_image': 100,
'merge_bbox': False,
'nms_algorithm': 'exp_soft_nms',
'nms_kernel': 3,
'nms_threshold': 0.5,
'output_sizes': [[128, 128]],
'rand_color': True,
'rand_crop': True,
'rand_pushes': False,
'rand_samples': False,
'rand_scale_max': 1.4,
'rand_scale_min': 0.6,
'rand_scale_step': 0.1,
'rand_scales': array([0.6, 0.7, 0.8, 0.9, 1. , 1.1, 1.2, 1.3]),
'special_crop': False,
'test_scales': [1],
'top_k': 70,
'weight_exp': 8}
len of db: 118287
start prefetching data...
shuffling indices...
start prefetching data...
shuffling indices...
start prefetching data...
shuffling indices...
start prefetching data...
shuffling indices...
building model...
module_file: models.CenterNet-104
start prefetching data...
shuffling indices...
total parameters: 210062960
setting learning rate to: 0.00025
training start...
0%| | 0/480000 [00:00<?, ?it/s]/home/kun/.local/lib/python3.5/site-packages/torch/nn/_reduction.py:46: UserWarning: size_average and reduce args will be deprecated, please use reduction='sum' instead.
warnings.warn(warning.format(ret))

Traceback (most recent call last):
File "train.py", line 203, in
train(training_dbs, validation_db, args.start_iter)
File "train.py", line 163, in train
nnet.set_lr(learning_rate)
File "/usr/lib/python3.5/contextlib.py", line 77, in exit
self.gen.throw(type, value, traceback)
File "/home/kun/master/CenterNet-master/utils/tqdm.py", line 23, in stdout_to_tqdm
raise exc
File "/home/kun/master/CenterNet-master/utils/tqdm.py", line 21, in stdout_to_tqdm
yield save_stdout
File "train.py", line 138, in train
training_loss, focal_loss, pull_loss, push_loss, regr_loss = nnet.train(training)
File "/home/kun/master/CenterNet-master/nnet/py_factory.py", line 93, in train
loss.backward()
File "/home/kun/.local/lib/python3.5/site-packages/torch/tensor.py", line 107, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File "/home/kun/.local/lib/python3.5/site-packages/torch/autograd/init.py", line 93, in backward
allow_unreachable=True) # allow_unreachable flag
File "/home/kun/.local/lib/python3.5/site-packages/torch/autograd/function.py", line 77, in apply
return self._forward_cls.backward(self, args)
File "/home/kun/master/CenterNet-master/models/py_utils/_cpools/init.py", line 59, in backward
output = right_pool.backward(input, grad_output)[0]
RuntimeError: Expected object of type Variable but found type CUDAType for argument #0 'result' (checked_cast_variable at /pytorch/torch/csrc/autograd/VariableTypeManual.cpp:173)
frame #0: std::function<std::string ()>::operator()() const + 0x11 (0x7fadda674441 in /home/kun/.local/lib/python3.5/site-packages/torch/lib/libc10.so)
frame #1: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x2a (0x7fadda673d7a in /home/kun/.local/lib/python3.5/site-packages/torch/lib/libc10.so)
frame #2: torch::autograd::VariableType::checked_cast_variable(at::Tensor&, char const, int) + 0x169 (0x7fadd9146419 in /home/kun/.local/lib/python3.5/site-packages/torch/lib/libtorch.so.1)
frame #3: torch::autograd::VariableType::unpack(at::Tensor&, char const, int) + 0x9 (0x7fadd91464b9 in /home/kun/.local/lib/python3.5/site-packages/torch/lib/libtorch.so.1)
frame #4: torch::autograd::VariableType::s__th_gt_out(at::Tensor&, at::Tensor const&, at::Tensor const&) const + 0x24b (0x7fadd8f716fb in /home/kun/.local/lib/python3.5/site-packages/torch/lib/libtorch.so.1)
frame #5: at::TypeDefault::_th_gt_out(at::Tensor&, at::Tensor const&, at::Tensor const&) const + 0x205 (0x7faddb5841a5 in /home/kun/.local/lib/python3.5/site-packages/torch/lib/libcaffe2.so)
frame #6: at::TypeDefault::gt_out(at::Tensor&, at::Tensor const&, at::Tensor const&) const + 0x62 (0x7faddb566b02 in /home/kun/.local/lib/python3.5/site-packages/torch/lib/libcaffe2.so)
frame #7: torch::autograd::VariableType::gt_out(at::Tensor&, at::Tensor const&, at::Tensor const&) const + 0x35c (0x7fadd901074c in /home/kun/.local/lib/python3.5/site-packages/torch/lib/libtorch.so.1)
frame #8: pool_backward(at::Tensor, at::Tensor) + 0x826 (0x7fadcf17af56 in /home/kun/.local/lib/python3.5/site-packages/cpools-0.0.0-py3.5-linux-x86_64.egg/right_pool.cpython-35m-x86_64-linux-gnu.so)
frame #9: + 0x129e4 (0x7fadcf1869e4 in /home/kun/.local/lib/python3.5/site-packages/cpools-0.0.0-py3.5-linux-x86_64.egg/right_pool.cpython-35m-x86_64-linux-gnu.so)
frame #10: + 0x12afe (0x7fadcf186afe in /home/kun/.local/lib/python3.5/site-packages/cpools-0.0.0-py3.5-linux-x86_64.egg/right_pool.cpython-35m-x86_64-linux-gnu.so)
frame #11: + 0x10a16 (0x7fadcf184a16 in /home/kun/.local/lib/python3.5/site-packages/cpools-0.0.0-py3.5-linux-x86_64.egg/right_pool.cpython-35m-x86_64-linux-gnu.so)

frame #15: python3() [0x4ec2e3]
frame #19: python3() [0x4ec2e3]
frame #21: python3() [0x4fbfce]
frame #24: torch::autograd::PyFunction::apply(std::vector<torch::autograd::Variable, std::allocatortorch::autograd::Variable >&&) + 0x193 (0x7fae22e8d833 in /home/kun/.local/lib/python3.5/site-packages/torch/lib/libtorch_python.so)
frame #25: + 0x3108aa (0x7fadd8b058aa in /home/kun/.local/lib/python3.5/site-packages/torch/lib/libtorch.so.1)
frame #26: torch::autograd::Engine::evaluate_function(torch::autograd::FunctionTask&) + 0x385 (0x7fadd8afe975 in /home/kun/.local/lib/python3.5/site-packages/torch/lib/libtorch.so.1)
frame #27: torch::autograd::Engine::thread_main(torch::autograd::GraphTask) + 0xc0 (0x7fadd8b00970 in /home/kun/.local/lib/python3.5/site-packages/torch/lib/libtorch.so.1)
frame #28: torch::autograd::Engine::thread_init(int) + 0x136 (0x7fadd8afdd46 in /home/kun/.local/lib/python3.5/site-packages/torch/lib/libtorch.so.1)
frame #29: torch::autograd::python::PythonEngine::thread_init(int) + 0x2a (0x7fae22e882fa in /home/kun/.local/lib/python3.5/site-packages/torch/lib/libtorch_python.so)
frame #30: + 0xb8c80 (0x7fadda179c80 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #31: + 0x76ba (0x7fae368036ba in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #32: clone + 0x6d (0x7fae3653941d in /lib/x86_64-linux-gnu/libc.so.6)

Duankaiwen · 2019-06-11T03:11:45Z

please show your
models/py_utils/_cpools/src/bottom_pool.cpp
............................................./left_pool.cpp
............................................./right_pool.cpp
............................................/top_pool.cpp

mama110 · 2019-06-11T04:34:23Z

bottom_pool.cpp (other 3 cpp files are similar to this one)

#include <torch/extension.h>
#include

std::vectorat::Tensor pool_forward(
at::Tensor input
) {
// Initialize output
at::Tensor output = at::zeros_like(input);

// Get height
int64_t height = input.size(2);

// Copy the last column
at::Tensor input_temp  = input.select(2, 0);
at::Tensor output_temp = output.select(2, 0);
output_temp.copy_(input_temp);

at::Tensor max_temp;
for (int64_t ind = 0; ind < height - 1; ++ind) {
    input_temp  = input.select(2, ind + 1);
    output_temp = output.select(2, ind);
    max_temp    = output.select(2, ind + 1);

    at::max_out(max_temp, input_temp, output_temp);
}

return { 
    output
};

}

std::vectorat::Tensor pool_backward(
at::Tensor input,
at::Tensor grad_output
) {
auto output = at::zeros_like(input);

int32_t batch   = input.size(0);
int32_t channel = input.size(1);
int32_t height  = input.size(2);
int32_t width   = input.size(3);

auto max_val = at::zeros({batch, channel, width}, torch::TensorOptions().dtype(torch::kFloat).device(torch::kCUDA));
auto max_ind = at::zeros({batch, channel, width}, torch::TensorOptions().dtype(torch::kLong).device(torch::kCUDA));

auto input_temp = input.select(2, 0);
max_val.copy_(input_temp);

max_ind.fill_(0);

auto output_temp      = output.select(2, 0);
auto grad_output_temp = grad_output.select(2, 0);
output_temp.copy_(grad_output_temp);

auto un_max_ind = max_ind.unsqueeze(2);
auto gt_mask    = at::zeros({batch, channel, width}, torch::TensorOptions().dtype(torch::kByte).device(torch::kCUDA));
auto max_temp   = at::zeros({batch, channel, width}, torch::TensorOptions().dtype(torch::kFloat).device(torch::kCUDA));
for (int32_t ind = 0; ind < height - 1; ++ind) {
    input_temp = input.select(2, ind + 1);
    at::gt_out(gt_mask, input_temp, max_val);

    at::masked_select_out(max_temp, input_temp, gt_mask);
    max_val.masked_scatter_(gt_mask, max_temp);
    max_ind.masked_fill_(gt_mask, ind + 1);

    grad_output_temp = grad_output.select(2, ind + 1).unsqueeze(2);
    output.scatter_add_(2, un_max_ind, grad_output_temp);
}

return {
    output
};

}

PYBIND11_MODULE(TORCH_EXTENSION_NAME, m) {
m.def(
"forward", &pool_forward, "Bottom Pool Forward",
py::call_guardpy::gil_scoped_release()
);
m.def(
"backward", &pool_backward, "Bottom Pool Backward",
py::call_guardpy::gil_scoped_release()
);
}

yulei1234 · 2019-08-11T08:03:20Z

Is the problem solved? I have encountered a similar problem...

lolongcovas · 2019-09-01T14:39:15Z

hi all, i faced the same error. I found the solution in the corner-net from princeton-vl.

lolongcovas mentioned this issue Sep 1, 2019

train error #46

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cuda10 trian issue #55

cuda10 trian issue #55

mama110 commented Jun 10, 2019 •

edited

Loading

Duankaiwen commented Jun 10, 2019 •

edited

Loading

mama110 commented Jun 10, 2019

Duankaiwen commented Jun 10, 2019

mama110 commented Jun 11, 2019 •

edited

Loading

Duankaiwen commented Jun 11, 2019

Duankaiwen commented Jun 11, 2019

mama110 commented Jun 11, 2019

Duankaiwen commented Jun 11, 2019

mama110 commented Jun 11, 2019 •

edited

Loading

yulei1234 commented Aug 11, 2019

lolongcovas commented Sep 1, 2019

cuda10 trian issue #55

cuda10 trian issue #55

Comments

mama110 commented Jun 10, 2019 • edited Loading

Duankaiwen commented Jun 10, 2019 • edited Loading

mama110 commented Jun 10, 2019

Duankaiwen commented Jun 10, 2019

mama110 commented Jun 11, 2019 • edited Loading

Duankaiwen commented Jun 11, 2019

Duankaiwen commented Jun 11, 2019

mama110 commented Jun 11, 2019

Duankaiwen commented Jun 11, 2019

mama110 commented Jun 11, 2019 • edited Loading

yulei1234 commented Aug 11, 2019

lolongcovas commented Sep 1, 2019

mama110 commented Jun 10, 2019 •

edited

Loading

Duankaiwen commented Jun 10, 2019 •

edited

Loading

mama110 commented Jun 11, 2019 •

edited

Loading

mama110 commented Jun 11, 2019 •

edited

Loading