Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cuda10 trian issue #55

Open
mama110 opened this issue Jun 10, 2019 · 11 comments
Open

cuda10 trian issue #55

mama110 opened this issue Jun 10, 2019 · 11 comments

Comments

@mama110
Copy link

mama110 commented Jun 10, 2019

when I run python3 train.py CenterNet-52, I meet this error(my video card is rtx 2080ti and I'm using CUDA10 and Pytorch 1.1, and I've modify the batch_size and chunk_sizes to 2):

RuntimeError: Expected object of type Variable but found type CUDAType for argument #0 'result' (checked_cast_variable at /pytorch/torch/csrc/autograd/VariableTypeManual.cpp:173)

Is there a way that I can run this code by pytorch 1.1(cuda 10)? Thanks

@Duankaiwen
Copy link
Owner

Duankaiwen commented Jun 10, 2019

Hi @mama110 If you use pytorch1.1, please refer to this: princeton-vl/CornerNet@3809432. And you need to delete all the compiled files before recompiling the corner pooling layers.

@mama110
Copy link
Author

mama110 commented Jun 10, 2019

@Duankaiwen I've deleted the compiled folders(build and cpools-xxxxxxxx.egg) and recompiled the corner pool layer, but It doesn't work. Did I miss something?

@Duankaiwen
Copy link
Owner

Delete all files except src, init.py, setup.py

@mama110
Copy link
Author

mama110 commented Jun 11, 2019

I delete all compiled files and recompile the corner pooling layers, but the same error comes up.

RuntimeError: Expected object of type Variable but found type CUDAType for argument #0 'result' (checked_cast_variable at /pytorch/torch/csrc/autograd/VariableTypeManual.cpp:173)

By the way, the solution princeton-vl/CornerNet@3809432 solve the problem when I‘m compiling the corner pooling layer, but when I run the train.py, aforementioned error comes up.

@Duankaiwen
Copy link
Owner

Please show the full log

@Duankaiwen
Copy link
Owner

@mama110
Copy link
Author

mama110 commented Jun 11, 2019

@Duankaiwen

kun@pupa:~/master/CenterNet-master$ python3 train.py CenterNet-104
loading all datasets...
using 4 threads
loading from cache file: cache/coco_trainval2014.pkl
No cache file found...
loading annotations into memory...
Done (t=14.11s)
creating index...
index created!
118287it [00:38, 3092.92it/s]
loading annotations into memory...
Done (t=10.89s)
creating index...
index created!
loading from cache file: cache/coco_trainval2014.pkl
loading annotations into memory...
Done (t=9.46s)
creating index...
index created!
loading from cache file: cache/coco_trainval2014.pkl
loading annotations into memory...
Done (t=11.28s)
creating index...
index created!
loading from cache file: cache/coco_trainval2014.pkl
loading annotations into memory...
Done (t=9.95s)
creating index...
index created!
loading from cache file: cache/coco_minival2014.pkl
No cache file found...
loading annotations into memory...
Done (t=0.47s)
creating index...
index created!
5000it [00:01, 3069.90it/s]
loading annotations into memory...
Done (t=0.29s)
creating index...
index created!
system config...
{'batch_size': 2,
'cache_dir': 'cache',
'chunk_sizes': [2],
'config_dir': 'config',
'data_dir': './data',
'data_rng': <mtrand.RandomState object at 0x7fae23f95168>,
'dataset': 'MSCOCO',
'decay_rate': 10,
'display': 5,
'learning_rate': 0.00025,
'max_iter': 480000,
'nnet_rng': <mtrand.RandomState object at 0x7fae23f951b0>,
'opt_algo': 'adam',
'prefetch_size': 6,
'pretrain': None,
'result_dir': 'results',
'sampling_function': 'kp_detection',
'snapshot': 5000,
'snapshot_name': 'CenterNet-104',
'stepsize': 450000,
'test_split': 'testdev',
'train_split': 'trainval',
'val_iter': 500,
'val_split': 'minival',
'weight_decay': False,
'weight_decay_rate': 1e-05,
'weight_decay_type': 'l2'}
db config...
{'ae_threshold': 0.5,
'border': 128,
'categories': 80,
'data_aug': True,
'gaussian_bump': True,
'gaussian_iou': 0.7,
'gaussian_radius': -1,
'input_size': [511, 511],
'kp_categories': 1,
'lighting': True,
'max_per_image': 100,
'merge_bbox': False,
'nms_algorithm': 'exp_soft_nms',
'nms_kernel': 3,
'nms_threshold': 0.5,
'output_sizes': [[128, 128]],
'rand_color': True,
'rand_crop': True,
'rand_pushes': False,
'rand_samples': False,
'rand_scale_max': 1.4,
'rand_scale_min': 0.6,
'rand_scale_step': 0.1,
'rand_scales': array([0.6, 0.7, 0.8, 0.9, 1. , 1.1, 1.2, 1.3]),
'special_crop': False,
'test_scales': [1],
'top_k': 70,
'weight_exp': 8}
len of db: 118287
start prefetching data...
shuffling indices...
start prefetching data...
shuffling indices...
start prefetching data...
shuffling indices...
start prefetching data...
shuffling indices...
building model...
module_file: models.CenterNet-104
start prefetching data...
shuffling indices...
total parameters: 210062960
setting learning rate to: 0.00025
training start...
0%| | 0/480000 [00:00<?, ?it/s]/home/kun/.local/lib/python3.5/site-packages/torch/nn/_reduction.py:46: UserWarning: size_average and reduce args will be deprecated, please use reduction='sum' instead.
warnings.warn(warning.format(ret))

Traceback (most recent call last):
File "train.py", line 203, in
train(training_dbs, validation_db, args.start_iter)
File "train.py", line 163, in train
nnet.set_lr(learning_rate)
File "/usr/lib/python3.5/contextlib.py", line 77, in exit
self.gen.throw(type, value, traceback)
File "/home/kun/master/CenterNet-master/utils/tqdm.py", line 23, in stdout_to_tqdm
raise exc
File "/home/kun/master/CenterNet-master/utils/tqdm.py", line 21, in stdout_to_tqdm
yield save_stdout
File "train.py", line 138, in train
training_loss, focal_loss, pull_loss, push_loss, regr_loss = nnet.train(training)
File "/home/kun/master/CenterNet-master/nnet/py_factory.py", line 93, in train
loss.backward()
File "/home/kun/.local/lib/python3.5/site-packages/torch/tensor.py", line 107, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File "/home/kun/.local/lib/python3.5/site-packages/torch/autograd/init.py", line 93, in backward
allow_unreachable=True) # allow_unreachable flag
File "/home/kun/.local/lib/python3.5/site-packages/torch/autograd/function.py", line 77, in apply
return self._forward_cls.backward(self, args)
File "/home/kun/master/CenterNet-master/models/py_utils/_cpools/init.py", line 59, in backward
output = right_pool.backward(input, grad_output)[0]
RuntimeError: Expected object of type Variable but found type CUDAType for argument #0 'result' (checked_cast_variable at /pytorch/torch/csrc/autograd/VariableTypeManual.cpp:173)
frame #0: std::function<std::string ()>::operator()() const + 0x11 (0x7fadda674441 in /home/kun/.local/lib/python3.5/site-packages/torch/lib/libc10.so)
frame #1: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x2a (0x7fadda673d7a in /home/kun/.local/lib/python3.5/site-packages/torch/lib/libc10.so)
frame #2: torch::autograd::VariableType::checked_cast_variable(at::Tensor&, char const
, int) + 0x169 (0x7fadd9146419 in /home/kun/.local/lib/python3.5/site-packages/torch/lib/libtorch.so.1)
frame #3: torch::autograd::VariableType::unpack(at::Tensor&, char const
, int) + 0x9 (0x7fadd91464b9 in /home/kun/.local/lib/python3.5/site-packages/torch/lib/libtorch.so.1)
frame #4: torch::autograd::VariableType::s__th_gt_out(at::Tensor&, at::Tensor const&, at::Tensor const&) const + 0x24b (0x7fadd8f716fb in /home/kun/.local/lib/python3.5/site-packages/torch/lib/libtorch.so.1)
frame #5: at::TypeDefault::_th_gt_out(at::Tensor&, at::Tensor const&, at::Tensor const&) const + 0x205 (0x7faddb5841a5 in /home/kun/.local/lib/python3.5/site-packages/torch/lib/libcaffe2.so)
frame #6: at::TypeDefault::gt_out(at::Tensor&, at::Tensor const&, at::Tensor const&) const + 0x62 (0x7faddb566b02 in /home/kun/.local/lib/python3.5/site-packages/torch/lib/libcaffe2.so)
frame #7: torch::autograd::VariableType::gt_out(at::Tensor&, at::Tensor const&, at::Tensor const&) const + 0x35c (0x7fadd901074c in /home/kun/.local/lib/python3.5/site-packages/torch/lib/libtorch.so.1)
frame #8: pool_backward(at::Tensor, at::Tensor) + 0x826 (0x7fadcf17af56 in /home/kun/.local/lib/python3.5/site-packages/cpools-0.0.0-py3.5-linux-x86_64.egg/right_pool.cpython-35m-x86_64-linux-gnu.so)
frame #9: + 0x129e4 (0x7fadcf1869e4 in /home/kun/.local/lib/python3.5/site-packages/cpools-0.0.0-py3.5-linux-x86_64.egg/right_pool.cpython-35m-x86_64-linux-gnu.so)
frame #10: + 0x12afe (0x7fadcf186afe in /home/kun/.local/lib/python3.5/site-packages/cpools-0.0.0-py3.5-linux-x86_64.egg/right_pool.cpython-35m-x86_64-linux-gnu.so)
frame #11: + 0x10a16 (0x7fadcf184a16 in /home/kun/.local/lib/python3.5/site-packages/cpools-0.0.0-py3.5-linux-x86_64.egg/right_pool.cpython-35m-x86_64-linux-gnu.so)

frame #15: python3() [0x4ec2e3]
frame #19: python3() [0x4ec2e3]
frame #21: python3() [0x4fbfce]
frame #24: torch::autograd::PyFunction::apply(std::vector<torch::autograd::Variable, std::allocatortorch::autograd::Variable >&&) + 0x193 (0x7fae22e8d833 in /home/kun/.local/lib/python3.5/site-packages/torch/lib/libtorch_python.so)
frame #25: + 0x3108aa (0x7fadd8b058aa in /home/kun/.local/lib/python3.5/site-packages/torch/lib/libtorch.so.1)
frame #26: torch::autograd::Engine::evaluate_function(torch::autograd::FunctionTask&) + 0x385 (0x7fadd8afe975 in /home/kun/.local/lib/python3.5/site-packages/torch/lib/libtorch.so.1)
frame #27: torch::autograd::Engine::thread_main(torch::autograd::GraphTask
) + 0xc0 (0x7fadd8b00970 in /home/kun/.local/lib/python3.5/site-packages/torch/lib/libtorch.so.1)
frame #28: torch::autograd::Engine::thread_init(int) + 0x136 (0x7fadd8afdd46 in /home/kun/.local/lib/python3.5/site-packages/torch/lib/libtorch.so.1)
frame #29: torch::autograd::python::PythonEngine::thread_init(int) + 0x2a (0x7fae22e882fa in /home/kun/.local/lib/python3.5/site-packages/torch/lib/libtorch_python.so)
frame #30: + 0xb8c80 (0x7fadda179c80 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #31: + 0x76ba (0x7fae368036ba in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #32: clone + 0x6d (0x7fae3653941d in /lib/x86_64-linux-gnu/libc.so.6)

@Duankaiwen
Copy link
Owner

please show your
models/py_utils/_cpools/src/bottom_pool.cpp
............................................./left_pool.cpp
............................................./right_pool.cpp
............................................/top_pool.cpp

@mama110
Copy link
Author

mama110 commented Jun 11, 2019

bottom_pool.cpp (other 3 cpp files are similar to this one)

#include <torch/extension.h>
#include

std::vectorat::Tensor pool_forward(
at::Tensor input
) {
// Initialize output
at::Tensor output = at::zeros_like(input);

// Get height
int64_t height = input.size(2);

// Copy the last column
at::Tensor input_temp  = input.select(2, 0);
at::Tensor output_temp = output.select(2, 0);
output_temp.copy_(input_temp);

at::Tensor max_temp;
for (int64_t ind = 0; ind < height - 1; ++ind) {
    input_temp  = input.select(2, ind + 1);
    output_temp = output.select(2, ind);
    max_temp    = output.select(2, ind + 1);

    at::max_out(max_temp, input_temp, output_temp);
}

return { 
    output
};

}

std::vectorat::Tensor pool_backward(
at::Tensor input,
at::Tensor grad_output
) {
auto output = at::zeros_like(input);

int32_t batch   = input.size(0);
int32_t channel = input.size(1);
int32_t height  = input.size(2);
int32_t width   = input.size(3);

auto max_val = at::zeros({batch, channel, width}, torch::TensorOptions().dtype(torch::kFloat).device(torch::kCUDA));
auto max_ind = at::zeros({batch, channel, width}, torch::TensorOptions().dtype(torch::kLong).device(torch::kCUDA));

auto input_temp = input.select(2, 0);
max_val.copy_(input_temp);

max_ind.fill_(0);

auto output_temp      = output.select(2, 0);
auto grad_output_temp = grad_output.select(2, 0);
output_temp.copy_(grad_output_temp);

auto un_max_ind = max_ind.unsqueeze(2);
auto gt_mask    = at::zeros({batch, channel, width}, torch::TensorOptions().dtype(torch::kByte).device(torch::kCUDA));
auto max_temp   = at::zeros({batch, channel, width}, torch::TensorOptions().dtype(torch::kFloat).device(torch::kCUDA));
for (int32_t ind = 0; ind < height - 1; ++ind) {
    input_temp = input.select(2, ind + 1);
    at::gt_out(gt_mask, input_temp, max_val);

    at::masked_select_out(max_temp, input_temp, gt_mask);
    max_val.masked_scatter_(gt_mask, max_temp);
    max_ind.masked_fill_(gt_mask, ind + 1);

    grad_output_temp = grad_output.select(2, ind + 1).unsqueeze(2);
    output.scatter_add_(2, un_max_ind, grad_output_temp);
}

return {
    output
};

}

PYBIND11_MODULE(TORCH_EXTENSION_NAME, m) {
m.def(
"forward", &pool_forward, "Bottom Pool Forward",
py::call_guardpy::gil_scoped_release()
);
m.def(
"backward", &pool_backward, "Bottom Pool Backward",
py::call_guardpy::gil_scoped_release()
);
}

@yulei1234
Copy link

Is the problem solved? I have encountered a similar problem...

@lolongcovas
Copy link

hi all, i faced the same error. I found the solution in the corner-net from princeton-vl.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants