-
Notifications
You must be signed in to change notification settings - Fork 102
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FSDv2 speedup #165
Comments
torch_scatter is not included in TorchEx. It is from https://github.com/rusty1s/pytorch_scatter. |
Unfortunately, I am stuck on exporting aten::all operation to ONNX. It seems that PyTorch update is needed, but I can't launch FSDv2 even with PyTorch 1.9.0. |
Collate and scatter has been removed, and it is inclear what to do... Perhaps I have to merge FSD into the latest mmdetection 1.3.0? In this case I want at least have a patch of FSD over mmdetection3d 0.5.0 in order to apply this patch to mmdetection 1.3.0 (and resolve conflicts by hands next). There is 'first release' commit (fb8c92f), is it vanilla mmdetection3d 0.5.0?
Could you give an advice, what would be the right way to deploy FSDv2 in TensorRT, if this is even possible? |
I will not recommend you deploy a sparse conv based algorithm if you are not an experienced engineer...
It took quite long time for a couple of professional engineers in TuSimple to make it work, including the author of spconv. It is really an odyssey for beginners to deal with all these things... However, I truly appreciate your effort and attention to our work. |
I am trying to deploy spconv, but there is one limitation, which seems to me fundamental. Spconv outputs shape is data-dependent , i.e. Dense 1d matrix: Sparse 1d matrix (features, indices): Dense 1d kernel: Dense result: Sparse result: Dense 1d matrix: Sparse 1d matrix: Dense 1d kernel: Dense result: Sparse result: So output matrices shape depends not only on input matrices shape, but also on data in these input matrices (specifically on data in input indices matrix). TensorRT has limited dynamic shape support, i.e. input tensor can has a arbitrary shape (in min,opt,max range), but output tensor shape must strictly depend on the input tensor shape and is calculating at the time TRT engine is being built. Solution is to pre-calculate maximum bound for output tensor shape, and really fill only part of output tensor elements in runtime, letting others to remain zeros. But at the next step I need to slice this padded tensor, and it's unclear how to do this in general case. Spconv deploy for FSD implies writing two plugins = GetIndicePairs plugin and ImplicitGemm plugin - ImplicitGemm uses result of GetIndicePairs (GetIndicePairs result can be reused by more than one ImplicitGemm layer). I tried to slice GetIndicePairs result in this way: Set additional output 1x1 tensor in GetIndicePairs and populate it with real_indices_num.
where res[0] is out indices tensor, and res[8] is additional tensor for real_indices_num. This model can be exported to ONNX with some Slice layers. The problem is - I can build TRT engine from this ONNX, but can't run this engine (engine->createExecutionContext() return nullptr).
TRT engine can be run normally. So I suppose that ONNX supports dynamic slicing, but TRT is not... I can workaround this and imagine some schemes how to pass real_indices_num from GeiIndicePairs to ImplicitGemm:
Set additional output 1x1 tensor in GetIndicePairs and 1x1 input tensor in ImplicitGemm.
Enumerate GetIndicePairs and ImplicitGemm layers so each ImplicitGemm layer will know the number of it's parent GetIndicePairs layer. But the problem is - how to perform slicing after ImpicitGemm layer? What will be if I don't perform slicing and simply pass zero padded features and indices tensors further?
1 0 If slicing is not performed, technically is it inconsistent sparse tensor - feature with index 0 can be both 0 and 1. It seems that TRT plugin for Traveller59 Spconv exists: But dialog on https://github.com/traveller59/spconv/issues is almost dead... |
I think I must feed num_act_out_real (real_indices_num) to some FSD-specific layers in order they can perform slicing. Or perform padding like this: 1 0 And consumers of sparse input must know, that features with index -1 does not really exist. |
As I can understand, there are 2 backbones - backbone of the segmentor, which is SimpleSparseUNet, and backbone of SingleStageFSDv2, which is VirtualVoxelMixer. @Abyssaledge do you mean input size to both backbones need to be fixed? |
Yes, I believe all the inputs should be in fixed sizes. |
It seems that TensorRT 8.6 supports data-dependent operations (for example NonZero). Data-dependent operations still are not supported in plugins, but there is a workaround for this - add additional output tensor num_out, and perform slicing after plugin. So perhaps it is no need to fix backbone input size. This way I translated FSDv2 to TensorRT up to Output tensors are: I verified, that output in TensoRT and Pytorch versions are the same given the same input (with acceptable precision in my opinion). I have tried an input, that network didn't see during an export. Now I am trying to translate FSDv2 up to
But there is another problem - insufficient workspace. FSDv2 seems to be too large for TensorRT. I tried to set max workspace size 1 << 50 (which is effectively infinity), but it didn't help... I set GetIndicePairsImplicitGemmPlugin to require 1700000 bytes workspace, and I believe this is not a problem. Another plugins require 0 bytes workspace. Zipped ONNX is about 160 Mb, so it's hard to attach it there. |
If I set multiscale_features=None in
would it be a serious problem for FSDv2? |
It seems that I can't deploy FSDv2 due to memory limitations (TensorRT requests > 30 Gb GPU memory - and this impossible for me) |
Why not try smaller channels or depth? |
I am trying to convert FSDv2 to ONNX (and next to TensorRT), but there is an error:
RuntimeError: ONNX export failed on an operator with unrecognized namespace torch_scatter::scatter_max. If you are trying to export a custom operator, make sure you registered it with the right domain and version.
It seems, that I must convert TorchEx operations to ONNX first. How difficult is it?
Do you have any plans to speedup FSDv2?
FSDv2 timings at our point clouds are from 120 to 180 ms. I want to speedup it at least to 50 ms - it seems impossible to integrate FSDv2 to real autonomous driving system otherwise...
The text was updated successfully, but these errors were encountered: