-
Notifications
You must be signed in to change notification settings - Fork 45.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Slow inference speed of object detection models and a hack as solution #3270
Comments
Would something similar work with faster-rcnn meta architecture and it's SecondStagePostprocessing? |
Nice work! @derekjchow is that something we could integrate in a general way? CC @benoitsteiner (for placement) @ebrevdo |
Thank you for the detailed analysis! Looking deeper into the code, it seems we clear devices when freezing the graph for portability reasons. See this file. |
In that case, would you explain what is needed to recover the original device placement config for those frozen models currently listed in the Object Detection Model Zoo? @wkelongws showed a hack to manually assign the nodes to GPU and CPU regained the advertised performance, but that requires careful inspection of each model network, so that is not really scalable. |
@nguyeho7 I haven't tried the trick on faster-rcnn architecture but I believe it will perform in a similar way. The point is the graph nodes have to be assigned to GPU or CPU accordingly to achieve the reported inference speed. The trick here is not the optimal way to assign the nodes as @tokk-nv mentioned above. Apparently the device tags are initially there and then removed in these frozen graphs. So these released frozen graphs cannot achieve the reported inference speed. @derekjchow Since the frozen graphs cannot achieve the reported inference speed due to the lack of optimal device assignment, I think the tensorflow team should provide:
|
@wkelongws thanks for this work! I am struggling to speed up SSD with the api since a while (#3136). This could be the reason. Can you tell me how to apply your hack? |
@gustavz The source code (a jupyter notebook file) to apply the hack is in the attached .zip file. The hack is basically manually find a cut-off point to split the graph into two halves, assign the first half to GPU and the second half to CPU. The cut-off point I used here is manually decided and will vary for other graphs. |
For the hack, it's a little more complicated to find a cutoff point with faster rcnn as the tensors are very interconnected. Notably the non maximum suppression is called twice, once after the first stage and once in the second stage post processing. I tried to use Squeeze_3 and Squeeze_2 which are the outputs of the second stage box proposals but there is always a missing tensor (i.e. wrong cut) somewhere. |
@wkelongws, no i did not have success with it, it seems that |
@nguyeho7 You are correct. The hack is just as a demo here. It is not scalable. For the complicated model structures such as faster_rcnn we need to find a way to restore the device placement. I guess re-exporting the frozen graph from check point might do the work:
Due to some other errors on my machine, I personally haven't tried this method. But @tokk-nv tried and confirmed still the device placement is missing after reexporting from provided check point. |
@wkelongws |
@wkelongws @nguyeho7 I've tried removing the clear_devices bits from the exporter, but I don't see any change in speed. We'll need more time to look at this issue a bit closer. |
Would loading the checkpoint directly instead of the frozen graph help? |
I encounter similar issue with SSD model. What's more confusing is that profiler reports that almost all of time spent in
|
@nguyeho7 I'm seeing this on checkpoint as well as frozen graph. |
I am having similarly bad performance. Tensorflow GPU is installed and says its connected and I see the process in the nvidia-smi tool. Anyone have an idea why inference is so slow on GPU? Anyone have any ideas whats going on? |
The hack works on the pretrained model from the model zoo but not on finetuned model exported using |
@gariepyalex is it slower? Or how does it not work? |
Why is that speed still faster than my Nvidia K80 on the AWS px2.large? |
@nguyeho7 It looks like the model in the model zoo has been generated from an older version of the code (november 17 if you look at the name of the tar.gz file). After running |
@madhavajay No you don't need to re-export the graph, it should pick the GPU automatically and from your previous message it does look like the speed improved when you added the GPU. There are many other factors (resolution, image loading is very notable, batch sizes) that affect the speed. Also note that their reported times are on an Titan X instead of a k80. This thread is more about proper device placement and possible improvements from that angle. |
What is weird is that if you look in tensorboard at the graph generated when running |
@nguyeho7 okay but surely the Titan X isnt as fast as the K80? Thats what im trying to understand in terms of real time inference for say video vs batch inference for static images. |
@wkelongws sharing a few facts:
@gariepyalex this meta arch is not only used for a specific machine, so setting device info in meta arch might not be the right choice. |
@gariepyalex i tried your hack, running export_inference_graph.py was successful and a model got generated but when i do inference i see lot of false detection with big bounding box.. i am using Tensorflow 1.5 and clear_devices=False. U have any idea what could be the problem.. |
@wkelongws your hack worked perfectly fine for all my ssd_mobilenet models. But now i got a new ssd_mobilenet trained on 600x600 input images and when i try to apply the split_model hack while inference it gives me the following error:
so i had a look at the Is this normal? if not, what could have caused this? |
I had noticed that |
Hi @naisy, I saw that you did a good work in accelerating the code in splitting the graph. I am just starting with tensorflow and object detection API. I used the "object_detection_tutorial.ipynb" with some modifications to read frames from a video file, predict the bounding boxes and write the predicted frames to a new video file. Even using a GPU (tesla P100), I am getting predictions that take 1 second per frame. As I read here, I have the feeling that there might have something wrong in my code, because it is too slow for a single frame.. Do you have any advice for what I should look in order to make it faster? thank you a lot. |
Hi @engineer1982, If you write session creation inside the loop, that will be late. In the tutorial, session is created after loading frame, but if you want to predict consecutively,
I think that this will be improved considerably. |
Thank you so much @naisy for a quick reply! I will try it today and get back to you. One more doubt: I am using train.py (legacy folder) to train a faster rcnn inception resnet v2. I have 93 images on my "train folder" and in the config file has batchsize = 1. When I run it using a GPU (tesla P100), each step takes 0,9s. As I am using a provided function from tensorflow, I believe it should have everything ok in the code.
thank you again for your help! |
Hi @engineer1982, Sorry, I’m not familier about training code. And I have not trained with faster rcnn. So I cannot answer that. |
@naisy, ok :) And in what application and hardware did you get 22 FPS? |
Hi @engineer1982, |
Do you know why only first time takes longer time? |
@naisy @wkelongws How would this work with TensorRT? Would TensorRT have the same problems of improper cpu/gpu assignment or would it solve the problem? |
Hi @atyshka, I think that similar processing is good in TensorRT. |
@naisy Pardon me if I'm wrong, but changing that one line with the NMS isn't what you meant by splitting the graph. You had a lot more code in there that used two separate graphs to run. |
Hi @atyshka, When spliting the graph, consider the dividing point by looking at the graph. |
@dennywangtenk - In your code you are trying to have the graph loaded at once for all the images. Which would mean a significant reduction in i/o operations. right? I'm working on my faster RCNN and trying the solutions mentioned in this issue. |
@naisy where you able to run the faster rcnn model built on tensorflow using tensor rt . Can you share some pointers on how to run it using tensor rt |
For tf-trt please refer to the following URL. |
@naisy hi thanks for the pointers , but i am particularly looking for faster rcnn + inceptiion v2 architecture which is currently not available in the links you provided . can you pls give references on this |
When you read the code, you can see that it corresponds to inception_v2. |
Hi @naisy thanks for the tips but when i looked into the file i am not able to find the faster_rcnn_inception_v2_coco in the file , but https://github.com/NVIDIA-AI-IOT/tf_trt_models/blob/master/tf_trt_models/detection.py i could see that there faster_rcnn_resnet50_coco |
It appears that @naisy code works for the pretrained faster rcnn 01_28 that is in Google's model zoo, but NOT for a retrained custom model based on that graph, because the retrained models lack the squeeze_2 and squeeze_3 nodes. @abhigoku10 Were you able to solve this problem with the squeeze nodes? I noticed you encountered it before |
@iocuydi i was able to run the squeezenet but there was some error due to which i was getting wrong results during prediction |
Add faster_rcnn_inception_v2_coco settings into detection.py.
You'd better ask TF-TRT questions at the TensorRT forum. |
@naisy thanks for the info. I have asked the questions in the community but not getting any positive responses |
Did anyone come up with a reliable reproducible solution? |
Hi There, |
System information
models/research/object_detection/
No custom code for reproducing the bug. I have written custom code for diagnosing.
Linux Ubuntu 16.04
Anaconda conda-forge channel
b'unknown' 1.4.1 (output from
python -c "import tensorflow as tf; print(tf.GIT_VERSION, tf.VERSION)"
)CUDA 8.0/cuDNN 6.0
1 TITAN X (Pascal) 12189MiB
Run the provided object detection demo (ssd_mobilenet_v1_coco_2017_11_17 model) with a small modification in the last cell to record the inference speed:
The results show that the inference speed is much shower than the reported inference speed, 30ms, in the model zoo page:
Describe the problem
Summary:
By directly running the provided object detection demo, the observed inference speed of object detection models in the model zoo is much slower than the reported inference speed. With some hack, a higher inference speed than the reported speed can be achieved. After some diagnostics, it is highly likely that the slow inference speed is caused by:
proof of the hypothesis: tf.where and other post-processing operations are running anomaly slow on GPU
By outputting trace file, we can diagnose the running time of each node in details.
To output the trace file, modify the last cell of object detection demo as:
The output json file has been included in the .zip file in the source code section below.
Visualizing the json file in chrome://tracing/ gives:
The CNN related operations end at ~13ms and the rest post-processing operations take about 133ms. We have noticed that adding the trace function will further slow down the inference speed. But it is shows clearly that the post-processing operations (post CNN) run very slowly on GPU.
As a comparison, one can run the object detection demo with GPU disabled, and profile the running trace using the same method. To disable GPU, add
os.environ['CUDA_VISIBLE_DEVICES'] = ''
in the first row of the last cell.The output json file has been included in the .zip file in the source code section below.
Visualizing this json file in chrome://tracing/ gives:
By running everything on CPU, the CNN operations end at roughly 63ms and the rest post-processing operations only takes about 15ms on CPU which is significantly faster than the time they take when running on GPU.
proof of the hypothesis: The frozen inference graph is lack of the ability to optimized the GPU/CPU assignment
We add some hack trying to see can we achieve a higher inference speed. The hack is manually assigning the CNN related nodes on GPU and the rest nodes on CPU. The idea is using GPU to accelerate only CNN operations and leave the post-processing operations on CPU.
The source code has been included in the .zip file in the source code section below.
With this hack, we are able to observe a higher inference speed than the reported speed.
To verify the hypothesis, here are some questions we need from the tensorflow team:
Are the numbers of inference speed reported on the detection model zoo page tested on the frozen inference graphs or original graphs?
Are the slow tf.where and other post-processing operations supposed to run on GPU or CPU? Is the slow running speed on GPU normal?
Is there a device assigning function to optimize the GPU/CPU use in the original tensorflow graphs? Is that function missing in the frozen inference graphs?
Source code / logs
tensorflowissue.zip
The text was updated successfully, but these errors were encountered: