Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ngraph : fix for multithreading test_analyzer_image_classification #18265

Conversation

pawelpiotrowicz
Copy link
Contributor

@pawelpiotrowicz pawelpiotrowicz commented Jun 21, 2019

ngraph : fix for multithreading test_analyzer_image_classification test=develop

@baojun-nervana
@kbinias
@LeoZhao-Intel

kbinias
kbinias previously approved these changes Jun 27, 2019
Copy link
Contributor

@kbinias kbinias left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@luotao1 luotao1 closed this Jul 3, 2019
@luotao1 luotao1 reopened this Jul 3, 2019
@luotao1
Copy link
Contributor

luotao1 commented Jul 3, 2019

Could you refine the title and description? I could not see the full command from your title.

@LeoZhao-Intel
Copy link
Contributor

@pawelpiotrowicz can you describe your idea or design in comment? It is better to let us understand your general thoughts.

baojun-nervana
baojun-nervana previously approved these changes Jul 9, 2019
Copy link
Contributor

@baojun-nervana baojun-nervana left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

@pawelpiotrowicz
Copy link
Contributor Author

@LeoZhao-Intel The idea behind this solution allows threads to have an internal cache. Only pointer to cache is stored on thread local cache, real cache is located on heap.
The cache is deleted if thread dies – heap is released.

@pawelpiotrowicz pawelpiotrowicz force-pushed the pawepiot/ngraph_multithread_tls branch from afc2408 to fdc1350 Compare July 25, 2019 10:20
@pawelpiotrowicz pawelpiotrowicz changed the title ngraph : fix for multithreading test_analyzer_image_classification --… ngraph : fix for multithreading test_analyzer_image_classification Jul 25, 2019
@LeoZhao-Intel
Copy link
Contributor

@LeoZhao-Intel The idea behind this solution allows threads to have an internal cache. Only pointer to cache is stored on thread local cache, real cache is located on heap.
The cache is deleted if thread dies – heap is released.

Got your idea, you want to use thread_local variable to identify thread lifetime.

But be careful on that, we met some issues on mkldnn for some kind of cache reusing for Baidu's online service deployment. In their usages, there is a thread pool which is used for inference execution, but these threads are never exit and they are always reused for different iteration.

Copy link
Contributor

@baojun-nervana baojun-nervana left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@baojun-nervana
Copy link
Contributor

@pawelpiotrowicz Can you add a description as requested to move forward with this PR?

@pawelpiotrowicz
Copy link
Contributor Author

@luotao1 , I changed description as you requested could you carry on with this PR ?

@bingyanghuang
Copy link
Contributor

@pawelpiotrowicz Hi pawel , could you refine your description in this way:

  1. What's the problem you are fixing, could you describe it more detailed rather than "the multi-threading problem of n-graph"?
  2. Could you give the command line to reproduce this problem in this PR?
  3. If the issue you want to fix has been created , could you refer the issue number in your description? That will be more friendly for baidu to understand why we have this PR.
  4. After fixing this bug, what's result we will get? e.g. this is the multi-threading problem, will this problem affect the performance ? if so, please give the benchmark before and after fix, if not could you give some words on after fixing this problem, what we can get.

@luotao1
Copy link
Contributor

luotao1 commented Aug 1, 2019

Thanks for @bingyanghuang 's suggestions, @pawelpiotrowicz Could you tell us how to reproduce the bug and verify whether this PR fix the bug? like #18382 (comment)

@pawelpiotrowicz
Copy link
Contributor Author

pawelpiotrowicz commented Aug 1, 2019

@bingyanghuang @luotao1

  1. What's the problem you are fixing, could you describe it more detailed rather than "the multi-threading problem of n-graph"?

It's reffering to test_analyzer_image_classification app with ngraph support.

  1. Could you give the command line to reproduce this problem in this PR?
cmake ..  -DCMAKE_BUILD_TYPE=Debug -DWITH_TESTING=ON -DWITH_INFERENCE_API_TEST=ON -DON_INFER=ON -DWITH_PYTHON=ON -DWITH_NGRAPH=ON -DWITH_GPU=OFF
paddle-public/build/paddle/fluid/inference/tests/api/test_analyzer_image_classification --infer_model=/home/pawepiot/workspace/multi_instance_public/paddle-public/build/third_party/inference_demo/googlenet/model --gtest_filter=Analyzer_resnet50.profile --use_ngraph --use_analysis=false --repeat=100 --paddle_num_threads=4 --num_threads=2
Note: Google Test filter = Analyzer_resnet50.profile
[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from Analyzer_resnet50
[ RUN      ] Analyzer_resnet50.profile
WARNING: Logging before InitGoogleLogging() is written to STDERR
I0801 14:22:55.798418 23398 tester_helper.h:233] feed target 0: {-1, 3, 227, 227}
I0801 14:22:55.799906 23398 tester_helper.h:90] NativeConfig {
  PaddlePredictor::Config {
    model_dir: 
  }
  use_gpu: 0
  device: 0
  fraction_of_gpu_memory: 0
  specify_input_name: 1
}
I0801 14:22:55.838295 23399 tester_helper.h:332] Thread 0, number of threads 2, run 100 times...
I0801 14:22:55.838299 23400 tester_helper.h:332] Thread 1, number of threads 2, run 100 times...
*** Error in /home/pawepiot/workspace/multi_instance_public/paddle-public/build/paddle/fluid/inference/tests/api/test_analyzer_image_classification': double free or corruption (out): 0x00007fe4e82fac30 ***
======= Backtrace: =========
/lib/x86_64-linux-gnu/libc.so.6(+0x777e5)[0x7fe51fbe87e5]
/lib/x86_64-linux-gnu/libc.so.6(+0x8037a)[0x7fe51fbf137a]
/lib/x86_64-linux-gnu/libc.so.6(cfree+0x4c)[0x7fe51fbf553c]
/home/pawepiot/workspace/multi_instance_public/paddle-public/build/paddle/fluid/inference/tests/api/test_analyzer_image_classification(_ZSt8_DestroyINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEEvPT_+0x18)[0x428eca4]
/home/pawepiot/workspace/multi_instance_public/paddle-public/build/paddle/fluid/inference/tests/api/test_analyzer_image_classification(_ZNSt12_Destroy_auxILb0EE9__destroyIPNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEEEvT_S9_+0x2e)[0x428c17b]
/home/pawepiot/workspace/multi_instance_public/paddle-public/build/paddle/fluid/inference/tests/api/test_analyzer_image_classification(_ZSt8_DestroyIPNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEEvT_S7_+0x23)[0x4287d18]
/home/pawepiot/workspace/multi_instance_public/paddle-public/build/paddle/fluid/inference/tests/api/test_analyzer_image_classification(_ZSt8_DestroyIPNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEES5_EvT_S7_RSaIT0_E+0x27)[0x4282acb]
/home/pawepiot/workspace/multi_instance_public/paddle-public/build/paddle/fluid/inference/tests/api/test_analyzer_image_classification(_ZNSt6vectorINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEESaIS5_EED1Ev+0x35)[0x427d2bd]
/home/pawepiot/workspace/multi_instance_public/paddle-public/build/paddle/fluid/inference/tests/api/test_analyzer_image_classification(_ZN6paddle9operators11EngineCacheD1Ev+0x2c)[0x5496872]
/home/pawepiot/workspace/multi_instance_public/paddle-public/build/paddle/fluid/inference/tests/api/test_analyzer_image_classification(_ZNSt4pairIKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEN6paddle9operators11EngineCacheEED1Ev+0x1c)[0x54a658a./run.sh: line 35: 23398 Bus error               (core dumped) $cmd_fail

If the issue you want to fix has been created , could you refer the issue number in your description? That will be more friendly for baidu to understand why we have this PR.

  1. No issue registerd. It was one of the ngraph-integration task.

  2. After fix

Note: Google Test filter = Analyzer_resnet50.profile
[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from Analyzer_resnet50
[ RUN      ] Analyzer_resnet50.profile
WARNING: Logging before InitGoogleLogging() is written to STDERR
I0801 14:53:10.581333 16227 tester_helper.h:233] feed target 0: {-1, 3, 227, 227}
I0801 14:53:10.584867 16227 tester_helper.h:90] NativeConfig {
  PaddlePredictor::Config {
    model_dir: 
  }
  use_gpu: 0
  device: 0
  fraction_of_gpu_memory: 0
  specify_input_name: 1
}
I0801 14:53:10.634505 16228 tester_helper.h:332] Thread 0, number of threads 2, run 100 times...
I0801 14:53:10.634531 16229 tester_helper.h:332] Thread 1, number of threads 2, run 100 times...
I0801 14:53:46.711762 16229 helper.h:322] ====== threads: 2, thread id: 1 ======
I0801 14:53:46.714756 16229 helper.h:324] ====== batch size: 1, iterations: 1, repetitions: 100 ======
I0801 14:53:46.714833 16229 helper.h:326] ====== batch latency: 360.769ms, number of samples: 1, sample latency: 360.769ms, fps: 2.77186, data type: float ======
I0801 14:53:47.085276 16228 helper.h:322] ====== threads: 2, thread id: 0 ======
I0801 14:53:47.085427 16228 helper.h:324] ====== batch size: 1, iterations: 1, repetitions: 100 ======
I0801 14:53:47.085489 16228 helper.h:326] ====== batch latency: 364.507ms, number of samples: 1, sample latency: 364.507ms, fps: 2.74343, data type: float ======
[       OK ] Analyzer_resnet50.profile (36573 ms)
[----------] 1 test from Analyzer_resnet50 (36573 ms total)

[----------] Global test environment tear-down
[==========] 1 test from 1 test case ran. (36573 ms total)
[  PASSED  ] 1 test.

@luotao1
Copy link
Contributor

luotao1 commented Aug 1, 2019

@zhupengyang Could you help reproduce the bug and verify whether this PR fixes the bug with #18265 (comment)?

@zhupengyang
Copy link
Contributor

  • reproduce the bug: commit id: f745d6d
Note: Google Test filter = Analyzer_resnet50.profile
[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from Analyzer_resnet50
[ RUN      ] Analyzer_resnet50.profile
WARNING: Logging before InitGoogleLogging() is written to STDERR
I0802 06:21:59.253562 31349 tester_helper.h:233] feed target 0: {-1, 3, 227, 227}
I0802 06:21:59.255908 31349 tester_helper.h:90] NativeConfig {
  PaddlePredictor::Config {
    model_dir: 
  }
  use_gpu: 0
  device: 0
  fraction_of_gpu_memory: 0
  specify_input_name: 1
}
I0802 06:21:59.314203 31350 tester_helper.h:332] Thread 0, number of threads 2, run 100 times...
I0802 06:21:59.314246 31351 tester_helper.h:332] Thread 1, number of threads 2, run 100 times...
Segmentation fault
[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from Analyzer_resnet50
[ RUN      ] Analyzer_resnet50.profile
WARNING: Logging before InitGoogleLogging() is written to STDERR
I0802 12:28:25.573624 14799 tester_helper.h:233] feed target 0: {-1, 3, 227, 227}
I0802 12:28:25.575999 14799 tester_helper.h:90] NativeConfig {
  PaddlePredictor::Config {
    model_dir: 
  }
  use_gpu: 0
  device: 0
  fraction_of_gpu_memory: 0
  specify_input_name: 1
}
I0802 12:28:25.650452 14800 tester_helper.h:332] Thread 0, number of threads 2, run 100 times...
I0802 12:28:25.650454 14801 tester_helper.h:332] Thread 1, number of threads 2, run 100 times...
I0802 12:29:10.631986 14801 helper.h:322] ====== threads: 2, thread id: 1 ======
I0802 12:29:10.632396 14801 helper.h:324] ====== batch size: 1, iterations: 1, repetitions: 100 ======
I0802 12:29:10.632429 14801 helper.h:326] ====== batch latency: 449.813ms, number of samples: 1, sample latency: 449.813ms, fps: 2.22315, data type: float ======
I0802 12:29:11.176981 14800 helper.h:322] ====== threads: 2, thread id: 0 ======
I0802 12:29:11.177028 14800 helper.h:324] ====== batch size: 1, iterations: 1, repetitions: 100 ======
I0802 12:29:11.177039 14800 helper.h:326] ====== batch latency: 455.263ms, number of samples: 1, sample latency: 455.263ms, fps: 2.19653, data type: float ======
[       OK ] Analyzer_resnet50.profile (45701 ms)
[----------] 1 test from Analyzer_resnet50 (45701 ms total)

[----------] Global test environment tear-down
[==========] 1 test from 1 test case ran. (45702 ms total)
[  PASSED  ] 1 test.

@luotao1

@luotao1
Copy link
Contributor

luotao1 commented Aug 2, 2019

Thanks for @zhupengyang's verify work! @tensor-tang Please take a review!

Copy link
Contributor

@tensor-tang tensor-tang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Thanks @pawelpiotrowicz 's great work and @zhupengyang 's validation

@luotao1 luotao1 merged commit e53f517 into PaddlePaddle:develop Aug 5, 2019
@luotao1
Copy link
Contributor

luotao1 commented Aug 5, 2019

Please add a unit-test to ensure the multi-threading test_analyzer_image_classification on ngraph in next PR.

suoych pushed a commit to suoych/Paddle that referenced this pull request Aug 5, 2019
@baojun-nervana baojun-nervana deleted the pawepiot/ngraph_multithread_tls branch August 5, 2019 16:51
@bingyanghuang
Copy link
Contributor

bingyanghuang commented Aug 13, 2019

Please add a unit-test to ensure the multi-threading test_analyzer_image_classification on ngraph in next PR.

@pawelpiotrowicz When do you plan to add this unit test ?

@pawelpiotrowicz
Copy link
Contributor Author

@bingyanghuang , The task is a bit complex - I need more time, firstly I have to see test-coverage output and depends on result take decision.

@luotao1
Copy link
Contributor

luotao1 commented Aug 14, 2019

The task is a bit complex

You can simply add followings like

void compare(bool use_mkldnn = false, bool use_ngraph = false) {
AnalysisConfig cfg;
SetConfig(&cfg);
if (use_mkldnn) {
cfg.EnableMKLDNN();
cfg.pass_builder()->AppendPass("fc_mkldnn_pass");
}
if (use_ngraph) {
cfg.EnableNgraph();
}
std::vector<std::vector<PaddleTensor>> inputs;
LoadInputData(&inputs);
CompareNativeAndAnalysis(
reinterpret_cast<const PaddlePredictor::Config *>(&cfg), inputs);
}
TEST(Analyzer_bert, compare) { compare(); }
#ifdef PADDLE_WITH_MKLDNN
TEST(Analyzer_bert, compare_mkldnn) {
compare(true, false /* use_mkldnn, no use_ngraph */);
}
#endif
#ifdef PADDLE_WITH_NGRAPH
TEST(Analyzer_bert, compare_ngraph) {
compare(false, true /* no use_mkldnn, use_ngraph */);
}
#endif

#ifdef PADDLE_WITH_NGRAPH
TEST(Analyzer_bert, profile_ngraph) { profile(false, true); }
#endif
TestPrediction(reinterpret_cast<const PaddlePredictor::Config *>(&config),
                 inputs, &outputs, FLAGS_num_threads /*2 or 4*/);

see test-coverage output

test-coverage is an independent thing with this multi-threading unit-tests. Adding unit-test will not decrease the test-coverage, you can add the unit-test at first.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants