Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cherry picked windows patches. #14046

Merged
merged 71 commits into from
Nov 8, 2018
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
71 commits
Select commit Hold shift + click to select a range
adec702
cudnn widndows
dzhwinter Aug 17, 2018
963a745
"add comment"
dzhwinter Aug 17, 2018
335398f
dlfnh
dzhwinter Aug 17, 2018
59160e8
"windows support"
dzhwinter Aug 17, 2018
36878d7
comment out backtarce
dzhwinter Aug 17, 2018
64ce121
"windows support"
dzhwinter Aug 17, 2018
5c88cd2
remove werror in windows
dzhwinter Aug 17, 2018
f1a7ae3
"fix cmake error"
dzhwinter Aug 20, 2018
17602ea
windows port of malloc
dzhwinter Aug 20, 2018
c5722eb
merge windows/cudnn
dzhwinter Aug 20, 2018
34f8c9b
windows port
dzhwinter Aug 24, 2018
89f95ea
merge develop branch
dzhwinter Aug 24, 2018
c1ad52f
pre-commit
dzhwinter Aug 24, 2018
cfbf1ba
add source
dzhwinter Aug 24, 2018
a94d4f5
fix math_function compile
dzhwinter Aug 24, 2018
65f144a
fix commit
dzhwinter Aug 24, 2018
c7e0ed8
inference lib
dzhwinter Aug 24, 2018
488a2dd
with ir node
dzhwinter Aug 25, 2018
efd0884
add op registry
dzhwinter Aug 25, 2018
d7f98f3
more platform is done
dzhwinter Aug 25, 2018
26dbe35
add msvc flags and copy lib done
dzhwinter Aug 26, 2018
7dceb8a
check some operators
dzhwinter Aug 26, 2018
2ec589a
float.h fixed
dzhwinter Aug 26, 2018
cd8f3e9
operator module is done
dzhwinter Aug 27, 2018
78aab05
fix more op errors
dzhwinter Aug 27, 2018
b74af56
cpu compile is done
dzhwinter Aug 28, 2018
b78394e
done
dzhwinter Aug 29, 2018
f5329d6
add some synatx
dzhwinter Aug 30, 2018
5e8e7fb
change data type
dzhwinter Aug 30, 2018
dbe90cc
merge develop branch
dzhwinter Sep 1, 2018
b5402d7
Merge remote-tracking branch 'origin/develop' into windows/support
dzhwinter Sep 2, 2018
52d60f8
merge conclit
dzhwinter Sep 2, 2018
5c2637e
tensor util
dzhwinter Sep 2, 2018
75681c0
switch to 9.2
dzhwinter Sep 2, 2018
a0aa2ec
build compile
dzhwinter Sep 2, 2018
379b471
squash commit
dzhwinter Sep 3, 2018
c3e1fb5
add demo
dzhwinter Sep 12, 2018
372caf4
windows staff
dzhwinter Sep 14, 2018
e199953
debug the device context
dzhwinter Sep 14, 2018
85f8dd1
debug version
dzhwinter Sep 14, 2018
3ae9645
compile in linux
wanghaoshuang Oct 14, 2018
b12f7c2
compile in linux.
wanghaoshuang Oct 15, 2018
962061f
windows fix
dzhwinter Oct 15, 2018
804dd7d
merge conflict. both linux and windows pass.
dzhwinter Oct 15, 2018
e41a3fc
fix update to develop hang problem.
dzhwinter Oct 16, 2018
607080e
windows static library
dzhwinter Oct 23, 2018
f9e7cfb
save binary file
wanghaoshuang Oct 23, 2018
5993155
Merge remote-tracking branch 'dzhwinter/windows/support' into windows…
wanghaoshuang Oct 23, 2018
78cf76a
fix linux compile
wanghaoshuang Oct 23, 2018
9e522a4
update cmake
wanghaoshuang Oct 23, 2018
c6dcffc
lb. add debug output
dzhwinter Oct 23, 2018
dbd0075
Merge branch 'windows/support' into lb
dzhwinter Oct 23, 2018
597d921
clean demo_ci
dzhwinter Oct 24, 2018
abe8e20
clean demo_ci
dzhwinter Oct 24, 2018
b154e0b
clean demo_ci
dzhwinter Oct 24, 2018
468467f
update real incnet tester
dzhwinter Oct 24, 2018
09409ba
staged. test speed=49ms in 1080.
dzhwinter Oct 26, 2018
7141deb
add cudnn back. staged.
dzhwinter Oct 26, 2018
c8adc2c
cudnn version. staged.
dzhwinter Oct 29, 2018
ebfe5a0
merge develop branch
dzhwinter Oct 30, 2018
bf2e4cb
cleard. staged
dzhwinter Oct 30, 2018
3167658
add back jit simd instructions. stage.
dzhwinter Oct 31, 2018
9da7b33
details
dzhwinter Nov 1, 2018
1ace55c
merge develop branch
dzhwinter Nov 1, 2018
0a18058
clean cmake. test=develop
dzhwinter Nov 1, 2018
eb2f7ed
refine tests. test=develop
dzhwinter Nov 2, 2018
cc02353
test=develop
dzhwinter Nov 2, 2018
60f70b1
test=develop
dzhwinter Nov 4, 2018
deb4af7
add test
dzhwinter Nov 7, 2018
2835e04
merge develop branch. test=develop
dzhwinter Nov 7, 2018
234a1d9
Merge remote-tracking branch 'origin/develop' into windows/debug
dzhwinter Nov 8, 2018
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
cleard. staged
  • Loading branch information
dzhwinter committed Oct 30, 2018
commit bf2e4cb1882b077b9efa78626f30965e3f15a2ab
3 changes: 2 additions & 1 deletion cmake/cuda.cmake
Original file line number Diff line number Diff line change
Expand Up @@ -173,6 +173,7 @@ set(CUDA_PROPAGATE_HOST_FLAGS OFF)
if (NOT WIN32) # windows msvc2015 support c++11 natively.
# -std=c++11 -fPIC not recoginize by msvc
list(APPEND CUDA_NVCC_FLAGS "-std=c++11")
# in cuda9, suppress cuda warning on eigen with "-w"
list(APPEND CUDA_NVCC_FLAGS "-w" "-Xcompiler -fPIC")
else(NOT WIN32)
list(APPEND CUDA_NVCC_FLAGS "-w" "-Xcompiler -fPIC" "-Xcompiler /w")
Expand All @@ -181,7 +182,7 @@ endif(NOT WIN32)
if(WITH_FAST_MATH)
# Make use of fast math library. https://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/index.html
list(APPEND CUDA_NVCC_FLAGS "--use_fast_math")
# in cuda9, suppress cuda warning on eigen
endif(WITH_FAST_MATH)

# Set :expt-relaxed-constexpr to suppress Eigen warnings
list(APPEND CUDA_NVCC_FLAGS "--expt-relaxed-constexpr")
Expand Down
1 change: 1 addition & 0 deletions cmake/external/threadpool.cmake
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@ INCLUDE(ExternalProject)
SET(THREADPOOL_SOURCE_DIR ${THIRD_PARTY_PATH}/threadpool)
SET(THREADPOOL_INCLUDE_DIR ${THREADPOOL_SOURCE_DIR}/src/extern_threadpool)
INCLUDE_DIRECTORIES(${THREADPOOL_INCLUDE_DIR})
message("Debug" ${THREADPOOL_INCLUDE_DIR})

ExternalProject_Add(
extern_threadpool
Expand Down
20 changes: 4 additions & 16 deletions cmake/flags.cmake
Original file line number Diff line number Diff line change
Expand Up @@ -143,26 +143,14 @@ set(GPU_COMMON_FLAGS
-Wno-error=unused-function # Warnings in Numpy Header.
-Wno-error=array-bounds # Warnings in Eigen::array
)
set(COMMON_FLAGS
-fPIC
-fno-omit-frame-pointer)
set(GPU_COMMON_FLAGS
-fPIC
-fno-omit-frame-pointer)

else(NOT WIN32)
set(COMMON_FLAGS
"/w") #disable all warnings.

set(GPU_COMMON_FLAGS
"/w") #disable all warnings

endif(NOT WIN32)

else(NOT WIN32)
set(COMMON_FLAGS
-fPIC
-fno-omit-frame-pointer
"/w") #disable all warnings.
set(GPU_COMMON_FLAGS
-fPIC
-fno-omit-frame-pointer
"/w") #disable all warnings
endif(NOT WIN32)

Expand Down
79 changes: 15 additions & 64 deletions paddle/fluid/framework/executor.cc
Original file line number Diff line number Diff line change
Expand Up @@ -48,6 +48,7 @@ ExecutorPrepareContext::~ExecutorPrepareContext() {
VLOG(5) << "destroy ExecutorPrepareContext";
}

#ifndef _WIN32
template <typename RefCntMap>
static void DeleteUnusedTensors(const Scope& scope, const OperatorBase* op,
GarbageCollector<Tensor>* gc,
Expand Down Expand Up @@ -82,6 +83,7 @@ static void DeleteUnusedTensors(const Scope& scope, const OperatorBase* op,
gc->Add(erase_tensors);
}
}
#endif

Executor::Executor(const platform::Place& place) : place_(place) {}

Expand Down Expand Up @@ -331,97 +333,35 @@ void Executor::Run(const ProgramDesc& program, Scope* scope,

std::unique_ptr<ExecutorPrepareContext> Executor::Prepare(
const ProgramDesc& program, int block_id) {
VLOG(3) << "before create prepare" << block_id << " " << program.Size();
std::unique_ptr<ExecutorPrepareContext> ctx(
new ExecutorPrepareContext(program, block_id));
VLOG(3) << "after create prepare";
// PADDLE_ENFORCE_LT(static_cast<size_t>(block_id), program.Size());
VLOG(3) << "before create op_desc";
PADDLE_ENFORCE_LT(static_cast<size_t>(block_id), program.Size());
auto& block = program.Block(block_id);
VLOG(3) << "create before" << ctx->ops_.size() << " "
<< block.AllOps().size();
int counter = 0;
for (auto& op_desc : block.AllOps()) {
ctx->ops_.push_back(OpRegistry::CreateOp(*op_desc));
VLOG(3) << "create op "
<< "index " << ++counter << " type " << op_desc->Type();
}
VLOG(3) << "create finished" << ctx->ops_.size() << " "
<< block.AllOps().size();
return ctx;
}

std::vector<std::shared_ptr<ExecutorPrepareContext>> Executor::Prepare(
const ProgramDesc& program, const std::vector<int>& block_ids) {
VLOG(3) << "inside prepare";
std::vector<std::shared_ptr<ExecutorPrepareContext>> result;
VLOG(3) << "before go through block_ids";
for (auto& bid : block_ids) {
VLOG(3) << "block id" << bid;
auto* ctx = new ExecutorPrepareContext(program, bid);
// PADDLE_ENFORCE_LT(static_cast<size_t>(bid), program.Size());
PADDLE_ENFORCE_LT(static_cast<size_t>(bid), program.Size());
auto& block = program.Block(bid);
int counter = 0;
VLOG(3) << "create before" << ctx->ops_.size() << " "
<< block.AllOps().size();
for (auto& op_desc : block.AllOps()) {
ctx->ops_.push_back(OpRegistry::CreateOp(*op_desc));
VLOG(3) << "create op "
<< "index " << ++counter << " type " << op_desc->Type();
}
VLOG(3) << "create finished" << ctx->ops_.size() << " "
<< block.AllOps().size();
result.push_back(std::shared_ptr<ExecutorPrepareContext>(ctx));
}
return result;
}

// void CheckResult(const std::string op_type, ExecutorPrepareContext* ctx,
// Scope* local_scope) {
// VLOG(3) << "before checking result";
// auto& dev_ctx = *platform::DeviceContextPool::Instance().Get(place_);
// std::vector<std::string> outputs;
// auto& block = ctx->prog_.Block(0);
// bool found = false;
// framework::OpDesc* myop = nullptr;
// for(auto& op : block.AllOps()) {
// if(op->Type() == "load_combine" || op->Type() == "fetch" || op->Type() ==
// "feed") return;
// if (op->Type() == op_type) {
// found = true;
// myop = op;
// break;
// }
// }
// }
// if(!found) {
// VLOG(3) << "not found op!";
// return;
// }
// auto* op = myop;
// VLOG(3) << "start op output" << op->Type();
// for(auto var_name: op->OutputArgumentNames()) {
// auto* var = local_scope->Var(var_name);
// auto* var_desc = block.FindVar(var_name);
// if (var_desc->Persistable()) continue;
// auto* tensor = var->GetMutable<framework::LoDTensor>();
// framework::Tensor check;
// VLOG(3) << "before tensor copy";
// framework::TensorCopy(*tensor, platform::CPUPlace(), dev_ctx, &check);
// VLOG(3) << "after tensor copy";
// float sum = .0;
// for(size_t i=0; i < check.numel(); ++i) {
// sum += check.data<float>()[i];
// }
// VLOG(3) << "op " << op->Type() << " output var " << var_name << " sum "
// << sum;
// VLOG(3) << "after checking result";
// }

void Executor::RunPreparedContext(ExecutorPrepareContext* ctx, Scope* scope,
bool create_local_scope, bool create_vars,
bool keep_kids) {
VLOG(3) << "RunPreparedContext inside";
Scope* local_scope = scope;
if (create_vars) {
if (create_local_scope) {
Expand All @@ -430,6 +370,7 @@ void Executor::RunPreparedContext(ExecutorPrepareContext* ctx, Scope* scope,
CreateVariables(ctx->prog_, local_scope, ctx->block_id_);
}

#ifndef _WIN32
int64_t max_memory_size = GetEagerDeletionThreshold();
std::unique_ptr<GarbageCollector<Tensor>> gc;
// WhileOp would set keep_kids to false
Expand Down Expand Up @@ -471,6 +412,16 @@ void Executor::RunPreparedContext(ExecutorPrepareContext* ctx, Scope* scope,
} else {
platform::DeviceContextPool::Instance().Get(place_)->Wait();
}
#else // WIN32
for (auto& op : ctx->ops_) {
op->Run(*local_scope, place_);
if (FLAGS_benchmark) {
VLOG(2) << "Memory used after operator " + op->Type() + " running: "
<< memory::memory_usage(place_);
}
}
platform::DeviceContextPool::Instance().Get(place_)->Wait();
#endif // NOT WIN32

if (local_scope != scope) {
scope->DeleteScope(local_scope);
Expand Down
4 changes: 3 additions & 1 deletion paddle/fluid/framework/executor.h
Original file line number Diff line number Diff line change
Expand Up @@ -17,12 +17,14 @@ limitations under the License. */
#include <map>
#include <string>
#include <vector>
#include "paddle/fluid/framework/garbage_collector.h"
#include "paddle/fluid/framework/op_info.h"
#include "paddle/fluid/framework/program_desc.h"
#include "paddle/fluid/framework/scope.h"
#include "paddle/fluid/framework/tensor.h"
#include "paddle/fluid/platform/device_context.h"
#ifndef _WIN32
#include "paddle/fluid/framework/garbage_collector.h"
#endif

namespace paddle {
namespace framework {
Expand Down
3 changes: 2 additions & 1 deletion paddle/fluid/inference/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -35,9 +35,10 @@ endif()

# Create static library
if (WIN32)
cc_library(paddle_fluid DEPS ${fluid_modules} ${fluid_thirdpa} paddle_fluid_api paddle_inference_api)
cc_library(paddle_fluid DEPS ${fluid_modules} ${fluid_third_partys} paddle_fluid_api paddle_inference_api)
else(WIND32)
cc_library(paddle_fluid DEPS ${fluid_modules} ${STATIC_INFERENCE_APIS} zero_copy_tensor reset_tensor_array)
endif(WIN32)

if(NOT APPLE)
# TODO(liuyiqu: Temporarily disable the link flag because it is not support on Mac.
Expand Down
1 change: 1 addition & 0 deletions paddle/fluid/inference/api/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -51,6 +51,7 @@ function(inference_api_test TARGET_NAME)
endfunction(inference_api_test)

cc_library(reset_tensor_array SRCS details/reset_tensor_array.cc DEPS lod_tensor scope)
cc_library(helper SRCS helper.cc DEPS reset_tensor_array lod_tensor scope)
cc_library(paddle_inference_api SRCS api.cc api_impl.cc helper.cc DEPS reset_tensor_array lod_tensor scope)
cc_library(analysis_predictor SRCS analysis_predictor.cc DEPS paddle_inference_api analysis naive_executor zero_copy_tensor)
cc_library(zero_copy_tensor SRCS details/zero_copy_tensor.cc DEPS paddle_inference_api)
Expand Down
1 change: 0 additions & 1 deletion paddle/fluid/inference/api/api.cc
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,6 @@
#include "paddle/fluid/framework/scope.h"
#include "paddle/fluid/inference/api/paddle_inference_api.h"
#include "paddle/fluid/platform/enforce.h"
#include "paddle_inference_api.h"

namespace paddle {

Expand Down
5 changes: 2 additions & 3 deletions paddle/fluid/inference/api/api_impl.cc
Original file line number Diff line number Diff line change
Expand Up @@ -260,9 +260,8 @@ std::unique_ptr<PaddlePredictor> CreatePaddlePredictor<
if (config.use_gpu) {
// 1. GPU memeroy
PADDLE_ENFORCE_GT(
config.fraction_of_gpu_memory, 0.f,
"fraction_of_gpu_memory in the config should be set to range (0.,
1.]");
config.fraction_of_gpu_memory, 0.f,
"fraction_of_gpu_memory in the config should be set to range (0.,1.]");
PADDLE_ENFORCE_GE(config.device, 0, "Invalid device id %d", config.device);
std::vector<std::string> flags;
if (config.fraction_of_gpu_memory >= 0.0f ||
Expand Down
2 changes: 1 addition & 1 deletion paddle/fluid/inference/api/api_impl.h
Original file line number Diff line number Diff line change
Expand Up @@ -31,10 +31,10 @@ limitations under the License. */
#include "paddle/fluid/framework/lod_tensor_array.h"
#include "paddle/fluid/framework/naive_executor.h"
#include "paddle/fluid/inference/api/details/reset_tensor_array.h"
#include "paddle/fluid/inference/api/paddle_inference_api.h"
#include "paddle/fluid/inference/io.h"
#include "paddle/fluid/platform/init.h"
#include "paddle/fluid/platform/profiler.h"
#include "paddle_inference_api.h" // NOLINT

namespace paddle {

Expand Down
9 changes: 4 additions & 5 deletions paddle/fluid/inference/api/helper.h
Original file line number Diff line number Diff line change
Expand Up @@ -14,18 +14,17 @@

#pragma once

#define GLOG_NO_ABBREVIATED_SEVERITIES
#define GOOGLE_GLOG_DLL_DECL
#include <glog/logging.h>

#include <algorithm>
#include <chrono> // NOLINT
#include <iterator>
#include <numeric>
#include <sstream>
#include <string>
#include <vector>
#include "paddle/fluid/string/printf.h"
#include "paddle_inference_api.h"
#include "timer.h"
#include "paddle_inference_api.h" //NOLINT

namespace paddle {
namespace inference {
Expand Down Expand Up @@ -97,7 +96,7 @@ static void TensorAssignData(PaddleTensor *tensor,
}

template <typename T>
static int ZeroCopyTensorAssignData(ZeroCopyTensor *tensor,
static int ZeroCopyTensorAssignData(paddle::ZeroCopyTensor *tensor,
const std::vector<std::vector<T>> &data) {
int size{0};
auto *ptr = tensor->mutable_data<T>(PaddlePlace::kCPU);
Expand Down
10 changes: 4 additions & 6 deletions paddle/fluid/operators/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -284,12 +284,10 @@ op_library(array_to_lod_tensor_op DEPS lod_rank_table_op)
op_library(max_sequence_len_op DEPS lod_rank_table)
op_library(sequence_conv_op DEPS context_project)
op_library(sequence_pool_op DEPS sequence_pooling)
if (NOT WIN32)
op_library(lstm_op DEPS sequence2batch lstm_compute)
op_library(hierarchical_sigmoid_op DEPS matrix_bit_code)
op_library(lstmp_op DEPS sequence2batch lstm_compute)
op_library(gru_op DEPS sequence2batch gru_compute)
endif(NOT WIN32)
op_library(lstm_op DEPS sequence2batch lstm_compute)
op_library(hierarchical_sigmoid_op DEPS matrix_bit_code)
op_library(lstmp_op DEPS sequence2batch lstm_compute)
op_library(gru_op DEPS sequence2batch gru_compute)
op_library(recurrent_op DEPS executor)
op_library(warpctc_op DEPS dynload_warpctc sequence_padding sequence_scale)
op_library(cos_sim_op DEPS cos_sim_functor)
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -31,12 +31,12 @@ namespace operators {

template <typename T>
__device__ bool GT_E(T a, T b) {
return (a > b) || fabs(a - b) < 1e-4;
return (a > b) || fabsf(static_cast<float>(a - b)) < 1e-4;
}

template <typename T>
__device__ bool LT_E(T a, T b) {
return (a < b) || fabs(a - b) < 1e-4;
return (a < b) || fabsf(static_cast<float>(a - b)) < 1e-4;
}

template <typename T>
Expand Down
14 changes: 7 additions & 7 deletions paddle/fluid/operators/math/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -57,9 +57,6 @@ math_library(sequence_padding)
math_library(sequence_pooling DEPS math_function)
math_library(sequence_scale)
math_library(softmax DEPS math_function)
if (NOT WIN32)
math_library(matrix_bit_code)
endif (NOT WIN32)
math_library(unpooling)
math_library(vol2col)

Expand All @@ -75,7 +72,10 @@ if(WITH_GPU)
endif()
cc_test(concat_test SRCS concat_test.cc DEPS concat_and_split)
cc_test(cpu_vec_test SRCS cpu_vec_test.cc DEPS blas cpu_info)
cc_library(jit_kernel
SRCS jit_kernel.cc jit_kernel_blas.cc jit_kernel_exp.cc jit_kernel_rnn.cc jit_kernel_crf_decode.cc
DEPS cpu_info cblas)
cc_test(jit_kernel_test SRCS jit_kernel_test.cc DEPS jit_kernel)
if (NOT WIN32)
math_library(matrix_bit_code)
cc_library(jit_kernel
SRCS jit_kernel.cc jit_kernel_blas.cc jit_kernel_exp.cc jit_kernel_rnn.cc jit_kernel_crf_decode.cc
DEPS cpu_info cblas)
cc_test(jit_kernel_test SRCS jit_kernel_test.cc DEPS jit_kernel)
endif (NOT WIN32)
2 changes: 2 additions & 0 deletions paddle/fluid/platform/device_context.cc
Original file line number Diff line number Diff line change
Expand Up @@ -235,7 +235,9 @@ CUDADeviceContext::CUDADeviceContext(CUDAPlace place)
<< ", Runtime Version: " << runtime_version_ / 1000 << "."
<< (runtime_version_ % 100) / 10;

#ifndef _WIN32
callback_manager_.reset(new StreamCallbackManager(stream_));
#endif // NOT WIN32
}

CUDADeviceContext::~CUDADeviceContext() {
Expand Down
Loading