Update BOLT #99

shintaro-iwasaki · 2021-06-23T15:59:23Z

To e7535f8fedb5f355c332df9f2a87ebd61c82d983

Somehow BOLT llvmomp branch diverged from the official LLVM OpenMP. llvm/llvm-project@119a9ea This patch fixes this gap.

[libomptarget][devicertl] Drop templated atomic functions The five __kmpc_atomic templates are instantiated a total of seven times. This change replaces the template with explictly typed functions, which have the same prototype for amdgcn and nvptx, and implements them with the same code presently in use. Rolls in the accepted but not yet landed D95085. The unsigned long long type can be replaced with uint64_t when replacing the cuda function. Until then, clang warns on casting a pointer to one to a pointer to the other. Reviewed By: tianshilei1992 Differential Revision: https://reviews.llvm.org/D95093 cherry-pick: 9b19ecb8f1ec7acbcfd6f0e4f3cbd6902570105d llvm/llvm-project@9b19ecb

The buckets are initialized in __kmp_dephash_create but when they are extended the memory is allocated but not NULL'd, potentially leaving some buckets uninitialized after all entries have been copied into the new allocation. This commit makes sure the buckets are properly initialized with NULL before copying the entries. Differential Revision: https://reviews.llvm.org/D95167 cherry-pick: edbcc17b7a0b5a4f20ec55983e172d0120ccbca9 llvm/llvm-project@edbcc17

[libomptarget] Build cuda plugin without cuda installed locally Compiles a new file, `plugins/cuda/dynamic_cuda/cuda.cpp`, to an object file that exposes the same symbols that the plugin presently uses from libcuda. The object file contains dlopen of libcuda and cached dlsym calls. Also provides a cuda.h containing the subset that is used. This lets the cmake file choose between the system cuda and a dlopen shim, with no changes to rtl.cpp. The corresponding change to amdgpu is postponed until after a refactor of the plugin to reduce the size of the hsa.h stub required Reviewed By: jdoerfert Differential Revision: https://reviews.llvm.org/D95155 cherry-pick: 47e95e87a3e4f738635ff965616d4e2d96bf838a llvm/llvm-project@47e95e8

Also, return NULL from unsuccessful OMPT function lookup. Differential Revision: https://reviews.llvm.org/D95277 cherry-pick: 480cbed31e74b0db3d31d78789b639af250ce9fe llvm/llvm-project@480cbed

[libomptarget][cuda] Call v2 functions explicitly rtl.cpp calls functions like cuMemFree that are replaced by a macro in cuda.h with cuMemFree_v2. This patch changes the source to use the v2 names consistently. See also D95104, D95155 for the idea. Alternatives are to use a mixture, e.g. call the macro names and explictly dlopen the _v2 names, or to keep the current status where the symbols are replaced by macros in both files Reviewed By: jdoerfert Differential Revision: https://reviews.llvm.org/D95274 cherry-pick: 78b0630b72a9742d62b07cef912b72f1743bfae9 llvm/llvm-project@78b0630

[libomptarget][amdgpu][nfc] Update comments Reviewed By: jdoerfert Differential Revision: https://reviews.llvm.org/D95295 cherry-pick: dc70c56be5922b874b1408edc1315fcda40680ba llvm/llvm-project@dc70c56

…nsics [libomptarget][nvptx] Replace cuda atomic primitives with clang intrinsics Tested by diff of IR generated for target_impl.cu before and after. NFC. Part of removing deviceRTL build time dependency on cuda SDK. Reviewed By: tianshilei1992 Differential Revision: https://reviews.llvm.org/D95294 cherry-pick: c3074d48d38cc1207da893b6f3545b5777db4c27 llvm/llvm-project@c3074d4

D95161 removed the option `--libomptarget-nvptx-path`, which is used in the tests for `libomptarget-nvptx`. Reviewed By: jdoerfert Differential Revision: https://reviews.llvm.org/D95293 cherry-pick: cfd978d5d3c8a06813e25f69ff1386428380a7cb llvm/llvm-project@cfd978d

cherry-pick: e5e448aafa7699c17f78aaffb001b665b607e5ae llvm/llvm-project@e5e448a

[libomptarget] Compile with older cuda, revert D95274 Fixes regression reported in comments of D95274. Reviewed By: jdoerfert Differential Revision: https://reviews.llvm.org/D95367 cherry-pick: 95f0d1edafe3e52a4057768f8cde5d55faf39d16 llvm/llvm-project@95f0d1e

Summary: Fix the names to use Pascal case to comply with the LLVM coding guidelines. `ident_t` is required for compatibility with the rest of libomp. cherry-pick: 93eef7d8e978d9efd0b28311d7be0d483f22e5d2 llvm/llvm-project@93eef7d

This patch makes prep for dropping CUDA when compiling `deviceRTLs`. CUDA intrinsics are replaced by NVVM intrinsics which refers to code in `__clang_cuda_intrinsics.h`. We don't want to directly include it because in the near future we're going to switch to OpenMP and by then the header cannot be used anymore. Reviewed By: JonChesterfield Differential Revision: https://reviews.llvm.org/D95327 cherry-pick: 27cc4a8138d819f78bc4fc028e39772bbda84dbd llvm/llvm-project@27cc4a8

`omp_is_initial_device` in device code was implemented as a builtin function in D38968 for a better performance. Therefore there is no chance that this function will be called to `deviceRTLs`. As we're moving to build `deviceRTLs` with OpenMP compiler, this function can lead to a compilation error. This patch just simply removes it. Reviewed By: JonChesterfield Differential Revision: https://reviews.llvm.org/D95397 cherry-pick: 3333244d77c44e8bb5af57027646596f7714ff62 llvm/llvm-project@3333244

[libomptarget][cuda] Gracefully handle missing cuda library If using dynamic cuda, and it failed to load, it is not safe to call cuGetErrorString. Reviewed By: jdoerfert Differential Revision: https://reviews.llvm.org/D95412 cherry-pick: fafd45c01f3a49a40b09a31e9ea82efc87f3ab35 llvm/llvm-project@fafd45c

This reverts commit fafd45c01f3a49a40b09a31e9ea82efc87f3ab35. cherry-pick: 357eea6e8bf78a822b8d3a6fe3bc6f85fee66a3e llvm/llvm-project@357eea6

The basic design is to create an outer-most parallel team. It is not a regular team because it is only created when the first hidden helper task is encountered, and is only responsible for the execution of hidden helper tasks. We first use `pthread_create` to create a new thread, let's call it the initial and also the main thread of the hidden helper team. This initial thread then initializes a new root, just like what RTL does in initialization. After that, it directly calls `__kmpc_fork_call`. It is like the initial thread encounters a parallel region. The wrapped function for this team is, for main thread, which is the initial thread that we create via `pthread_create` on Linux, waits on a condition variable. The condition variable can only be signaled when RTL is being destroyed. For other work threads, they just do nothing. The reason that main thread needs to wait there is, in current implementation, once the main thread finishes the wrapped function of this team, it starts to free the team which is not what we want. Two environment variables, `LIBOMP_NUM_HIDDEN_HELPER_THREADS` and `LIBOMP_USE_HIDDEN_HELPER_TASK`, are also set to configure the number of threads and enable/disable this feature. By default, the number of hidden helper threads is 8. Here are some open issues to be discussed: 1. The main thread goes to sleeping when the initialization is finished. As Andrey mentioned, we might need it to be awaken from time to time to do some stuffs. What kind of update/check should be put here? Reviewed By: jdoerfert Differential Revision: https://reviews.llvm.org/D77609 cherry-pick: 9d64275ae08fbdeeca0ce9c2f3951a2de6f38a08 llvm/llvm-project@9d64275

In much of the libomptarget interface we have an ident_t object now, if it is not null we can use it to improve the profile output. For now, we simply use the ident_t "source information string" as generated by the FE. Reviewed By: tianshilei1992 Differential Revision: https://reviews.llvm.org/D95282 cherry-pick: 8c7fdc4c61bff94a3ac1bb4877d1c00e01ee53be llvm/llvm-project@8c7fdc4

Fixed declaration of define by adding a comma symbol. Required to fix build without profiling. cherry-pick: 4a63e53373f92adb2261ff5554ec633001ed0eee llvm/llvm-project@4a63e53

cherry-pick: 94cf89d1c2c543a8a2e7b8346a1af7f8605327b5 llvm/llvm-project@94cf89d

…et dependent language From this patch (plus some landed patches), `deviceRTLs` is taken as a regular OpenMP program with just `declare target` regions. In this way, ideally, `deviceRTLs` can be written in OpenMP directly. No CUDA, no HIP anymore. (Well, AMD is still working on getting it work. For now AMDGCN still uses original way to compile) However, some target specific functions are still required, but they're no longer written in target specific language. For example, CUDA parts have all refined by replacing CUDA intrinsic and builtins with LLVM/Clang/NVVM intrinsics. Here're a list of changes in this patch. 1. For NVPTX, `DEVICE` is defined empty in order to make the common parts still work with AMDGCN. Later once AMDGCN is also available, we will completely remove `DEVICE` or probably some other macros. 2. Shared variable is implemented with OpenMP allocator, which is defined in `allocator.h`. Again, this feature is not available on AMDGCN, so two macros are redefined properly. 3. CUDA header `cuda.h` is dropped in the source code. In order to deal with code difference in various CUDA versions, we build one bitcode library for each supported CUDA version. For each CUDA version, the highest PTX version it supports will be used, just as what we currently use for CUDA compilation. 4. Correspondingly, compiler driver is also updated to support CUDA version encoded in the name of bitcode library. Now the bitcode library for NVPTX is named as `libomptarget-nvptx-cuda_[cuda_version]-sm_[sm_number].bc`, such as `libomptarget-nvptx-cuda_80-sm_20.bc`. With this change, there are also multiple features to be expected in the near future: 1. CUDA will be completely dropped when compiling OpenMP. By the time, we also build bitcode libraries for all supported SM, multiplied by all supported CUDA version. 2. Atomic operations used in `deviceRTLs` can be replaced by `omp atomic` if OpenMP 5.1 feature is fully supported. For now, the IR generated is totally wrong. 3. Target specific parts will be wrapped into `declare variant` with `isa` selector if it can work properly. No target specific macro is needed anymore. 4. (Maybe more...) Reviewed By: JonChesterfield Differential Revision: https://reviews.llvm.org/D94745 cherry-pick: 7c03f7d7d04c0f017cc8e9522209c98036042f17 llvm/llvm-project@7c03f7d

cherry-pick: 32cc5564e2707c6036230ef3929c0d783ccea04c llvm/llvm-project@32cc556

In order to support remote execution, we need to be able to send the target binary description to the remote host for registration (and consequent deregistration). To support this, I added these two optional new functions to the plugin API: - `__tgt_rtl_register_lib` - `__tgt_rtl_unregister_lib` These functions will be called to properly manage the instance of libomptarget running on the remote host. Reviewed By: JonChesterfield Differential Revision: https://reviews.llvm.org/D93293 cherry-pick: 683719bc0cc8e12a5f9c06135fc97a13ef414f69 llvm/llvm-project@683719b

This introduces a remote offloading plugin for libomptarget. This implementation relies on gRPC and protobuf, so this library will only build if both libraries are available on the system. The corresponding server is compiled to `openmp-offloading-server`. This is a large change, but the only way to split this up is into RTL/server but I fear that could introduce an inconsistency amongst them. Ideally, tests for this should be added to the current ones that but that is problematic for at least one reason. Given that libomptarget registers plugin on a first-come-first-serve basis, if we wanted to offload onto a local x86 through a different process, then we'd have to either re-order the plugin list in `rtl.cpp` (which is what I did locally for testing) or find a better solution for runtime plugin registration in libomptarget. Differential Revision: https://reviews.llvm.org/D95314 cherry-pick: ec8f4a38c83eefc51be4f8cc39f93cc79116aab5 llvm/llvm-project@ec8f4a3

[libomptarget][cuda] Only run tests when sure there is cuda available Prior to D95155, building the cuda plugin implied cuda was installed locally. With that change, every machine can build a cuda plugin, but they won't all have cuda and/or an nvptx card installed locally. This change enables the nvptx tests when either: - libcuda is present - the user has forced use of the dlopen stub The default case when there is no cuda detected will no longer attempt to run the tests on nvptx hardware, as was the case before D95155. Reviewed By: jdoerfert, ronlieb Differential Revision: https://reviews.llvm.org/D95467 cherry-pick: fdeffd6fb0c1a824137c502e2b1c182aded17325 llvm/llvm-project@fdeffd6

[libomptarget][cuda] Gracefully handle missing cuda library If using dynamic cuda, and it failed to load, it is not safe to call cuGetErrorString. Reviewed By: jdoerfert Differential Revision: https://reviews.llvm.org/D95412 cherry-pick: 7baff00eeedb0d9bac5e7cd0c5e4189a6cc6101d llvm/llvm-project@7baff00

Requiring 3.15 causes a build breakage, I'm sure none of the contents actually require 3.15 or above. Differential Revision: https://reviews.llvm.org/D95474 cherry-pick: 810572cc96e99b5d6fb695f1f047c2ae6507831b llvm/llvm-project@810572c

Differential Revision: https://reviews.llvm.org/D95476 cherry-pick: 5f1d4d477902a9c058bd5506f17eeab6b7c5b7f5 llvm/llvm-project@5f1d4d4

Differential Revision: https://reviews.llvm.org/D95486 cherry-pick: 3caa2d3354e31827ba7a5e258f0025bac5336cbe llvm/llvm-project@3caa2d3

[libomptarget][cuda] Handle missing _v2 symbols gracefully Follow on from D95367. Dlsym the _v2 symbols if present, otherwise use the unsuffixed version. Builds a hashtable for the check, can revise for zero heap allocations later if necessary. Reviewed By: jdoerfert Differential Revision: https://reviews.llvm.org/D95415 cherry-pick: 653655040f3e89f7725ce6961d797d4ac918708b llvm/llvm-project@6536550

This patch sets the def-allocator-var ICV based on the environment variables provided in OMP_ALLOCATOR. Previously, only allowed value for OMP_ALLOCATOR was a predefined memory allocator. OpenMP 5.1 specification allows predefined memory allocator, predefined mem space, or predefined mem space with traits in OMP_ALLOCATOR. If an allocator can not be created using the provided environment variables, the def-allocator-var is set to omp_default_mem_alloc. Differential Revision: https://reviews.llvm.org/D94985 cherry-pick: 927af4b3c57681e623b8449fb717a447559358d0 llvm/llvm-project@927af4b

With D94745, we no longer use CUDA SDK to compile `deviceRTLs`. Therefore, many CMake code in the project is useless. This patch cleans up unnecessary code and also drops the requirement to build NVPTX `deviceRTLs`. CUDA detection is still being used however to determine whether we need to involve the tests. Auto detection of compute capability is enabled by default and can be disabled by setting CMake variable `LIBOMPTARGET_NVPTX_AUTODETECT_COMPUTE_CAPABILITY=OFF`. If auto detection is enabled, and CUDA is also valid, it will only build the bitcode library for the detected version; otherwise, all variants supported will be generated. One drawback of this patch is, we now generate 96 variants of bitcode library, and totally 1485 files to be built with a clean build on a non-CUDA system. `LIBOMPTARGET_NVPTX_COMPUTE_CAPABILITIES=""` can be used to disable building NVPTX `deviceRTLs`. Reviewed By: JonChesterfield Differential Revision: https://reviews.llvm.org/D95466 cherry-pick: e7535f8fedb5f355c332df9f2a87ebd61c82d983 llvm/llvm-project@e7535f8

shintaro-iwasaki and others added 30 commits June 3, 2021 21:56

[BOLT] Adjust differences from the official LLVM OpenMP package

862cb02

Somehow BOLT llvmomp branch diverged from the official LLVM OpenMP. llvm/llvm-project@119a9ea This patch fixes this gap.

[OpenMP] Remove unnecessary pointer checks in a few locations

f7f1fc8

Also, return NULL from unsuccessful OMPT function lookup. Differential Revision: https://reviews.llvm.org/D95277 cherry-pick: 480cbed31e74b0db3d31d78789b639af250ce9fe llvm/llvm-project@480cbed

[libomptarget][amdgpu][nfc] Update comments

82fd597

[libomptarget][amdgpu][nfc] Update comments Reviewed By: jdoerfert Differential Revision: https://reviews.llvm.org/D95295 cherry-pick: dc70c56be5922b874b1408edc1315fcda40680ba llvm/llvm-project@dc70c56

[libomptarget][cuda] Fix build, change missed from D95274

cbc76b0

cherry-pick: e5e448aafa7699c17f78aaffb001b665b607e5ae llvm/llvm-project@e5e448a

[OpenMP][NFC] Fix SourceInfo.h variable names

c847adc

Summary: Fix the names to use Pascal case to comply with the LLVM coding guidelines. `ident_t` is required for compatibility with the rest of libomp. cherry-pick: 93eef7d8e978d9efd0b28311d7be0d483f22e5d2 llvm/llvm-project@93eef7d

Revert "[libomptarget][cuda] Gracefully handle missing cuda library"

ec12ece

This reverts commit fafd45c01f3a49a40b09a31e9ea82efc87f3ab35. cherry-pick: 357eea6e8bf78a822b8d3a6fe3bc6f85fee66a3e llvm/llvm-project@357eea6

[LIBOMPTARGET]FIX define declaration, NFC

753068c

Fixed declaration of define by adding a comma symbol. Required to fix build without profiling. cherry-pick: 4a63e53373f92adb2261ff5554ec633001ed0eee llvm/llvm-project@4a63e53

[libomptarget][NFC] Fixed obsolete function names in comments

6105797

cherry-pick: 94cf89d1c2c543a8a2e7b8346a1af7f8605327b5 llvm/llvm-project@94cf89d

[libomptarget][devicertl][amdgpu] Fix build, variable renaming error

05ea6a0

cherry-pick: 32cc5564e2707c6036230ef3929c0d783ccea04c llvm/llvm-project@32cc556

[libomptarget][NFC] Use portable printf format specifiers.

b36f943

Differential Revision: https://reviews.llvm.org/D95476 cherry-pick: 5f1d4d477902a9c058bd5506f17eeab6b7c5b7f5 llvm/llvm-project@5f1d4d4

[libomptarget][NFC] Avoid gcc 5/6 issue with lambda captures.

7797990

Differential Revision: https://reviews.llvm.org/D95486 cherry-pick: 3caa2d3354e31827ba7a5e258f0025bac5336cbe llvm/llvm-project@3caa2d3

nawrinsu and others added 3 commits June 3, 2021 22:39

Merge commit 'd924b620edd8bf03b0692311fdd4980069a34cec' into bolt-d924b6

6dc29bc

shintaro-iwasaki added the upstream-merge label Jun 23, 2021

shintaro-iwasaki closed this Jan 24, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update BOLT #99

Update BOLT #99

shintaro-iwasaki commented Jun 23, 2021

Update BOLT #99

Update BOLT #99

Conversation

shintaro-iwasaki commented Jun 23, 2021