Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update BOLT #99

Closed
wants to merge 33 commits into from
Closed

Conversation

shintaro-iwasaki
Copy link
Collaborator

To e7535f8fedb5f355c332df9f2a87ebd61c82d983

shintaro-iwasaki and others added 30 commits June 3, 2021 21:56
Somehow BOLT llvmomp branch diverged from the official LLVM OpenMP.
llvm/llvm-project@119a9ea
This patch fixes this gap.
[libomptarget][devicertl] Drop templated atomic functions

The five __kmpc_atomic templates are instantiated a total of seven times.
This change replaces the template with explictly typed functions, which
have the same prototype for amdgcn and nvptx, and implements them with
the same code presently in use.

Rolls in the accepted but not yet landed D95085.

The unsigned long long type can be replaced with uint64_t when replacing
the cuda function. Until then, clang warns on casting a pointer to one to
a pointer to the other.

Reviewed By: tianshilei1992

Differential Revision: https://reviews.llvm.org/D95093

cherry-pick: 9b19ecb8f1ec7acbcfd6f0e4f3cbd6902570105d
llvm/llvm-project@9b19ecb
The buckets are initialized in __kmp_dephash_create but when they are extended
the memory is allocated but not NULL'd, potentially leaving some buckets
uninitialized after all entries have been copied into the new allocation.
This commit makes sure the buckets are properly initialized with NULL before
copying the entries.

Differential Revision: https://reviews.llvm.org/D95167

cherry-pick: edbcc17b7a0b5a4f20ec55983e172d0120ccbca9
llvm/llvm-project@edbcc17
[libomptarget] Build cuda plugin without cuda installed locally

Compiles a new file, `plugins/cuda/dynamic_cuda/cuda.cpp`, to an object file that exposes the same symbols that the plugin presently uses from libcuda. The object file contains dlopen of libcuda and cached dlsym calls. Also provides a cuda.h containing the subset that is used.

This lets the cmake file choose between the system cuda and a dlopen shim, with no changes to rtl.cpp.

The corresponding change to amdgpu is postponed until after a refactor of the plugin to reduce the size of the hsa.h stub required

Reviewed By: jdoerfert

Differential Revision: https://reviews.llvm.org/D95155

cherry-pick: 47e95e87a3e4f738635ff965616d4e2d96bf838a
llvm/llvm-project@47e95e8
Also, return NULL from unsuccessful OMPT function lookup.

Differential Revision: https://reviews.llvm.org/D95277

cherry-pick: 480cbed31e74b0db3d31d78789b639af250ce9fe
llvm/llvm-project@480cbed
[libomptarget][cuda] Call v2 functions explicitly

rtl.cpp calls functions like cuMemFree that are replaced by a macro
in cuda.h with cuMemFree_v2. This patch changes the source to use
the v2 names consistently.

See also D95104, D95155 for the idea. Alternatives are to use a mixture,
e.g. call the macro names and explictly dlopen the _v2 names, or to keep
the current status where the symbols are replaced by macros in both files

Reviewed By: jdoerfert

Differential Revision: https://reviews.llvm.org/D95274

cherry-pick: 78b0630b72a9742d62b07cef912b72f1743bfae9
llvm/llvm-project@78b0630
[libomptarget][amdgpu][nfc] Update comments

Reviewed By: jdoerfert

Differential Revision: https://reviews.llvm.org/D95295

cherry-pick: dc70c56be5922b874b1408edc1315fcda40680ba
llvm/llvm-project@dc70c56
…nsics

[libomptarget][nvptx] Replace cuda atomic primitives with clang intrinsics

Tested by diff of IR generated for target_impl.cu before and after. NFC. Part
of removing deviceRTL build time dependency on cuda SDK.

Reviewed By: tianshilei1992

Differential Revision: https://reviews.llvm.org/D95294

cherry-pick: c3074d48d38cc1207da893b6f3545b5777db4c27
llvm/llvm-project@c3074d4
D95161 removed the option `--libomptarget-nvptx-path`, which is used in
the tests for `libomptarget-nvptx`.

Reviewed By: jdoerfert

Differential Revision: https://reviews.llvm.org/D95293

cherry-pick: cfd978d5d3c8a06813e25f69ff1386428380a7cb
llvm/llvm-project@cfd978d
[libomptarget] Compile with older cuda, revert D95274

Fixes regression reported in comments of D95274.

Reviewed By: jdoerfert

Differential Revision: https://reviews.llvm.org/D95367

cherry-pick: 95f0d1edafe3e52a4057768f8cde5d55faf39d16
llvm/llvm-project@95f0d1e
Summary:
Fix the names to use Pascal case to comply with the LLVM coding guidelines. `ident_t` is required for compatibility with the rest of libomp.

cherry-pick: 93eef7d8e978d9efd0b28311d7be0d483f22e5d2
llvm/llvm-project@93eef7d
This patch makes prep for dropping CUDA when compiling `deviceRTLs`.
CUDA intrinsics are replaced by NVVM intrinsics which refers to code in
`__clang_cuda_intrinsics.h`. We don't want to directly include it because in the
near future we're going to switch to OpenMP and by then the header cannot be
used anymore.

Reviewed By: JonChesterfield

Differential Revision: https://reviews.llvm.org/D95327

cherry-pick: 27cc4a8138d819f78bc4fc028e39772bbda84dbd
llvm/llvm-project@27cc4a8
`omp_is_initial_device` in device code was implemented as a builtin
function in D38968 for a better performance. Therefore there is no chance that
this function will be called to `deviceRTLs`. As we're moving to build `deviceRTLs`
with OpenMP compiler, this function can lead to a compilation error. This patch
just simply removes it.

Reviewed By: JonChesterfield

Differential Revision: https://reviews.llvm.org/D95397

cherry-pick: 3333244d77c44e8bb5af57027646596f7714ff62
llvm/llvm-project@3333244
[libomptarget][cuda] Gracefully handle missing cuda library

If using dynamic cuda, and it failed to load, it is not safe to call
cuGetErrorString.

Reviewed By: jdoerfert

Differential Revision: https://reviews.llvm.org/D95412

cherry-pick: fafd45c01f3a49a40b09a31e9ea82efc87f3ab35
llvm/llvm-project@fafd45c
This reverts commit fafd45c01f3a49a40b09a31e9ea82efc87f3ab35.

cherry-pick: 357eea6e8bf78a822b8d3a6fe3bc6f85fee66a3e
llvm/llvm-project@357eea6
The basic design is to create an outer-most parallel team. It is not a regular team because it is only created when the first hidden helper task is encountered, and is only responsible for the execution of hidden helper tasks.  We first use `pthread_create` to create a new thread, let's call it the initial and also the main thread of the hidden helper team. This initial thread then initializes a new root, just like what RTL does in initialization. After that, it directly calls `__kmpc_fork_call`. It is like the initial thread encounters a parallel region. The wrapped function for this team is, for main thread, which is the initial thread that we create via `pthread_create` on Linux, waits on a condition variable. The condition variable can only be signaled when RTL is being destroyed. For other work threads, they just do nothing. The reason that main thread needs to wait there is, in current implementation, once the main thread finishes the wrapped function of this team, it starts to free the team which is not what we want.

Two environment variables, `LIBOMP_NUM_HIDDEN_HELPER_THREADS` and `LIBOMP_USE_HIDDEN_HELPER_TASK`, are also set to configure the number of threads and enable/disable this feature. By default, the number of hidden helper threads is 8.

Here are some open issues to be discussed:
1. The main thread goes to sleeping when the initialization is finished. As Andrey mentioned, we might need it to be awaken from time to time to do some stuffs. What kind of update/check should be put here?

Reviewed By: jdoerfert

Differential Revision: https://reviews.llvm.org/D77609

cherry-pick: 9d64275ae08fbdeeca0ce9c2f3951a2de6f38a08
llvm/llvm-project@9d64275
In much of the libomptarget interface we have an ident_t object now, if
it is not null we can use it to improve the profile output. For now, we
simply use the ident_t "source information string" as generated by the
FE.

Reviewed By: tianshilei1992

Differential Revision: https://reviews.llvm.org/D95282

cherry-pick: 8c7fdc4c61bff94a3ac1bb4877d1c00e01ee53be
llvm/llvm-project@8c7fdc4
Fixed declaration of define by adding a comma symbol. Required to fix build without profiling.
cherry-pick: 4a63e53373f92adb2261ff5554ec633001ed0eee
llvm/llvm-project@4a63e53
…et dependent language

From this patch (plus some landed patches), `deviceRTLs` is taken as a regular OpenMP program with just `declare target` regions. In this way, ideally, `deviceRTLs` can be written in OpenMP directly. No CUDA, no HIP anymore. (Well, AMD is still working on getting it work. For now AMDGCN still uses original way to compile) However, some target specific functions are still required, but they're no longer written in target specific language. For example, CUDA parts have all refined by replacing CUDA intrinsic and builtins with LLVM/Clang/NVVM intrinsics.
Here're a list of changes in this patch.
1. For NVPTX, `DEVICE` is defined empty in order to make the common parts still work with AMDGCN. Later once AMDGCN is also available, we will completely remove `DEVICE` or probably some other macros.
2. Shared variable is implemented with OpenMP allocator, which is defined in `allocator.h`. Again, this feature is not available on AMDGCN, so two macros are redefined properly.
3. CUDA header `cuda.h` is dropped in the source code. In order to deal with code difference in various CUDA versions, we build one bitcode library for each supported CUDA version. For each CUDA version, the highest PTX version it supports will be used, just as what we currently use for CUDA compilation.
4. Correspondingly, compiler driver is also updated to support CUDA version encoded in the name of bitcode library. Now the bitcode library for NVPTX is named as `libomptarget-nvptx-cuda_[cuda_version]-sm_[sm_number].bc`, such as `libomptarget-nvptx-cuda_80-sm_20.bc`.

With this change, there are also multiple features to be expected in the near future:
1. CUDA will be completely dropped when compiling OpenMP. By the time, we also build bitcode libraries for all supported SM, multiplied by all supported CUDA version.
2. Atomic operations used in `deviceRTLs` can be replaced by `omp atomic` if OpenMP 5.1 feature is fully supported. For now, the IR generated is totally wrong.
3. Target specific parts will be wrapped into `declare variant` with `isa` selector if it can work properly. No target specific macro is needed anymore.
4. (Maybe more...)

Reviewed By: JonChesterfield

Differential Revision: https://reviews.llvm.org/D94745

cherry-pick: 7c03f7d7d04c0f017cc8e9522209c98036042f17
llvm/llvm-project@7c03f7d
In order to support remote execution, we need to be able to send the
target binary description to the remote host for registration (and
consequent deregistration). To support this, I added these two
optional new functions to the plugin API:
- `__tgt_rtl_register_lib`
- `__tgt_rtl_unregister_lib`

These functions will be called to properly manage the instance of
libomptarget running on the remote host.

Reviewed By: JonChesterfield

Differential Revision: https://reviews.llvm.org/D93293

cherry-pick: 683719bc0cc8e12a5f9c06135fc97a13ef414f69
llvm/llvm-project@683719b
This introduces a remote offloading plugin for libomptarget. This
implementation relies on gRPC and protobuf, so this library will only
build if both libraries are available on the system. The corresponding
server is compiled to `openmp-offloading-server`.

This is a large change, but the only way to split this up is into RTL/server
but I fear that could introduce an inconsistency amongst them.

Ideally, tests for this should be added to the current ones that but that is
problematic for at least one reason. Given that libomptarget registers plugin
on a first-come-first-serve basis, if we wanted to offload onto a local x86
through a different process, then we'd have to either re-order the plugin list
in `rtl.cpp` (which is what I did locally for testing) or find a better
solution for runtime plugin registration in libomptarget.

Differential Revision: https://reviews.llvm.org/D95314

cherry-pick: ec8f4a38c83eefc51be4f8cc39f93cc79116aab5
llvm/llvm-project@ec8f4a3
[libomptarget][cuda] Only run tests when sure there is cuda available

Prior to D95155, building the cuda plugin implied cuda was installed locally.
With that change, every machine can build a cuda plugin, but they won't all have
cuda and/or an nvptx card installed locally.

This change enables the nvptx tests when either:
- libcuda is present
- the user has forced use of the dlopen stub

The default case when there is no cuda detected will no longer attempt to
run the tests on nvptx hardware, as was the case before D95155.

Reviewed By: jdoerfert, ronlieb

Differential Revision: https://reviews.llvm.org/D95467

cherry-pick: fdeffd6fb0c1a824137c502e2b1c182aded17325
llvm/llvm-project@fdeffd6
[libomptarget][cuda] Gracefully handle missing cuda library

If using dynamic cuda, and it failed to load, it is not safe to call
cuGetErrorString.

Reviewed By: jdoerfert

Differential Revision: https://reviews.llvm.org/D95412

cherry-pick: 7baff00eeedb0d9bac5e7cd0c5e4189a6cc6101d
llvm/llvm-project@7baff00
Requiring 3.15 causes a build breakage, I'm sure none of the contents actually require
3.15 or above.

Differential Revision: https://reviews.llvm.org/D95474

cherry-pick: 810572cc96e99b5d6fb695f1f047c2ae6507831b
llvm/llvm-project@810572c
Differential Revision: https://reviews.llvm.org/D95476

cherry-pick: 5f1d4d477902a9c058bd5506f17eeab6b7c5b7f5
llvm/llvm-project@5f1d4d4
Differential Revision: https://reviews.llvm.org/D95486

cherry-pick: 3caa2d3354e31827ba7a5e258f0025bac5336cbe
llvm/llvm-project@3caa2d3
[libomptarget][cuda] Handle missing _v2 symbols gracefully

Follow on from D95367. Dlsym the _v2 symbols if present, otherwise use the
unsuffixed version. Builds a hashtable for the check, can revise for zero
heap allocations later if necessary.

Reviewed By: jdoerfert

Differential Revision: https://reviews.llvm.org/D95415

cherry-pick: 653655040f3e89f7725ce6961d797d4ac918708b
llvm/llvm-project@6536550
nawrinsu and others added 3 commits June 3, 2021 22:39
This patch sets the def-allocator-var ICV based on the environment variables
provided in OMP_ALLOCATOR. Previously, only allowed value for OMP_ALLOCATOR
was a predefined memory allocator. OpenMP 5.1 specification allows predefined
memory allocator, predefined mem space, or predefined mem space with traits in
OMP_ALLOCATOR. If an allocator can not be created using the provided environment
variables, the def-allocator-var is set to omp_default_mem_alloc.

Differential Revision: https://reviews.llvm.org/D94985

cherry-pick: 927af4b3c57681e623b8449fb717a447559358d0
llvm/llvm-project@927af4b
With D94745, we no longer use CUDA SDK to compile `deviceRTLs`. Therefore,
many CMake code in the project is useless. This patch cleans up unnecessary code
and also drops the requirement to build NVPTX `deviceRTLs`. CUDA detection is
still being used however to determine whether we need to involve the tests. Auto
detection of compute capability is enabled by default and can be disabled by
setting CMake variable `LIBOMPTARGET_NVPTX_AUTODETECT_COMPUTE_CAPABILITY=OFF`.
If auto detection is enabled, and CUDA is also valid, it will only build the
bitcode library for the detected version; otherwise, all variants supported will
be generated. One drawback of this patch is, we now generate 96 variants of
bitcode library, and totally 1485 files to be built with a clean build on a
non-CUDA system. `LIBOMPTARGET_NVPTX_COMPUTE_CAPABILITIES=""` can be used to
disable building NVPTX `deviceRTLs`.

Reviewed By: JonChesterfield

Differential Revision: https://reviews.llvm.org/D95466

cherry-pick: e7535f8fedb5f355c332df9f2a87ebd61c82d983
llvm/llvm-project@e7535f8
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.