Update BOLT #85

shintaro-iwasaki · 2020-10-28T03:07:05Z

To 6aa7228a629d81af78d4f701b7defb701f4b9283

cherry-pick: 0c938a8dd80ad707ec1f20d936cc5c9d73df8de5 llvm/llvm-project@0c938a8

Patch by mati865@gmail.com Differential Revision: https://reviews.llvm.org/D85210 cherry-pick: 4a04bc8995639e1d333790518e4d42e0961f740e llvm/llvm-project@4a04bc8

RTM Adaptive Locks are supported on msys2/mingw for clang and gcc. Differential Revision: https://reviews.llvm.org/D81776 cherry-pick: bf2aa74e51997ee190f3b34dd26a1b564e59e267 llvm/llvm-project@bf2aa74

Without this patch, the following example fails but shouldn't according to OpenMP TR8: ``` #pragma omp target enter data map(alloc:i) #pragma omp target data map(present, alloc: i) { #pragma omp target exit data map(delete:i) } // fails presence check here ``` OpenMP TR8 sec. 2.22.7.1 "map Clause", p. 321, L23-26 states: > If the map clause appears on a target, target data, target enter > data or target exit data construct with a present map-type-modifier > then on entry to the region if the corresponding list item does not > appear in the device data environment an error occurs and the > program terminates. There is no corresponding statement about the exit from a region. Thus, the `present` modifier should: 1. Check for presence upon entry into any region, including a `target exit data` region. This behavior is already implemented correctly. 2. Should not check for presence upon exit from any region, including a `target` or `target data` region. Without this patch, this behavior is not implemented correctly, breaking the above example. In the case of `target data`, this patch fixes the latter behavior by removing the `present` modifier from the map types Clang generates for the runtime call at the end of the region. In the case of `target`, we have not found a valid OpenMP program for which such a fix would matter. It appears that, if a program can guarantee that data is present at the beginning of a `target` region so that there's no error there, that data is also guaranteed to be present at the end. This patch adds a comment to the runtime to document this case. Reviewed By: grokos, RaviNarayanaswamy, ABataev Differential Revision: https://reviews.llvm.org/D84422 cherry-pick: 002d61db2b7790dc884953bf9271878bf0af3a8e llvm/llvm-project@002d61d

OpenMP TR8 sec. 2.15.6 "target update Construct", p. 183, L3-4 states: > If the corresponding list item is not present in the device data > environment and there is no present modifier in the clause, then no > assignment occurs to or from the original list item. L10-11 states: > If a present modifier appears in the clause and the corresponding > list item is not present in the device data environment then an > error occurs and the program termintates. (OpenMP 5.0 also has the first passage but without mention of the present modifier of course.) In both passages, I assume "is not present" includes the case of partially but not entirely present. However, without this patch, the target update directive misbehaves in this case both with and without the present modifier. For example: ``` #pragma omp target enter data map(to:arr[0:3]) #pragma omp target update to(arr[0:5]) // might fail on data transfer #pragma omp target update to(present:arr[0:5]) // might fail on data transfer ``` The problem is that `DeviceTy::getTgtPtrBegin` does not return a null pointer in that case, so `target_data_update` sees the data as fully present, and the data transfer then might fail depending on the target device. However, without the present modifier, there should never be a failure. Moreover, with the present modifier, there should always be a failure, and the diagnostic should mention the present modifier. This patch fixes `DeviceTy::getTgtPtrBegin` to return null when `target_data_update` is the caller. I'm wondering if it should do the same for more callers. Reviewed By: grokos, jdoerfert Differential Revision: https://reviews.llvm.org/D85246 cherry-pick: 5ab43989c353a2378910d20c7b88e44ea92b3aee llvm/llvm-project@5ab4398

targetDataMapper function fills arrays with the mapping data in the direct order. When this function is called by targetDataBegin or tgt_target_update functions, it works as expected. But targetDataEnd function processes mapped data in reverse order. In this case, the base pointer might be deleted before the associated data is deleted. Need to reverse data, mapped by mapper, too, since it always adds data that must be deleted at the end of the buffer. Fixes the test declare_mapper_target_update.cpp. Also, reduces the memry fragmentation by preallocation the memory buffers. Differential Revision: https://reviews.llvm.org/D85216 cherry-pick: 6780d5675b7cd75279d8fc13ee1a1cc272087613 llvm/llvm-project@6780d56

The standard way of printing `int64_t` data is via the PRId64 macro, `ld` is for `long int` and int64_t is not guaranteed to be typedef'ed as `long int` on all platforms. E.g. on Windows we get mismatch warnings. Differential Revision: https://reviews.llvm.org/D85353 cherry-pick: 40470eb27a5c97b01e89d8825626487b0682abec llvm/llvm-project@40470eb

For example, without this patch, the following fails as expected with or without the `present` modifier, but the `present` modifier doesn't produce its usual diagnostic: ``` #pragma omp target data map(alloc: arr[0:2]) { #pragma omp target map(present, tofrom: arr[0:100]) // not fully present ; } ``` Reviewed By: grokos, vzakhari Differential Revision: https://reviews.llvm.org/D85320 cherry-pick: 41b1aefecb9447620dd182b0352abed0df05665c llvm/llvm-project@41b1aef

For example: ``` #pragma omp target data map(tofrom:arr[0:100]) { #pragma omp target exit data map(delete:arr[0:100]) #pragma omp target enter data map(alloc:arr[98:2]) } ``` Without this patch, the transfer at the end of the target data region is broken and fails depending on the target device. According to my read of the spec, the transfer shouldn't even be attempted because `arr[0:100]` isn't (fully) present there. To fix that, this patch makes `DeviceTy::getTgtPtrBegin` return null for this case. Reviewed By: grokos Differential Revision: https://reviews.llvm.org/D85342 cherry-pick: 8c8bb128dfd09f84b27b9e732cf1355582b1d019 llvm/llvm-project@8c8bb12

D85342 broke this case. The new test case presents an example. Reviewed By: grokos Differential Revision: https://reviews.llvm.org/D85369 cherry-pick: 518a27e5591c211ceeef3091edc59012e6ace2b2 llvm/llvm-project@518a27e

Starting with 787eb0c637b I got spurious segmentation faults for some testcases. I could nail it down to `brel` trying to release the "memory" of the node allocated on the stack of __kmpc_omp_wait_deps. With this patch, you will see the assertion triggering for some of the tests in the test suite. My proposed solution for the issue is to just patch __kmpc_omp_wait_deps: ``` __kmp_init_node(&node); - node.dn.on_stack = 1; + // the stack owns the node + __kmp_node_ref(&node); ``` What do you think? Reviewed By: AndreyChurbanov Differential Revision: https://reviews.llvm.org/D84472 cherry-pick: 66a3575c2895f3b06056908bb40699f16e4b92d7 llvm/llvm-project@66a3575

[libomptarget] Implement host plugin for amdgpu Replacement for D71384. Primary difference is inlining the dependency on atmi followed by extensive simplification and bugfixes. This is the latest version from https://github.com/ROCm-Developer-Tools/amd-llvm-project/tree/aomp12 with minor patches and a rename from hsa to amdgpu, on the basis that this can't be used by other implementations of hsa without additional work. This will not build unless the ROCM_DIR variable is passed so won't break other builds. That variable is used to locate two amdgpu specific libraries that ship as part of rocm: libhsakmt at https://github.com/RadeonOpenCompute/ROCT-Thunk-Interface libhsa-runtime64 at https://github.com/RadeonOpenCompute/ROCR-Runtime These libraries build from source. The build scripts in those repos are for shared libraries, but can be adapted to statically link both into this plugin. There are caveats. - This works well enough to run various tests and benchmarks, and will be used to support the current clang bring up - It is adequately thread safe for the above but there will be races remaining - It is not stylistically correct for llvm, though has had clang-format run - It has suboptimal memory management and locking strategies - The debug printing / error handling is inconsistent I would like to contribute this pretty much as-is and then improve it in-tree. This would be advantagous because the aomp12 branch that was in use for fixing this codebase has just been joined with the amd internal rocm dev process. Reviewed By: jdoerfert Differential Revision: https://reviews.llvm.org/D85742 cherry-pick: d0b312955f12beba5b03c8a524a8456cb4174bd7 llvm/llvm-project@d0b3129

…nel) Instead of calling `cuFuncGetAttribute` with `CU_FUNC_ATTRIBUTE_MAX_THREADS_PER_BLOCK` for every kernel invocation, we can do it for the first one and cache the result as part of the `KernelInfo` struct. The only functional change is that we now expect `cuFuncGetAttribute` to succeed and otherwise propagate the error. Ignoring any error seems like a slippery slope... Reviewed By: JonChesterfield Differential Revision: https://reviews.llvm.org/D86038 cherry-pick: aa27cfc1e7d7456325e951a4ba3ced405027f7d0 llvm/llvm-project@aa27cfc

Reviewed By: JonChesterfield Differential Revision: https://reviews.llvm.org/D86039 cherry-pick: 5272d29e2cb7c967c3016fa285f14edc7515d9bf llvm/llvm-project@5272d29

Differential Revision: https://reviews.llvm.org/D86082 cherry-pick: 32ebdc70f3af478f4f6a9c75b4bc47a453b1b933 llvm/llvm-project@32ebdc7

cherry-pick: 6e1b11087f080b1cb9a023f9f920d29d5465633e llvm/llvm-project@6e1b110

This patch contains the following changes: 1. Renamed the function `DeviceTy::data_exchange` to `DeviceTy::dataExchange`; 2. Changed the second argument `DeviceTy DstDev` to `DeviceTy &DstDev`; 3. Renamed the last argument. Reviewed By: ye-luo Differential Revision: https://reviews.llvm.org/D86238 cherry-pick: 83c3d07994c4cd24b9548362d03290af2a9483b0 llvm/llvm-project@83c3d07

Target memory manager is introduced in this patch which aims to manage target memory such that they will not be freed immediately when they are not used because the overhead of memory allocation and free is very large. For CUDA device, cuMemFree even blocks the context switch on device which affects concurrent kernel execution. The memory manager can be taken as a memory pool. It divides the pool into multiple buckets according to the size such that memory allocation/free distributed to different buckets will not affect each other. In this version, we use the exact-equality policy to find a free buffer. This is an open question: will best-fit work better here? IMO, best-fit is not good for target memory management because computation on GPU usually requires GBs of data. Best-fit might lead to a serious waste. For example, there is a free buffer of size 1960MB, and now we need a buffer of size 1200MB. If best-fit, the free buffer will be returned, leading to a 760MB waste. The allocation will happen when there is no free memory left, and the memory free on device will take place in the following two cases: 1. The program ends. Obviously. However, there is a little problem that plugin library is destroyed before the memory manager is destroyed, leading to a fact that the call to target plugin will not succeed. 2. Device is out of memory when we request a new memory. The manager will walk through all free buffers from the bucket with largest base size, pick up one buffer, free it, and try to allocate immediately. If it succeeds, it will return right away rather than freeing all buffers in free list. Update: A threshold (8KB by default) is set such that users could control what size of memory will be managed by the manager. It can also be configured by an environment variable `LIBOMPTARGET_MEMORY_MANAGER_THRESHOLD`. Reviewed By: jdoerfert, ye-luo, JonChesterfield Differential Revision: https://reviews.llvm.org/D81054 cherry-pick: 0289696751e9a959b2413ca26624fc6c91be1eea llvm/llvm-project@0289696

The issue mentioned has been fixed in D84996 cherry-pick: f93b42a629ae5865bf71f8062553125b8684deaf llvm/llvm-project@f93b42a

Patch by mati865@gmail.com Differential Revision: https://reviews.llvm.org/D86448 cherry-pick: d0f4f5a182d7ea91150ae090563bc0095e8ca1b3 llvm/llvm-project@d0f4f5a

Instead of copying and pasting the same `#ifdef` expressions in multiple places, define a type and a pair of macros in `kmp_os.h`, to handle whether `va_list` is pointer-like or not: * `kmp_va_list` is the type to use for `__kmp_fork_call()` * `kmp_va_deref()` dereferences a `va_list`, if necessary * `kmp_va_addr_of()` takes the address of a `va_list`, if necessary Also add FreeBSD to the list of OSes that has a non pointer-like va_list. This can now be easily extended to other OSes too. Reviewed By: AndreyChurbanov Differential Revision: https://reviews.llvm.org/D86397 cherry-pick: cde8f4c164a27670ebe60a1969d486393336d778 llvm/llvm-project@cde8f4c

after cde8f4c164a2. Sort system includes, while here. cherry-pick: 47b0262d3f82a5574c7539afeb76cc1687417ca5 llvm/llvm-project@47b0262

…ransfer In this patch, we pack all small first-private arguments, allocate and transfer them all at once to reduce the number of data transfer which is very expensive. Let's take the test case as example. ``` int main() { int data1[3] = {1}, data2[3] = {2}, data3[3] = {3}; int sum[16] = {0}; #pragma omp target teams distribute parallel for map(tofrom: sum) firstprivate(data1, data2, data3) for (int i = 0; i < 16; ++i) { for (int j = 0; j < 3; ++j) { sum[i] += data1[j]; sum[i] += data2[j]; sum[i] += data3[j]; } } } ``` Here `data1`, `data2`, and `data3` are three first-private arguments of the target region. In the previous `libomptarget`, it called data allocation and data transfer three times, each of which allocated and transferred 12 bytes. With this patch, it only calls allocation and transfer once. The size is `(12+4)*3=48` where 12 is the size of each array and 4 is the padding to keep the address aligned with 8. It is implemented in this way: 1. First collect all information for those *first*-private arguments. _private_ arguments are not the case because private arguments don't need to be mapped to target device. It just needs a data allocation. With the patch for memory manager, the data allocation could be very cheap, especially for the small size. For each qualified argument, push a place holder pointer `nullptr` to the `vector` for kernel arguments, and we will update them later. 2. After we have all information, create a buffer that can accommodate all arguments plus their paddings. Copy the arguments to the buffer at the right place, i.e. aligned address. 3. Allocate a target memory with the same size as the host buffer, transfer the host buffer to target device, and finally update all place holder pointers in the arguments `vector`. The reason we only consider small arguments is, the data transfer is asynchronous. Therefore, for the large argument, we could continue to do things on the host side meanwhile, hopefully, the data is also being transferred. The "small" is defined by that the argument size is less than a predefined value. Currently it is 1024. I'm not sure whether it is a good one, and that is an open question. Another question is, do we need to make it configurable via an environment variable? Reviewed By: ye-luo Differential Revision: https://reviews.llvm.org/D86307 cherry-pick: 0775c1dfbce69d1d13414995de2e77acc942b7eb llvm/llvm-project@0775c1d

Patch by nihui (Ni Hui) Differential Revision: https://reviews.llvm.org/D76755 cherry-pick: 09af378f49dca98bc931ba0ff2c1cde307fe7c2c llvm/llvm-project@09af378

Patch by mati865@gmail.com Differential Revision: https://reviews.llvm.org/D86552 cherry-pick: 1596ea80fdf3410f94ef9a2548701d26cc81c2f5 llvm/llvm-project@1596ea8

…ibrary cherry-pick: 28fbf422f248fc74681a53208aa2f543a67515ac llvm/llvm-project@28fbf42

cherry-pick: 5d989fb37d7cfb4f7766a45d4efc82b5add3811f llvm/llvm-project@5d989fb

Summary: This patch changes the libomptarget runtime to always emit debug messages that occur before offloading failure. The goal is to provide users with information about why their application failed in the target region rather than a single failure message. This is only done in regions that precede offloading failure so this should not impact runtime performance. if the debug environment variable is set then the message is forwarded to the debug output as usual. A new environment variable was added for future use but does nothing in this current patch. LIBOMPTARGET_INFO will be used to report runtime information to the user if requrested, such as grid size, SPMD usage, or data mapping. It will take an integer indicating the level of information verbosity and a value of 0 will disable it. Reviewers: jdoerfort Subscribers: guansong sstefan1 yaxunl ye-luo Tags: #OpenMP Differential Revision: https://reviews.llvm.org/D86483 cherry-pick: 7a5a74ea9675008589593e0f811c9b60fc962d0b llvm/llvm-project@7a5a74e

The test command in `private_mapping.c` was set to expect failure by mistake. It is fixed in this patch. Reviewed By: ABataev Differential Revision: https://reviews.llvm.org/D86758 cherry-pick: 46e0ced762ce2c32bc846b5c0129c3b5020ca5d9 llvm/llvm-project@46e0ced

PrivateArgumentManager shall immediately allocate firstprivates if they are bases for the next parameters and the next paramaters rely on the fact that the base musst be allocated already. Differential Revision: https://reviews.llvm.org/D86781 cherry-pick: 6aa7228a629d81af78d4f701b7defb701f4b9283 llvm/llvm-project@6aa7228

shintaro-iwasaki · 2020-10-28T03:07:14Z

test:bolt/all

The latest Clang causes a compilation error if a program contains a certain complicated untied OpenMP task. This patch marks such tests as unsupported.

shintaro-iwasaki · 2020-10-28T05:58:23Z

The test results show that the coverage of BOLT is the same as that of the official LLVM OpenMP (at the point of f2400f024d323bc9000a4c126f2008a8b58fb4a0): https://jenkins-pmrs.cels.anl.gov/job/bolt-llvmproj-review-centos/5/.

I will merge this PR.

dwblaikie and others added 30 commits October 7, 2020 13:30

OpenMP: Fix typo variabls -> variables

069b231

cherry-pick: 0c938a8dd80ad707ec1f20d936cc5c9d73df8de5 llvm/llvm-project@0c938a8

[OpenMP] Don't use MSVC workaround with MinGW

6af197b

Patch by mati865@gmail.com Differential Revision: https://reviews.llvm.org/D85210 cherry-pick: 4a04bc8995639e1d333790518e4d42e0961f740e llvm/llvm-project@4a04bc8

[OpenMP] support build on msys2/mingw with clang or gcc

3359e30

RTM Adaptive Locks are supported on msys2/mingw for clang and gcc. Differential Revision: https://reviews.llvm.org/D81776 cherry-pick: bf2aa74e51997ee190f3b34dd26a1b564e59e267 llvm/llvm-project@bf2aa74

[OpenMP] Fix ref count dec for implicit map of partial data

b4ee476

D85342 broke this case. The new test case presents an example. Reviewed By: grokos Differential Revision: https://reviews.llvm.org/D85369 cherry-pick: 518a27e5591c211ceeef3091edc59012e6ace2b2 llvm/llvm-project@518a27e

[OpenMP][CUDA] Keep one kernel list per device, not globally.

d5a97c0

Reviewed By: JonChesterfield Differential Revision: https://reviews.llvm.org/D86039 cherry-pick: 5272d29e2cb7c967c3016fa285f14edc7515d9bf llvm/llvm-project@5272d29

[libomptarget][NFC] Sort list of plugins in chronological order

ac3b35e

Differential Revision: https://reviews.llvm.org/D86082 cherry-pick: 32ebdc70f3af478f4f6a9c75b4bc47a453b1b933 llvm/llvm-project@32ebdc7

[libomptarget][amdgpu] Support building with static rocm libraries

6bca18a

cherry-pick: 6e1b11087f080b1cb9a023f9f920d29d5465633e llvm/llvm-project@6e1b110

[NFC][OpenMP] Remove outdated comments about potential issues

70c632c

The issue mentioned has been fixed in D84996 cherry-pick: f93b42a629ae5865bf71f8062553125b8684deaf llvm/llvm-project@f93b42a

[OpenMP] Check if _MSC_VER is defined before using it

de15e89

Patch by mati865@gmail.com Differential Revision: https://reviews.llvm.org/D86448 cherry-pick: d0f4f5a182d7ea91150ae090563bc0095e8ca1b3 llvm/llvm-project@d0f4f5a

Add <stdarg.h> include to kmp_os.h, to get the va_list type, required

80a079c

after cde8f4c164a2. Sort system includes, while here. cherry-pick: 47b0262d3f82a5574c7539afeb76cc1687417ca5 llvm/llvm-project@47b0262

[OpenMP] Fix build on macOS sdk 10.12 and newer

c3c8d16

Patch by nihui (Ni Hui) Differential Revision: https://reviews.llvm.org/D76755 cherry-pick: 09af378f49dca98bc931ba0ff2c1cde307fe7c2c llvm/llvm-project@09af378

[OpenMP] Fix import library installation with MinGW

7104dc4

Patch by mati865@gmail.com Differential Revision: https://reviews.llvm.org/D86552 cherry-pick: 1596ea80fdf3410f94ef9a2548701d26cc81c2f5 llvm/llvm-project@1596ea8

[libomptarget][amdgpu] Update plugin CMake to work with latest rocr l…

37e6751

…ibrary cherry-pick: 28fbf422f248fc74681a53208aa2f543a67515ac llvm/llvm-project@28fbf42

[libomptarget][amdgpu] Improve thread safety, remove dead code

f7460f7

cherry-pick: 5d989fb37d7cfb4f7766a45d4efc82b5add3811f llvm/llvm-project@5d989fb

shintaro-iwasaki added the upstream-merge label Oct 28, 2020

shintaro-iwasaki added 2 commits October 28, 2020 00:05

Merge commit 'c3d5df7a7e33ed45011489f1c6aaf47fe9216274' into bolt-c3d5df

2c8c091

[BOLT/test] mark some "untied" tests as unsupported with clang

e44bb85

The latest Clang causes a compilation error if a program contains a certain complicated untied OpenMP task. This patch marks such tests as unsupported.

shintaro-iwasaki force-pushed the bolt-c3d5df branch from 36a34ea to e44bb85 Compare October 28, 2020 05:46

shintaro-iwasaki merged commit c365888 into pmodels:main Oct 28, 2020

shintaro-iwasaki deleted the bolt-c3d5df branch October 30, 2020 15:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update BOLT #85

Update BOLT #85

shintaro-iwasaki commented Oct 28, 2020

shintaro-iwasaki commented Oct 28, 2020

shintaro-iwasaki commented Oct 28, 2020

Update BOLT #85

Update BOLT #85

Conversation

shintaro-iwasaki commented Oct 28, 2020

shintaro-iwasaki commented Oct 28, 2020

shintaro-iwasaki commented Oct 28, 2020