Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update BOLT #85

Merged
merged 32 commits into from
Oct 28, 2020
Merged

Update BOLT #85

merged 32 commits into from
Oct 28, 2020

Conversation

shintaro-iwasaki
Copy link
Collaborator

To 6aa7228a629d81af78d4f701b7defb701f4b9283

dwblaikie and others added 30 commits October 7, 2020 13:30
cherry-pick: 0c938a8dd80ad707ec1f20d936cc5c9d73df8de5
llvm/llvm-project@0c938a8
Patch by mati865@gmail.com

Differential Revision: https://reviews.llvm.org/D85210

cherry-pick: 4a04bc8995639e1d333790518e4d42e0961f740e
llvm/llvm-project@4a04bc8
RTM Adaptive Locks are supported on msys2/mingw for clang and gcc.

Differential Revision: https://reviews.llvm.org/D81776

cherry-pick: bf2aa74e51997ee190f3b34dd26a1b564e59e267
llvm/llvm-project@bf2aa74
Without this patch, the following example fails but shouldn't
according to OpenMP TR8:

```
 #pragma omp target enter data map(alloc:i)
 #pragma omp target data map(present, alloc: i)
 {
   #pragma omp target exit data map(delete:i)
 } // fails presence check here
```

OpenMP TR8 sec. 2.22.7.1 "map Clause", p. 321, L23-26 states:

> If the map clause appears on a target, target data, target enter
> data or target exit data construct with a present map-type-modifier
> then on entry to the region if the corresponding list item does not
> appear in the device data environment an error occurs and the
> program terminates.

There is no corresponding statement about the exit from a region.
Thus, the `present` modifier should:

1. Check for presence upon entry into any region, including a `target
   exit data` region.  This behavior is already implemented correctly.

2. Should not check for presence upon exit from any region, including
   a `target` or `target data` region.  Without this patch, this
   behavior is not implemented correctly, breaking the above example.

In the case of `target data`, this patch fixes the latter behavior by
removing the `present` modifier from the map types Clang generates for
the runtime call at the end of the region.

In the case of `target`, we have not found a valid OpenMP program for
which such a fix would matter.  It appears that, if a program can
guarantee that data is present at the beginning of a `target` region
so that there's no error there, that data is also guaranteed to be
present at the end.  This patch adds a comment to the runtime to
document this case.

Reviewed By: grokos, RaviNarayanaswamy, ABataev

Differential Revision: https://reviews.llvm.org/D84422

cherry-pick: 002d61db2b7790dc884953bf9271878bf0af3a8e
llvm/llvm-project@002d61d
OpenMP TR8 sec. 2.15.6 "target update Construct", p. 183, L3-4 states:

> If the corresponding list item is not present in the device data
> environment and there is no present modifier in the clause, then no
> assignment occurs to or from the original list item.

L10-11 states:

> If a present modifier appears in the clause and the corresponding
> list item is not present in the device data environment then an
> error occurs and the program termintates.

(OpenMP 5.0 also has the first passage but without mention of the
present modifier of course.)

In both passages, I assume "is not present" includes the case of
partially but not entirely present.  However, without this patch, the
target update directive misbehaves in this case both with and without
the present modifier.  For example:

```
 #pragma omp target enter data map(to:arr[0:3])
 #pragma omp target update to(arr[0:5]) // might fail on data transfer
 #pragma omp target update to(present:arr[0:5]) // might fail on data transfer
```

The problem is that `DeviceTy::getTgtPtrBegin` does not return a null
pointer in that case, so `target_data_update` sees the data as fully
present, and the data transfer then might fail depending on the target
device.  However, without the present modifier, there should never be
a failure.  Moreover, with the present modifier, there should always
be a failure, and the diagnostic should mention the present modifier.

This patch fixes `DeviceTy::getTgtPtrBegin` to return null when
`target_data_update` is the caller.  I'm wondering if it should do the
same for more callers.

Reviewed By: grokos, jdoerfert

Differential Revision: https://reviews.llvm.org/D85246

cherry-pick: 5ab43989c353a2378910d20c7b88e44ea92b3aee
llvm/llvm-project@5ab4398
targetDataMapper function fills arrays with the mapping data in the
direct order. When this function is called by targetDataBegin or
tgt_target_update functions, it works as expected. But targetDataEnd
function processes mapped data in reverse order. In this case, the base
pointer might be deleted before the associated data is deleted. Need to
reverse data, mapped by mapper, too, since it always adds data that must
be deleted at the end of the buffer.
Fixes the test declare_mapper_target_update.cpp.
Also, reduces the memry fragmentation by preallocation the memory
buffers.

Differential Revision: https://reviews.llvm.org/D85216

cherry-pick: 6780d5675b7cd75279d8fc13ee1a1cc272087613
llvm/llvm-project@6780d56
The standard way of printing `int64_t` data is via the PRId64 macro, `ld`
is for `long int` and int64_t is not guaranteed to be typedef'ed as `long int`
on all platforms. E.g. on Windows we get mismatch warnings.

Differential Revision: https://reviews.llvm.org/D85353

cherry-pick: 40470eb27a5c97b01e89d8825626487b0682abec
llvm/llvm-project@40470eb
For example, without this patch, the following fails as expected with
or without the `present` modifier, but the `present` modifier doesn't
produce its usual diagnostic:

```
 #pragma omp target data map(alloc: arr[0:2])
 {
   #pragma omp target map(present, tofrom: arr[0:100]) // not fully present
   ;
 }
```

Reviewed By: grokos, vzakhari

Differential Revision: https://reviews.llvm.org/D85320

cherry-pick: 41b1aefecb9447620dd182b0352abed0df05665c
llvm/llvm-project@41b1aef
For example:

```
 #pragma omp target data map(tofrom:arr[0:100])
 {
   #pragma omp target exit data map(delete:arr[0:100])
   #pragma omp target enter data map(alloc:arr[98:2])
 }
```

Without this patch, the transfer at the end of the target data region
is broken and fails depending on the target device.  According to my
read of the spec, the transfer shouldn't even be attempted because
`arr[0:100]` isn't (fully) present there.  To fix that, this patch
makes `DeviceTy::getTgtPtrBegin` return null for this case.

Reviewed By: grokos

Differential Revision: https://reviews.llvm.org/D85342

cherry-pick: 8c8bb128dfd09f84b27b9e732cf1355582b1d019
llvm/llvm-project@8c8bb12
D85342 broke this case.  The new test case presents an example.

Reviewed By: grokos

Differential Revision: https://reviews.llvm.org/D85369

cherry-pick: 518a27e5591c211ceeef3091edc59012e6ace2b2
llvm/llvm-project@518a27e
Starting with 787eb0c637b I got spurious segmentation faults for some testcases. I could nail it down to `brel` trying to release the "memory" of the node allocated on the stack of __kmpc_omp_wait_deps. With this patch, you will see the assertion triggering for some of the tests in the test suite.

My proposed solution for the issue is to just patch __kmpc_omp_wait_deps:
```
  __kmp_init_node(&node);
-  node.dn.on_stack = 1;
+  // the stack owns the node
+  __kmp_node_ref(&node);
```

What do you think?

Reviewed By: AndreyChurbanov

Differential Revision: https://reviews.llvm.org/D84472

cherry-pick: 66a3575c2895f3b06056908bb40699f16e4b92d7
llvm/llvm-project@66a3575
[libomptarget] Implement host plugin for amdgpu

Replacement for D71384. Primary difference is inlining the dependency on atmi
followed by extensive simplification and bugfixes. This is the latest version
from https://github.com/ROCm-Developer-Tools/amd-llvm-project/tree/aomp12 with
minor patches and a rename from hsa to amdgpu, on the basis that this can't be
used by other implementations of hsa without additional work.

This will not build unless the ROCM_DIR variable is passed so won't break other
builds. That variable is used to locate two amdgpu specific libraries that ship
as part of rocm:
libhsakmt at https://github.com/RadeonOpenCompute/ROCT-Thunk-Interface
libhsa-runtime64 at https://github.com/RadeonOpenCompute/ROCR-Runtime
These libraries build from source. The build scripts in those repos are for
shared libraries, but can be adapted to statically link both into this plugin.

There are caveats.
- This works well enough to run various tests and benchmarks, and will be used
  to support the current clang bring up
- It is adequately thread safe for the above but there will be races remaining
- It is not stylistically correct for llvm, though has had clang-format run
- It has suboptimal memory management and locking strategies
- The debug printing / error handling is inconsistent

I would like to contribute this pretty much as-is and then improve it in-tree.
This would be advantagous because the aomp12 branch that was in use for fixing
this codebase has just been joined with the amd internal rocm dev process.

Reviewed By: jdoerfert

Differential Revision: https://reviews.llvm.org/D85742

cherry-pick: d0b312955f12beba5b03c8a524a8456cb4174bd7
llvm/llvm-project@d0b3129
…nel)

Instead of calling `cuFuncGetAttribute` with
`CU_FUNC_ATTRIBUTE_MAX_THREADS_PER_BLOCK` for every kernel invocation,
we can do it for the first one and cache the result as part of the
`KernelInfo` struct. The only functional change is that we now expect
`cuFuncGetAttribute` to succeed and otherwise propagate the error.
Ignoring any error seems like a slippery slope...

Reviewed By: JonChesterfield

Differential Revision: https://reviews.llvm.org/D86038

cherry-pick: aa27cfc1e7d7456325e951a4ba3ced405027f7d0
llvm/llvm-project@aa27cfc
Reviewed By: JonChesterfield

Differential Revision: https://reviews.llvm.org/D86039

cherry-pick: 5272d29e2cb7c967c3016fa285f14edc7515d9bf
llvm/llvm-project@5272d29
This patch contains the following changes:
1. Renamed the function `DeviceTy::data_exchange` to `DeviceTy::dataExchange`;
2. Changed the second argument `DeviceTy DstDev` to `DeviceTy &DstDev`;
3. Renamed the last argument.

Reviewed By: ye-luo

Differential Revision: https://reviews.llvm.org/D86238

cherry-pick: 83c3d07994c4cd24b9548362d03290af2a9483b0
llvm/llvm-project@83c3d07
Target memory manager is introduced in this patch which aims to manage target
memory such that they will not be freed immediately when they are not used
because the overhead of memory allocation and free is very large. For CUDA
device, cuMemFree even blocks the context switch on device which affects
concurrent kernel execution.

The memory manager can be taken as a memory pool. It divides the pool into
multiple buckets according to the size such that memory allocation/free
distributed to different buckets will not affect each other.

In this version, we use the exact-equality policy to find a free buffer. This
is an open question: will best-fit work better here? IMO, best-fit is not good
for target memory management because computation on GPU usually requires GBs of
data. Best-fit might lead to a serious waste. For example, there is a free
buffer of size 1960MB, and now we need a buffer of size 1200MB. If best-fit,
the free buffer will be returned, leading to a 760MB waste.

The allocation will happen when there is no free memory left, and the memory
free on device will take place in the following two cases:
1. The program ends. Obviously. However, there is a little problem that plugin
library is destroyed before the memory manager is destroyed, leading to a fact
that the call to target plugin will not succeed.
2. Device is out of memory when we request a new memory. The manager will walk
through all free buffers from the bucket with largest base size, pick up one
buffer, free it, and try to allocate immediately. If it succeeds, it will
return right away rather than freeing all buffers in free list.

Update:
A threshold (8KB by default) is set such that users could control what size of memory
will be managed by the manager. It can also be configured by an environment variable
`LIBOMPTARGET_MEMORY_MANAGER_THRESHOLD`.

Reviewed By: jdoerfert, ye-luo, JonChesterfield

Differential Revision: https://reviews.llvm.org/D81054

cherry-pick: 0289696751e9a959b2413ca26624fc6c91be1eea
llvm/llvm-project@0289696
The issue mentioned has been fixed in D84996

cherry-pick: f93b42a629ae5865bf71f8062553125b8684deaf
llvm/llvm-project@f93b42a
Patch by mati865@gmail.com

Differential Revision: https://reviews.llvm.org/D86448

cherry-pick: d0f4f5a182d7ea91150ae090563bc0095e8ca1b3
llvm/llvm-project@d0f4f5a
Instead of copying and pasting the same `#ifdef` expressions in multiple
places, define a type and a pair of macros in `kmp_os.h`, to handle
whether `va_list` is pointer-like or not:

* `kmp_va_list` is the type to use for `__kmp_fork_call()`
* `kmp_va_deref()` dereferences a `va_list`, if necessary
* `kmp_va_addr_of()` takes the address of a `va_list`, if necessary

Also add FreeBSD to the list of OSes that has a non pointer-like
va_list. This can now be easily extended to other OSes too.

Reviewed By: AndreyChurbanov

Differential Revision: https://reviews.llvm.org/D86397

cherry-pick: cde8f4c164a27670ebe60a1969d486393336d778
llvm/llvm-project@cde8f4c
after cde8f4c164a2. Sort system includes, while here.

cherry-pick: 47b0262d3f82a5574c7539afeb76cc1687417ca5
llvm/llvm-project@47b0262
…ransfer

In this patch, we pack all small first-private arguments, allocate and transfer them all at once to reduce the number of data transfer which is very expensive.

Let's take the test case as example.
```
int main() {
  int data1[3] = {1}, data2[3] = {2}, data3[3] = {3};
  int sum[16] = {0};
#pragma omp target teams distribute parallel for map(tofrom: sum) firstprivate(data1, data2, data3)
  for (int i = 0; i < 16; ++i) {
    for (int j = 0; j < 3; ++j) {
      sum[i] += data1[j];
      sum[i] += data2[j];
      sum[i] += data3[j];
    }
  }
}
```
Here `data1`, `data2`, and `data3` are three first-private arguments of the target region. In the previous `libomptarget`, it called data allocation and data transfer three times, each of which allocated and transferred 12 bytes. With this patch, it only calls allocation and transfer once. The size is `(12+4)*3=48` where 12 is the size of each array and 4 is the padding to keep the address aligned with 8. It is implemented in this way:
1. First collect all information for those *first*-private arguments. _private_ arguments are not the case because private arguments don't need to be mapped to target device. It just needs a data allocation. With the patch for memory manager, the data allocation could be very cheap, especially for the small size. For each qualified argument, push a place holder pointer `nullptr` to the `vector` for kernel arguments, and we will update them later.
2. After we have all information, create a buffer that can accommodate all arguments plus their paddings. Copy the arguments to the buffer at the right place, i.e. aligned address.
3. Allocate a target memory with the same size as the host buffer, transfer the host buffer to target device, and finally update all place holder pointers in the arguments `vector`.

The reason we only consider small arguments is, the data transfer is asynchronous. Therefore, for the large argument, we could continue to do things on the host side meanwhile, hopefully, the data is also being transferred. The "small" is defined by that the argument size is less than a predefined value. Currently it is 1024. I'm not sure whether it is a good one, and that is an open question. Another question is, do we need to make it configurable via an environment variable?

Reviewed By: ye-luo

Differential Revision: https://reviews.llvm.org/D86307

cherry-pick: 0775c1dfbce69d1d13414995de2e77acc942b7eb
llvm/llvm-project@0775c1d
Patch by nihui (Ni Hui)

Differential Revision: https://reviews.llvm.org/D76755

cherry-pick: 09af378f49dca98bc931ba0ff2c1cde307fe7c2c
llvm/llvm-project@09af378
Patch by mati865@gmail.com

Differential Revision: https://reviews.llvm.org/D86552

cherry-pick: 1596ea80fdf3410f94ef9a2548701d26cc81c2f5
llvm/llvm-project@1596ea8
Summary:

This patch changes the libomptarget runtime to always emit debug messages that
occur before offloading failure. The goal is to provide users with information
about why their application failed in the target region rather than a single
failure message. This is only done in regions that precede offloading failure
so this should not impact runtime performance. if the debug environment
variable is set then the message is forwarded to the debug output as usual.

A new environment variable was added for future use but does nothing in this
current patch. LIBOMPTARGET_INFO will be used to report runtime information to
the user if requrested, such as grid size, SPMD usage, or data mapping. It will
take an integer indicating the level of information verbosity and a value of 0
will disable it.

Reviewers: jdoerfort

Subscribers: guansong sstefan1 yaxunl ye-luo

Tags: #OpenMP

Differential Revision: https://reviews.llvm.org/D86483

cherry-pick: 7a5a74ea9675008589593e0f811c9b60fc962d0b
llvm/llvm-project@7a5a74e
The test command in `private_mapping.c` was set to expect failure by mistake. It is fixed in this patch.

Reviewed By: ABataev

Differential Revision: https://reviews.llvm.org/D86758

cherry-pick: 46e0ced762ce2c32bc846b5c0129c3b5020ca5d9
llvm/llvm-project@46e0ced
PrivateArgumentManager shall immediately allocate firstprivates if they
are bases for the next parameters and the next paramaters rely on the
fact that the base musst be allocated already.

Differential Revision: https://reviews.llvm.org/D86781

cherry-pick: 6aa7228a629d81af78d4f701b7defb701f4b9283
llvm/llvm-project@6aa7228
@shintaro-iwasaki
Copy link
Collaborator Author

test:bolt/all

The latest Clang causes a compilation error if a program contains a certain
complicated untied OpenMP task.  This patch marks such tests as unsupported.
@shintaro-iwasaki
Copy link
Collaborator Author

The test results show that the coverage of BOLT is the same as that of the official LLVM OpenMP (at the point of f2400f024d323bc9000a4c126f2008a8b58fb4a0): https://jenkins-pmrs.cels.anl.gov/job/bolt-llvmproj-review-centos/5/.

I will merge this PR.

@shintaro-iwasaki shintaro-iwasaki merged commit c365888 into pmodels:main Oct 28, 2020
@shintaro-iwasaki shintaro-iwasaki deleted the bolt-c3d5df branch October 30, 2020 15:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.