Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add GPU architectures to the build-info file #1840

Merged
merged 4 commits into from
Mar 7, 2024

Conversation

parthosa
Copy link
Contributor

@parthosa parthosa commented Mar 6, 2024

Related Issue: NVIDIA/spark-rapids#10430

Based on the discussion thread here, a new property named gpu_architectures has been added in the version-info.properties file. This property stores a semicolon-separated list of GPU architectures supported by the cuDF and JNI plugin.

The plugin will parse this property and verify if the user is running Spark RAPIDS job on a supported architecture.

Output

File: spark-rapids-jni-version-info.properties

version=24.04.0-SNAPSHOT
user=
revision=4daf3f8c8bcd5ebaeff12f77ee99ba45e8e0ed46
branch=spark-rapids-jni-10430
date=2024-03-06T22:14:26Z
url=https://github.com/NVIDIA/spark-rapids-jni.git
gpu_architectures=70;75;80;86;90;

File: cudf-java-version-info.properties

version=24.04.0-SNAPSHOT
user=
revision=8d073e4ca0a6cb9d9a4d9fe5e4e0147f01d7eb36
branch=HEAD
date=2024-03-06T22:14:23Z
url=https://github.com/rapidsai/cudf.git
gpu_architectures=70;75;80;86;90;

Signed-off-by: Partho Sarthi <psarthi@nvidia.com>
build/build-info Outdated
@@ -32,6 +32,7 @@ echo_build_properties() {
echo branch=$(cd "$git_path" && git rev-parse --abbrev-ref HEAD)
echo date=$(date -u +%Y-%m-%dT%H:%M:%SZ)
echo url=$(cd "$git_path" && git config --get remote.origin.url)
echo gpu_architectures=$(cd "$git_path" && find . -name libcudfjni.a -o -name libcudf.a | xargs -n 1 bash -c 'cuobjdump $1 || exit 0' _ | grep 'arch = ' | awk -F_ '{print $2}' | sort -n -u)
Copy link
Contributor Author

@parthosa parthosa Mar 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Explanation:
This command searches for files libcudfjni.a or libcudf.a in the current directory, then runs cuobjdump on each file to extract architecture information and sorts the unique results numerically.

Result Format:
70 75 80 86 90

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this looking at both libcudfjni.a and libcudf.a? It should only be looking only at the libcudf artifact for the cudf properties, IMO, as that's where almost all the kernels are. If libcudfjni.a has an arch libcudf.a does not, we do not want to advertise it in the properties.

I don't see where this is handling the spark-rapids-jni kernels. Those are not in either libcudfjni.a or libcudf.a but rather in libcudf.so. The way this is written, cudf and spark-rapids-jni properties will always have the same property values. I'm OK with that, but it begs the question why the plugin would be checking both of them if they're known to be identical.

This should not be using find, it should be examining the known location where the artifact is placed before it is put into the jar artifact. find invites accidental errors where someone has a rogue binary somewhere in their repo (e.g.: temporary directory, whatever) and it accidentally adds an arch that is not in the final jar artifact.

Copy link
Collaborator

@gerashegalov gerashegalov Mar 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jlowe find is from my POC NVIDIA/spark-rapids#10540 (comment)

We probably don't need to go this far, but I was thinking of an idea to record the arch sets for each lib going into a build (including the nvcomp ones), and then compute the set intersection for the final device support check, can be done at build or run time.

Copy link
Contributor Author

@parthosa parthosa Mar 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should not be using find, it should be examining the known location where the artifact is placed before it is

For cudf supported architectures, we should check ./thirdparty/cudf/cpp/build/libcudf.a.
For spark-rapids-jni, what is the difference between ./target/cmake-build/libcudf.so and ./target/classes/amd64/Linux/libcudf.so

We probably don't need to go this far, but I was thinking of an idea to record the arch sets for each lib going

We should probably use set intersection of each lib. Can we ever have a case when spark-rapids-jni has different architectures supported from cudf?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For spark-rapids-jni, what is the difference between ./target/cmake-build/libcudf.so and ./target/classes/amd64/Linux/libcudf.so

They are the same, you can pick either path. The former is where the library is orginally built, the latter is where it's copied to pull it into the resulting jar.

Can we ever have a case when spark-rapids-jni has different architectures supported from cudf?

Theoretically yes but not easily the way it builds today. IMO we could easily overengineer this thing to solve a problem that doesn't need to be solved in practice. We could simply probe for the architectures on libcudf.so and call it a day. Is it 100% bulletproof? No, theoretically someone could do a very crazy thing to get that to not be correct because they went out of their way to somehow build libcudf.a, libcudfjni.a, and libcudf.so differently. But I've never seen this in practice and arguably YAGNI applies. That avoids the "where are all the places to look" problem and the need to do set intersections.

Copy link
Collaborator

@gerashegalov gerashegalov Mar 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we ever have a case when spark-rapids-jni has different architectures supported from cudf?

inconsistent build environment after upmerges, branch switches, other mistakes during custom dev build, bugs in a prod build plus things like nvcomp kernels built externally

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed script delimiter to semi-colon and explicitly passed 'libcudf.a' and 'libcudf.so' as library files for cudf and spark-rapids-jni for processing by the build info script

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My perception from internal chats is that devs routinely runs into build situations that would be YAGNI, but we do not need to optimize for this 😄

build/build-info Show resolved Hide resolved
build/build-info Outdated
@@ -32,6 +32,7 @@ echo_build_properties() {
echo branch=$(cd "$git_path" && git rev-parse --abbrev-ref HEAD)
echo date=$(date -u +%Y-%m-%dT%H:%M:%SZ)
echo url=$(cd "$git_path" && git config --get remote.origin.url)
echo gpu_architectures=$(cd "$git_path" && find . -name libcudfjni.a -o -name libcudf.a | xargs -n 1 bash -c 'cuobjdump $1 || exit 0' _ | grep 'arch = ' | awk -F_ '{print $2}' | sort -n -u)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this looking at both libcudfjni.a and libcudf.a? It should only be looking only at the libcudf artifact for the cudf properties, IMO, as that's where almost all the kernels are. If libcudfjni.a has an arch libcudf.a does not, we do not want to advertise it in the properties.

I don't see where this is handling the spark-rapids-jni kernels. Those are not in either libcudfjni.a or libcudf.a but rather in libcudf.so. The way this is written, cudf and spark-rapids-jni properties will always have the same property values. I'm OK with that, but it begs the question why the plugin would be checking both of them if they're known to be identical.

This should not be using find, it should be examining the known location where the artifact is placed before it is put into the jar artifact. find invites accidental errors where someone has a rogue binary somewhere in their repo (e.g.: temporary directory, whatever) and it accidentally adds an arch that is not in the final jar artifact.

build/build-info Outdated
@@ -32,6 +32,7 @@ echo_build_properties() {
echo branch=$(cd "$git_path" && git rev-parse --abbrev-ref HEAD)
echo date=$(date -u +%Y-%m-%dT%H:%M:%SZ)
echo url=$(cd "$git_path" && git config --get remote.origin.url)
echo gpu_architectures=$(cd "$git_path" && find . -name libcudfjni.a -o -name libcudf.a | xargs -n 1 bash -c 'cuobjdump $1 || exit 0' _ | grep 'arch = ' | awk -F_ '{print $2}' | sort -n -u)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not in favor of the space-separated arch list. Java properties file format allows for the space being the separator between the key and the value portion of the line

jshell> var props = new java.util.Properties()
props ==> {}

jshell> props.load(new java.io.StringReader("gpu_architectures 70 75 80 86 90"))

jshell> props
props ==> {gpu_architectures=70 75 80 86 90}

I prefer a semicolon-separated value format without any white space in the value 70;75;80;86;90 similar to the first example https://cmake.org/cmake/help/latest/prop_tgt/CUDA_ARCHITECTURES.html#prop_tgt:CUDA_ARCHITECTURES and this is what we typically use for -DGPU_ARCHS passed to Maven

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense. This would keep it consistent with current usage.

build/build-info Outdated
@@ -32,6 +32,7 @@ echo_build_properties() {
echo branch=$(cd "$git_path" && git rev-parse --abbrev-ref HEAD)
echo date=$(date -u +%Y-%m-%dT%H:%M:%SZ)
echo url=$(cd "$git_path" && git config --get remote.origin.url)
echo gpu_architectures=$(cd "$git_path" && find . -name libcudfjni.a -o -name libcudf.a | xargs -n 1 bash -c 'cuobjdump $1 || exit 0' _ | grep 'arch = ' | awk -F_ '{print $2}' | sort -n -u)
Copy link
Collaborator

@gerashegalov gerashegalov Mar 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jlowe find is from my POC NVIDIA/spark-rapids#10540 (comment)

We probably don't need to go this far, but I was thinking of an idea to record the arch sets for each lib going into a build (including the nvcomp ones), and then compute the set intersection for the final device support check, can be done at build or run time.

… build info script

Signed-off-by: Partho Sarthi <psarth@nvidia.com>
gerashegalov
gerashegalov previously approved these changes Mar 6, 2024
Copy link
Collaborator

@gerashegalov gerashegalov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

build/build-info Outdated
echo version=$version
echo user=$USER
echo revision=$(cd "$git_path" && git rev-parse HEAD)
echo branch=$(cd "$git_path" && git rev-parse --abbrev-ref HEAD)
echo date=$(date -u +%Y-%m-%dT%H:%M:%SZ)
echo url=$(cd "$git_path" && git config --get remote.origin.url)
echo gpu_architectures=$(cd "$git_path" && cuobjdump "$library_path" 2>/dev/null | grep 'arch = ' | awk -F_ '{print $2}' | sort -n -u | tr '\n' ';')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: not a git op and no need to cd?

Suggested change
echo gpu_architectures=$(cd "$git_path" && cuobjdump "$library_path" 2>/dev/null | grep 'arch = ' | awk -F_ '{print $2}' | sort -n -u | tr '\n' ';')
echo gpu_architectures=$(cuobjdump "$library_path" 2>/dev/null | grep 'arch = ' | awk -F_ '{print $2}' | sort -n -u | tr '\n' ';')

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It kept it as safety but the build paths are absolute. Refactored as suggested.

Signed-off-by: Partho Sarthi <psarth@nvidia.com>
gerashegalov
gerashegalov previously approved these changes Mar 7, 2024
Copy link
Collaborator

@gerashegalov gerashegalov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@gerashegalov
Copy link
Collaborator

build

mythrocks
mythrocks previously approved these changes Mar 7, 2024
jlowe
jlowe previously approved these changes Mar 7, 2024
Copy link
Member

@jlowe jlowe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor nit, not mustfix.

build/build-info Outdated
echo version=$version
echo user=$USER
echo revision=$(cd "$git_path" && git rev-parse HEAD)
echo branch=$(cd "$git_path" && git rev-parse --abbrev-ref HEAD)
echo date=$(date -u +%Y-%m-%dT%H:%M:%SZ)
echo url=$(cd "$git_path" && git config --get remote.origin.url)
echo gpu_architectures=$(cuobjdump "$libcudf_path" 2>/dev/null | grep 'arch = ' | awk -F_ '{print $2}' | sort -n -u | tr '\n' ';')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: awk can be leveraged to do the grep directly.

Suggested change
echo gpu_architectures=$(cuobjdump "$libcudf_path" 2>/dev/null | grep 'arch = ' | awk -F_ '{print $2}' | sort -n -u | tr '\n' ';')
echo gpu_architectures=$(cuobjdump "$libcudf_path" 2>/dev/null | awk -F_ '/arch =/ {print $2}' | sort -n -u | tr '\n' ';')

@parthosa parthosa dismissed stale reviews from jlowe, mythrocks, and gerashegalov via 48eb19a March 7, 2024 16:03
@parthosa parthosa requested a review from jlowe March 7, 2024 16:20
@jlowe
Copy link
Member

jlowe commented Mar 7, 2024

build

Copy link
Collaborator

@gerashegalov gerashegalov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@gerashegalov gerashegalov merged commit 8f8aeed into NVIDIA:branch-24.04 Mar 7, 2024
3 checks passed
@parthosa parthosa deleted the spark-rapids-jni-10430 branch March 7, 2024 18:59
gerashegalov pushed a commit to NVIDIA/spark-rapids that referenced this pull request Mar 15, 2024
Fixes #10430. This PR ensures that Spark RAPIDS jobs are executed on supported GPU architectures without relying on manual configuration. 

### Changes:
1. Processes `gpu_architectures` property from the `*version-info.properties` file generated by the native builds. 
2. Verifies if the user is running the job on an architecture supported by the cuDF and JNI libraries and throws an exception if the architecture is unsupported.

### Testing
Tested on a Dataproc VM running on Nvidia P4 (GPU Architecture 6.1)

```
24/03/06 17:44:58 WARN RapidsPluginUtils: spark.rapids.sql.explain is set to `NOT_ON_GPU`. Set it to 'NONE' to suppress the diagnostics logging about the query placement on the GPU.
24/03/06 17:45:10 ERROR RapidsExecutorPlugin: Exception in the executor plugin, shutting down!
java.lang.RuntimeException: Device architecture 61 is unsupported. Minimum supported architecture: 75.
        at com.nvidia.spark.rapids.RapidsPluginUtils$.checkGpuArchitectureInternal(Plugin.scala:366)
        at com.nvidia.spark.rapids.RapidsPluginUtils$.checkGpuArchitecture(Plugin.scala:375)
        at com.nvidia.spark.rapids.RapidsExecutorPlugin.init(Plugin.scala:461)
```


### Related PR
* NVIDIA/spark-rapids-jni#1840
* Add conf for minimum supported CUDA and error handling

Signed-off-by: Partho Sarthi <psarthi@nvidia.com>

* Revert "Add conf for minimum supported CUDA and error handling"

This reverts commit 7b8eaea.

* Verify the GPU architecture is supported by the plugin libraries

Signed-off-by: Partho Sarthi <psarthi@nvidia.com>

* Use semi-colon as delimiter and use intersection of supported gpu architectures

Signed-off-by: Partho Sarthi <psarthi@nvidia.com>

* Allow for compatibility with major architectures

Signed-off-by: Partho Sarthi <psarthi@nvidia.com>

* Check for version as integers

Signed-off-by: Partho Sarthi <psarthi@nvidia.com>

* Modify compatibility check for same major version and same or higher minor version

Signed-off-by: Partho Sarthi <psarthi@nvidia.com>

* Add a config to skip verification and refactor checking

Signed-off-by: Partho Sarthi <psarthi@nvidia.com>

* Update RapidsConf.scala

Co-authored-by: Jason Lowe <jlowe@nvidia.com>

* Update verification logic

Signed-off-by: Partho Sarthi <psarthi@nvidia.com>

* Update warning message

Signed-off-by: Partho Sarthi <psarthi@nvidia.com>

* Add unit tests and update warning message.

Signed-off-by: Partho Sarthi <psarthi@nvidia.com>

* Update exception class

Signed-off-by: Partho Sarthi <psarthi@nvidia.com>

* Address review comments

Signed-off-by: Partho Sarthi <psarthi@nvidia.com>

---------

Signed-off-by: Partho Sarthi <psarthi@nvidia.com>
Co-authored-by: Jason Lowe <jlowe@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants