Add GPU architectures to the build-info file #1840

parthosa · 2024-03-06T18:16:51Z

Related Issue: NVIDIA/spark-rapids#10430

Based on the discussion thread here, a new property named gpu_architectures has been added in the version-info.properties file. This property stores a semicolon-separated list of GPU architectures supported by the cuDF and JNI plugin.

The plugin will parse this property and verify if the user is running Spark RAPIDS job on a supported architecture.

Output

File: spark-rapids-jni-version-info.properties

version=24.04.0-SNAPSHOT
user=
revision=4daf3f8c8bcd5ebaeff12f77ee99ba45e8e0ed46
branch=spark-rapids-jni-10430
date=2024-03-06T22:14:26Z
url=https://github.com/NVIDIA/spark-rapids-jni.git
gpu_architectures=70;75;80;86;90;

File: cudf-java-version-info.properties

version=24.04.0-SNAPSHOT
user=
revision=8d073e4ca0a6cb9d9a4d9fe5e4e0147f01d7eb36
branch=HEAD
date=2024-03-06T22:14:23Z
url=https://github.com/rapidsai/cudf.git
gpu_architectures=70;75;80;86;90;

Signed-off-by: Partho Sarthi <psarthi@nvidia.com>

parthosa · 2024-03-06T18:49:01Z

build/build-info

@@ -32,6 +32,7 @@ echo_build_properties() {
  echo branch=$(cd "$git_path" && git rev-parse --abbrev-ref HEAD)
  echo date=$(date -u +%Y-%m-%dT%H:%M:%SZ)
  echo url=$(cd "$git_path" && git config --get remote.origin.url)
+  echo gpu_architectures=$(cd "$git_path" && find . -name libcudfjni.a -o -name libcudf.a | xargs -n 1 bash -c 'cuobjdump $1 || exit 0' _ | grep 'arch = ' | awk -F_ '{print $2}' | sort -n -u)


Explanation:
This command searches for files libcudfjni.a or libcudf.a in the current directory, then runs cuobjdump on each file to extract architecture information and sorts the unique results numerically.

Result Format:
70 75 80 86 90

Why is this looking at both libcudfjni.a and libcudf.a? It should only be looking only at the libcudf artifact for the cudf properties, IMO, as that's where almost all the kernels are. If libcudfjni.a has an arch libcudf.a does not, we do not want to advertise it in the properties.

I don't see where this is handling the spark-rapids-jni kernels. Those are not in either libcudfjni.a or libcudf.a but rather in libcudf.so. The way this is written, cudf and spark-rapids-jni properties will always have the same property values. I'm OK with that, but it begs the question why the plugin would be checking both of them if they're known to be identical.

This should not be using find, it should be examining the known location where the artifact is placed before it is put into the jar artifact. find invites accidental errors where someone has a rogue binary somewhere in their repo (e.g.: temporary directory, whatever) and it accidentally adds an arch that is not in the final jar artifact.

@jlowe find is from my POC NVIDIA/spark-rapids#10540 (comment)

We probably don't need to go this far, but I was thinking of an idea to record the arch sets for each lib going into a build (including the nvcomp ones), and then compute the set intersection for the final device support check, can be done at build or run time.

This should not be using find, it should be examining the known location where the artifact is placed before it is

For cudf supported architectures, we should check ./thirdparty/cudf/cpp/build/libcudf.a.
For spark-rapids-jni, what is the difference between ./target/cmake-build/libcudf.so and ./target/classes/amd64/Linux/libcudf.so

We probably don't need to go this far, but I was thinking of an idea to record the arch sets for each lib going

We should probably use set intersection of each lib. Can we ever have a case when spark-rapids-jni has different architectures supported from cudf?

For spark-rapids-jni, what is the difference between ./target/cmake-build/libcudf.so and ./target/classes/amd64/Linux/libcudf.so

They are the same, you can pick either path. The former is where the library is orginally built, the latter is where it's copied to pull it into the resulting jar.

Can we ever have a case when spark-rapids-jni has different architectures supported from cudf?

Theoretically yes but not easily the way it builds today. IMO we could easily overengineer this thing to solve a problem that doesn't need to be solved in practice. We could simply probe for the architectures on libcudf.so and call it a day. Is it 100% bulletproof? No, theoretically someone could do a very crazy thing to get that to not be correct because they went out of their way to somehow build libcudf.a, libcudfjni.a, and libcudf.so differently. But I've never seen this in practice and arguably YAGNI applies. That avoids the "where are all the places to look" problem and the need to do set intersections.

Can we ever have a case when spark-rapids-jni has different architectures supported from cudf?

inconsistent build environment after upmerges, branch switches, other mistakes during custom dev build, bugs in a prod build plus things like nvcomp kernels built externally

Changed script delimiter to semi-colon and explicitly passed 'libcudf.a' and 'libcudf.so' as library files for cudf and spark-rapids-jni for processing by the build info script

My perception from internal chats is that devs routinely runs into build situations that would be YAGNI, but we do not need to optimize for this 😄

build/build-info

jlowe · 2024-03-06T19:09:15Z

build/build-info

@@ -32,6 +32,7 @@ echo_build_properties() {
  echo branch=$(cd "$git_path" && git rev-parse --abbrev-ref HEAD)
  echo date=$(date -u +%Y-%m-%dT%H:%M:%SZ)
  echo url=$(cd "$git_path" && git config --get remote.origin.url)
+  echo gpu_architectures=$(cd "$git_path" && find . -name libcudfjni.a -o -name libcudf.a | xargs -n 1 bash -c 'cuobjdump $1 || exit 0' _ | grep 'arch = ' | awk -F_ '{print $2}' | sort -n -u)


Why is this looking at both libcudfjni.a and libcudf.a? It should only be looking only at the libcudf artifact for the cudf properties, IMO, as that's where almost all the kernels are. If libcudfjni.a has an arch libcudf.a does not, we do not want to advertise it in the properties.

I don't see where this is handling the spark-rapids-jni kernels. Those are not in either libcudfjni.a or libcudf.a but rather in libcudf.so. The way this is written, cudf and spark-rapids-jni properties will always have the same property values. I'm OK with that, but it begs the question why the plugin would be checking both of them if they're known to be identical.

This should not be using find, it should be examining the known location where the artifact is placed before it is put into the jar artifact. find invites accidental errors where someone has a rogue binary somewhere in their repo (e.g.: temporary directory, whatever) and it accidentally adds an arch that is not in the final jar artifact.

gerashegalov · 2024-03-06T19:27:32Z

build/build-info

@@ -32,6 +32,7 @@ echo_build_properties() {
  echo branch=$(cd "$git_path" && git rev-parse --abbrev-ref HEAD)
  echo date=$(date -u +%Y-%m-%dT%H:%M:%SZ)
  echo url=$(cd "$git_path" && git config --get remote.origin.url)
+  echo gpu_architectures=$(cd "$git_path" && find . -name libcudfjni.a -o -name libcudf.a | xargs -n 1 bash -c 'cuobjdump $1 || exit 0' _ | grep 'arch = ' | awk -F_ '{print $2}' | sort -n -u)


I am not in favor of the space-separated arch list. Java properties file format allows for the space being the separator between the key and the value portion of the line

jshell> var props = new java.util.Properties() props ==> {} jshell> props.load(new java.io.StringReader("gpu_architectures 70 75 80 86 90")) jshell> props props ==> {gpu_architectures=70 75 80 86 90}

I prefer a semicolon-separated value format without any white space in the value 70;75;80;86;90 similar to the first example https://cmake.org/cmake/help/latest/prop_tgt/CUDA_ARCHITECTURES.html#prop_tgt:CUDA_ARCHITECTURES and this is what we typically use for -DGPU_ARCHS passed to Maven

That makes sense. This would keep it consistent with current usage.

gerashegalov · 2024-03-06T19:35:24Z

build/build-info

@@ -32,6 +32,7 @@ echo_build_properties() {
  echo branch=$(cd "$git_path" && git rev-parse --abbrev-ref HEAD)
  echo date=$(date -u +%Y-%m-%dT%H:%M:%SZ)
  echo url=$(cd "$git_path" && git config --get remote.origin.url)
+  echo gpu_architectures=$(cd "$git_path" && find . -name libcudfjni.a -o -name libcudf.a | xargs -n 1 bash -c 'cuobjdump $1 || exit 0' _ | grep 'arch = ' | awk -F_ '{print $2}' | sort -n -u)


@jlowe find is from my POC NVIDIA/spark-rapids#10540 (comment)

We probably don't need to go this far, but I was thinking of an idea to record the arch sets for each lib going into a build (including the nvcomp ones), and then compute the set intersection for the final device support check, can be done at build or run time.

… build info script Signed-off-by: Partho Sarthi <psarth@nvidia.com>

gerashegalov

LGTM

gerashegalov · 2024-03-06T23:03:29Z

build/build-info

  echo version=$version
  echo user=$USER
  echo revision=$(cd "$git_path" && git rev-parse HEAD)
  echo branch=$(cd "$git_path" && git rev-parse --abbrev-ref HEAD)
  echo date=$(date -u +%Y-%m-%dT%H:%M:%SZ)
  echo url=$(cd "$git_path" && git config --get remote.origin.url)
+  echo gpu_architectures=$(cd "$git_path" && cuobjdump "$library_path" 2>/dev/null | grep 'arch = ' | awk -F_ '{print $2}' | sort -n -u | tr '\n' ';')


nit: not a git op and no need to cd?

Suggested change

echo gpu_architectures=$(cd "$git_path" && cuobjdump "$library_path" 2>/dev/null | grep 'arch = ' | awk -F_ '{print $2}' | sort -n -u | tr '\n' ';')

echo gpu_architectures=$(cuobjdump "$library_path" 2>/dev/null | grep 'arch = ' | awk -F_ '{print $2}' | sort -n -u | tr '\n' ';')

It kept it as safety but the build paths are absolute. Refactored as suggested.

Signed-off-by: Partho Sarthi <psarth@nvidia.com>

gerashegalov

LGTM

gerashegalov · 2024-03-07T01:25:45Z

build

jlowe

Minor nit, not mustfix.

jlowe · 2024-03-07T14:40:09Z

build/build-info

  echo version=$version
  echo user=$USER
  echo revision=$(cd "$git_path" && git rev-parse HEAD)
  echo branch=$(cd "$git_path" && git rev-parse --abbrev-ref HEAD)
  echo date=$(date -u +%Y-%m-%dT%H:%M:%SZ)
  echo url=$(cd "$git_path" && git config --get remote.origin.url)
+  echo gpu_architectures=$(cuobjdump "$libcudf_path" 2>/dev/null | grep 'arch = ' | awk -F_ '{print $2}' | sort -n -u | tr '\n' ';')


Nit: awk can be leveraged to do the grep directly.

Suggested change

echo gpu_architectures=$(cuobjdump "$libcudf_path" 2>/dev/null | grep 'arch = ' | awk -F_ '{print $2}' | sort -n -u | tr '\n' ';')

echo gpu_architectures=$(cuobjdump "$libcudf_path" 2>/dev/null | awk -F_ '/arch =/ {print $2}' | sort -n -u | tr '\n' ';')

jlowe · 2024-03-07T17:12:11Z

build

gerashegalov

LGTM

Fixes #10430. This PR ensures that Spark RAPIDS jobs are executed on supported GPU architectures without relying on manual configuration. ### Changes: 1. Processes `gpu_architectures` property from the `*version-info.properties` file generated by the native builds. 2. Verifies if the user is running the job on an architecture supported by the cuDF and JNI libraries and throws an exception if the architecture is unsupported. ### Testing Tested on a Dataproc VM running on Nvidia P4 (GPU Architecture 6.1) ``` 24/03/06 17:44:58 WARN RapidsPluginUtils: spark.rapids.sql.explain is set to `NOT_ON_GPU`. Set it to 'NONE' to suppress the diagnostics logging about the query placement on the GPU. 24/03/06 17:45:10 ERROR RapidsExecutorPlugin: Exception in the executor plugin, shutting down! java.lang.RuntimeException: Device architecture 61 is unsupported. Minimum supported architecture: 75. at com.nvidia.spark.rapids.RapidsPluginUtils$.checkGpuArchitectureInternal(Plugin.scala:366) at com.nvidia.spark.rapids.RapidsPluginUtils$.checkGpuArchitecture(Plugin.scala:375) at com.nvidia.spark.rapids.RapidsExecutorPlugin.init(Plugin.scala:461) ``` ### Related PR * NVIDIA/spark-rapids-jni#1840 * Add conf for minimum supported CUDA and error handling Signed-off-by: Partho Sarthi <psarthi@nvidia.com> * Revert "Add conf for minimum supported CUDA and error handling" This reverts commit 7b8eaea. * Verify the GPU architecture is supported by the plugin libraries Signed-off-by: Partho Sarthi <psarthi@nvidia.com> * Use semi-colon as delimiter and use intersection of supported gpu architectures Signed-off-by: Partho Sarthi <psarthi@nvidia.com> * Allow for compatibility with major architectures Signed-off-by: Partho Sarthi <psarthi@nvidia.com> * Check for version as integers Signed-off-by: Partho Sarthi <psarthi@nvidia.com> * Modify compatibility check for same major version and same or higher minor version Signed-off-by: Partho Sarthi <psarthi@nvidia.com> * Add a config to skip verification and refactor checking Signed-off-by: Partho Sarthi <psarthi@nvidia.com> * Update RapidsConf.scala Co-authored-by: Jason Lowe <jlowe@nvidia.com> * Update verification logic Signed-off-by: Partho Sarthi <psarthi@nvidia.com> * Update warning message Signed-off-by: Partho Sarthi <psarthi@nvidia.com> * Add unit tests and update warning message. Signed-off-by: Partho Sarthi <psarthi@nvidia.com> * Update exception class Signed-off-by: Partho Sarthi <psarthi@nvidia.com> * Address review comments Signed-off-by: Partho Sarthi <psarthi@nvidia.com> --------- Signed-off-by: Partho Sarthi <psarthi@nvidia.com> Co-authored-by: Jason Lowe <jlowe@nvidia.com>

Add GPU architectures to the build-info file

4daf3f8

Signed-off-by: Partho Sarthi <psarthi@nvidia.com>

parthosa mentioned this pull request Mar 6, 2024

Handle minimum GPU architecture supported [databricks] NVIDIA/spark-rapids#10540

Merged

parthosa commented Mar 6, 2024

View reviewed changes

jlowe reviewed Mar 6, 2024

View reviewed changes

gerashegalov reviewed Mar 6, 2024

View reviewed changes

Update pom to include libcudf path and use semi-colon as delimiter in…

82cfc38

… build info script Signed-off-by: Partho Sarthi <psarth@nvidia.com>

parthosa requested review from jlowe and gerashegalov March 6, 2024 22:24

gerashegalov previously approved these changes Mar 6, 2024

View reviewed changes

Remove git path and rename libcudf_path

176c5d6

Signed-off-by: Partho Sarthi <psarth@nvidia.com>

parthosa dismissed gerashegalov’s stale review via 176c5d6 March 6, 2024 23:59

parthosa requested a review from gerashegalov March 7, 2024 00:00

gerashegalov previously approved these changes Mar 7, 2024

View reviewed changes

gerashegalov mentioned this pull request Mar 7, 2024

Use whoami instead of $USER #1843

Merged

mythrocks previously approved these changes Mar 7, 2024

View reviewed changes

jlowe previously approved these changes Mar 7, 2024

View reviewed changes

jlowe reviewed Mar 7, 2024

View reviewed changes

Remove grep and use awk instead

48eb19a

parthosa dismissed stale reviews from jlowe, mythrocks, and gerashegalov via 48eb19a March 7, 2024 16:03

parthosa requested a review from jlowe March 7, 2024 16:20

jlowe approved these changes Mar 7, 2024

View reviewed changes

gerashegalov approved these changes Mar 7, 2024

View reviewed changes

gerashegalov merged commit 8f8aeed into NVIDIA:branch-24.04 Mar 7, 2024
3 checks passed

parthosa deleted the spark-rapids-jni-10430 branch March 7, 2024 18:59

parthosa mentioned this pull request Mar 7, 2024

[FEA] Error out when running on an unsupported GPU architecture NVIDIA/spark-rapids#10430

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add GPU architectures to the build-info file #1840

Add GPU architectures to the build-info file #1840

parthosa commented Mar 6, 2024 •

edited by gerashegalov

Loading

parthosa Mar 6, 2024 •

edited

Loading

jlowe Mar 6, 2024

gerashegalov Mar 6, 2024 •

edited

Loading

parthosa Mar 6, 2024 •

edited

Loading

jlowe Mar 6, 2024

gerashegalov Mar 6, 2024 •

edited

Loading

parthosa Mar 6, 2024

gerashegalov Mar 6, 2024

jlowe Mar 6, 2024

gerashegalov Mar 6, 2024

parthosa Mar 6, 2024

gerashegalov Mar 6, 2024 •

edited

Loading

gerashegalov left a comment

gerashegalov Mar 6, 2024

parthosa Mar 7, 2024

gerashegalov left a comment

gerashegalov commented Mar 7, 2024

jlowe left a comment

jlowe Mar 7, 2024

jlowe commented Mar 7, 2024

gerashegalov left a comment

	echo gpu_architectures=$(cd "$git_path" && cuobjdump "$library_path" 2>/dev/null \| grep 'arch = ' \| awk -F_ '{print $2}' \| sort -n -u \| tr '\n' ';')
	echo gpu_architectures=$(cuobjdump "$library_path" 2>/dev/null \| grep 'arch = ' \| awk -F_ '{print $2}' \| sort -n -u \| tr '\n' ';')

Add GPU architectures to the build-info file #1840

Add GPU architectures to the build-info file #1840

Conversation

parthosa commented Mar 6, 2024 • edited by gerashegalov Loading

Output

parthosa Mar 6, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gerashegalov Mar 6, 2024 • edited Loading

Choose a reason for hiding this comment

parthosa Mar 6, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gerashegalov Mar 6, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gerashegalov Mar 6, 2024 • edited Loading

Choose a reason for hiding this comment

gerashegalov left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gerashegalov left a comment

Choose a reason for hiding this comment

gerashegalov commented Mar 7, 2024

jlowe left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jlowe commented Mar 7, 2024

gerashegalov left a comment

Choose a reason for hiding this comment

parthosa commented Mar 6, 2024 •

edited by gerashegalov

Loading

parthosa Mar 6, 2024 •

edited

Loading

gerashegalov Mar 6, 2024 •

edited

Loading

parthosa Mar 6, 2024 •

edited

Loading

gerashegalov Mar 6, 2024 •

edited

Loading

gerashegalov Mar 6, 2024 •

edited

Loading