Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] non-deterministic compiled SQLExecPlugin.class with scala 2.13 deployment #9571

Closed
pxLi opened this issue Oct 30, 2023 · 6 comments · Fixed by #9576
Closed

[BUG] non-deterministic compiled SQLExecPlugin.class with scala 2.13 deployment #9571

pxLi opened this issue Oct 30, 2023 · 6 comments · Fixed by #9576
Assignees
Labels
bug Something isn't working build Related to CI / CD or cleanly building P0 Must have for release Scala 2.13 Issues related to Scala 2.13 binaries

Comments

@pxLi
Copy link
Collaborator

pxLi commented Oct 30, 2023

Describe the bug
We saw it failed in our CI not bitwise-identical across shims constantly.

[2023-10-29T11:34:31.265Z] [ERROR] Failed to execute goal org.apache.maven.plugins:maven-antrun-plugin:3.1.0:run (create-parallel-world) on project rapids-4-spark_2.13: An Ant BuildException has occured: The following error occurred while executing this line:
[2023-10-29T11:34:31.265Z] [ERROR] /home/jenkins/agent/workspace/jenkins-rapids_scala213_nightly-dev-github-4/dist/maven-antrun/build-parallel-worlds.xml:124: exec binary-dedupe.sh failed, exit code is 255, error msg is com/nvidia/spark/rapids/SQLExecPlugin.class is not bitwise-identical across shims

after comparing the generated SQLExecPlugin.class in different spark shims, I found the only diff is that
mvn install

public <A$> Function1<A$, BoxedUnit> compose(final Function1<A$, SparkSessionExtensions> g) {
        return Function1.compose$(this, g);
    }

    public <A$> Function1<SparkSessionExtensions, A$> andThen(final Function1<BoxedUnit, A$> g) {
        return Function1.andThen$(this, g);
    }

after following mvn deploy:

    public <A> Function1<A, BoxedUnit> compose(final Function1<A, SparkSessionExtensions> g) {
        return Function1.compose$(this, g);
    }

    public <A> Function1<SparkSessionExtensions, A> andThen(final Function1<BoxedUnit, A> g) {
        return Function1.andThen$(this, g);
    }

A$ vs A

Steps/Code to reproduce bug
run with internal rapids_scala213_nightly-dev-github pipeline (dup a pipeline for testing)
UPDATE: local repro steps #9571 (comment)

Expected behavior
After build SQLExecPlugin.class should always have the identical sha value in different shims,
or move SQLExecPlugin.class out of check

@pxLi pxLi added bug Something isn't working build Related to CI / CD or cleanly building P0 Must have for release labels Oct 30, 2023
@pxLi
Copy link
Collaborator Author

pxLi commented Oct 30, 2023

After a couple of tests, the issue was confirmed showing after the deploy command https://github.com/NVIDIA/spark-rapids/blob/branch-23.12/jenkins/spark-nightly-build.sh#L120-L124

this could be repro locally
A. build and install the shim pkgs

export URM_URL="https://urm.nvidia.com/artifactory/sw-spark-maven"
export WORKSPACE=`pwd`

cd scala2.13
ln -sf ../jenkins jenkins

mvn -Dmaven.wagon.http.retryHandler.count=3 -DretryFailedDeploymentCount=3 \
-Drat.skip=true -U -B clean install \
-s jenkins/settings.xml -P mirror-apache-to-urm \
-Dmaven.repo.local=$WORKSPACE/.m2 -Dcuda.version=cuda11 \
-DskipTests=true -Dbuildver=331

after the installation command, please check scala2.13/{sql-plugin|aggregator}/target/spark331 (or /.m2),
SQLExecPlugin.class in the sql/aggregator jar has the A$

B. execute the deploy command for intermediate pkgs

export URM_CREDS_USR=<internal_nv_user>
export URM_CREDS_PSW=<internal_nv_pwd>

mvn -Dmaven.wagon.http.retryHandler.count=3 -DretryFailedDeploymentCount=3 \
-Drat.skip=true -B deploy -pl '!dist' \
-s jenkins/settings.xml -P mirror-apache-to-urm -Dmaven.repo.local=$WORKSPACE/.m2 -Dcuda.version=cuda11 \
-DskipTests -Dbuildver=331

after the deployment command, please check scala2.13/{sql-plugin|aggregator}/target/spark331 (or /.m2) again,
SQLExecPlugin.class in a new sql/aggregator jar now changed to use the A.
And the commit infos in both artifacts are the same

It looks like rapids-4-spark-sql_2.13 got re-compiled with different result in the deploy process causing the issue

in CI

# sql compiling in mvn install

[2023-10-30T07:00:31.138Z] [INFO] --- scala-maven-plugin:4.3.0:compile (scala-compile-first) @ rapids-4-spark-sql_2.13 ---
[2023-10-30T07:00:31.700Z] [INFO] Using incremental compilation using Mixed compile order
[2023-10-30T07:00:31.700Z] [INFO] Compiler bridge file: <workspace>/scala2.13/target/spark331/.sbt/1.0/zinc/org.scala-sbt/org.scala-sbt-compiler-bridge_2.13-1.3.1-bin_2.13.10__52.0-1.3.1_20191012T045515.jar
[2023-10-30T07:00:32.262Z] [INFO] Compiling 421 Scala sources and 60 Java sources to <workspace>/scala2.13/sql-plugin/target/spark331/classes ...
[2023-10-30T07:00:34.791Z] [WARNING] [Warn] : [deprecation @  | origin= | version=] -target is deprecated: Use -release instead to compile against the correct platform API.
[2023-10-30T07:01:21.403Z] [WARNING] one warning found
[2023-10-30T07:01:23.917Z] [INFO] Done compiling.
[2023-10-30T07:01:25.804Z] [INFO] compile in 54.1 s

# then, sql compiling in mvn deploy

[2023-10-30T07:09:20.729Z] [INFO] --- scala-maven-plugin:4.3.0:compile (scala-compile-first) @ rapids-4-spark-sql_2.13 ---
[2023-10-30T07:09:20.984Z] [INFO] Using incremental compilation using Mixed compile order
[2023-10-30T07:09:20.984Z] [INFO] Compiler bridge file: <workspace>/scala2.13/target/spark331/.sbt/1.0/zinc/org.scala-sbt/org.scala-sbt-compiler-bridge_2.13-1.3.1-bin_2.13.10__52.0-1.3.1_20191012T045515.jar
[2023-10-30T07:09:23.493Z] [INFO] Compiling 23 Scala sources to <workspace>/scala2.13/sql-plugin/target/spark331/classes ...
[2023-10-30T07:09:25.379Z] [WARNING] [Warn] : [deprecation @  | origin= | version=] -target is deprecated: Use -release instead to compile against the correct platform API.
[2023-10-30T07:09:37.544Z] [WARNING] one warning found
[2023-10-30T07:09:37.544Z] [INFO] Done compiling.
[2023-10-30T07:09:37.544Z] [INFO] compile in 16.0 s


# if only mvn install the package it would always be > 'A$'
c45c081c347546be188c12cbd64620f9974a1a04 *com/nvidia/spark/rapids/SQLExecPlugin.class
# but in nightly, we need to do mvn deploy for aggregator artifacts after each shim's build, 
# then it would be overwritten > 'A'
361897d6318eadacacf6f46e7e827b19558e035e *com/nvidia/spark/rapids/SQLExecPlugin.class

@pxLi pxLi changed the title [BUG] non-deterministic compiled SQLExecPlugin.class with scala 2.13 [BUG] non-deterministic compiled SQLExecPlugin.class with scala 2.13 deployment Oct 30, 2023
@jlowe
Copy link
Member

jlowe commented Oct 30, 2023

Some simpler repro steps that do not require access to the internal repository, any CI Jenkins settings, and doesn't spend time building unrelated projects in the repo:

$ mvn clean install -pl sql-plugin -am -Dcuda.version=cuda11 -DskipTests -Dskip -Dbuildver=331
[... mvn output elided ...]
$ javap -cp sql-plugin/target/spark331/rapids-4-spark-sql_2.13-23.12.0-SNAPSHOT-spark331.jar com.nvidia.spark.rapids.SQLExecPlugin | grep compose
  public <A$> scala.Function1<A$, scala.runtime.BoxedUnit> compose(scala.Function1<A$, org.apache.spark.sql.SparkSessionExtensions>);
$ mvn deploy -pl sql-plugin -am -Dcuda.version=cuda11 -DskipTests -Dskip -Dbuildver=331 -DaltDeploymentRepository=tmplocalrepo::default::file:/tmp/tmplocalrepo
[... mvn output elided ...]
$ javap -cp sql-plugin/target/spark331/rapids-4-spark-sql_2.13-23.12.0-SNAPSHOT-spark331.jar com.nvidia.spark.rapids.SQLExecPlugin | grep compose
  public <A> scala.Function1<A, scala.runtime.BoxedUnit> compose(scala.Function1<A, org.apache.spark.sql.SparkSessionExtensions>);

@jlowe
Copy link
Member

jlowe commented Oct 30, 2023

Verified this is not really the deploy goal that is directly problem but rather the fact that the deploy goal implicitly re-runs the install goal (and all previous goals like compile). Here's even simpler repro steps:

$ mvn clean install -pl sql-plugin -am -Dcuda.version=cuda11 -DskipTests -Dskip -Dbuildver=331
[... mvn output elided ...]
$ javap -cp sql-plugin/target/spark331/rapids-4-spark-sql_2.13-23.12.0-SNAPSHOT-spark331.jar com.nvidia.spark.rapids.SQLExecPlugin | grep compose
  public <A$> scala.Function1<A$, scala.runtime.BoxedUnit> compose(scala.Function1<A$, org.apache.spark.sql.SparkSessionExtensions>);
$ rm sql-plugin/target/spark331/classes/com/nvidia/spark/rapids/SQLExecPlugin.class
$ mvn install -pl sql-plugin -am -Dcuda.version=cuda11 -DskipTests -Dskip -Dbuildver=331
[... mvn output elided ...]
$ javap -cp sql-plugin/target/spark331/rapids-4-spark-sql_2.13-23.12.0-SNAPSHOT-spark331.jar com.nvidia.spark.rapids.SQLExecPlugin | grep compose
  public <A> scala.Function1<A, scala.runtime.BoxedUnit> compose(scala.Function1<A, org.apache.spark.sql.SparkSessionExtensions>);

For some reason SQLExecPlugin and 22 other files are getting recompiled in the sql-plugin project as part of the prerequisite phases executed before the deploy phase, and it's during that recompile that the change occurs. I verified from Maven -X output that we're passing the same scala compiler options in both cases, so it's not clear why it comes out different during a fresh compile but not during a recompile.

@jlowe
Copy link
Member

jlowe commented Oct 30, 2023

I verified that if we avoid doing an install run first followed by a deploy and instead directly build everything during a single deploy build, we end up with A$ instead of A.

@jlowe
Copy link
Member

jlowe commented Oct 30, 2023

The affected functions are part of scala.Function1, so on a hunch, I looked for other classes in sql-plugin that are using that trait. Other classes are affected by this if recompiled in isolation (e.g.: GpuLog) (i.e.: A$ becomes A for the compose signature). Oddly, removing ScalaStack.class to force it to recompile at the same time as the scala.Function1 descendants seems to fix it. For example, this slight modification to the above repro steps "fixes" the inconsistency:

$ mvn clean install -pl sql-plugin -am -Dcuda.version=cuda11 -DskipTests -Dskip -Dbuildver=331
[... mvn output elided ...]
$ javap -cp sql-plugin/target/spark331/rapids-4-spark-sql_2.13-23.12.0-SNAPSHOT-spark331.jar com.nvidia.spark.rapids.SQLExecPlugin | grep compose
  public <A$> scala.Function1<A$, scala.runtime.BoxedUnit> compose(scala.Function1<A$, org.apache.spark.sql.SparkSessionExtensions>);
$ rm sql-plugin/target/spark331/classes/com/nvidia/spark/rapids/SQLExecPlugin.class sql-plugin/target/spark331/classes/com/nvidia/spark/rapids/ScalaStack.class
$ mvn install -pl sql-plugin -am -Dcuda.version=cuda11 -DskipTests -Dskip -Dbuildver=331
[... mvn output elided ...]
$ javap -cp sql-plugin/target/spark331/rapids-4-spark-sql_2.13-23.12.0-SNAPSHOT-spark331.jar com.nvidia.spark.rapids.SQLExecPlugin | grep compose
  public <A$> scala.Function1<A$, scala.runtime.BoxedUnit> compose(scala.Function1<A$, org.apache.spark.sql.SparkSessionExtensions>);

@jlowe
Copy link
Member

jlowe commented Oct 30, 2023

If I hack out the use of ScalaStack and have the code just use scala.collection.mutable.Stack directly for the 2.13 build, the A$ goes away in all the uses in compose that I checked and instead become A. A recompile leaves it at A as expected, so the problems seem related to ScalaStack somehow.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working build Related to CI / CD or cleanly building P0 Must have for release Scala 2.13 Issues related to Scala 2.13 binaries
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants