Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable implicit JDK profile activation [databricks] #9591

Merged
merged 25 commits into from
Nov 2, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
25 commits
Select commit Hold shift + click to select a range
093e31d
Consolidate deps switching in an intermediate pom
gerashegalov Oct 22, 2023
824cc80
Merge remote-tracking branch 'origin/branch-23.12' into shimDepsSwitc…
gerashegalov Oct 24, 2023
e0b5384
Merge remote-tracking branch 'origin/branch-23.12' into shimDepsSwitc…
gerashegalov Oct 26, 2023
225e517
Revert inclusion of shim-deps module
gerashegalov Oct 26, 2023
956d952
Merge remote-tracking branch 'origin/branch-23.12' into shimDepsSwitc…
gerashegalov Oct 26, 2023
1410c12
Merge remote-tracking branch 'origin/branch-23.12' into shimDepsSwitc…
gerashegalov Oct 27, 2023
addaccb
merge
gerashegalov Oct 27, 2023
53258d8
Merge remote-tracking branch 'origin/branch-23.12' into shimDepsSwitc…
gerashegalov Oct 27, 2023
b881e30
shim-deps module as a build order function for a decendant
gerashegalov Oct 27, 2023
7ae26ba
regenerate 2.13 poms
gerashegalov Oct 27, 2023
30646ec
make shim-deps 2.12/2.13 sensitive
gerashegalov Oct 27, 2023
5034895
Make JDK profiles implicit
gerashegalov Oct 27, 2023
d990638
Merge remote-tracking branch 'origin/branch-23.12' into implicitJDKPr…
gerashegalov Oct 27, 2023
8456c52
Remove explicit JDK profiles from github workflow
gerashegalov Oct 28, 2023
183aa14
Enforce Java and Scala 2.13 buildvers
gerashegalov Oct 28, 2023
50254a2
Merge remote-tracking branch 'origin/branch-23.12' into shimDepsSwitc…
gerashegalov Oct 28, 2023
2cba130
Merge branch 'shimDepsSwitcherParent' into implicitJDKProfiles2
gerashegalov Oct 28, 2023
0ea0357
Document skipping JDK enforcement
gerashegalov Oct 28, 2023
b355ae9
Update README
gerashegalov Oct 29, 2023
07fdfe1
Merge remote-tracking branch 'origin/branch-23.12' into implicitJDKPr…
gerashegalov Oct 31, 2023
b44c921
Merge remote-tracking branch 'origin/branch-23.12' into implicitJDKPr…
gerashegalov Oct 31, 2023
985f60a
separate fake modules
gerashegalov Oct 31, 2023
6dc4088
Remove hardcoding of -target/release
gerashegalov Oct 31, 2023
a7a4a40
Remove scala2.13 Java target hardcoding
gerashegalov Oct 31, 2023
87fbd80
Undo unnecessary enforce plugin changes
gerashegalov Oct 31, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 3 additions & 3 deletions .github/workflows/mvn-verify-check.yml
Original file line number Diff line number Diff line change
Expand Up @@ -179,7 +179,7 @@ jobs:
-Drat.skip=true \
${{ env.COMMON_MVN_FLAGS }}


verify-all-modules:
needs: get-shim-versions-from-dist
runs-on: ubuntu-latest
Expand Down Expand Up @@ -208,7 +208,7 @@ jobs:
java -version && mvn --version && echo "ENV JAVA_HOME: $JAVA_HOME, PATH: $PATH"
# test command
mvn -Dmaven.wagon.http.retryHandler.count=3 -B verify \
-P "individual,pre-merge,jdk${{ matrix.java-version }}" \
-P "individual,pre-merge" \
-Dbuildver=${{ matrix.spark-version }} \
${{ env.COMMON_MVN_FLAGS }}

Expand Down Expand Up @@ -244,6 +244,6 @@ jobs:
java -version && mvn --version && echo "ENV JAVA_HOME: $JAVA_HOME, PATH: $PATH"
# test command
./mvnw -Dmaven.wagon.http.retryHandler.count=3 -B install \
-P "individual,pre-merge,jdk11" \
-P "individual,pre-merge" \
-Dbuildver=${{ needs.get-shim-versions-from-dist.outputs.defaultSparkVersion }} \
${{ env.COMMON_MVN_FLAGS }}
20 changes: 7 additions & 13 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -60,12 +60,12 @@ You can find all available build versions in the top level pom.xml file. If you
for Databricks then you should use the `jenkins/databricks/build.sh` script and modify it for
the version you want.

Note that we build against both Scala 2.12 and 2.13. Any contribution you make to the
Note that we build against both Scala 2.12 and 2.13. Any contribution you make to the
codebase should compile with both Scala 2.12 and 2.13 for Apache Spark versions 3.3.0 and
higher.
higher.

Also, if you make changes in the parent `pom.xml` or any other of the module `pom.xml`
files, you must run the following command to sync the changes between the Scala 2.12 and
Also, if you make changes in the parent `pom.xml` or any other of the module `pom.xml`
files, you must run the following command to sync the changes between the Scala 2.12 and
2.13 pom files:

```shell script
Expand All @@ -74,7 +74,7 @@ files, you must run the following command to sync the changes between the Scala

That way any new dependencies or other changes will also be picked up in the Scala 2.13 build.

See the [scala2.13](scala2.13) directory for more information on how to build against
See the [scala2.13](scala2.13) directory for more information on how to build against
Scala 2.13.

To get an uber jar with more than 1 version you have to `mvn package` each version
Expand Down Expand Up @@ -191,14 +191,8 @@ public final class com.nvidia.spark.rapids.shims.SparkShimImpl {
We support JDK8 as our main JDK version, and test JDK8, JDK11 and JDK17. It is possible to build and run
with more modern JDK versions, however these are untested. The first step is to set `JAVA_HOME` in
the environment to your JDK root directory. NOTE: for JDK17, we only support build against spark 3.3.0+

Also make sure to use scala-maven-plugin version `scala.plugin.version` 4.6.0 or later to correctly process
[maven.compiler.release](https://github.com/davidB/scala-maven-plugin/blob/4.6.1/src/main/java/scala_maven/ScalaMojoSupport.java#L161)
flag if cross-compilation is required.

```bash
mvn clean verify -Dbuildver=330 -P<jdk11|jdk17>
```
If you need to build with a JDK version that we do not test internally add
`-Denforcer.skipRules=requireJavaVersion` to the Maven invocation.

### Building and Testing with ARM

Expand Down
3 changes: 2 additions & 1 deletion aggregator/pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -21,8 +21,9 @@

<parent>
<groupId>com.nvidia</groupId>
<artifactId>rapids-4-spark-parent_2.12</artifactId>
<artifactId>rapids-4-spark-jdk-profiles_2.12</artifactId>
<version>23.12.0-SNAPSHOT</version>
<relativePath>../jdk-profiles/pom.xml</relativePath>
</parent>
<artifactId>rapids-4-spark-aggregator_2.12</artifactId>
<name>RAPIDS Accelerator for Apache Spark Aggregator</name>
Expand Down
57 changes: 27 additions & 30 deletions datagen/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,14 +2,14 @@

In order to do scale testing we need a way to generate lots of data in a
deterministic way that gives us control over the number of unique values
in a column, the skew of the values in a column, and the correlation of
in a column, the skew of the values in a column, and the correlation of
data between tables for joins. To accomplish this we wrote
`org.apache.spark.sql.tests.datagen`.

## Setup Environment

To get started with big data generation the first thing you need to do is
to include the appropriate jar on the classpath for your version of Apache Spark.
To get started with big data generation the first thing you need to do is
to include the appropriate jar on the classpath for your version of Apache Spark.
Note that this does not run on the GPU, but it does use parts of the shim framework
that the RAPIDS Accelerator does. The jar is specific to the version of Spark you
are using and is not pushed to Maven Central. Because of this you will have to
Expand All @@ -22,15 +22,12 @@ mvn clean package -Dbuildver=$SPARK_VERSION

Where `$SPARK_VERSION` is a compressed version number, like 330 for Spark 3.3.0.

If you are building with a jdk version that is not 8, you will need to add in the
corresponding profile flag `-P<jdk11|jdk17>`

After this the jar should be at
After this the jar should be at
`target/datagen_2.12-$PLUGIN_VERSION-spark$SPARK_VERSION.jar`
for example a Spark 3.3.0 jar for the 23.12.0 release would be
`target/datagen_2.12-23.12.0-spark330.jar`

To get a spark shell with this you can run
To get a spark shell with this you can run
```shell
spark-shell --jars target/datagen_2.12-23.12.0-spark330.jar
```
Expand Down Expand Up @@ -97,7 +94,7 @@ the [advanced control section](#advanced-control).
Generating nearly random data that fits a schema is great, but we want
to process this data in interesting ways, like doing a hash aggregate to see
how well it scales. To do that we really need to have a way to configure
the number of unique values that appear in a column. The
the number of unique values that appear in a column. The
[Internal Details](#internal-details) section describes how this works
but here we will control this by setting a seed range. Let's start off by creating
10 million rows of data to be processed.
Expand Down Expand Up @@ -211,7 +208,7 @@ around this.

### NormalDistribution

Often data is distributed in a normal or Gaussian like distribution.
Often data is distributed in a normal or Gaussian like distribution.
`NormalDistribution` takes a mean and a standard deviation to provide a way to
insert some basic skew into your data. Please note that this will clamp
the produced values to the configured seed range, so if seed range is not large
Expand Down Expand Up @@ -268,15 +265,15 @@ dataTable.toDF(spark).groupBy("a").count().orderBy(desc("count")).show()

### MultiDistribution

There are times when you might want to combine more than one distribution. Like
There are times when you might want to combine more than one distribution. Like
having a `NormalDistribution` along with a `FlatDistribution` so that the data
is skewed, but there is still nearly full coverage of the seed range. Of you could
combine two `NormalDistribution` instances to have two different sized bumps at
different key ranges. `MultiDistribution` allows you to do this. It takes a
`Seq` of weight/`LocationToSeedMapping` pairs. The weights are relative to
`Seq` of weight/`LocationToSeedMapping` pairs. The weights are relative to
each other and determine how often on mapping will be used vs another. If
you wanted a `NormalDistribution` to be used 10 times as often as a
`FlatDistribution` you would give the normal a weight of 10 and the flat a
you wanted a `NormalDistribution` to be used 10 times as often as a
`FlatDistribution` you would give the normal a weight of 10 and the flat a
weight of 1.


Expand Down Expand Up @@ -316,7 +313,7 @@ only showing top 20 rows
## Multi-Column Keys

With the basic tools provided we can now replicate a lot of processing. We can do
complicated things like a join with a fact table followed by an aggregation.
complicated things like a join with a fact table followed by an aggregation.

```scala
val dbgen = DBGen()
Expand All @@ -327,7 +324,7 @@ dataTable("join_key").setSeedRange(0, 999)
dataTable("agg_key").setSeedRange(0, 9)
val fdf = factTable.toDF(spark)
val ddf = dataTable.toDF(spark)
spark.time(fdf.join(ddf).groupBy("agg_key").agg(min("value"),
spark.time(fdf.join(ddf).groupBy("agg_key").agg(min("value"),
max("value"), sum("value")).orderBy("agg_key").show())
+--------------------+-----------+----------+-------------------+
| agg_key| min(value)|max(value)| sum(value)|
Expand Down Expand Up @@ -363,12 +360,12 @@ generate the seed is normalized so that for each row the same seed is passed int
all the generator functions. (This is not 100% correct for arrays and maps, but it
is close enough). This results in the generated data being correlated with each
other so that if you set a seed range of 1 to 200, you will get 200 unique values
in each column, and 200 unique values for any combination of the keys in that
in each column, and 200 unique values for any combination of the keys in that
key group.

This should work with any distribution and any type you want. The key to making
this work is that you need to configure the value ranges the same for both sets
of corresponding keys. In most cases you want the types to be the same as well,
of corresponding keys. In most cases you want the types to be the same as well,
but Spark supports equi-joins where the left and right keys are different types.
The default generators for integral types should produce the same values for the
same input keys if the value range is the same for both. This is not true for
Expand Down Expand Up @@ -433,15 +430,15 @@ command to the final data in a column.

### LocationToSeedMapping

The first level maps the current location of a data item
(table, column, row + sub-row) to a single 64-bit seed. The
The first level maps the current location of a data item
(table, column, row + sub-row) to a single 64-bit seed. The
`LocationToSeedMapping` class handles this. That mapping should produce a
seed that corresponds to the user provided seed range. But it has full
control over how it wants to do that. It could favor some seed more than
others, or simply go off of the row itself.

You can manually set this for columns or sub-columns through the
`configureKeyGroup` API in a `TableGen`. Or you can call
You can manually set this for columns or sub-columns through the
`configureKeyGroup` API in a `TableGen`. Or you can call
`setSeedMapping` on a column or sub-column. Be careful not to mix the two
because they can conflict with each other and there are no guard rails.

Expand All @@ -452,8 +449,8 @@ level. If the user does not configure nulls, or if the type is not nullable
this never runs.

This can be set on any column or sub-column by calling either `setNullProbability`
which will install a `NullProbabilityGenerationFunction` or by calling the
`setNullGen` API on that item.
which will install a `NullProbabilityGenerationFunction` or by calling the
`setNullGen` API on that item.

### LengthGeneratorFunction

Expand All @@ -463,9 +460,9 @@ to avoid data skew in the resulting column. This is because the naive way to gen
where all possible lengths have an equal probability produces skew in the
resulting values. A length of 0 has one and only one possible value in it.
So if we restrict the length to 0 or 1, then half of all values generated will be
zero length strings, which is not ideal.
zero length strings, which is not ideal.

If you want to set the length of a String or Array you can navigate to the
If you want to set the length of a String or Array you can navigate to the
column or sub-column you want and call `setLength(fixedLen)` on it. This will install
an updated `FixedLengthGeneratorFunction`. You may set a range of lengths using
setLength(minLen, maxLen), but this may introduce skew in the resulting data.
Expand All @@ -486,15 +483,15 @@ dataTable.toDF(spark).show(false)
+---+----------+----+
```

You can also set a `LengthGeneratorFunction` instance for any column or sub-column
You can also set a `LengthGeneratorFunction` instance for any column or sub-column
using the `setLengthGen` API.

### GeneratorFunction

The thing that actually produces data is a `GeneratorFunction`. It maps the key to
a value in the desired value range if that range is supported. For nested
types like structs or arrays parts of this can be delegated to child
GeneratorFunctions.
GeneratorFunctions.

You can set the `GeneratorFunction` for a column or sub-column with the
`setValueGen` API.
Expand All @@ -508,12 +505,12 @@ control to decide how the location information is mapped to the values. By
convention, it should honor things like the `LocationToSeedMapping`,
but it is under no requirement to do so.

This is similar for the `LocationToSeedMapping` and the `NullGeneratorFunction`.
This is similar for the `LocationToSeedMapping` and the `NullGeneratorFunction`.
If you have a requirement to generate null values from row 1024 to row 9999999,
you can write a `NullGeneratorFunction` to do that and install it on a column

```scala
case class MyNullGen(minRow: Long, maxRow: Long,
case class MyNullGen(minRow: Long, maxRow: Long,
gen: GeneratorFunction = null) extends NullGeneratorFunction {

override def withWrapped(gen: GeneratorFunction): MyNullGen =
Expand Down
4 changes: 2 additions & 2 deletions delta-lake/delta-20x/pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -21,9 +21,9 @@

<parent>
<groupId>com.nvidia</groupId>
<artifactId>rapids-4-spark-parent_2.12</artifactId>
<artifactId>rapids-4-spark-jdk-profiles_2.12</artifactId>
<version>23.12.0-SNAPSHOT</version>
<relativePath>../../pom.xml</relativePath>
<relativePath>../../jdk-profiles/pom.xml</relativePath>
</parent>

<artifactId>rapids-4-spark-delta-20x_2.12</artifactId>
Expand Down
4 changes: 2 additions & 2 deletions delta-lake/delta-21x/pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -21,9 +21,9 @@

<parent>
<groupId>com.nvidia</groupId>
<artifactId>rapids-4-spark-parent_2.12</artifactId>
<artifactId>rapids-4-spark-jdk-profiles_2.12</artifactId>
<version>23.12.0-SNAPSHOT</version>
<relativePath>../../pom.xml</relativePath>
<relativePath>../../jdk-profiles/pom.xml</relativePath>
</parent>

<artifactId>rapids-4-spark-delta-21x_2.12</artifactId>
Expand Down
4 changes: 2 additions & 2 deletions delta-lake/delta-22x/pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -21,9 +21,9 @@

<parent>
<groupId>com.nvidia</groupId>
<artifactId>rapids-4-spark-parent_2.12</artifactId>
<artifactId>rapids-4-spark-jdk-profiles_2.12</artifactId>
<version>23.12.0-SNAPSHOT</version>
<relativePath>../../pom.xml</relativePath>
<relativePath>../../jdk-profiles/pom.xml</relativePath>
</parent>

<artifactId>rapids-4-spark-delta-22x_2.12</artifactId>
Expand Down
4 changes: 2 additions & 2 deletions delta-lake/delta-24x/pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -21,9 +21,9 @@

<parent>
<groupId>com.nvidia</groupId>
<artifactId>rapids-4-spark-parent_2.12</artifactId>
<artifactId>rapids-4-spark-jdk-profiles_2.12</artifactId>
<version>23.12.0-SNAPSHOT</version>
<relativePath>../../pom.xml</relativePath>
<relativePath>../../jdk-profiles/pom.xml</relativePath>
</parent>

<artifactId>rapids-4-spark-delta-24x_2.12</artifactId>
Expand Down
4 changes: 2 additions & 2 deletions delta-lake/delta-spark321db/pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -21,9 +21,9 @@

<parent>
<groupId>com.nvidia</groupId>
<artifactId>rapids-4-spark-parent_2.12</artifactId>
<artifactId>rapids-4-spark-jdk-profiles_2.12</artifactId>
<version>23.12.0-SNAPSHOT</version>
<relativePath>../../pom.xml</relativePath>
<relativePath>../../jdk-profiles/pom.xml</relativePath>
</parent>

<artifactId>rapids-4-spark-delta-spark321db_2.12</artifactId>
Expand Down
4 changes: 2 additions & 2 deletions delta-lake/delta-spark330db/pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -21,9 +21,9 @@

<parent>
<groupId>com.nvidia</groupId>
<artifactId>rapids-4-spark-parent_2.12</artifactId>
<artifactId>rapids-4-spark-jdk-profiles_2.12</artifactId>
<version>23.12.0-SNAPSHOT</version>
<relativePath>../../pom.xml</relativePath>
<relativePath>../../jdk-profiles/pom.xml</relativePath>
</parent>

<artifactId>rapids-4-spark-delta-spark330db_2.12</artifactId>
Expand Down
4 changes: 2 additions & 2 deletions delta-lake/delta-spark332db/pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -21,9 +21,9 @@

<parent>
<groupId>com.nvidia</groupId>
<artifactId>rapids-4-spark-parent_2.12</artifactId>
<artifactId>rapids-4-spark-jdk-profiles_2.12</artifactId>
<version>23.12.0-SNAPSHOT</version>
<relativePath>../../pom.xml</relativePath>
<relativePath>../../jdk-profiles/pom.xml</relativePath>
</parent>

<artifactId>rapids-4-spark-delta-spark332db_2.12</artifactId>
Expand Down
4 changes: 2 additions & 2 deletions delta-lake/delta-stub/pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -21,9 +21,9 @@

<parent>
<groupId>com.nvidia</groupId>
<artifactId>rapids-4-spark-parent_2.12</artifactId>
<artifactId>rapids-4-spark-jdk-profiles_2.12</artifactId>
<version>23.12.0-SNAPSHOT</version>
<relativePath>../../pom.xml</relativePath>
<relativePath>../../jdk-profiles/pom.xml</relativePath>
</parent>

<artifactId>rapids-4-spark-delta-stub_2.12</artifactId>
Expand Down
3 changes: 2 additions & 1 deletion dist/pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -21,8 +21,9 @@

<parent>
<groupId>com.nvidia</groupId>
<artifactId>rapids-4-spark-parent_2.12</artifactId>
<artifactId>rapids-4-spark-jdk-profiles_2.12</artifactId>
<version>23.12.0-SNAPSHOT</version>
<relativePath>../jdk-profiles/pom.xml</relativePath>
</parent>
<artifactId>rapids-4-spark_2.12</artifactId>
<name>RAPIDS Accelerator for Apache Spark Distribution</name>
Expand Down
Loading