Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

spillable cache for GpuCartesianRDD #1784

Closed
wants to merge 39 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
39 commits
Select commit Hold shift + click to select a range
86fc104
spillable cache for GpuCartesianRDD
sperlingxx Feb 22, 2021
07b2d15
lazy cache
sperlingxx Feb 23, 2021
72b2e12
Update cudf dependency to 0.18 (#1828)
NvTimLiu Mar 1, 2021
653c33a
Merge branch 'branch-0.4' into fix-merge
jlowe Mar 1, 2021
bb03535
Update mortgage tests to support reading multiple dataset formats (#1…
NvTimLiu Mar 1, 2021
17657fe
Remove benchmarks (#1826)
jlowe Mar 1, 2021
c40ec37
Merge branch 'branch-0.4' into fix-merge
jlowe Mar 1, 2021
50fd165
Merge pull request #1835 from jlowe/fix-merge
jlowe Mar 1, 2021
6483543
Spark 3.0.2 shim no longer a snapshot shim (#1831)
jlowe Mar 1, 2021
c52e9a5
Merge pull request #1837 from NVIDIA/branch-0.4
nvauto Mar 1, 2021
7e210c2
Make databricks build.sh more convenient for dev (#1838)
tgravescs Mar 1, 2021
51049a6
Add a shim provider for Spark 3.2.0 development branch (#1704)
gerashegalov Mar 2, 2021
28b00a7
Cleanup unused Jenkins files and scripts (#1829)
NvTimLiu Mar 2, 2021
e614ef4
Spark 3.1.1 shim no longer a snapshot shim (#1832)
jlowe Mar 2, 2021
fc9cecf
Update to note support for 3.0.2 (#1842)
sameerz Mar 2, 2021
85bfacb
Fix fails on the mortgage ETL test (#1845)
NvTimLiu Mar 2, 2021
63a2e3d
Have most of range partitioning run on the GPU (#1796)
revans2 Mar 2, 2021
c776be9
Merge branch 'branch-0.4' into fix-merge
jlowe Mar 2, 2021
e06c226
Fix NullPointerException on null partition insert (#1744)
gerashegalov Mar 2, 2021
5b93033
Merge branch 'branch-0.4' into fix-merge
jlowe Mar 2, 2021
923fa4e
Merge pull request #1848 from jlowe/fix-merge
jlowe Mar 2, 2021
dea867a
Update changelog for 0.4 (#1849)
sameerz Mar 2, 2021
95c3e75
Merge pull request #1850 from NVIDIA/branch-0.4
nvauto Mar 2, 2021
40c0eda
Refactor join code to reduce duplicated code (#1839)
jlowe Mar 2, 2021
19d1f05
Add shim for Spark 3.0.3 (#1834)
jlowe Mar 3, 2021
32213fa
Cost-based optimizer (#1616)
andygrove Mar 3, 2021
24ab0ae
Fix Part Suite Tests (#1852)
revans2 Mar 3, 2021
eab507e
Add shim for Spark 3.1.2 (#1836)
jlowe Mar 3, 2021
ad0b6d9
fix shuffle manager doc on ucx library path (#1858)
rongou Mar 4, 2021
dc2847c
Disable coalesce batch spilling to avoid cudf contiguous_split bug (#…
jlowe Mar 4, 2021
3c243c7
Merge branch 'branch-0.4' into fix-merge
jlowe Mar 4, 2021
2439b4b
Fix tests for Spark 3.2.0 shim (#1869)
revans2 Mar 4, 2021
6e57e27
Add in support for DateAddInterval (#1841)
nartal1 Mar 5, 2021
60fb754
Merge pull request #1875 from jlowe/fix-merge
pxLi Mar 5, 2021
1a32484
spillable cache for GpuCartesianRDD
sperlingxx Feb 22, 2021
b908c73
lazy cache
sperlingxx Feb 23, 2021
4718d00
adapt new interface of SpillableColumnarBatch
sperlingxx Mar 5, 2021
6fd391b
fix merge conflicts
sperlingxx Mar 5, 2021
8a46305
fix merge conflicts
sperlingxx Mar 5, 2021
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
283 changes: 282 additions & 1 deletion CHANGELOG.md

Large diffs are not rendered by default.

10 changes: 0 additions & 10 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,18 +5,8 @@ The RAPIDS Accelerator for Apache Spark provides a set of plugins for
[Apache Spark](https://spark.apache.org) that leverage GPUs to accelerate processing
via the [RAPIDS](https://rapids.ai) libraries and [UCX](https://www.openucx.org/).

![TPCxBB Like query results](./docs/img/tpcxbb-like-results.png "TPCxBB Like Query Results")

The chart above shows results from running ETL queries based off of the
[TPCxBB benchmark](http://www.tpc.org/tpcx-bb/default.asp). These are **not** official results in
any way. It uses a 10TB Dataset (scale factor 10,000), stored in parquet. The processing happened on
a two node DGX-2 cluster. Each node has 96 CPU cores, 1.5TB host memory, 16 V100 GPUs, and 512 GB
GPU memory.

To get started and try the plugin out use the [getting started guide](./docs/get-started/getting-started.md).

For more information about these benchmarks, see the [benchmark guide](./docs/benchmarks.md).

## Compatibility

The SQL plugin tries to produce results that are bit for bit identical with Apache Spark.
Expand Down
6 changes: 6 additions & 0 deletions api_validation/pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -46,6 +46,12 @@
<spark.version>${spark311.version}</spark.version>
</properties>
</profile>
<profile>
<id>spark320</id>
<properties>
<spark.version>${spark320.version}</spark.version>
</properties>
</profile>
</profiles>

<dependencies>
Expand Down
8 changes: 4 additions & 4 deletions docs/FAQ.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,10 +10,10 @@ nav_order: 11

### What versions of Apache Spark does the RAPIDS Accelerator for Apache Spark support?

The RAPIDS Accelerator for Apache Spark requires version 3.0.0 or 3.0.1 of Apache Spark. Because the
plugin replaces parts of the physical plan that Apache Spark considers to be internal the code for
those plans can change even between bug fix releases. As a part of our process, we try to stay on
top of these changes and release updates as quickly as possible.
The RAPIDS Accelerator for Apache Spark requires version 3.0.0, 3.0.1, 3.0.2 or 3.1.1 of Apache
Spark. Because the plugin replaces parts of the physical plan that Apache Spark considers to be
internal the code for those plans can change even between bug fix releases. As a part of our
process, we try to stay on top of these changes and release updates as quickly as possible.

### Which distributions are supported?

Expand Down
4 changes: 3 additions & 1 deletion docs/additional-functionality/rapids-shuffle.md
Original file line number Diff line number Diff line change
Expand Up @@ -257,7 +257,10 @@ In this section, we are using a docker container built using the sample dockerfi
| 3.0.1 | com.nvidia.spark.rapids.spark301.RapidsShuffleManager |
| 3.0.1 EMR | com.nvidia.spark.rapids.spark301emr.RapidsShuffleManager |
| 3.0.2 | com.nvidia.spark.rapids.spark302.RapidsShuffleManager |
| 3.0.3 | com.nvidia.spark.rapids.spark303.RapidsShuffleManager |
| 3.1.1 | com.nvidia.spark.rapids.spark311.RapidsShuffleManager |
| 3.1.2 | com.nvidia.spark.rapids.spark312.RapidsShuffleManager |
| 3.2.0 | com.nvidia.spark.rapids.spark320.RapidsShuffleManager |

2. Recommended settings for UCX 1.9.0+
```shell
Expand All @@ -270,7 +273,6 @@ In this section, we are using a docker container built using the sample dockerfi
--conf spark.executorEnv.UCX_MAX_RNDV_RAILS=1 \
--conf spark.executorEnv.UCX_MEMTYPE_CACHE=n \
--conf spark.executorEnv.UCX_IB_RX_QUEUE_LEN=1024 \
--conf spark.executorEnv.LD_LIBRARY_PATH=/usr/lib:/usr/lib/ucx \
--conf spark.executor.extraClassPath=${SPARK_CUDF_JAR}:${SPARK_RAPIDS_PLUGIN_JAR}
```

Expand Down
212 changes: 0 additions & 212 deletions docs/benchmarks.md

This file was deleted.

1 change: 1 addition & 0 deletions docs/configs.md
Original file line number Diff line number Diff line change
Expand Up @@ -139,6 +139,7 @@ Name | SQL Function(s) | Description | Default Value | Notes
<a name="sql.expression.CreateNamedStruct"></a>spark.rapids.sql.expression.CreateNamedStruct|`named_struct`, `struct`|Creates a struct with the given field names and values|true|None|
<a name="sql.expression.CurrentRow$"></a>spark.rapids.sql.expression.CurrentRow$| |Special boundary for a window frame, indicating stopping at the current row|true|None|
<a name="sql.expression.DateAdd"></a>spark.rapids.sql.expression.DateAdd|`date_add`|Returns the date that is num_days after start_date|true|None|
<a name="sql.expression.DateAddInterval"></a>spark.rapids.sql.expression.DateAddInterval| |Adds interval to date|true|None|
<a name="sql.expression.DateDiff"></a>spark.rapids.sql.expression.DateDiff|`datediff`|Returns the number of days from startDate to endDate|true|None|
<a name="sql.expression.DateSub"></a>spark.rapids.sql.expression.DateSub|`date_sub`|Returns the date that is num_days before start_date|true|None|
<a name="sql.expression.DayOfMonth"></a>spark.rapids.sql.expression.DayOfMonth|`dayofmonth`, `day`|Returns the day of the month from a date or timestamp|true|None|
Expand Down
4 changes: 2 additions & 2 deletions docs/download.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,8 +21,8 @@ This release includes additional performance improvements, including
* Instructions on how to use [Alluxio caching](get-started/getting-started-alluxio.md) with Spark to
leverage caching.

The release is supported on Apache Spark 3.0.0, 3.0.1, 3.1.1, Databricks 7.3 ML LTS and Google Cloud
Platform Dataproc 2.0.
The release is supported on Apache Spark 3.0.0, 3.0.1, 3.0.2, 3.1.1, Databricks 7.3 ML LTS and
Google Cloud Platform Dataproc 2.0.

The list of all supported operations is provided [here](supported_ops.md).

Expand Down
1 change: 0 additions & 1 deletion docs/get-started/Dockerfile.cuda
Original file line number Diff line number Diff line change
Expand Up @@ -35,7 +35,6 @@ RUN set -ex && \
ln -s /lib /lib64 && \
mkdir -p /opt/spark && \
mkdir -p /opt/spark/jars && \
mkdir -p /opt/tpch && \
mkdir -p /opt/spark/examples && \
mkdir -p /opt/spark/work-dir && \
mkdir -p /opt/sparkRapidsPlugin && \
Expand Down
Loading