Skip to content
This repository has been archived by the owner on Sep 18, 2023. It is now read-only.

Commit

Permalink
[NSE-350]Update the documents for 1.1.1 (#351)
Browse files Browse the repository at this point in the history
  • Loading branch information
Hong authored Jun 4, 2021
1 parent 559b317 commit 9de8562
Show file tree
Hide file tree
Showing 9 changed files with 229 additions and 53 deletions.
154 changes: 146 additions & 8 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,13 +1,151 @@
# Change log
Generated on 2021-04-29
Generated on 2021-06-02

## Release 1.1.1

### Native SQL Engine

#### Features
|||
|:---|:---|
|[#304](https://github.com/oap-project/native-sql-engine/issues/304)|Upgrade to Arrow 4.0.0|
|[#285](https://github.com/oap-project/native-sql-engine/issues/285)|ColumnarWindow: Support Date/Timestamp input in MAX/MIN|
|[#297](https://github.com/oap-project/native-sql-engine/issues/297)|Disable incremental compiler in CI|
|[#245](https://github.com/oap-project/native-sql-engine/issues/245)|Support columnar rdd cache|
|[#276](https://github.com/oap-project/native-sql-engine/issues/276)|Add option to switch Hadoop version|
|[#274](https://github.com/oap-project/native-sql-engine/issues/274)|Comment to trigger tpc-h RAM test|
|[#256](https://github.com/oap-project/native-sql-engine/issues/256)|CI: do not run ram report for each PR|

#### Bugs Fixed
|||
|:---|:---|
|[#325](https://github.com/oap-project/native-sql-engine/issues/325)|java.util.ConcurrentModificationException: mutation occurred during iteration|
|[#329](https://github.com/oap-project/native-sql-engine/issues/329)|numPartitions are not the same|
|[#318](https://github.com/oap-project/native-sql-engine/issues/318)|fix Spark 311 on data source v2|
|[#311](https://github.com/oap-project/native-sql-engine/issues/311)|Build reports errors|
|[#302](https://github.com/oap-project/native-sql-engine/issues/302)|test on v2 failed due to an exception|
|[#257](https://github.com/oap-project/native-sql-engine/issues/257)|different version of slf4j-log4j|
|[#293](https://github.com/oap-project/native-sql-engine/issues/293)|Fix BHJ loss if key = 0|
|[#248](https://github.com/oap-project/native-sql-engine/issues/248)|arrow dependency must put after arrow installation|

#### PRs
|||
|:---|:---|
|[#332](https://github.com/oap-project/native-sql-engine/pull/332)|[NSE-325] fix incremental compile issue with 4.5.x scala-maven-plugin|
|[#335](https://github.com/oap-project/native-sql-engine/pull/335)|[NSE-329] fix out partitioning in BHJ and SHJ|
|[#328](https://github.com/oap-project/native-sql-engine/pull/328)|[NSE-318]check schema before reuse exchange|
|[#307](https://github.com/oap-project/native-sql-engine/pull/307)|[NSE-304] Upgrade to Arrow 4.0.0|
|[#312](https://github.com/oap-project/native-sql-engine/pull/312)|[NSE-311] Build reports errors|
|[#272](https://github.com/oap-project/native-sql-engine/pull/272)|[NSE-273] support spark311|
|[#303](https://github.com/oap-project/native-sql-engine/pull/303)|[NSE-302] fix v2 test|
|[#306](https://github.com/oap-project/native-sql-engine/pull/306)|[NSE-304] Upgrade to Arrow 4.0.0: Change basic GHA TPC-H test target …|
|[#286](https://github.com/oap-project/native-sql-engine/pull/286)|[NSE-285] ColumnarWindow: Support Date input in MAX/MIN|
|[#298](https://github.com/oap-project/native-sql-engine/pull/298)|[NSE-297] Disable incremental compiler in GHA CI|
|[#291](https://github.com/oap-project/native-sql-engine/pull/291)|[NSE-257] fix multiple slf4j bindings|
|[#294](https://github.com/oap-project/native-sql-engine/pull/294)|[NSE-293] fix unsafemap with key = '0'|
|[#233](https://github.com/oap-project/native-sql-engine/pull/233)|[NSE-207] fix issues found from aggregate unit tests|
|[#246](https://github.com/oap-project/native-sql-engine/pull/246)|[NSE-245]Adding columnar RDD cache support|
|[#289](https://github.com/oap-project/native-sql-engine/pull/289)|[NSE-206]Update installation guide and configuration guide.|
|[#277](https://github.com/oap-project/native-sql-engine/pull/277)|[NSE-276] Add option to switch Hadoop version|
|[#275](https://github.com/oap-project/native-sql-engine/pull/275)|[NSE-274] Comment to trigger tpc-h RAM test|
|[#271](https://github.com/oap-project/native-sql-engine/pull/271)|[NSE-196] clean up configs in unit tests|
|[#258](https://github.com/oap-project/native-sql-engine/pull/258)|[NSE-257] fix different versions of slf4j-log4j12|
|[#259](https://github.com/oap-project/native-sql-engine/pull/259)|[NSE-248] fix arrow dependency order|
|[#249](https://github.com/oap-project/native-sql-engine/pull/249)|[NSE-241] fix hashagg result length|
|[#255](https://github.com/oap-project/native-sql-engine/pull/255)|[NSE-256] do not run ram report test on each PR|


### SQL DS Cache

#### Features
|||
|:---|:---|
|[#118](https://github.com/oap-project/sql-ds-cache/issues/118)|port to Spark 3.1.1|

#### Bugs Fixed
|||
|:---|:---|
|[#121](https://github.com/oap-project/sql-ds-cache/issues/121)|OAP Index creation stuck issue|

#### PRs
|||
|:---|:---|
|[#132](https://github.com/oap-project/sql-ds-cache/pull/132)|Fix SampleBasedStatisticsSuite UnitTest case|
|[#122](https://github.com/oap-project/sql-ds-cache/pull/122)|[ sql-ds-cache-121] Fix Index stuck issues|
|[#119](https://github.com/oap-project/sql-ds-cache/pull/119)|[SQL-DS-CACHE-118][POAE7-1130] port sql-ds-cache to Spark3.1.1|


### OAP MLlib

#### Features
|||
|:---|:---|
|[#26](https://github.com/oap-project/oap-mllib/issues/26)|[PIP] Support Spark 3.0.1 / 3.0.2 and upcoming 3.1.1|

#### PRs
|||
|:---|:---|
|[#39](https://github.com/oap-project/oap-mllib/pull/39)|[ML-26] Build for different spark version by -Pprofile|


### PMEM Spill

#### Features
|||
|:---|:---|
|[#34](https://github.com/oap-project/pmem-spill/issues/34)|Support vanilla spark 3.1.1|

#### PRs
|||
|:---|:---|
|[#41](https://github.com/oap-project/pmem-spill/pull/41)|[PMEM-SPILL-34][POAE7-1119]Port RDD cache to Spark 3.1.1 as separate module|


### PMEM Common

#### Features
|||
|:---|:---|
|[#10](https://github.com/oap-project/pmem-common/issues/10)|add -mclflushopt flag to enable clflushopt for gcc|
|[#8](https://github.com/oap-project/pmem-common/issues/8)|use clflushopt instead of clflush |

#### PRs
|||
|:---|:---|
|[#11](https://github.com/oap-project/pmem-common/pull/11)|[PMEM-COMMON-10][POAE7-1010]Add -mclflushopt flag to enable clflushop…|
|[#9](https://github.com/oap-project/pmem-common/pull/9)|[PMEM-COMMON-8][POAE7-896]use clflush optimize version for clflush|


### PMEM Shuffle

#### Features
|||
|:---|:---|
|[#15](https://github.com/oap-project/pmem-shuffle/issues/15)|Doesn't work with Spark3.1.1|

#### PRs
|||
|:---|:---|
|[#16](https://github.com/oap-project/pmem-shuffle/pull/16)|[pmem-shuffle-15] Make pmem-shuffle support Spark3.1.1|


### Remote Shuffle

#### Features
|||
|:---|:---|
|[#18](https://github.com/oap-project/remote-shuffle/issues/18)|upgrade to Spark-3.1.1|
|[#11](https://github.com/oap-project/remote-shuffle/issues/11)|Support DAOS Object Async API|

#### PRs
|||
|:---|:---|
|[#19](https://github.com/oap-project/remote-shuffle/pull/19)|[REMOTE-SHUFFLE-18] upgrade to Spark-3.1.1|
|[#14](https://github.com/oap-project/remote-shuffle/pull/14)|[REMOTE-SHUFFLE-11] Support DAOS Object Async API|



## Release 1.1.0
* [Native SQL Engine](#native-sql-engine)
* [SQL DS Cache](#sql-ds-cache)
* [OAP MLlib](#oap-mllib)
* [PMEM Spill](#pmem-spill)
* [PMEM Shuffle](#pmem-shuffle)
* [Remote Shuffle](#remote-shuffle)

### Native SQL Engine

Expand Down Expand Up @@ -264,7 +402,7 @@ Generated on 2021-04-29
|[#6](https://github.com/oap-project/pmem-shuffle/pull/6)|[PMEM-SHUFFLE-7] enable fsdax mode in pmem-shuffle|


### Remote-Shuffle
### Remote Shuffle

#### Features
|||
Expand Down
14 changes: 7 additions & 7 deletions arrow-data-source/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,8 +18,8 @@ Please make sure you have already installed the software in your system.
3. cmake 3.16 or higher version
4. maven 3.6 or higher version
5. Hadoop 2.7.5 or higher version
6. Spark 3.0.0 or higher version
7. Intel Optimized Arrow 3.0.0
6. Spark 3.1.1 or higher version
7. Intel Optimized Arrow 4.0.0

### Building by Conda

Expand Down Expand Up @@ -145,14 +145,14 @@ mvn clean -DskipTests package
readlink -f standard/target/spark-arrow-datasource-standard-<version>-jar-with-dependencies.jar
```

### Download Spark 3.0.0
### Download Spark 3.1.1

Currently ArrowDataSource works on the Spark 3.0.0 version.
Currently ArrowDataSource works on the Spark 3.1.1 version.

```
wget http://archive.apache.org/dist/spark/spark-3.0.0/spark-3.0.0-bin-hadoop2.7.tgz
tar -xf ./spark-3.0.0-bin-hadoop2.7.tgz
export SPARK_HOME=`pwd`/spark-3.0.0-bin-hadoop2.7
wget http://archive.apache.org/dist/spark/spark-3.1.1/spark-3.1.1-bin-hadoop2.7.tgz
tar -xf ./spark-3.1.1-bin-hadoop2.7.tgz
export SPARK_HOME=`pwd`/spark-3.1.1-bin-hadoop2.7
```

If you are new to Apache Spark, please go though [Spark's official deploying guide](https://spark.apache.org/docs/latest/cluster-overview.html) before getting started with ArrowDataSource.
Expand Down
4 changes: 2 additions & 2 deletions docs/Installation.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,8 +26,8 @@ Based on the different environment, there are some parameters can be set via -D
| arrow_root | When build_arrow set to False, arrow_root will be enabled to find the location of your existing arrow library. | /usr/local |
| build_protobuf | Build Protobuf from Source. If set to False, default library path will be used to find protobuf library. | True |

When build_arrow set to True, the build_arrow.sh will be launched and compile a custom arrow library from [OAP Arrow](https://github.com/oap-project/arrow)
If you wish to change any parameters from Arrow, you can change it from the build_arrow.sh script under native-sql-enge/arrow-data-source/script/.
When build_arrow set to True, the build_arrow.sh will be launched and compile a custom arrow library from [OAP Arrow](https://github.com/oap-project/arrow/tree/arrow-4.0.0-oap-1.1.1)
If you wish to change any parameters from Arrow, you can change it from the `build_arrow.sh` script under `native-sql-engine/arrow-data-source/script/`.

### Additional Notes
[Notes for Installation Issues](./InstallationNotes.md)
16 changes: 8 additions & 8 deletions docs/OAP-Developer-Guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,13 +3,13 @@
This document contains the instructions & scripts on installing necessary dependencies and building OAP modules.
You can get more detailed information from OAP each module below.

* [SQL Index and Data Source Cache](https://github.com/oap-project/sql-ds-cache/blob/v1.1.0-spark-3.0.0/docs/Developer-Guide.md)
* [PMem Common](https://github.com/oap-project/pmem-common/tree/v1.1.0-spark-3.0.0)
* [PMem Spill](https://github.com/oap-project/pmem-spill/tree/v1.1.0-spark-3.0.0)
* [PMem Shuffle](https://github.com/oap-project/pmem-shuffle/tree/v1.1.0-spark-3.0.0#5-install-dependencies-for-pmem-shuffle)
* [Remote Shuffle](https://github.com/oap-project/remote-shuffle/tree/v1.1.0-spark-3.0.0)
* [OAP MLlib](https://github.com/oap-project/oap-mllib/tree/v1.1.0-spark-3.0.0)
* [Native SQL Engine](https://github.com/oap-project/native-sql-engine/tree/v1.1.0-spark-3.0.0)
* [SQL Index and Data Source Cache](https://github.com/oap-project/sql-ds-cache/blob/v1.1.1-spark-3.1.1/docs/Developer-Guide.md)
* [PMem Common](https://github.com/oap-project/pmem-common/tree/v1.1.1-spark-3.1.1)
* [PMem Spill](https://github.com/oap-project/pmem-spill/tree/v1.1.1-spark-3.1.1)
* [PMem Shuffle](https://github.com/oap-project/pmem-shuffle/tree/v1.1.1-spark-3.1.1#5-install-dependencies-for-pmem-shuffle)
* [Remote Shuffle](https://github.com/oap-project/remote-shuffle/tree/v1.1.1-spark-3.1.1)
* [OAP MLlib](https://github.com/oap-project/oap-mllib/tree/v1.1.1-spark-3.1.1)
* [Native SQL Engine](https://github.com/oap-project/native-sql-engine/tree/v1.1.1-spark-3.1.1)

## Building OAP

Expand All @@ -22,7 +22,7 @@ We provide scripts to help automatically install dependencies required, please c
# cd oap-tools
# sh dev/install-compile-time-dependencies.sh
```
*Note*: oap-tools tag version `v1.1.0-spark-3.0.0` corresponds to all OAP modules' tag version `v1.1.0-spark-3.0.0`.
*Note*: oap-tools tag version `v1.1.1-spark-3.1.1` corresponds to all OAP modules' tag version `v1.1.1-spark-3.1.1`.

Then the dependencies below will be installed:

Expand Down
8 changes: 4 additions & 4 deletions docs/OAP-Installation-Guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@ $ wget -c https://repo.continuum.io/miniconda/Miniconda2-latest-Linux-x86_64.sh
$ chmod +x Miniconda2-latest-Linux-x86_64.sh
$ bash Miniconda2-latest-Linux-x86_64.sh
```
For changes to take effect, ***reload*** your current shell.
For changes to take effect, ***close and re-open*** your current shell.
To test your installation, run the command `conda list` in your terminal window. A list of installed packages appears if it has been installed correctly.

### Installing OAP
Expand All @@ -29,7 +29,7 @@ Create a Conda environment and install OAP Conda package.
```bash
$ conda create -n oapenv -y python=3.7
$ conda activate oapenv
$ conda install -c conda-forge -c intel -y oap=1.1.0
$ conda install -c conda-forge -c intel -y oap=1.1.1
```

Once finished steps above, you have completed OAP dependencies installation and OAP building, and will find built OAP jars under `$HOME/miniconda2/envs/oapenv/oap_jars`
Expand All @@ -38,8 +38,8 @@ Dependencies below are required by OAP and all of them are included in OAP Conda

- [Arrow](https://github.com/oap-project/arrow/tree/arrow-4.0.0-oap-1.1.1)
- [Plasma](http://arrow.apache.org/blog/2017/08/08/plasma-in-memory-object-store/)
- [Memkind](https://anaconda.org/intel/memkind)
- [Vmemcache](https://anaconda.org/intel/vmemcache)
- [Memkind](https://github.com/memkind/memkind/tree/v1.10.1)
- [Vmemcache](https://github.com/pmem/vmemcache.git)
- [HPNL](https://anaconda.org/intel/hpnl)
- [PMDK](https://github.com/pmem/pmdk)
- [OneAPI](https://software.intel.com/content/www/us/en/develop/tools/oneapi.html)
Expand Down
4 changes: 2 additions & 2 deletions docs/Prerequisite.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,8 +9,8 @@ Please make sure you have already installed the software in your system.
4. cmake 3.16 or higher version
5. Maven 3.6.3 or higher version
6. Hadoop 2.7.5 or higher version
7. Spark 3.0.0 or higher version
8. Intel Optimized Arrow 3.0.0
7. Spark 3.1.1 or higher version
8. Intel Optimized Arrow 4.0.0

## gcc installation

Expand Down
12 changes: 6 additions & 6 deletions docs/SparkInstallation.md
Original file line number Diff line number Diff line change
@@ -1,12 +1,12 @@
### Download Spark 3.0.1
### Download Spark 3.1.1

Currently Native SQL Engine works on the Spark 3.0.1 version.
Currently Native SQL Engine works on the Spark 3.1.1 version.

```
wget http://archive.apache.org/dist/spark/spark-3.0.1/spark-3.0.1-bin-hadoop3.2.tgz
sudo mkdir -p /opt/spark && sudo mv spark-3.0.1-bin-hadoop3.2.tgz /opt/spark
sudo cd /opt/spark && sudo tar -xf spark-3.0.1-bin-hadoop3.2.tgz
export SPARK_HOME=/opt/spark/spark-3.0.1-bin-hadoop3.2/
wget http://archive.apache.org/dist/spark/spark-3.1.1/spark-3.1.1-bin-hadoop3.2.tgz
sudo mkdir -p /opt/spark && sudo mv spark-3.1.1-bin-hadoop3.2.tgz /opt/spark
sudo cd /opt/spark && sudo tar -xf spark-3.1.1-bin-hadoop3.2.tgz
export SPARK_HOME=/opt/spark/spark-3.1.1-bin-hadoop3.2/
```

### [Or building Spark from source](https://spark.apache.org/docs/latest/building-spark.html)
Expand Down
35 changes: 27 additions & 8 deletions docs/User-Guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,11 @@ A Native Engine for Spark SQL with vectorized SIMD optimizations

![Overview](./image/nativesql_arch.png)

Spark SQL works very well with structured row-based data. It used WholeStageCodeGen to improve the performance by Java JIT code. However Java JIT is usually not working very well on utilizing latest SIMD instructions, especially under complicated queries. [Apache Arrow](https://arrow.apache.org/) provided CPU-cache friendly columnar in-memory layout, its SIMD optimized kernels and LLVM based SQL engine Gandiva are also very efficient. Native SQL Engine used these technologies and brought better performance to Spark SQL.
Spark SQL works very well with structured row-based data. It used WholeStageCodeGen to improve the performance by Java JIT code. However Java JIT is usually not working very well on utilizing latest SIMD instructions, especially under complicated queries. [Apache Arrow](https://arrow.apache.org/) provided CPU-cache friendly columnar in-memory layout, its SIMD-optimized kernels and LLVM-based SQL engine Gandiva are also very efficient.

Native SQL Engine reimplements Spark SQL execution layer with SIMD-friendly columnar data processing based on Apache Arrow,
and leverages Arrow's CPU-cache friendly columnar in-memory layout, SIMD-optimized kernels and LLVM-based expression engine to bring better performance to Spark SQL.


## Key Features

Expand Down Expand Up @@ -36,7 +40,20 @@ We implemented columnar shuffle to improve the shuffle performance. With the col

Please check the operator supporting details [here](./operators.md)

## Build the Plugin
## How to use OAP: Native SQL Engine

There are three ways to use OAP: Native SQL Engine,
1. Use precompiled jars
2. Building by Conda Environment
3. Building by Yourself

### Use precompiled jars

Please go to [OAP's Maven Central Repository](https://repo1.maven.org/maven2/com/intel/oap/) to find Native SQL Engine jars.
For usage, you will require below two jar files:
1. spark-arrow-datasource-standard-<version>-jar-with-dependencies.jar is located in com/intel/oap/spark-arrow-datasource-standard/<version>/
2. spark-columnar-core-<version>-jar-with-dependencies.jar is located in com/intel/oap/spark-columnar-core/<version>/
Please notice the files are fat jars shipped with our custom Arrow library and pre-compiled from our server(using GCC 9.3.0 and LLVM 7.0.1), which means you will require to pre-install GCC 9.3.0 and LLVM 7.0.1 in your system for normal usage.

### Building by Conda

Expand All @@ -47,18 +64,18 @@ Then you can just skip below steps and jump to [Get Started](#get-started).

If you prefer to build from the source code on your hand, please follow below steps to set up your environment.

### Prerequisite
#### Prerequisite

There are some requirements before you build the project.
Please check the document [Prerequisite](./Prerequisite.md) and make sure you have already installed the software in your system.
If you are running a SPARK Cluster, please make sure all the software are installed in every single node.

### Installation
Please check the document [Installation Guide](./Installation.md)
#### Installation

### Configuration & Testing
Please check the document [Configuration Guide](./Configuration.md)
Please check the document [Installation Guide](./Installation.md)

## Get started

To enable OAP NativeSQL Engine, the previous built jar `spark-columnar-core-<version>-jar-with-dependencies.jar` should be added to Spark configuration. We also recommend to use `spark-arrow-datasource-standard-<version>-jar-with-dependencies.jar`. We will demonstrate an example by using both jar files.
SPARK related options are:

Expand All @@ -71,6 +88,8 @@ SPARK related options are:
For Spark Standalone Mode, please set the above value as relative path to the jar file.
For Spark Yarn Cluster Mode, please set the above value as absolute path to the jar file.

More Configuration, please check the document [Configuration Guide](./Configuration.md)

Example to run Spark Shell with ArrowDataSource jar file
```
${SPARK_HOME}/bin/spark-shell \
Expand Down Expand Up @@ -99,7 +118,7 @@ orders.createOrReplaceTempView("orders")
spark.sql("select * from orders where o_orderdate > date '1998-07-26'").show(20000, false)
```

The result should show up on Spark console and you can check the DAG diagram with some Columnar Processing stage. Native SQL engine still lacks some features, please check out the [limitations](./limitations.md).
The result should showup on Spark console and you can check the DAG diagram with some Columnar Processing stage. Native SQL engine still lacks some features, please check out the [limitations](./limitations.md).


## Performance data
Expand Down
Loading

0 comments on commit 9de8562

Please sign in to comment.