Skip to content

Commit

Permalink
Add clarifications and details to integration-tests README [skip ci] (#…
Browse files Browse the repository at this point in the history
…4694)

* add clarifications and details to integration-tests md

Signed-off-by: Ahmed Hussein (amahussein) <a@ahussein.me>

* address PR comments

Signed-off-by: Ahmed Hussein (amahussein) <a@ahussein.me>

* Fix typos

Signed-off-by: Ahmed Hussein (amahussein) <a@ahussein.me>
  • Loading branch information
amahussein authored Feb 11, 2022
1 parent 2053dc7 commit c2ba7b3
Showing 1 changed file with 138 additions and 19 deletions.
157 changes: 138 additions & 19 deletions integration_tests/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,41 +5,151 @@ are intended to be able to be run against any Spark-compatible cluster/release t
verify that the plugin is doing the right thing in as many cases as possible.

There are two sets of tests. The PySpark tests are described on this page. The scala tests are
described [here](../tests/README.md).
described [here](../tests/README.md).

## Dependencies
## Setting Up the Environment

The tests are based off of `pyspark` and `pytest` running on Python 3. There really are
only a small number of Python dependencies that you need to install for the tests. The
dependencies also only need to be on the driver. You can install them on all nodes
in the cluster but it is not required.

### pytest
`pip install pytest`
### Prerequisities

The build requires `OpenJDK 8`, `maven`, and `python`.
Skip to the next section if you have already installed them.

#### Java Environment

It is recommended to use `alternatives` to manage multiple java versions.
Then you can simply set `JAVA_HOME` to JDK directory:

```shell script
JAVA_HOME=$(readlink -nf $(which java) | xargs dirname | xargs dirname | xargs dirname)
```

#### Installing python using pyenv

It is recommended that you use `pyenv` to manage Python installations.

- First, make sure to install all the required dependencies listed
[here](https://github.com/pyenv/pyenv/wiki#suggested-build-environment).
- Follow instructions to use the right method of installation described
[here](https://github.com/pyenv/pyenv#installation)
- Verify that `pyenv` is set correctly

```shell script
which pyenv
```

- Using `pyenv` to set Python installation
- To check versions to be installed (will return a long list)

```shell script
ls ~/.pyenv/versions/
```

- To install a specific version from the available list

```shell script
pyenv install 3.X.Y
```

Should be enough to get the basics started.
- To check available versions locally

### sre_yield
`pip install sre_yield`
```shell script
ls ~/.pyenv/versions/
```

`sre_yield` provides a set of APIs to generate string data from a regular expression.
- To set python environment to one of the installed versions

### pandas
`pip install pandas`
```shell script
pyenv global 3.X.Y
```

`pandas` is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool and is
only needed when testing integration with pandas.
For full details on `pyenv` and instructions, see [pyenv github page](https://github.com/pyenv/pyenv).

### pyarrow
`pip install pyarrow`
#### Installing specific version of Maven

`pyarrow` provides a Python API for functionality provided by the Arrow C++ libraries, along with tools for Arrow
integration and interoperability with pandas, NumPy, and other software in the Python ecosystem. This is used
to test improved transfer performance to pandas based user defined functions.
All package managers like `brew` and `apt` offer maven. However, it may lag behind some
versions. In that case, you can install the latest binary from the [Maven download page](https://maven.apache.org/download.cgi).
For manual installation, you need to setup your environment:

## pytest-xdist and findspark
```shell script
export M2_HOME=PATH_TO_MAVEN_ROOT_DIRECTOTY
export M2=${M2_HOME}/bin
export PATH=$M2:$PATH
```

`pytest-xdist` and `findspark` can be used to speed up running the tests by running them in parallel.
### Dependencies

- pytest
: A framework that makes it easy to write small, readable tests, and can scale to support complex
functional testing for applications and libraries (requires Python 3.6+).
- sre_yield
: Provides a set of APIs to generate string data from a regular expression.
- pandas
: A fast, powerful, flexible and easy to use open source data analysis and manipulation
tool and is only needed when testing integration with pandas.
- pyarrow
: Provides a Python API for functionality provided by the Arrow C++ libraries, along with
tools for Arrow integration and interoperability with pandas, NumPy, and other software in
the Python ecosystem. This is used to test improved transfer performance to pandas based user
defined functions.
- pytest-xdist
: A plugin that extends pytest with new test execution modes, the most used being distributing
tests across multiple CPUs to speed up test execution
- findspark
: Adds pyspark to sys.path at runtime

You can install all the dependencies using `pip` by running the following command:

```shell script
pip install pytest \
sre_yield \
pandas \
pyarrow \
pytest-xdist \
findspark
```

### Installing Spark

You need to install spark-3.x and set `$SPARK_HOME/bin` to your `$PATH`, where
`SPARK_HOME` points to the directory of a runnable Spark distribution.
This can be done in the following three steps:

1. Choose the appropriate way to create Spark distribution:

- To run the plugin against a non-snapshot version of spark, download a distribution from Apache-Spark [download page](https://spark.apache.org/downloads.html);
- To run the plugin against a snapshot version of Spark, you will need to buid
the distribution from source:

```shell script
## clone locally
git clone https://github.com/apache/spark.git spark-src-latest
cd spark-src-latest
## build a distribution with hive support
## generate a single tgz file $MY_SPARK_BUILD.tgz
./dev/make-distribution.sh --name $MY_SPARK_BUILD --tgz -Pkubernetes -Phive
```

For more details about the configurations, and the arguments, visit [Apache Spark Docs::Building Spark](https://spark.apache.org/docs/latest/building-spark.html#building-a-runnable-distribution).

2. Extract the `.tgz` file to a suitable work directory `$SPARK_INSTALLS_DIR/$MY_SPARK_BUILD`.

3. Set the variables to appropriate values:

```shell script
export SPARK_HOME=$SPARK_INSTALLS_DIR/$MY_SPARK_BUILD
export PATH=${SPARK_HOME}/bin:$PATH
```

### Building The Plugin

Next, visit [CONTRIBUTING::Building From Source](../CONTRIBUTING.md#building-from-source) to learn
about building the plugin for different versions of Spark.
Make sure that you compile the plugin against the same version of Spark that it is going to run with.

## Running

Expand All @@ -60,6 +170,15 @@ The python tests run with pytest and the script honors pytest parameters. Some h
- `-r fExXs` Show extra test summary info as specified by chars: (f)ailed, (E)rror, (x)failed, (X)passed, (s)kipped
- For other options and more details please visit [pytest-usage](https://docs.pytest.org/en/stable/usage.html) or type `pytest --help`
Examples:
```shell script
## running all integration tests for Map
./integration_tests/run_pyspark_from_build.sh -k map_test.py
## Running a single integration test in map_test
./integration_tests/run_pyspark_from_build.sh -k test_map_integration_1
```
### Spark execution mode
Spark Applications (pytest in this case) can be run against different cluster backends
Expand Down

0 comments on commit c2ba7b3

Please sign in to comment.