Name	Name	Last commit message	Last commit date
parent directory ..
build	build
project	project
scripts	scripts
src/main/scala	src/main/scala
README.md	README.md
build.sbt	build.sbt
run-benchmark.py	run-benchmark.py

Benchmarks

Overview

This is a basic framework for writing benchmarks to measure Delta's performance. It is currently designed to run benchmark on Spark running in an EMR cluster. However, it can be easily extended for other Spark-based benchmarks. To get started, first download/clone this repository in your local machine. Then you have to set up an EMR cluster and run the benchmark scripts in this directory. See the next section for more details.

Running TPC-DS benchmark

This TPC-DS benchmark is constructed such that you have to run the following two steps.

Load data: You have to create the TPC-DS database with all the Delta tables. To do that, the raw TPC-DS data has been provided as Apache Parquet files. In this step you will have to use your EMR cluster to read the parquet files and rewrite them as Delta tables.
Query data: Then, using the tables definitions in the Hive Metatore, you can run the 99 benchmark queries.

The next section will provide the detailed steps of how to setup the necessary Hive Metastore and EMR cluster, how to test the setup with small-scale data, and then finally run the full scale benchmark.

Prerequisites

An AWS account with necessary permissions to do the following:
- Manage RDS instances for creating an external Hive Metastore
- Manage EMR clusters for running the benchmark
- Read and write to an S3 bucket from the EMR cluster
A S3 bucket which will be used to generate the TPC-DS data.
A machine which has access to the AWS setup and where this repository has been downloaded or cloned.

Create external Hive Metastore using Amazon RDS

Create an external Hive Metastore in a MySQL database using Amazon RDS with the following specifications:

MySQL 8.x on a db.m5.large.
General purpose SSDs, and no Autoscaling storage.
Non-empty password for admin
Same region, VPC, subnet as those you will run the EMR cluster. See AWS docs for more guidance.
- Note: Region us-west-2 since that is what this benchmark has been most tested with.

After the database is ready, note the JDBC connection details, the username and password. We will need them for the next step. Note that this step needs to be done just once. All EMR clusters can connect and reused this Hive Metastsore.

Create EMR cluster

Create an EMR cluster that connects to the external Hive Metastore. Here are the specifications of the EMR cluster required for running benchmarks.

EMR version 6.5.0 having Apache Spark 3.1
Master - i3.2xlarge
Workers - 16 x i3.2xlarge (or just 1 worker if you are just testing by running the 1GB benchmark).
Hive-site configuration to connect to the Hive Metastore. See Using an external MySQL database or Amazon Aurora for more details.
Same region, VPC, subnet as those of the Hive Metastore.
- Note: Region us-west-2 since that is what this benchmark has been most tested with.
No autoscaling, and default EBS storage.

Once the EMR cluster is ready, note the following:

Hostname of the EMR cluster master node.
PEM file for SSH into the master node. These will be needed to run the workloads in this framework.

Prepare S3 bucket

Create a new S3 bucket (or use an existing one) which is in the same region your EMR cluster.

Test the cluster setup

Navigate to your local copy of this repository and this benchmark directory. Then run the following steps.

Run simple test workload

Verify that you have the following information

<HOST_NAME>: EMR cluster master node host name
<PEM_FILE>: Local path to your PEM file for SSH into the master node.
<BENCHMARK_PATH>: S3 path where tables will be created. Make sure your AWS credentials have read/write permission to that path.

Then run a simple table write-read test: Run the following in your shell.

```sh
./run-benchmark.py --cluster-hostname <HOSTNAME> -i <PEM_FILE> --benchmark-path <BENCHMARK_PATH> --benchmark test 
```

If this works correctly, then you should see an output that look like this.

```text
>>> Benchmark script generated and uploaded

...
There is a screen on:
12001..ip-172-31-21-247	(Detached)

Files for this benchmark:
20220126-191336-test-benchmarks.jar
20220126-191336-test-cmd.sh
20220126-191336-test-out.txt
>>> Benchmark script started in a screen. Stdout piped into 20220126-191336-test-out.txt.Final report will be generated on completion in 20220126-191336-test-report.json.
```

The test workload launched in a `screen` is going to run the following: 
- Spark jobs to run a simple SQL query
- Create a Delta table in the given location 
- Read it back

To see whether they worked correctly, SSH into the node and check the output of 20220126-191336-test-out.txt. Once the workload terminates, the last few lines should be something like the following:
```text
RESULT:
{
  "benchmarkSpecs" : {
    "benchmarkPath" : ...,
    "benchmarkId" : "20220126-191336-test"
  },
  "queryResults" : [ {
    "name" : "sql-test",
    "durationMs" : 11075
  }, {
    "name" : "db-list-test",
    "durationMs" : 208
  }, {
    "name" : "db-create-test",
    "durationMs" : 4070
  }, {
    "name" : "db-use-test",
    "durationMs" : 41
  }, {
    "name" : "table-drop-test",
    "durationMs" : 74
  }, {
    "name" : "table-create-test",
    "durationMs" : 33812
  }, {
    "name" : "table-query-test",
    "durationMs" : 4795
  } ]
}
FILE UPLOAD: Uploaded /home/hadoop/20220126-191336-test-report.json to s3:// ...
SUCCESS
```

The above metrics are also written to a json file and uploaded to the given path. Please verify that both the table and report are generated in that path.

Run 1GB TPC-DS

Now that you are familiar with how the framework runs the workload, you can try running the small scale TPC-DS benchmark.

Load data as Delta tables: ./run-benchmark.py --cluster-hostname <HOSTNAME> -i <PEM_FILE> --benchmark-path <BENCHMARK_PATH> --benchmark tpcds-1gb-delta-load
Run queries on Delta tables: ./run-benchmark.py --cluster-hostname <HOSTNAME> -i <PEM_FILE> --benchmark-path <BENCHMARK_PATH> --benchmark tpcds-1gb-delta

Run 3TB TPC-DS

Finally, you are all set up to run the full scale benchmark. Similar to the 1GB benchmark, run the following

Load data as Delta tables: ./run-benchmark.py --cluster-hostname <HOSTNAME> -i <PEM_FILE> --benchmark-path <BENCHMARK_PATH> --benchmark tpcds-3tb-delta-load
Run queries on Delta tables: ./run-benchmark.py --cluster-hostname <HOSTNAME> -i <PEM_FILE> --benchmark-path <BENCHMARK_PATH> --benchmark tpcds-3tb-delta

Compare the results using the generated JSON files.

Internals of the framework

Structure of this framework's code

build.sbt, project/, src/ form the SBT project which contains the Scala code that define the benchmark workload.
- Benchmark.scala is the basic interface, and TestBenchmark.scala is a sample implementation.
run-benchmark.py contains the specification of the benchmarks defined by name (e.g. tpcds-3tb-delta). Each benchmark specification is defined by the following:
- Fully qualified name of the main Scala class to be started.
- Command line argument for the main function.
- Additional Maven artifact to load (example io.delta:delta-core_2.12:1.0.0).
- Spark configurations to use.
scripts has the core python scripts that are called by run-benchmark.py

The script run-benchmark.py does the following:

Compile the Scala code into a uber jar.
Upload it to the given hostname.
Using ssh to the hostname, it will launch a screen and start the main class with spark-submit.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

benchmarks

benchmarks

README.md

Benchmarks

Overview

Running TPC-DS benchmark

Prerequisites

Create external Hive Metastore using Amazon RDS

Create EMR cluster

Prepare S3 bucket

Test the cluster setup

Run simple test workload

Run 1GB TPC-DS

Run 3TB TPC-DS

Internals of the framework

Files

benchmarks

Directory actions

More options

Directory actions

More options

Latest commit

History

benchmarks

Folders and files

parent directory

README.md

Benchmarks

Overview

Running TPC-DS benchmark

Prerequisites

Create external Hive Metastore using Amazon RDS

Create EMR cluster

Prepare S3 bucket

Test the cluster setup

Run simple test workload

Run 1GB TPC-DS

Run 3TB TPC-DS

Internals of the framework