Skip to content

Latest commit

 

History

History

benchmarks

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

DataFusion and Ballista Benchmarks

This crate contains benchmarks based on popular public data sets and open source benchmark suites, making it easy to run real-world benchmarks to help with performance and scalability testing and for comparing performance with other Arrow implementations as well as other query engines.

Benchmark derived from TPC-H

These benchmarks are derived from the TPC-H benchmark.

Generating Test Data

TPC-H data can be generated using the tpch-gen.sh script, which creates a Docker image containing the TPC-DS data generator.

./tpch-gen.sh

Data will be generated into the data subdirectory and will not be checked in because this directory has been added to the .gitignore file.

Running the DataFusion Benchmarks in Python

Build the Python bindings and then run:

$ python tpch.py --query q1 --path /mnt/bigdata/tpch/sf1-parquet/
Registering table part at path /mnt/bigdata/tpch/sf1-parquet//part
Registering table supplier at path /mnt/bigdata/tpch/sf1-parquet//supplier
Registering table partsupp at path /mnt/bigdata/tpch/sf1-parquet//partsupp
Registering table customer at path /mnt/bigdata/tpch/sf1-parquet//customer
Registering table orders at path /mnt/bigdata/tpch/sf1-parquet//orders
Registering table lineitem at path /mnt/bigdata/tpch/sf1-parquet//lineitem
Registering table nation at path /mnt/bigdata/tpch/sf1-parquet//nation
Registering table region at path /mnt/bigdata/tpch/sf1-parquet//region
Query q1 took 9.668351173400879 second(s)

Note that this Python script currently only supports running against file formats than contain a schema definition (such as Parquet).

Running the DataFusion Benchmarks in Rust

The benchmark can then be run (assuming the data created from dbgen is in ./data) with a command such as:

cargo run --release --bin tpch -- benchmark datafusion --iterations 3 --path ./data --format tbl --query 1 --batch-size 4096

You can enable the features simd (to use SIMD instructions) and/or mimalloc or snmalloc (to use either the mimalloc or snmalloc allocator) as features by passing them in as --features:

cargo run --release --features "simd mimalloc" --bin tpch -- benchmark datafusion --iterations 3 --path ./data --format tbl --query 1 --batch-size 4096

The benchmark program also supports CSV and Parquet input file formats and a utility is provided to convert from tbl (generated by the dbgen utility) to CSV and Parquet.

cargo run --release --bin tpch -- convert --input ./data --output /mnt/tpch-parquet --format parquet

This utility does not yet provide support for changing the number of partitions when performing the conversion. Another option is to use the following Docker image to perform the conversion from tbl files to CSV or Parquet.

docker run -it ballistacompute/spark-benchmarks:0.4.0-SNAPSHOT
  -h, --help   Show help message

Subcommand: convert-tpch
  -i, --input  <arg>
      --input-format  <arg>
  -o, --output  <arg>
      --output-format  <arg>
  -p, --partitions  <arg>
  -h, --help                   Show help message

Note that it is necessary to mount volumes into the Docker container as appropriate so that the file conversion process can access files on the host system.

Here is a full example that assumes that data is stored in the /mnt path on the host system.

docker run -v /mnt:/mnt -it ballistacompute/spark-benchmarks:0.4.0-SNAPSHOT \
  convert-tpch \
  --input /mnt/tpch/csv \
  --input-format tbl \
  --output /mnt/tpch/parquet \
  --output-format parquet \
  --partitions 64

Running the Ballista Benchmarks

To run the benchmarks it is necessary to have at least one Ballista scheduler and one Ballista executor running.

To run the scheduler from source:

cd $ARROW_HOME/ballista/scheduler
RUST_LOG=info cargo run --release

By default the scheduler will bind to 0.0.0.0 and listen on port 50050.

To run the executor from source:

cd $ARROW_HOME/ballista/executor
RUST_LOG=info cargo run --release

By default the executor will bind to 0.0.0.0 and listen on port 50051.

You can add SIMD/snmalloc/LTO flags to improve speed (with longer build times):

RUST_LOG=info RUSTFLAGS='-C target-cpu=native -C lto -C codegen-units=1 -C embed-bitcode' cargo run --release --bin executor --features "simd snmalloc" --target x86_64-unknown-linux-gnu

To run the benchmarks:

cd $ARROW_HOME/benchmarks
cargo run --release --bin tpch benchmark ballista --host localhost --port 50050 --query 1 --path $(pwd)/data --format tbl

Running the Ballista Benchmarks on docker-compose

To start a Rust scheduler and executor using Docker Compose:

cargo build --release
docker-compose up --build

Then you can run the benchmark with:

docker-compose run ballista-client bash -c '/root/tpch benchmark ballista --host ballista-scheduler --port 50050 --query 1 --path /data --format tbl'

Expected output

The result of query 1 should produce the following output when executed against the SF=1 dataset.

+--------------+--------------+----------+--------------------+--------------------+--------------------+--------------------+--------------------+----------------------+-------------+
| l_returnflag | l_linestatus | sum_qty  | sum_base_price     | sum_disc_price     | sum_charge         | avg_qty            | avg_price          | avg_disc             | count_order |
+--------------+--------------+----------+--------------------+--------------------+--------------------+--------------------+--------------------+----------------------+-------------+
| A            | F            | 37734107 | 56586554400.73001  | 53758257134.870026 | 55909065222.82768  | 25.522005853257337 | 38273.12973462168  | 0.049985295838396455 | 1478493     |
| N            | F            | 991417   | 1487504710.3799996 | 1413082168.0541    | 1469649223.1943746 | 25.516471920522985 | 38284.467760848296 | 0.05009342667421622  | 38854       |
| N            | O            | 74476023 | 111701708529.50996 | 106118209986.10472 | 110367023144.56622 | 25.502229680934594 | 38249.1238377803   | 0.049996589476752576 | 2920373     |
| R            | F            | 37719753 | 56568041380.90001  | 53741292684.60399  | 55889619119.83194  | 25.50579361269077  | 38250.854626099666 | 0.05000940583012587  | 1478870     |
+--------------+--------------+----------+--------------------+--------------------+--------------------+--------------------+--------------------+----------------------+-------------+
Query 1 iteration 0 took 1956.1 ms
Query 1 avg time: 1956.11 ms

Comparing Performance with Apache Spark

We run benchmarks to compare performance with Spark to identify future areas of optimization. We publish the latest results in the top-level README.

Ballista

Build with the release-lto profile.

cargo build --profile release-lto

Run the cluster.

./target/release-lto/ballista-scheduler
./target/release-lto/ballista-executor -c 24

Running the benchmark.

./target/release-lto/tpch benchmark ballista \
    --host localhost \
    --port 50050 \
    --path /mnt/bigdata/tpch/sf10-parquet-float/ \
    --format parquet \
    --iterations 1 \
    --partitions 24 \
    --query 1

Spark

Start the cluster.

./sbin/start-master.sh
./sbin/start-worker.sh spark://ripper:7077

Run the benchmark.

$SPARK_HOME/bin/spark-submit \
    --master spark://ripper:7077 \
    --class org.apache.arrow.ballista.SparkTpch \
    --conf spark.driver.memory=8G \
    --num-executors=1 \
    --conf spark.executor.memory=32G \
    --conf spark.executor.cores=24 \
    --conf spark.cores.max=24 \
    target/spark-tpch-0.5.0-SNAPSHOT-jar-with-dependencies.jar \
    tpch \
    --input-path /mnt/bigdata/tpch/sf10-parquet-float/ \
    --input-format parquet \
    --query-path /home/andy/git/apache/arrow-ballista/benchmarks/queries \
    --query 1

NYC Taxi Benchmark

These benchmarks are based on the New York Taxi and Limousine Commission data set.

cargo run --release --bin nyctaxi -- --iterations 3 --path /mnt/nyctaxi/csv --format csv --batch-size 4096

Example output:

Running benchmarks with the following options: Opt { debug: false, iterations: 3, batch_size: 4096, path: "/mnt/nyctaxi/csv", file_format: "csv" }
Executing 'fare_amt_by_passenger'
Query 'fare_amt_by_passenger' iteration 0 took 7138 ms
Query 'fare_amt_by_passenger' iteration 1 took 7599 ms
Query 'fare_amt_by_passenger' iteration 2 took 7969 ms

Running the Ballista Loadtest

 cargo run --bin tpch -- loadtest  ballista-load
  --query-list 1,3,5,6,7,10,12,13
  --requests 200
  --concurrency 10
  --data-path /****
  --format parquet
  --host localhost
  --port 50050
  --sql-path /***
  --debug