Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Add bytes_per_second to all libcudf benchmarks #13735

Open
GregoryKimball opened this issue Jul 23, 2023 · 7 comments
Open

[FEA] Add bytes_per_second to all libcudf benchmarks #13735

GregoryKimball opened this issue Jul 23, 2023 · 7 comments
Assignees
Labels
0 - Backlog In queue waiting for assignment feature request New feature or request good first issue Good for newcomers libcudf Affects libcudf (C++/CUDA) code.
Milestone

Comments

@GregoryKimball
Copy link
Contributor

GregoryKimball commented Jul 23, 2023

Many libcudf benchmarks report the bytes_per_second processed as part of the output data. This value is useful for comparing benchmarks because it normalizes the increasing execution time from processing more data. Also, bytes_per_second and real_time_s together let us compute bytes_processed values which serve as a useful Y-axis.

As of the end of 23.12 development, many benchmarks still do not report bytes_per_second in the output data. Here is a figure summarizing the metrics reported by the benchmark suite.

image

benchmark status notes
APPLY_BOOLEAN_MASK see #13937
BINARYOP see #13938
COPY_IF_ELSE see #13960
GROUPBY see #13984
GROUPBY_NV
HASHING see #13967, #13965
JOIN
JOIN_NV
QUANTILES
REDUCTION see #16126
REPLACE
SEARCH
SEARCH_NV
SET_OPS_NV
SHIFT see #13950
SORT
SORT_NV
STREAM_COMPACTION_NV see #14172
TRANSPOSE see #14170

Note: For this tracking list, cuIO benchmarks are omitted because "encoded file size" serves a similar purpose.

@GregoryKimball GregoryKimball added feature request New feature or request 0 - Backlog In queue waiting for assignment good first issue Good for newcomers libcudf Affects libcudf (C++/CUDA) code. labels Jul 23, 2023
@GregoryKimball GregoryKimball added this to the Benchmarking milestone Jul 23, 2023
@Blonck
Copy link
Contributor

Blonck commented Aug 22, 2023

Hi, I want to start contributing to cudf and would like to work on this issue.

Should I create one PR per benchmark or put all in one PR?

@davidwendt
Copy link
Contributor

Should I create one PR per benchmark or put all in one PR?

Thanks for looking into this. Smaller PRs are easier to review so I would prefer one PR per benchmark.

Blonck pushed a commit to Blonck/cudf that referenced this issue Aug 23, 2023
Due to missing parentheses in APPLY_BOOLEAN_MASK benchmark the number
of written bytes were not multiplied with the number of iterations
of this benchmark.

This patches relates to rapidsai#13735.
Blonck pushed a commit to Blonck/cudf that referenced this issue Aug 23, 2023
Due to missing parentheses in APPLY_BOOLEAN_MASK benchmark the number
of written bytes were not multiplied by the number of iterations
of this benchmark.

This patch relates to rapidsai#13735.
Blonck pushed a commit to Blonck/cudf that referenced this issue Aug 23, 2023
To add `bytes_per_second` a call to `SetBytesProcessed()` with number
of written and read bytes is added to the benchmark.

This patch relates to rapidsai#13735.
Blonck pushed a commit to Blonck/cudf that referenced this issue Aug 23, 2023
To add `bytes_per_second`, a call to `SetBytesProcessed()` with the
number of written and read bytes is added to the benchmark.

This patch relates to rapidsai#13735.
rapids-bot bot pushed a commit that referenced this issue Aug 23, 2023
…13937)

Due to missing parentheses in APPLY_BOOLEAN_MASK benchmark, the number of written bytes were not multiplied by the number of iterations of this benchmark.

This patch relates to #13735.

Authors:
  - Martin Marenz (https://github.com/Blonck)

Approvers:
  - Nghia Truong (https://github.com/ttnghia)
  - David Wendt (https://github.com/davidwendt)

URL: #13937
Blonck pushed a commit to Blonck/cudf that referenced this issue Aug 24, 2023
Blonck pushed a commit to Blonck/cudf that referenced this issue Aug 24, 2023
To add `bytes_per_second`, a call to `SetBytesProcessed()` with the
number of written and read bytes is added to the benchmark.

This patch relates to rapidsai#13735.
rapids-bot bot pushed a commit that referenced this issue Aug 24, 2023
To add `bytes_per_second`, a call to `.SetBytesProcessed()` with the number of written and read bytes is added to the benchmark.

This patch relates to #13735.

Authors:
  - Martin Marenz (https://github.com/Blonck)

Approvers:
  - Karthikeyan (https://github.com/karthikeyann)
  - Yunsong Wang (https://github.com/PointKernel)

URL: #13938
Blonck pushed a commit to Blonck/cudf that referenced this issue Aug 25, 2023
Adds `bytes_per_second to `COPY_IF_ELSE_BENCH`.

This patch relates to rapidsai#13735.
Blonck pushed a commit to Blonck/cudf that referenced this issue Aug 25, 2023
Adds `bytes_per_second` to the `PARTITION_BENCH` benchmark.

This patch relates to rapidsai#13735.
Blonck pushed a commit to Blonck/cudf that referenced this issue Aug 25, 2023
In the past, the HASING_NVBENCH benchmark treated the nulls parameter as
a boolean. Any value other than 0.0 resulted in a null probability of
1.0.

Now, the nulls parameter directly determines the null probability. For
instance, a value of 0.1 will generate 10% of the data as null.
Moreover, setting nulls to 0.0 produces data without a null bitmask.

Additionally, `bytes_per_second` are added to the benchmark.

This patch relates to rapidsai#13735.
Blonck pushed a commit to Blonck/cudf that referenced this issue Aug 25, 2023
In the past, the HASING_NVBENCH benchmark treated the nulls parameter as
a boolean. Any value other than 0.0 resulted in a null probability of
100% for the generated data.

Now, the nulls parameter directly determines the null probability. For
instance, a value of 0.1 will generate 10% of the data as null.
Moreover, setting nulls to 0.0 produces data without a null bitmask.

Additionally, `bytes_per_second` are added to the benchmark.

This patch relates to rapidsai#13735.
Blonck pushed a commit to Blonck/cudf that referenced this issue Aug 25, 2023
In the past, the `HASING_NVBENCH` benchmark treated the `nulls` parameter as
a boolean. Any value other than 0.0 resulted in a null probability of
100% for the generated data.

Now, the `nulls` parameter directly determines the null probability. For
instance, a value of 0.1 will generate 10% of the data as null.
Moreover, setting nulls to 0.0 produces data without a null bitmask.

Additionally, `bytes_per_second` are added to the benchmark.

This patch relates to rapidsai#13735.
Blonck pushed a commit to Blonck/cudf that referenced this issue Aug 28, 2023
To calculate the number of bytes written and read by the benchmark a few
helper function are introduced which calculate the payload size of a
column, a table, and the results of a groupby.

This patch relates to rapidsai#13735.
Blonck pushed a commit to Blonck/cudf that referenced this issue Aug 28, 2023
To calculate the number of bytes written and read by the benchmark a few
helper function are introduced which calculate the payload size of a
column, a table, and the results of a groupby.

This patch relates to rapidsai#13735.
Blonck pushed a commit to Blonck/cudf that referenced this issue Aug 28, 2023
Introduce helper functions to compute payload size for:
  - Column
  - Table
  - Groupby execution results

This patch relates to rapidsai#13735.
Blonck pushed a commit to Blonck/cudf that referenced this issue Aug 28, 2023
This patch adds memory statistics for the GROUPBY_NVBENCH benchmark
using the `max` aggregation.

For this purpose elper functions are introduced to compute the payload
size for:
  - Column
  - Table
  - Groupby execution results

This patch relates to rapidsai#13735.
Blonck pushed a commit to Blonck/cudf that referenced this issue Aug 28, 2023
This patch adds memory statistics for the GROUPBY_NVBENCH benchmark
using the `max` aggregation.

For this purpose helper functions are introduced to compute the payload
size for:
  - Column
  - Table
  - Groupby execution results

This patch relates to rapidsai#13735.
rapids-bot bot pushed a commit that referenced this issue Aug 30, 2023
Adds `bytes_per_second` to the `COPY_IF_ELSE_BENCH` benchmark.

This patch relates to #13735.

Authors:
  - Martin Marenz (https://github.com/Blonck)

Approvers:
  - Bradley Dice (https://github.com/bdice)
  - David Wendt (https://github.com/davidwendt)

URL: #13960
rapids-bot bot pushed a commit that referenced this issue Aug 30, 2023
Adds `bytes_per_second` to the `PARTITION_BENCH` benchmark.

This patch relates to #13735.

Authors:
  - Martin Marenz (https://github.com/Blonck)
  - Mark Harris (https://github.com/harrism)

Approvers:
  - David Wendt (https://github.com/davidwendt)
  - Mark Harris (https://github.com/harrism)

URL: #13965
Blonck pushed a commit to Blonck/cudf that referenced this issue Aug 30, 2023
This patch adds memory statistics for the GROUPBY_NVBENCH benchmarks.

For this purpose helper functions are introduced to compute the payload
size for:
  - Column
  - Table
  - Groupby execution results

This patch relates to rapidsai#13735.
rapids-bot bot pushed a commit that referenced this issue Aug 30, 2023
In the past, the HASING_NVBENCH benchmark treated the `nulls` parameter as a boolean. Any value other than 0.0 resulted in a null probability of 1.0.

Now, the `nulls` parameter directly determines the null probability. For instance, a value of 0.1 will generate 10% of the data as null. Moreover, setting nulls to 0.0 produces data without a null bitmask.

Additionally, `bytes_per_second` are added to the benchmark.

This patch relates to #13735.

Authors:
  - Martin Marenz (https://github.com/Blonck)
  - Yunsong Wang (https://github.com/PointKernel)

Approvers:
  - Yunsong Wang (https://github.com/PointKernel)
  - David Wendt (https://github.com/davidwendt)

URL: #13967
Blonck pushed a commit to Blonck/cudf that referenced this issue Aug 31, 2023
This patch adds memory statistics for the GROUPBY_NVBENCH benchmarks.

For this purpose helper functions are introduced to compute the payload
size for:
  - Column
  - Table
  - Groupby execution results

This patch relates to rapidsai#13735.
Blonck pushed a commit to Blonck/cudf that referenced this issue Aug 31, 2023
This patch adds memory statistics for the GROUPBY_NVBENCH benchmarks.

For this purpose helper functions are introduced to compute the payload
size for:
  - Column
  - Table
  - Groupby execution results

This patch relates to rapidsai#13735.
Blonck added a commit to Blonck/cudf that referenced this issue Sep 22, 2023
Blonck added a commit to Blonck/cudf that referenced this issue Sep 22, 2023
Blonck added a commit to Blonck/cudf that referenced this issue Sep 22, 2023
Blonck added a commit to Blonck/cudf that referenced this issue Sep 22, 2023
Blonck added a commit to Blonck/cudf that referenced this issue Sep 22, 2023
Blonck added a commit to Blonck/cudf that referenced this issue Sep 26, 2023
Blonck added a commit to Blonck/cudf that referenced this issue Sep 26, 2023
Blonck pushed a commit to Blonck/cudf that referenced this issue Sep 26, 2023
rapids-bot bot pushed a commit that referenced this issue Sep 28, 2023
rapids-bot bot pushed a commit that referenced this issue Oct 10, 2023
rapids-bot bot pushed a commit that referenced this issue Oct 10, 2023
Adds `bytes_per_second` to `SHIFT_BENCH`. 

This patch relates to #13735.

Authors:
  - Martin Marenz (https://github.com/Blonck)

Approvers:
  - Karthikeyan (https://github.com/karthikeyann)
  - Nghia Truong (https://github.com/ttnghia)
  - Mark Harris (https://github.com/harrism)

URL: #13950
@jihoonson
Copy link
Contributor

It is a good idea to have some metrics for data size. I wonder whether it would be even better if we have separate metrics for input and output data. For example, we could add custom metrics like input_items_per_second, output_items_per_second, and input_bytes_per_second and output_bytes_per_second. The first two are the number of items (usually rows), and the last two are the byte sizes of input and output.

rapids-bot bot pushed a commit that referenced this issue Jul 2, 2024
…ks (#16126)

This PR addresses #13735 for reduction benchmarks. There are 3 new utils added.

- `int64_t estimate_size(cudf::table_view)` returns a size estimate for the given table. #13984 was a previous attempt to add a similar utility, but this implementation uses `cudf::row_bit_count()` as suggested in #13984 (comment) instead of manually estimating the size.
- `void set_items_processed(State& state, int64_t items_processed_per_iteration)` is a thin wrapper of `State.SetItemsProcessed()`. This wrapper takes `items_processed_per_iteration` as a parameter instead of `total_items_processed`. This could be useful to avoid repeating `State.iterations() * items_processed_per_iteration` in each benchmark class.
- `void set_throughputs(nvbench::state& state)` is added as a workaround for NVIDIA/nvbench#175. We sometimes want to set throughput statistics after `state.exec()` calls especially when it is hard to estimate the result size upfront.

Here are snippets of reduction benchmarks after this change.

```
$ cpp/build/benchmarks/REDUCTION_BENCH
...
-----------------------------------------------------------------------------------------------------------------
Benchmark                                                       Time             CPU   Iterations UserCounters...
-----------------------------------------------------------------------------------------------------------------
Reduction/bool_all/10000/manual_time                        10257 ns        26845 ns        68185 bytes_per_second=929.907M/s items_per_second=975.078M/s
Reduction/bool_all/100000/manual_time                       11000 ns        27454 ns        63634 bytes_per_second=8.46642G/s items_per_second=9.09075G/s
Reduction/bool_all/1000000/manual_time                      12671 ns        28658 ns        55261 bytes_per_second=73.5018G/s items_per_second=78.922G/s
...

$ cpp/build/benchmarks/REDUCTION_NVBENCH
...
## rank_scan

### [0] NVIDIA RTX A5500

|        T        | null_probability | data_size | Samples |  CPU Time  | Noise  |  GPU Time  | Noise |  Elem/s  | GlobalMem BW |  BWUtil   |
|-----------------|------------------|-----------|---------|------------|--------|------------|-------|----------|--------------|-----------|
|             I32 |                0 |     10000 |  16992x |  33.544 us | 14.95% |  29.446 us | 5.58% |  82.321M |   5.596 TB/s |   728.54% |
|             I32 |              0.1 |     10000 |  16512x |  34.358 us | 13.66% |  30.292 us | 2.87% |  80.020M |   5.286 TB/s |   688.17% |
|             I32 |              0.5 |     10000 |  16736x |  34.058 us | 14.31% |  29.890 us | 3.40% |  81.097M |   5.430 TB/s |   706.89% |
...
```

Note that, when the data type is a 1-byte-width type in the google benchmark result summary, `bytes_per_second` appears to be smaller than `items_per_second`. This is because the former is a multiple of 1000 whereas the latter is a multiple of 1024. They are in fact the same number.

Implementation-wise, these are what I'm not sure if I made a best decision.
- Each of new utils above is declared and defined in different files. I did this because I could not find a good place to have them all, and they seem to belong to different utilities. Please let me know if there is a better place for them.
- All the new utils are defined in the global namespace since other util functions seem to have been defined in the same way. Please let me know if this is not the convention.

Authors:
  - Jihoon Son (https://github.com/jihoonson)

Approvers:
  - Mark Harris (https://github.com/harrism)
  - David Wendt (https://github.com/davidwendt)

URL: #16126
@GregoryKimball
Copy link
Contributor Author

Thank you @jihoonson for your suggestion. I agree that we need better and more granular throughput metrics. I would like to start with an input+output bytes per second metric that is consistently implemented in all benchmarks. From there we can add the separate input/output and bytes/items metrics as you suggest.

@jihoonson
Copy link
Contributor

@GregoryKimball sounds good 🙂 As my first PR has been merged, I will continue working on the remaining benchmarks.

@jihoonson
Copy link
Contributor

I would like to start with an input+output bytes per second metric that is consistently implemented in all benchmarks.

I started thinking that perhaps we should enforce all benchmarks to add such metrics. One way to do it is having new benchmark fixtures, which would explode at the end of the benchmark if those metrics are not set. Then we could update every benchmark to use the new fixtures. @GregoryKimball do you think this is a good idea?

@GregoryKimball
Copy link
Contributor Author

GregoryKimball commented Aug 1, 2024

we should enforce all benchmarks to add such metrics

Thank you @jihoonson for this suggestion. We might go this route at some point, but for now I think converting the rest of the benchmarks from gbench to nvbench and adding the missing metrics is higher priority.

@kingcrimsontianyu kingcrimsontianyu self-assigned this Sep 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
0 - Backlog In queue waiting for assignment feature request New feature or request good first issue Good for newcomers libcudf Affects libcudf (C++/CUDA) code.
Projects
Status: In Progress
Status: No status
Development

No branches or pull requests

5 participants