Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use radix sort for sort phase and sprite sorting #4291

Open
superdump opened this issue Mar 22, 2022 · 23 comments
Open

Use radix sort for sort phase and sprite sorting #4291

superdump opened this issue Mar 22, 2022 · 23 comments
Labels
A-Rendering Drawing game state to the screen C-Performance A change motivated by improving speed, memory usage or compile times

Comments

@superdump
Copy link
Contributor

superdump commented Mar 22, 2022

Background

We currently use sort_by_key to sort PhaseItems in the sort phase, and sort_unstable_by to sort extracted sprites in queue_sprites.

  • sort_by_key - This sort is stable (i.e., does not reorder equal elements) and O(m * n * log(n)) worst-case, where the key function is O(m).
  • sort_unstable_by - This sort is unstable (i.e., may reorder equal elements), in-place (i.e., does not allocate), and O(n * log(n)) worst-case.

For large numbers of items to sort, such as in bevymark or many_sprites where n > 100_000, and given sort keys that are a small and fixed number of bits such as the f32 that we are commonly using, a radix sort could/should be much faster.

crates

I investigated sorting 10M random f32s in the range 0.1f32..1000.0f32 (the default perspective camera near and far planes) and observed the following with the available radix sort crates:

1053.388ms      stable
391.499ms       unstable
60.359ms        rdst_single
16.067ms        rdst
79.550ms        voracious_sort
79.945ms        voracious_stable_sort
13.895ms        voracious_mt_sort
65.653ms        radsort
  • multi-threading
    • rdst and voracious_mt_sort results above are multi-threaded using rayon, all the rest of the results are single-threaded
    • rdst has a hard dependency on rayon, though it can be configured to do single-threaded sorting. I created an issue about making multi-threading optional as we already have a threadpool in bevy and maybe we don't want to add another: Make rayon optional nessex/rdst#2
    • voracious_radix_sort has a non-default multi-threading feature
    • radsort is single-threaded only
  • According to crates.io, the crates have the following footprints:
    • radsort - 17.1kB
    • rdst - 34.3kB
    • voracious_radix_sort - 178kB
  • Flexibility
    • radsort was the easiest to integrate using its sort_by_key function
    • rdst and voracious_radix_sort both require implementation of some traits on the type being sorted, and require that the type implement Copy. This cannot be done for batched sprites currently because BatchedPhaseItem contains a Range<f32> for which it is not possible to implement Copy

Proof of Concept

I made a branch here that uses radsort: https://github.com/superdump/bevy/tree/render-radix-sort .

The sorting of ExtractedSprites in queue_sprites is significantly improved by using radsort::sort_by_key like this:

            radsort::sort_by_key(extracted_sprites, |extracted_sprite| {
                (
                    extracted_sprite.transform.translation.z,
                    match extracted_sprite.image_handle_id {
                        HandleId::Id(uuid, id) => {
                            ((uuid.as_u128() & ((1 << 64) - 1)) << 64) | (id as u128)
                        }
                        HandleId::AssetPathId(id) => {
                            ((id.source_path_id().value() as u128) << 64)
                                | (id.label_id().value() as u128)
                        }
                    },
                )
            });

With that, on an M1 Max, the median execution time of queue_sprites in bevymark -- 20000 8 over 1500 frames increased from 9.82ms to 17.09ms, which makes no sense to me. In many_sprites it decreased from 11.34ms to 9.47ms.

The median execution time of sort_phase_system for the relevant phase (Transparent2d for sprites, Opaque3d for 3D meshes) in many_cubes -- sphere it decreased from 0.728ms to 0.094ms. In bevymark -- 20000 8 it increased from 0.184ms to 1.95ms, which makes no sense. And in many_sprites it increased from 0.106ms to 1.19ms.

This is quite confusing. I haven't been able to figure out why the sort performance gets worse. radsort claims both best and worst execution time of O(n), space complexity of O(n), and that it is a stable sort so it does not reorder equal items. On the branch, all the sort_key implementations are inlined.

Next steps

  • Figure out why radsort is much slower in many cases
  • Try voracious_radix_sort or rdst to see if they perform consistently better, though rdst would likely not be approved unless its multi-threading were made optional.
    • Make multi-threading optional in rdst if it turns out to be a preferable solution due to the smaller crate footprint
@superdump superdump added A-Rendering Drawing game state to the screen C-Performance A change motivated by improving speed, memory usage or compile times labels Mar 22, 2022
@superdump
Copy link
Contributor Author

I added a BatchRange containing a start and end u32, deriving Copy and implementing a fn as_range(&self) -> Range<u32>.

I tried rdst, which is multi-threaded by default and on bevymark -- 20000 8 queue_sprites increased from 9.82ms on main to 25.59ms, and the sort phase increased from 0.184ms on main to 2.13ms. many_sprites queue_sprites increased to 14.72ms (worse than radsort) and sort phase increased to 1.29ms. many_cubes -- sphere sort phase did decrease to 0.104ms, but that's still worse than radsort for every single case.

@bjorn3
Copy link
Contributor

bjorn3 commented Mar 22, 2022

radsort claims both best and worst execution time of O(n)

According to wikipedia radixsort is O(nw) where n is the amount of elements and w is the length of the element. It is provably impossible to have a sort with a worst case better than O(n log n).

@superdump
Copy link
Contributor Author

radsort claims both best and worst execution time of O(n)

According to wikipedia radixsort is O(nw) where n is the amount of elements and w is the length of the element. It is provably impossible to have a sort with a worst case better than O(n log n).

I think that is only true for comparison sorts. Radix sort is not a comparison sort.

@superdump
Copy link
Contributor Author

superdump commented Mar 22, 2022

I tried voracious_radix_sort using voracious_sort() which is an unstable radix sort. It does have a stable variant so I could try that too. It also has a multi-threaded variant. But I thought I'd start here.

bevymark -- 20000 8 queue_sprites increased from 9.82ms on main to 10.02ms, and the sort phase decreased from 0.184ms on main to 0.124ms. many_sprites queue_sprites increased from 11.34ms to 14.6ms and sort phase decreased from 0.106ms to 0.078ms. many_cubes -- sphere sort phase decreased from 0.728ms to 0.118ms.

So it is at least consistently better than Vec::sort_by_key for the sort phase, by 0.06-0.61ms in these tests, with the main significant benefit for many_cubes -- sphere. It is not good for sorting in queue_sprites which needs to consider not only the transform z (32 bits f32) but also the handle id (either 2 x u64 or u64 + u128), and I even cheated here and packed the f32 and handle id into a u128, dropping some of the bits and hoping to avoid collisions. In these tests the handles are all the same but I wanted to do something close to representative. Although, the sort phase should use a stable sort and that may impact performance so I'll try that.

@superdump
Copy link
Contributor Author

sort method many_sprites queue_sprites many_sprites sort phase bevymark queue_sprites bevymark sort phase many_cubes sort phase
main 11.34ms 0.106ms 9.82ms 0.184ms 0.728ms
radsort 9.47ms 1.19ms 17.09ms 1.95ms 0.094ms
rdst (multi-threaded) 14.72ms 1.29ms 25.59ms 2.13ms 0.104ms
voracious unstable 14.6ms 0.078ms 10.02ms 0.124ms 0.118ms
voracious stable 14.74ms 0.078ms 9.99ms 0.125ms 0.117ms
voracious multi-threaded unstable 14.76ms 0.098ms 10.04ms 0.155ms 0.237ms

@superdump
Copy link
Contributor Author

I realised that for the sort in queue_sprites, the sprites don't need to be ordered by the sprite image HandleId, just grouped. Perhaps there is a faster way to sort only by translation z and then group by image handle id.

@superdump
Copy link
Contributor Author

I tried to rework queue_sprites with the following ideas:

  • Sort the Vec<ExtractedSprite> only by translation z which should be faster for radix sort
  • Based on an assumption that in practice there would be multiple sprites at a z level, which I have since realised is wrong in many types of 2D games, when iterating the translation z order extracted sprites, push the ExtractedSprite onto a Vec in a HashMap<SpriteBatch, Vec>, and then when z changes, process the batches (which change based on whether the image handle id or colored state changed.

My assumption that z are relatively few compared to sprites was wrong and that wasn't a well-considered optimisation attempt. I was mostly following the path of the radix sort hammer working well for small sort keys.

Ultimately, I dropped that path. I don't know of a good way to improve queue_sprites consistently using radix sort. The best way to improve it was rather to pull in rayon and use its par_sort_unstable_by. This made a big improvement to queue_sprites in many_sprites but no improvement to bevymark. sort phase is a bit hit and miss being a bit slower for the sprites and faster for many_cubes.

sort method many_sprites queue_sprites many_sprites sort phase bevymark queue_sprites bevymark sort phase many_cubes sort phase
main 11.34ms 0.106ms 9.82ms 0.184ms 0.728ms
rayon par_sort_ 7.2ms 0.219ms 9.83ms 0.305ms 0.476ms

I guess bevymark doesn't have enough sprites for the sort to make any difference - I am running it with -- 20000 8 for a total of 160k sprites - where many_sprites has 409,600 sprites.

Summary

  • voracious_stable_sort is consistently as good or better for the sort phase and is single-threaded
  • rayon::par_sort_unstable_by is as good or potentially significantly better for queue_sprites

@mockersf
Copy link
Member

would it be possible to adapt the sort based on counts or previous execution? do one frame with each, then do the next 500 frames with the best one?

@superdump
Copy link
Contributor Author

would it be possible to adapt the sort based on counts or previous execution? do one frame with each, then do the next 500 frames with the best one?

I suppose you could, but looking at the results, we could implement the use of voracious_stable_sort for the sort phase, and rayon::par_sort_unstable_by for queue_sprites and probably obtain the fastest for all cases.

However, I didn’t make a PR for this because the sort phase gains are small and scene-dependent for the additional dependency on voracious_radix_sort, and distaste was expressed for including rayon due to competing threadpools and in the supposedly more realistic case of bevymark, there was no benefit.

Did I miss something in what you were asking?

@superdump
Copy link
Contributor Author

Cart linked to #3460 (review) on Discord as relevant investigation of and discussion about how batching works in general and for sprites.

@nessex
Copy link

nessex commented Mar 25, 2022

For an additional option, I've just released rdst = { version = "0.20.0", default-features = false }. If you disable the default feature multi-threading it can be used in entirely single-threaded mode without rayon and other large dependencies being pulled in.

This had a huge impact on bloat:
nessex/rdst#3 (comment)

@superdump
Copy link
Contributor Author

Nice, thanks!

@nessex
Copy link

nessex commented Mar 26, 2022

As discussed in discord, apart from the first few frames this is mostly just sorting already sorted Vecs. So it's really checking how fast you can detect that the Vec is already sorted. I just released rdst 0.20.1 which has a better check for this case, which should bring it a bit more in-line with the others.

In terms of bevy making use of radix sort, I think it would be good to work out what a good representative scene is. Trying to think up a worst-case scenario, some sort of sprite based particle system would probably be a pathological case for the sort phase.

Bevymark and many_sprites both don't stress this system very much as after the initial spawning, they don't alter the sort order of the relevant items.

@superdump
Copy link
Contributor Author

One test would be to have a combination of sprites, plain 2d meshes, and 2d meshes with custom materials, at various z depths. This would result in items queued (pushed to the phase vec) from separate systems that then need sorting.

I did like with rdst’s design that it operates byte-wise over the sort key and I think it somehow ignores bytes that have no impact on the order because they’re all the same?

@nessex
Copy link

nessex commented Mar 28, 2022

One test would be to have a combination of sprites, plain 2d meshes, and 2d meshes with custom materials, at various z depths. This would result in items queued (pushed to the phase vec) from separate systems that then need sorting.

That would be ideal, it would also tell us the impact on batching etc. I realised that bevymark can actually test changing data at least, by simply using more waves! So maybe a better test would be to set waves to like 2000, so the new data never stops.

I did like with rdst’s design that it operates byte-wise over the sort key and I think it somehow ignores bytes that have no impact on the order because they’re all the same?

I've tried to keep it entirely agnostic to the underlying types by sticking religiously to sorting byte-by-byte. On the downside, rdst can't take shortcuts by truly comparison-sorting small arrays in the same way as voracious etc. because I don't require orderable types in rdst. I have a cludgy byte-by-byte comparison sort as a diversion for small arrays... But it will never be as fast as a single comparison of f32's like in voracious without requiring Ord or whatever. This only applies if you have sub-1k or so items to sort, where a comparison sort can be faster for simple types.

And yes, both rdst and voracious will generally skip sorting a "level" (nth byte) if all bytes in that level are the same. With this trick, you can't skip the O(n) counting phase of course, but you do skip moving things around which is much slower!

If you want to skip counting too, you can just specify that there are less bytes to sort. This will usually require newtyping as the default sort key implementations naturally assume you're using the whole type :) I can't see this being useful for floats, but for i/usize data where only a few of the bytes can change, it's a huge speed boost.

@superdump
Copy link
Contributor Author

@nessex I don't know how the bit position radix sorting algorithm works, but would it be better to use u32 or u64 as the agnostic sort key data type given that CPU registers are 32/64 bits? Do you leverage SIMD already for packing the u8s into registers and operating on them? I imagine you could still do so for 32-/64-bit shuffles and masks and such too?

@nessex
Copy link

nessex commented Mar 28, 2022

rdst and voracious don't explicitly pack things into SIMD registers as it depends on the get_level / byte extraction functions' implementation for a given type. But both are structured to automatically vectorize fairly well if your byte extraction impl supports it.

For sort_key, floats are fine I think. The operations are quite primitive and vectorize well:
https://github.com/Nessex/rdst/blob/d4377cea6dd9eb1fa49091c0592f9a58feac255d/src/radix_key_impl.rs#L167

Actually, I've just pushed 0.20.2 which makes the float type ordering conform to the same deterministic ordering (including NaN) as the upcoming total_cmp nightly function. I'm not sure exactly what the appropriate ordering for NaN's needs to be for bevy, but they do seem to exist in the data... So this could be a consideration if deciding between f32 and anything else. You could newtype / use FloatOrd if you need something a bit different.

@james7132
Copy link
Member

james7132 commented Jun 21, 2022

After #5049, we're in a position to slot any of these radix sorts, and change which algorithm we use depending on sort mode. We can handle the stable sorts separately from the unstable ones. For 3D use cases, where unstable sorts can be used for every phase, we can easily use radsort or voracious and get a 5-7x speedup, and decide on an appropriate approach for the stable/batched 2D cases.

@superdump
Copy link
Contributor Author

After #5049, we're in a position to slot any of these radix sorts, and change which algorithm we use depending on sort mode. We can handle the stable sorts separately from the unstable ones. For 3D use cases, where unstable sorts can be used for every phase, we can easily use radsort or voracious and get a 5-7x speedup, and decide on an appropriate approach for the stable/batched 2D cases.

Probably worth retesting with latest versions of the three radix sort crates I tried.

bors bot pushed a commit that referenced this issue Jun 23, 2022
# Objective

Partially addresses #4291.

Speed up the sort phase for unbatched render phases.

## Solution
Split out one of the optimizations in #4899 and allow implementors of `PhaseItem` to change what kind of sort is used when sorting the items in the phase. This currently includes Stable, Unstable, and Unsorted. Each of these corresponds to `Vec::sort_by_key`, `Vec::sort_unstable_by_key`, and no sorting at all. The default is `Unstable`. The last one can be used as a default if users introduce a preliminary depth prepass.

## Performance
This will not impact the performance of any batched phases, as it is still using a stable sort. 2D's only phase is unchanged. All 3D phases are unbatched currently, and will benefit from this change.

On `many_cubes`, where the primary phase is opaque, this change sees a speed up from 907.02us -> 477.62us, a 47.35% reduction.

![image](https://user-images.githubusercontent.com/3137680/174471253-22424874-30d5-4db5-b5b4-65fb2c612a9c.png)

## Future Work
There were prior discussions to add support for faster radix sorts in #4291, which in theory should be a `O(n)` instead of a `O(nlog(n))` time. [`voracious`](https://crates.io/crates/voracious_radix_sort) has been proposed, but it seems to be optimize for use cases with more than 30,000 items, which may be atypical for most systems.

Another optimization included in #4899 is to reduce the size of a few of the IDs commonly used in `PhaseItem` implementations to shrink the types to make swapping/sorting faster. Both `CachedPipelineId` and `DrawFunctionId` could be reduced to `u32` instead of `usize`.

Ideally, this should automatically change to use stable sorts when `BatchedPhaseItem` is implemented on the same phase item type, but this requires specialization, which may not land in stable Rust for a short while.

---

## Changelog
Added: `PhaseItem::sort`

## Migration Guide
RenderPhases now default to a unstable sort (via `slice::sort_unstable_by_key`). This can typically improve sort phase performance, but may produce incorrect batching results when implementing `BatchedPhaseItem`. To revert to the older stable sort, manually implement `PhaseItem::sort` to implement a stable sort (i.e. via `slice::sort_by_key`).

Co-authored-by: Federico Rinaldi <gisquerin@gmail.com>
Co-authored-by: Robert Swain <robert.swain@gmail.com>
Co-authored-by: colepoirier <colepoirier@gmail.com>
inodentry pushed a commit to IyesGames/bevy that referenced this issue Aug 8, 2022
# Objective

Partially addresses bevyengine#4291.

Speed up the sort phase for unbatched render phases.

## Solution
Split out one of the optimizations in bevyengine#4899 and allow implementors of `PhaseItem` to change what kind of sort is used when sorting the items in the phase. This currently includes Stable, Unstable, and Unsorted. Each of these corresponds to `Vec::sort_by_key`, `Vec::sort_unstable_by_key`, and no sorting at all. The default is `Unstable`. The last one can be used as a default if users introduce a preliminary depth prepass.

## Performance
This will not impact the performance of any batched phases, as it is still using a stable sort. 2D's only phase is unchanged. All 3D phases are unbatched currently, and will benefit from this change.

On `many_cubes`, where the primary phase is opaque, this change sees a speed up from 907.02us -> 477.62us, a 47.35% reduction.

![image](https://user-images.githubusercontent.com/3137680/174471253-22424874-30d5-4db5-b5b4-65fb2c612a9c.png)

## Future Work
There were prior discussions to add support for faster radix sorts in bevyengine#4291, which in theory should be a `O(n)` instead of a `O(nlog(n))` time. [`voracious`](https://crates.io/crates/voracious_radix_sort) has been proposed, but it seems to be optimize for use cases with more than 30,000 items, which may be atypical for most systems.

Another optimization included in bevyengine#4899 is to reduce the size of a few of the IDs commonly used in `PhaseItem` implementations to shrink the types to make swapping/sorting faster. Both `CachedPipelineId` and `DrawFunctionId` could be reduced to `u32` instead of `usize`.

Ideally, this should automatically change to use stable sorts when `BatchedPhaseItem` is implemented on the same phase item type, but this requires specialization, which may not land in stable Rust for a short while.

---

## Changelog
Added: `PhaseItem::sort`

## Migration Guide
RenderPhases now default to a unstable sort (via `slice::sort_unstable_by_key`). This can typically improve sort phase performance, but may produce incorrect batching results when implementing `BatchedPhaseItem`. To revert to the older stable sort, manually implement `PhaseItem::sort` to implement a stable sort (i.e. via `slice::sort_by_key`).

Co-authored-by: Federico Rinaldi <gisquerin@gmail.com>
Co-authored-by: Robert Swain <robert.swain@gmail.com>
Co-authored-by: colepoirier <colepoirier@gmail.com>
james7132 added a commit to james7132/bevy that referenced this issue Oct 28, 2022
# Objective

Partially addresses bevyengine#4291.

Speed up the sort phase for unbatched render phases.

## Solution
Split out one of the optimizations in bevyengine#4899 and allow implementors of `PhaseItem` to change what kind of sort is used when sorting the items in the phase. This currently includes Stable, Unstable, and Unsorted. Each of these corresponds to `Vec::sort_by_key`, `Vec::sort_unstable_by_key`, and no sorting at all. The default is `Unstable`. The last one can be used as a default if users introduce a preliminary depth prepass.

## Performance
This will not impact the performance of any batched phases, as it is still using a stable sort. 2D's only phase is unchanged. All 3D phases are unbatched currently, and will benefit from this change.

On `many_cubes`, where the primary phase is opaque, this change sees a speed up from 907.02us -> 477.62us, a 47.35% reduction.

![image](https://user-images.githubusercontent.com/3137680/174471253-22424874-30d5-4db5-b5b4-65fb2c612a9c.png)

## Future Work
There were prior discussions to add support for faster radix sorts in bevyengine#4291, which in theory should be a `O(n)` instead of a `O(nlog(n))` time. [`voracious`](https://crates.io/crates/voracious_radix_sort) has been proposed, but it seems to be optimize for use cases with more than 30,000 items, which may be atypical for most systems.

Another optimization included in bevyengine#4899 is to reduce the size of a few of the IDs commonly used in `PhaseItem` implementations to shrink the types to make swapping/sorting faster. Both `CachedPipelineId` and `DrawFunctionId` could be reduced to `u32` instead of `usize`.

Ideally, this should automatically change to use stable sorts when `BatchedPhaseItem` is implemented on the same phase item type, but this requires specialization, which may not land in stable Rust for a short while.

---

## Changelog
Added: `PhaseItem::sort`

## Migration Guide
RenderPhases now default to a unstable sort (via `slice::sort_unstable_by_key`). This can typically improve sort phase performance, but may produce incorrect batching results when implementing `BatchedPhaseItem`. To revert to the older stable sort, manually implement `PhaseItem::sort` to implement a stable sort (i.e. via `slice::sort_by_key`).

Co-authored-by: Federico Rinaldi <gisquerin@gmail.com>
Co-authored-by: Robert Swain <robert.swain@gmail.com>
Co-authored-by: colepoirier <colepoirier@gmail.com>
ItsDoot pushed a commit to ItsDoot/bevy that referenced this issue Feb 1, 2023
# Objective

Partially addresses bevyengine#4291.

Speed up the sort phase for unbatched render phases.

## Solution
Split out one of the optimizations in bevyengine#4899 and allow implementors of `PhaseItem` to change what kind of sort is used when sorting the items in the phase. This currently includes Stable, Unstable, and Unsorted. Each of these corresponds to `Vec::sort_by_key`, `Vec::sort_unstable_by_key`, and no sorting at all. The default is `Unstable`. The last one can be used as a default if users introduce a preliminary depth prepass.

## Performance
This will not impact the performance of any batched phases, as it is still using a stable sort. 2D's only phase is unchanged. All 3D phases are unbatched currently, and will benefit from this change.

On `many_cubes`, where the primary phase is opaque, this change sees a speed up from 907.02us -> 477.62us, a 47.35% reduction.

![image](https://user-images.githubusercontent.com/3137680/174471253-22424874-30d5-4db5-b5b4-65fb2c612a9c.png)

## Future Work
There were prior discussions to add support for faster radix sorts in bevyengine#4291, which in theory should be a `O(n)` instead of a `O(nlog(n))` time. [`voracious`](https://crates.io/crates/voracious_radix_sort) has been proposed, but it seems to be optimize for use cases with more than 30,000 items, which may be atypical for most systems.

Another optimization included in bevyengine#4899 is to reduce the size of a few of the IDs commonly used in `PhaseItem` implementations to shrink the types to make swapping/sorting faster. Both `CachedPipelineId` and `DrawFunctionId` could be reduced to `u32` instead of `usize`.

Ideally, this should automatically change to use stable sorts when `BatchedPhaseItem` is implemented on the same phase item type, but this requires specialization, which may not land in stable Rust for a short while.

---

## Changelog
Added: `PhaseItem::sort`

## Migration Guide
RenderPhases now default to a unstable sort (via `slice::sort_unstable_by_key`). This can typically improve sort phase performance, but may produce incorrect batching results when implementing `BatchedPhaseItem`. To revert to the older stable sort, manually implement `PhaseItem::sort` to implement a stable sort (i.e. via `slice::sort_by_key`).

Co-authored-by: Federico Rinaldi <gisquerin@gmail.com>
Co-authored-by: Robert Swain <robert.swain@gmail.com>
Co-authored-by: colepoirier <colepoirier@gmail.com>
@rparrett
Copy link
Contributor

rparrett commented Mar 17, 2023

Looking at queue_sprites again with Bevy 0.10.0 and key functions from Rob's old branch. Motivated again by #8100.

noisy M1 Mac, Chrome
These are just frame times plucked from LogDiagnosticsPlugin, no tracing.

bevy-vs-pixi is very a similar benchmark to bevymark, but its z values are distributed differently.
It spawns pairs of sprites with z = rng.gen::<f32>() and z + f32::EPSILON.

bevymark was run with a hardcoded BirdScheduled { per_wave: 1000, wave: 100 }

bevy-vs-pixi was run with 32k rects (64k sprites)

bevymark native bevymark wasm bevy-vs-pixi native bevy-vs-pixi wasm
rdst 0.20.10 21.22ms 50.70ms 13.87ms 83.97ms
rdst no-default 20.28ms 49.65ms 13.36ms 81.71ms
sort_unstable_by 18.47ms 32.03ms 10.36ms 113.27ms
radsort 0.1.0 17.83ms 60.04ms 8.83ms 76.06ms
radsort key
cached in
ExtractedSprite
23.63ms 62.84ms 12.18ms 75.72ms
radsort
quadruple key
17.87ms 62.28ms 9.20ms 80.08ms
glidesort 0.1.2 17.63ms 31.75ms 14.73ms 128.84ms

Really fascinating. Not sure what's going on with rdst multi-threading. I did at least verify that on those runs, rayon was either in or out of the dep tree. Also re-ran the sort_unstable_by / wasm just to be sure.

Didn't include voracious because it doesn't seem possible to write a comparable key function.

@rparrett
Copy link
Contributor

I threw glidesort in there for fun.

@superdump
Copy link
Contributor Author

@rparrett I was revisiting this to see if it could be closed now. What I take from the above table is that glidesort is faster on ordered arrays and radsort is faster on random ones. In which case, I think sticking with radsort is the way to go. Did I understand correctly?

@rparrett
Copy link
Contributor

rparrett commented Nov 5, 2023

Sounds about right / matches what I was told about radsort / sounds good to me. There may be reason to rebenchmark later. The author (of radsort) is apparently cooking up a new release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-Rendering Drawing game state to the screen C-Performance A change motivated by improving speed, memory usage or compile times
Projects
Status: Todo
Development

No branches or pull requests

6 participants