Use radix sort for sort phase and sprite sorting #4291

superdump · 2022-03-22T11:44:18Z

Background

We currently use sort_by_key to sort PhaseItems in the sort phase, and sort_unstable_by to sort extracted sprites in queue_sprites.

sort_by_key - This sort is stable (i.e., does not reorder equal elements) and O(m * n * log(n)) worst-case, where the key function is O(m).
sort_unstable_by - This sort is unstable (i.e., may reorder equal elements), in-place (i.e., does not allocate), and O(n * log(n)) worst-case.

For large numbers of items to sort, such as in bevymark or many_sprites where n > 100_000, and given sort keys that are a small and fixed number of bits such as the f32 that we are commonly using, a radix sort could/should be much faster.

crates

I investigated sorting 10M random f32s in the range 0.1f32..1000.0f32 (the default perspective camera near and far planes) and observed the following with the available radix sort crates:

1053.388ms      stable
391.499ms       unstable
60.359ms        rdst_single
16.067ms        rdst
79.550ms        voracious_sort
79.945ms        voracious_stable_sort
13.895ms        voracious_mt_sort
65.653ms        radsort

multi-threading
- rdst and voracious_mt_sort results above are multi-threaded using rayon, all the rest of the results are single-threaded
- rdst has a hard dependency on rayon, though it can be configured to do single-threaded sorting. I created an issue about making multi-threading optional as we already have a threadpool in bevy and maybe we don't want to add another: Make rayon optional nessex/rdst#2
- voracious_radix_sort has a non-default multi-threading feature
- radsort is single-threaded only
According to crates.io, the crates have the following footprints:
- radsort - 17.1kB
- rdst - 34.3kB
- voracious_radix_sort - 178kB
Flexibility
- radsort was the easiest to integrate using its sort_by_key function
- rdst and voracious_radix_sort both require implementation of some traits on the type being sorted, and require that the type implement Copy. This cannot be done for batched sprites currently because BatchedPhaseItem contains a Range<f32> for which it is not possible to implement Copy

Proof of Concept

I made a branch here that uses radsort: https://github.com/superdump/bevy/tree/render-radix-sort .

The sorting of ExtractedSprites in queue_sprites is significantly improved by using radsort::sort_by_key like this:

            radsort::sort_by_key(extracted_sprites, |extracted_sprite| {
                (
                    extracted_sprite.transform.translation.z,
                    match extracted_sprite.image_handle_id {
                        HandleId::Id(uuid, id) => {
                            ((uuid.as_u128() & ((1 << 64) - 1)) << 64) | (id as u128)
                        }
                        HandleId::AssetPathId(id) => {
                            ((id.source_path_id().value() as u128) << 64)
                                | (id.label_id().value() as u128)
                        }
                    },
                )
            });

With that, on an M1 Max, the median execution time of queue_sprites in bevymark -- 20000 8 over 1500 frames increased from 9.82ms to 17.09ms, which makes no sense to me. In many_sprites it decreased from 11.34ms to 9.47ms.

The median execution time of sort_phase_system for the relevant phase (Transparent2d for sprites, Opaque3d for 3D meshes) in many_cubes -- sphere it decreased from 0.728ms to 0.094ms. In bevymark -- 20000 8 it increased from 0.184ms to 1.95ms, which makes no sense. And in many_sprites it increased from 0.106ms to 1.19ms.

This is quite confusing. I haven't been able to figure out why the sort performance gets worse. radsort claims both best and worst execution time of O(n), space complexity of O(n), and that it is a stable sort so it does not reorder equal items. On the branch, all the sort_key implementations are inlined.

Next steps

Figure out why radsort is much slower in many cases
Try voracious_radix_sort or rdst to see if they perform consistently better, though rdst would likely not be approved unless its multi-threading were made optional.
- Make multi-threading optional in rdst if it turns out to be a preferable solution due to the smaller crate footprint

The text was updated successfully, but these errors were encountered:

superdump · 2022-03-22T13:06:25Z

I added a BatchRange containing a start and end u32, deriving Copy and implementing a fn as_range(&self) -> Range<u32>.

I tried rdst, which is multi-threaded by default and on bevymark -- 20000 8 queue_sprites increased from 9.82ms on main to 25.59ms, and the sort phase increased from 0.184ms on main to 2.13ms. many_sprites queue_sprites increased to 14.72ms (worse than radsort) and sort phase increased to 1.29ms. many_cubes -- sphere sort phase did decrease to 0.104ms, but that's still worse than radsort for every single case.

bjorn3 · 2022-03-22T13:38:25Z

radsort claims both best and worst execution time of O(n)

According to wikipedia radixsort is O(nw) where n is the amount of elements and w is the length of the element. It is provably impossible to have a sort with a worst case better than O(n log n).

superdump · 2022-03-22T13:46:21Z

radsort claims both best and worst execution time of O(n)

According to wikipedia radixsort is O(nw) where n is the amount of elements and w is the length of the element. It is provably impossible to have a sort with a worst case better than O(n log n).

I think that is only true for comparison sorts. Radix sort is not a comparison sort.

superdump · 2022-03-22T13:57:31Z

I tried voracious_radix_sort using voracious_sort() which is an unstable radix sort. It does have a stable variant so I could try that too. It also has a multi-threaded variant. But I thought I'd start here.

bevymark -- 20000 8 queue_sprites increased from 9.82ms on main to 10.02ms, and the sort phase decreased from 0.184ms on main to 0.124ms. many_sprites queue_sprites increased from 11.34ms to 14.6ms and sort phase decreased from 0.106ms to 0.078ms. many_cubes -- sphere sort phase decreased from 0.728ms to 0.118ms.

So it is at least consistently better than Vec::sort_by_key for the sort phase, by 0.06-0.61ms in these tests, with the main significant benefit for many_cubes -- sphere. It is not good for sorting in queue_sprites which needs to consider not only the transform z (32 bits f32) but also the handle id (either 2 x u64 or u64 + u128), and I even cheated here and packed the f32 and handle id into a u128, dropping some of the bits and hoping to avoid collisions. In these tests the handles are all the same but I wanted to do something close to representative. Although, the sort phase should use a stable sort and that may impact performance so I'll try that.

superdump · 2022-03-22T14:47:14Z

sort method	many_sprites queue_sprites	many_sprites sort phase	bevymark queue_sprites	bevymark sort phase	many_cubes sort phase
main	11.34ms	0.106ms	9.82ms	0.184ms	0.728ms
radsort	9.47ms	1.19ms	17.09ms	1.95ms	0.094ms
rdst (multi-threaded)	14.72ms	1.29ms	25.59ms	2.13ms	0.104ms
voracious unstable	14.6ms	0.078ms	10.02ms	0.124ms	0.118ms
voracious stable	14.74ms	0.078ms	9.99ms	0.125ms	0.117ms
voracious multi-threaded unstable	14.76ms	0.098ms	10.04ms	0.155ms	0.237ms

superdump · 2022-03-22T14:51:08Z

I realised that for the sort in queue_sprites, the sprites don't need to be ordered by the sprite image HandleId, just grouped. Perhaps there is a faster way to sort only by translation z and then group by image handle id.

superdump · 2022-03-23T12:04:19Z

I tried to rework queue_sprites with the following ideas:

Sort the Vec<ExtractedSprite> only by translation z which should be faster for radix sort
Based on an assumption that in practice there would be multiple sprites at a z level, which I have since realised is wrong in many types of 2D games, when iterating the translation z order extracted sprites, push the ExtractedSprite onto a Vec in a HashMap<SpriteBatch, Vec>, and then when z changes, process the batches (which change based on whether the image handle id or colored state changed.

My assumption that z are relatively few compared to sprites was wrong and that wasn't a well-considered optimisation attempt. I was mostly following the path of the radix sort hammer working well for small sort keys.

Ultimately, I dropped that path. I don't know of a good way to improve queue_sprites consistently using radix sort. The best way to improve it was rather to pull in rayon and use its par_sort_unstable_by. This made a big improvement to queue_sprites in many_sprites but no improvement to bevymark. sort phase is a bit hit and miss being a bit slower for the sprites and faster for many_cubes.

sort method	many_sprites queue_sprites	many_sprites sort phase	bevymark queue_sprites	bevymark sort phase	many_cubes sort phase
main	11.34ms	0.106ms	9.82ms	0.184ms	0.728ms
rayon par_sort_	7.2ms	0.219ms	9.83ms	0.305ms	0.476ms

I guess bevymark doesn't have enough sprites for the sort to make any difference - I am running it with -- 20000 8 for a total of 160k sprites - where many_sprites has 409,600 sprites.

Summary

voracious_stable_sort is consistently as good or better for the sort phase and is single-threaded
rayon::par_sort_unstable_by is as good or potentially significantly better for queue_sprites

mockersf · 2022-03-23T15:20:46Z

would it be possible to adapt the sort based on counts or previous execution? do one frame with each, then do the next 500 frames with the best one?

superdump · 2022-03-23T22:17:06Z

would it be possible to adapt the sort based on counts or previous execution? do one frame with each, then do the next 500 frames with the best one?

I suppose you could, but looking at the results, we could implement the use of voracious_stable_sort for the sort phase, and rayon::par_sort_unstable_by for queue_sprites and probably obtain the fastest for all cases.

However, I didn’t make a PR for this because the sort phase gains are small and scene-dependent for the additional dependency on voracious_radix_sort, and distaste was expressed for including rayon due to competing threadpools and in the supposedly more realistic case of bevymark, there was no benefit.

Did I miss something in what you were asking?

superdump · 2022-03-24T12:04:11Z

Cart linked to #3460 (review) on Discord as relevant investigation of and discussion about how batching works in general and for sprites.

nessex · 2022-03-25T13:28:24Z

For an additional option, I've just released rdst = { version = "0.20.0", default-features = false }. If you disable the default feature multi-threading it can be used in entirely single-threaded mode without rayon and other large dependencies being pulled in.

This had a huge impact on bloat:
nessex/rdst#3 (comment)

superdump · 2022-03-26T13:36:19Z

Nice, thanks!

nessex · 2022-03-26T14:26:24Z

As discussed in discord, apart from the first few frames this is mostly just sorting already sorted Vecs. So it's really checking how fast you can detect that the Vec is already sorted. I just released rdst 0.20.1 which has a better check for this case, which should bring it a bit more in-line with the others.

In terms of bevy making use of radix sort, I think it would be good to work out what a good representative scene is. Trying to think up a worst-case scenario, some sort of sprite based particle system would probably be a pathological case for the sort phase.

Bevymark and many_sprites both don't stress this system very much as after the initial spawning, they don't alter the sort order of the relevant items.

superdump · 2022-03-26T18:49:21Z

One test would be to have a combination of sprites, plain 2d meshes, and 2d meshes with custom materials, at various z depths. This would result in items queued (pushed to the phase vec) from separate systems that then need sorting.

I did like with rdst’s design that it operates byte-wise over the sort key and I think it somehow ignores bytes that have no impact on the order because they’re all the same?

nessex · 2022-03-28T05:35:18Z

One test would be to have a combination of sprites, plain 2d meshes, and 2d meshes with custom materials, at various z depths. This would result in items queued (pushed to the phase vec) from separate systems that then need sorting.

That would be ideal, it would also tell us the impact on batching etc. I realised that bevymark can actually test changing data at least, by simply using more waves! So maybe a better test would be to set waves to like 2000, so the new data never stops.

I did like with rdst’s design that it operates byte-wise over the sort key and I think it somehow ignores bytes that have no impact on the order because they’re all the same?

I've tried to keep it entirely agnostic to the underlying types by sticking religiously to sorting byte-by-byte. On the downside, rdst can't take shortcuts by truly comparison-sorting small arrays in the same way as voracious etc. because I don't require orderable types in rdst. I have a cludgy byte-by-byte comparison sort as a diversion for small arrays... But it will never be as fast as a single comparison of f32's like in voracious without requiring Ord or whatever. This only applies if you have sub-1k or so items to sort, where a comparison sort can be faster for simple types.

And yes, both rdst and voracious will generally skip sorting a "level" (nth byte) if all bytes in that level are the same. With this trick, you can't skip the O(n) counting phase of course, but you do skip moving things around which is much slower!

If you want to skip counting too, you can just specify that there are less bytes to sort. This will usually require newtyping as the default sort key implementations naturally assume you're using the whole type :) I can't see this being useful for floats, but for i/usize data where only a few of the bytes can change, it's a huge speed boost.

superdump · 2022-03-28T06:35:57Z

@nessex I don't know how the bit position radix sorting algorithm works, but would it be better to use u32 or u64 as the agnostic sort key data type given that CPU registers are 32/64 bits? Do you leverage SIMD already for packing the u8s into registers and operating on them? I imagine you could still do so for 32-/64-bit shuffles and masks and such too?

nessex · 2022-03-28T07:01:26Z

rdst and voracious don't explicitly pack things into SIMD registers as it depends on the get_level / byte extraction functions' implementation for a given type. But both are structured to automatically vectorize fairly well if your byte extraction impl supports it.

For sort_key, floats are fine I think. The operations are quite primitive and vectorize well:
https://github.com/Nessex/rdst/blob/d4377cea6dd9eb1fa49091c0592f9a58feac255d/src/radix_key_impl.rs#L167

Actually, I've just pushed 0.20.2 which makes the float type ordering conform to the same deterministic ordering (including NaN) as the upcoming total_cmp nightly function. I'm not sure exactly what the appropriate ordering for NaN's needs to be for bevy, but they do seem to exist in the data... So this could be a consideration if deciding between f32 and anything else. You could newtype / use FloatOrd if you need something a bit different.

james7132 · 2022-06-21T23:52:06Z

After #5049, we're in a position to slot any of these radix sorts, and change which algorithm we use depending on sort mode. We can handle the stable sorts separately from the unstable ones. For 3D use cases, where unstable sorts can be used for every phase, we can easily use radsort or voracious and get a 5-7x speedup, and decide on an appropriate approach for the stable/batched 2D cases.

superdump · 2022-06-23T09:06:24Z

After #5049, we're in a position to slot any of these radix sorts, and change which algorithm we use depending on sort mode. We can handle the stable sorts separately from the unstable ones. For 3D use cases, where unstable sorts can be used for every phase, we can easily use radsort or voracious and get a 5-7x speedup, and decide on an appropriate approach for the stable/batched 2D cases.

Probably worth retesting with latest versions of the three radix sort crates I tried.

# Objective Partially addresses #4291. Speed up the sort phase for unbatched render phases. ## Solution Split out one of the optimizations in #4899 and allow implementors of `PhaseItem` to change what kind of sort is used when sorting the items in the phase. This currently includes Stable, Unstable, and Unsorted. Each of these corresponds to `Vec::sort_by_key`, `Vec::sort_unstable_by_key`, and no sorting at all. The default is `Unstable`. The last one can be used as a default if users introduce a preliminary depth prepass. ## Performance This will not impact the performance of any batched phases, as it is still using a stable sort. 2D's only phase is unchanged. All 3D phases are unbatched currently, and will benefit from this change. On `many_cubes`, where the primary phase is opaque, this change sees a speed up from 907.02us -> 477.62us, a 47.35% reduction. ![image](https://user-images.githubusercontent.com/3137680/174471253-22424874-30d5-4db5-b5b4-65fb2c612a9c.png) ## Future Work There were prior discussions to add support for faster radix sorts in #4291, which in theory should be a `O(n)` instead of a `O(nlog(n))` time. [`voracious`](https://crates.io/crates/voracious_radix_sort) has been proposed, but it seems to be optimize for use cases with more than 30,000 items, which may be atypical for most systems. Another optimization included in #4899 is to reduce the size of a few of the IDs commonly used in `PhaseItem` implementations to shrink the types to make swapping/sorting faster. Both `CachedPipelineId` and `DrawFunctionId` could be reduced to `u32` instead of `usize`. Ideally, this should automatically change to use stable sorts when `BatchedPhaseItem` is implemented on the same phase item type, but this requires specialization, which may not land in stable Rust for a short while. --- ## Changelog Added: `PhaseItem::sort` ## Migration Guide RenderPhases now default to a unstable sort (via `slice::sort_unstable_by_key`). This can typically improve sort phase performance, but may produce incorrect batching results when implementing `BatchedPhaseItem`. To revert to the older stable sort, manually implement `PhaseItem::sort` to implement a stable sort (i.e. via `slice::sort_by_key`). Co-authored-by: Federico Rinaldi <gisquerin@gmail.com> Co-authored-by: Robert Swain <robert.swain@gmail.com> Co-authored-by: colepoirier <colepoirier@gmail.com>

# Objective Partially addresses bevyengine#4291. Speed up the sort phase for unbatched render phases. ## Solution Split out one of the optimizations in bevyengine#4899 and allow implementors of `PhaseItem` to change what kind of sort is used when sorting the items in the phase. This currently includes Stable, Unstable, and Unsorted. Each of these corresponds to `Vec::sort_by_key`, `Vec::sort_unstable_by_key`, and no sorting at all. The default is `Unstable`. The last one can be used as a default if users introduce a preliminary depth prepass. ## Performance This will not impact the performance of any batched phases, as it is still using a stable sort. 2D's only phase is unchanged. All 3D phases are unbatched currently, and will benefit from this change. On `many_cubes`, where the primary phase is opaque, this change sees a speed up from 907.02us -> 477.62us, a 47.35% reduction. ![image](https://user-images.githubusercontent.com/3137680/174471253-22424874-30d5-4db5-b5b4-65fb2c612a9c.png) ## Future Work There were prior discussions to add support for faster radix sorts in bevyengine#4291, which in theory should be a `O(n)` instead of a `O(nlog(n))` time. [`voracious`](https://crates.io/crates/voracious_radix_sort) has been proposed, but it seems to be optimize for use cases with more than 30,000 items, which may be atypical for most systems. Another optimization included in bevyengine#4899 is to reduce the size of a few of the IDs commonly used in `PhaseItem` implementations to shrink the types to make swapping/sorting faster. Both `CachedPipelineId` and `DrawFunctionId` could be reduced to `u32` instead of `usize`. Ideally, this should automatically change to use stable sorts when `BatchedPhaseItem` is implemented on the same phase item type, but this requires specialization, which may not land in stable Rust for a short while. --- ## Changelog Added: `PhaseItem::sort` ## Migration Guide RenderPhases now default to a unstable sort (via `slice::sort_unstable_by_key`). This can typically improve sort phase performance, but may produce incorrect batching results when implementing `BatchedPhaseItem`. To revert to the older stable sort, manually implement `PhaseItem::sort` to implement a stable sort (i.e. via `slice::sort_by_key`). Co-authored-by: Federico Rinaldi <gisquerin@gmail.com> Co-authored-by: Robert Swain <robert.swain@gmail.com> Co-authored-by: colepoirier <colepoirier@gmail.com>

rparrett · 2023-03-17T04:51:44Z

Looking at queue_sprites again with Bevy 0.10.0 and key functions from Rob's old branch. Motivated again by #8100.

noisy M1 Mac, Chrome
These are just frame times plucked from LogDiagnosticsPlugin, no tracing.

bevy-vs-pixi is very a similar benchmark to bevymark, but its z values are distributed differently.
It spawns pairs of sprites with z = rng.gen::<f32>() and z + f32::EPSILON.

bevymark was run with a hardcoded BirdScheduled { per_wave: 1000, wave: 100 }

bevy-vs-pixi was run with 32k rects (64k sprites)

	bevymark native	bevymark wasm	bevy-vs-pixi native	bevy-vs-pixi wasm
rdst 0.20.10	21.22ms	50.70ms	13.87ms	83.97ms
rdst no-default	20.28ms	49.65ms	13.36ms	81.71ms
sort_unstable_by	18.47ms	32.03ms	10.36ms	113.27ms
radsort 0.1.0	17.83ms	60.04ms	8.83ms	76.06ms
radsort key cached in ExtractedSprite	23.63ms	62.84ms	12.18ms	75.72ms
radsort quadruple key	17.87ms	62.28ms	9.20ms	80.08ms
glidesort 0.1.2	17.63ms	31.75ms	14.73ms	128.84ms

Really fascinating. Not sure what's going on with rdst multi-threading. I did at least verify that on those runs, rayon was either in or out of the dep tree. Also re-ran the sort_unstable_by / wasm just to be sure.

Didn't include voracious because it doesn't seem possible to write a comparable key function.

rparrett · 2023-03-17T16:28:29Z

I threw glidesort in there for fun.

superdump · 2023-11-05T22:07:16Z

@rparrett I was revisiting this to see if it could be closed now. What I take from the above table is that glidesort is faster on ordered arrays and radsort is faster on random ones. In which case, I think sticking with radsort is the way to go. Did I understand correctly?

rparrett · 2023-11-05T23:25:48Z

Sounds about right / matches what I was told about radsort / sounds good to me. There may be reason to rebenchmark later. The author (of radsort) is apparently cooking up a new release.

superdump added A-Rendering Drawing game state to the screen C-Performance A change motivated by improving speed, memory usage or compile times labels Mar 22, 2022

james7132 mentioned this issue Jun 19, 2022

[Merged by Bors] - Allow unbatched render phases to use unstable sorts #5049

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use radix sort for sort phase and sprite sorting #4291

Use radix sort for sort phase and sprite sorting #4291

superdump commented Mar 22, 2022 •

edited

Loading

superdump commented Mar 22, 2022

bjorn3 commented Mar 22, 2022

superdump commented Mar 22, 2022

superdump commented Mar 22, 2022 •

edited

Loading

superdump commented Mar 22, 2022

superdump commented Mar 22, 2022

superdump commented Mar 23, 2022

mockersf commented Mar 23, 2022

superdump commented Mar 23, 2022

superdump commented Mar 24, 2022

nessex commented Mar 25, 2022

superdump commented Mar 26, 2022

nessex commented Mar 26, 2022 •

edited

Loading

superdump commented Mar 26, 2022

nessex commented Mar 28, 2022 •

edited

Loading

superdump commented Mar 28, 2022

nessex commented Mar 28, 2022 •

edited

Loading

james7132 commented Jun 21, 2022 •

edited

Loading

superdump commented Jun 23, 2022

rparrett commented Mar 17, 2023 •

edited

Loading

rparrett commented Mar 17, 2023

superdump commented Nov 5, 2023

rparrett commented Nov 5, 2023 •

edited

Loading

Use radix sort for sort phase and sprite sorting #4291

Use radix sort for sort phase and sprite sorting #4291

Comments

superdump commented Mar 22, 2022 • edited Loading

Background

crates

Proof of Concept

Next steps

superdump commented Mar 22, 2022

bjorn3 commented Mar 22, 2022

superdump commented Mar 22, 2022

superdump commented Mar 22, 2022 • edited Loading

superdump commented Mar 22, 2022

superdump commented Mar 22, 2022

superdump commented Mar 23, 2022

Summary

mockersf commented Mar 23, 2022

superdump commented Mar 23, 2022

superdump commented Mar 24, 2022

nessex commented Mar 25, 2022

superdump commented Mar 26, 2022

nessex commented Mar 26, 2022 • edited Loading

superdump commented Mar 26, 2022

nessex commented Mar 28, 2022 • edited Loading

superdump commented Mar 28, 2022

nessex commented Mar 28, 2022 • edited Loading

james7132 commented Jun 21, 2022 • edited Loading

superdump commented Jun 23, 2022

rparrett commented Mar 17, 2023 • edited Loading

rparrett commented Mar 17, 2023

superdump commented Nov 5, 2023

rparrett commented Nov 5, 2023 • edited Loading

superdump commented Mar 22, 2022 •

edited

Loading

superdump commented Mar 22, 2022 •

edited

Loading

nessex commented Mar 26, 2022 •

edited

Loading

nessex commented Mar 28, 2022 •

edited

Loading

nessex commented Mar 28, 2022 •

edited

Loading

james7132 commented Jun 21, 2022 •

edited

Loading

rparrett commented Mar 17, 2023 •

edited

Loading

rparrett commented Nov 5, 2023 •

edited

Loading