Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generate MeshUniforms on the GPU via compute shader where available. #12773

Merged
merged 54 commits into from
Apr 10, 2024

Conversation

pcwalton
Copy link
Contributor

@pcwalton pcwalton commented Mar 29, 2024

Currently, MeshUniforms are rather large: 160 bytes. They're also somewhat expensive to compute, because they involve taking the inverse of a 3x4 matrix. Finally, if a mesh is present in multiple views, that mesh will have a separate MeshUniform for each and every view, which is wasteful.

This commit fixes these issues by introducing the concept of a mesh input uniform and adding a mesh uniform building compute shader pass. The MeshInputUniform is simply the minimum amount of data needed for the GPU to compute the full MeshUniform. Most of this data is just the transform and is therefore only 64 bytes. MeshInputUniforms are computed during the extraction phase, much like skins are today, in order to avoid needlessly copying transforms around on CPU. (In fact, the render app has been changed to only store the translation of each mesh; it no longer cares about any other part of the transform, which is stored only on the GPU and the main world.) Before rendering, the build_mesh_uniforms pass runs to expand the MeshInputUniforms to the full MeshUniform.

The mesh uniform building pass does the following, all on GPU:

  1. Copy the appropriate fields of the MeshInputUniform to the MeshUniform slot. If a single mesh is present in multiple views, this effectively duplicates it into each view.

  2. Compute the inverse transpose of the model transform, used for transforming normals.

  3. If applicable, copy the mesh's transform from the previous frame for TAA. To support this, we double-buffer the MeshInputUniforms over two frames and swap the buffers each frame. The MeshInputUniforms for the current frame contain the index of that mesh's MeshInputUniform for the previous frame.

This commit produces wins in virtually every CPU part of the pipeline: extract_meshes, queue_material_meshes,
batch_and_prepare_render_phase, and especially
write_batched_instance_buffer are all faster. Shrinking the amount of CPU data that has to be shuffled around speeds up the entire rendering process.

Benchmark This branch main Speedup
many_cubes -nfc 17.259 24.529 42.12%
many_cubes -nfc -vpi 302.116 312.123 3.31%
many_foxes 3.227 3.515 8.92%

Because mesh uniform building requires compute shader, and WebGL 2 has no compute shader, the existing CPU mesh uniform building code has been left as-is. Many types now have both CPU mesh uniform building and GPU mesh uniform building modes. Developers can opt into the old CPU mesh uniform building by setting the use_gpu_uniform_builder option on PbrPlugin to false.

Below are graphs of the CPU portions of many-cubes --no-frustum-culling. Yellow is this branch, red is main.

extract_meshes:
Screenshot 2024-04-02 124842
It's notable that we get a small win even though we're now writing to a GPU buffer.

queue_material_meshes:
Screenshot 2024-04-02 124911
There's a bit of a regression here; not sure what's causing it. In any case it's very outweighed by the other gains.

batch_and_prepare_render_phase:
Screenshot 2024-04-02 125123
There's a huge win here, enough to make batching basically drop off the profile.

write_batched_instance_buffer:
Screenshot 2024-04-02 125237
There's a massive improvement here, as expected. Note that a lot of it simply comes from the fact that MeshInputUniform is Pod. (This isn't a maintainability problem in my view because MeshInputUniform is so simple: just 16 tightly-packed words.)

Changelog

Added

  • Per-mesh instance data is now generated on GPU with a compute shader instead of CPU, resulting in rendering performance improvements on platforms where compute shaders are supported.

Migration guide

  • Custom render phases now need multiple systems beyond just batch_and_prepare_render_phase. Code that was previously creating custom render phases should now add a BinnedRenderPhasePlugin or SortedRenderPhasePlugin as appropriate instead of directly adding batch_and_prepare_render_phase.

Currently, `MeshUniform`s are rather large: 160 bytes. They're also
somewhat expensive to compute, because they involve taking the inverse
of a 3x4 matrix. Finally, if a mesh is present in multiple views, that
mesh will have a separate `MeshUniform` for each and every view, which
is wasteful.

This commit fixes these issues by introducing the concept of a *mesh
input uniform* and adding a *mesh uniform building* compute shader pass.
The `MeshInputUniform` is simply the minimum amount of data needed for
the GPU to compute the full `MeshUniform`.  Most of this data is simply
the transform and is therefore only 64 bytes. `MeshInputUniform`s are
computed during the *extraction* phase, much like skins are today, in
order to avoid needlessly copying transforms around on CPU. (In fact,
the render app has been changed to only store the translation of each
mesh; it no longer cares about any other part of the transform, which is
stored only on the GPU and the main world.) Before rendering, the
`build_mesh_uniforms` pass runs to expand the `MeshInputUniform`s to the
full `MeshUniform`.

The mesh uniform building pass does the following:

1. Copy the appropriate fields of the `MeshInputUniform` to the
   `MeshUniform` slot. If a single mesh is present in multiple views,
   this effectively duplicates it into each view.

2. Compute the inverse transpose of the model transform, used for
   transforming normals.

3. If applicable, copy the mesh's transform from the previous frame for
   TAA. To support this, we double-buffer the `MeshInputUniform`s over
   two frames and swap the buffers each frame. The `MeshInputUniform`s
   for the current frame contain the index of that mesh's
   `MeshInputUniform` for the previous frame.

This commit produces wins in virtually every CPU part of the pipeline:
`extract_meshes`, `queue_material_meshes`,
`batch_and_prepare_render_phase`, and especially
`write_batched_instance_buffer` are all faster. Shrinking the amount of
CPU data that has to be shuffled around speeds up the entire rendering
process.

| Benchmark              | This branch | `main`  | Speedup |
|------------------------|-------------|---------|---------|
| `many_cubes -nfc`      |      21.878 |  30.117 |  37.65% |
| `many_cubes -nfc -vpi` |     302.116 | 312.123 |   3.31% |
| `many_foxes`           |       3.227 |   3.515 |   8.92% |

Because mesh uniform building requires compute shader, and WebGL 2 has
no compute shader, the existing CPU mesh uniform building code has been
left as-is. Many types now have both CPU mesh uniform building and GPU
mesh uniform building modes. Developers can opt into the old CPU mesh
uniform building by setting the `using_gpu_uniform_builder` option on
`PbrPlugin` to `false`.
@mnmaita mnmaita added A-Rendering Drawing game state to the screen C-Performance A change motivated by improving speed, memory usage or compile times labels Mar 30, 2024
@pcwalton pcwalton marked this pull request as ready for review March 30, 2024 23:36
crates/bevy_pbr/src/lib.rs Outdated Show resolved Hide resolved
@james7132 james7132 added the M-Needs-Migration-Guide A breaking change to Bevy's public API that needs to be noted in a migration guide label Mar 31, 2024
Copy link
Contributor

It looks like your PR is a breaking change, but you didn't provide a migration guide.

Could you add some context on what users should update when this change get released in a new version of Bevy?
It will be used to help writing the migration guide for the version. Putting it after a ## Migration Guide will help it get automatically picked up by our tooling.

Copy link
Member

@james7132 james7132 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only a partial review right now. Will do a full scan through soon.

crates/bevy_pbr/src/lib.rs Outdated Show resolved Hide resolved
crates/bevy_pbr/src/lightmap/mod.rs Outdated Show resolved Hide resolved
@pcwalton
Copy link
Contributor Author

pcwalton commented Apr 2, 2024

Unfortunately, I'm marking this as a draft. I found a bug and fixing it requires rewriting most of the patch.

@pcwalton pcwalton marked this pull request as draft April 2, 2024 03:45
@Elabajaba
Copy link
Contributor

Elabajaba commented Apr 2, 2024

edit: Whoops, didn't notice that it had been changed to a draft, and didn't read the message that's 2 above this one.

I'm getting crashes with this on many_foxes (immediately on launch), and scene_viewer (when I toggle shadows off after toggling them on). scene_viewer also seems to have flickering issues after shadows have been toggled on (tested with bistro and the synty castle scene).

many_foxes backtrace

thread 'Compute Task Pool (10)' panicked at crates\bevy_ecs\src\entity\mod.rs:223:9:
assertion failed: generation.get() <= HIGH_MASK
stack backtrace:
   0: std::panicking::begin_panic_handler
             at /rustc/7cf61ebde7b22796c69757901dd346d0fe70bd97/library\std\src\panicking.rs:647
   1: core::panicking::panic_fmt
             at /rustc/7cf61ebde7b22796c69757901dd346d0fe70bd97/library\core\src\panicking.rs:72
   2: core::panicking::panic
             at /rustc/7cf61ebde7b22796c69757901dd346d0fe70bd97/library\core\src\panicking.rs:144
   3: bevy_ecs::entity::Entity::from_raw_and_generation
             at .\crates\bevy_ecs\src\entity\mod.rs:223
   4: bevy_ecs::entity::Entities::resolve_from_id
             at .\crates\bevy_ecs\src\entity\mod.rs:794
   5: bevy_ecs::entity::Entities::contains
             at .\crates\bevy_ecs\src\entity\mod.rs:724
   6: bevy_ecs::system::commands::Commands::get_entity
             at .\crates\bevy_ecs\src\system\commands\mod.rs:334
   7: bevy_ecs::system::commands::Commands::entity
             at .\crates\bevy_ecs\src\system\commands\mod.rs:294
   8: bevy_pbr::render::gpu_preprocess::prepare_preprocess_bind_groups
             at .\crates\bevy_pbr\src\render\gpu_preprocess.rs:273
   9: core::ops::function::FnMut::call_mut<void (*)(bevy_ecs::system::commands::Commands,bevy_ecs::change_detection::Res<bevy_render::renderer::render_device::RenderDevice>,bevy_ecs::change_detection::Res<enum2$<bevy_render::batching::BatchedInstanceBuffers<bev
             at /rustc/7cf61ebde7b22796c69757901dd346d0fe70bd97\library\core\src\ops\function.rs:166
  10: core::ops::function::impls::impl$3::call_mut<tuple$<bevy_ecs::system::commands::Commands,bevy_ecs::change_detection::Res<bevy_render::renderer::render_device::RenderDevice>,bevy_ecs::change_detection::Res<enum2$<bevy_render::batching::BatchedInstanceBuffe
             at /rustc/7cf61ebde7b22796c69757901dd346d0fe70bd97\library\core\src\ops\function.rs:294
  11: bevy_ecs::system::function_system::impl$17::run::call_inner<tuple$<>,bevy_ecs::system::commands::Commands,bevy_ecs::change_detection::Res<bevy_render::renderer::render_device::RenderDevice>,bevy_ecs::change_detection::Res<enum2$<bevy_render::batching::Bat
             at .\crates\bevy_ecs\src\system\function_system.rs:661
  12: bevy_ecs::system::function_system::impl$17::run<tuple$<>,void (*)(bevy_ecs::system::commands::Commands,bevy_ecs::change_detection::Res<bevy_render::renderer::render_device::RenderDevice>,bevy_ecs::change_detection::Res<enum2$<bevy_render::batching::Batche
             at .\crates\bevy_ecs\src\system\function_system.rs:664
  13: bevy_ecs::system::function_system::impl$6::run_unsafe<void (*)(bevy_ecs::system::commands::Commands,bevy_ecs::change_detection::Res<bevy_render::renderer::render_device::RenderDevice>,bevy_ecs::change_detection::Res<enum2$<bevy_render::batching::BatchedIn
             at .\crates\bevy_ecs\src\system\function_system.rs:504
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
Encountered a panic in system `bevy_pbr::render::gpu_preprocess::prepare_preprocess_bind_groups`!

@IceSentry IceSentry self-requested a review April 9, 2024 04:51
@pcwalton
Copy link
Contributor Author

pcwalton commented Apr 9, 2024

Reported flickering problems in the lighting example. Marking as draft until I figure it out.

@pcwalton pcwalton marked this pull request as draft April 9, 2024 18:23
Copy link
Contributor

@IceSentry IceSentry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few nitpicks, nothing major.

I'd like to see some migration guide entry, but I'm not entirely sure what it should be yet, we can figure something out closer to release.

This generally LGTM but like mentioned on discord there's a flickering issue in the lighting example, so I won't approve it yet. I confirmed that it works in webgl and webgpu though.

crates/bevy_pbr/src/prepass/mod.rs Outdated Show resolved Hide resolved
crates/bevy_pbr/src/render/gpu_preprocess.rs Outdated Show resolved Hide resolved
crates/bevy_pbr/src/render/mesh.rs Outdated Show resolved Hide resolved
),
);
} else {
let render_device = render_app.world().resource::<RenderDevice>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

render_device should already be in scope.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It won't borrow check if I use the existing render_device variable.

);
};

let render_device = render_app.world().resource::<RenderDevice>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here, render_device is already in scope.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's also needed to satisfy the borrow check.

crates/bevy_render/src/batching/mod.rs Outdated Show resolved Hide resolved
crates/bevy_render/src/render_phase/mod.rs Show resolved Hide resolved
@pcwalton pcwalton marked this pull request as ready for review April 9, 2024 20:06
@pcwalton pcwalton requested a review from IceSentry April 9, 2024 20:06
@superdump superdump added this pull request to the merge queue Apr 10, 2024
Merged via the queue into bevyengine:main with commit 11817f4 Apr 10, 2024
27 checks passed
pcwalton added a commit to pcwalton/bevy that referenced this pull request Apr 10, 2024
This commit implements opt-in GPU frustum culling, built on top of the
infrastructure in bevyengine#12773. To enable it on a camera, add the `GpuCulling`
component to it. To additionally disable CPU frustum culling, add the
`NoCpuCulling` component. Note that adding `GpuCulling` without
`NoCpuCulling` *currently* does nothing useful. The reason why
`GpuCulling` doesn't automatically imply `NoCpuCulling` is that I intend
to follow this patch up with GPU two-phase occlusion culling, and CPU
frustum culling plus GPU occlusion culling seems like a very
commonly-desired mode.

Adding the `GpuCulling` component frustum to a view puts that view into
*indirect mode*. This mode makes all drawcalls indirect, relying on the
mesh preprocessing shader to allocate instances dynamically. In indirect
mode, the `PreprocessWorkItem` `output_index` points not to a
`MeshUniform` instance slot but instead to a set of `wgpu`
`IndirectParameters`, from which it allocates an instance slot
dynamically if frustum culling succeeds. Batch building has been updated
to allocate and track indirect parameter slots, and the AABBs are now
supplied to the GPU as `MeshCullingData`.

A small amount of code relating to the frustum culling has been borrowed
from meshlets and moved into `maths.wgsl`. Note that standard Bevy
frustum culling uses AABBs, while meshlets use bounding spheres; this
means that not as much code can be shared as one might think.

This patch doesn't provide any way to perform GPU culling on shadow
maps, to avoid making this patch bigger than it already is. That can be
a followup.
pcwalton added a commit to pcwalton/bevy that referenced this pull request Apr 10, 2024
This commit implements opt-in GPU frustum culling, built on top of the
infrastructure in bevyengine#12773. To enable it on a camera, add the `GpuCulling`
component to it. To additionally disable CPU frustum culling, add the
`NoCpuCulling` component. Note that adding `GpuCulling` without
`NoCpuCulling` *currently* does nothing useful. The reason why
`GpuCulling` doesn't automatically imply `NoCpuCulling` is that I intend
to follow this patch up with GPU two-phase occlusion culling, and CPU
frustum culling plus GPU occlusion culling seems like a very
commonly-desired mode.

Adding the `GpuCulling` component frustum to a view puts that view into
*indirect mode*. This mode makes all drawcalls indirect, relying on the
mesh preprocessing shader to allocate instances dynamically. In indirect
mode, the `PreprocessWorkItem` `output_index` points not to a
`MeshUniform` instance slot but instead to a set of `wgpu`
`IndirectParameters`, from which it allocates an instance slot
dynamically if frustum culling succeeds. Batch building has been updated
to allocate and track indirect parameter slots, and the AABBs are now
supplied to the GPU as `MeshCullingData`.

A small amount of code relating to the frustum culling has been borrowed
from meshlets and moved into `maths.wgsl`. Note that standard Bevy
frustum culling uses AABBs, while meshlets use bounding spheres; this
means that not as much code can be shared as one might think.

This patch doesn't provide any way to perform GPU culling on shadow
maps, to avoid making this patch bigger than it already is. That can be
a followup.
pcwalton added a commit to pcwalton/bevy that referenced this pull request Apr 10, 2024
This commit implements opt-in GPU frustum culling, built on top of the
infrastructure in bevyengine#12773. To enable it on a camera, add the `GpuCulling`
component to it. To additionally disable CPU frustum culling, add the
`NoCpuCulling` component. Note that adding `GpuCulling` without
`NoCpuCulling` *currently* does nothing useful. The reason why
`GpuCulling` doesn't automatically imply `NoCpuCulling` is that I intend
to follow this patch up with GPU two-phase occlusion culling, and CPU
frustum culling plus GPU occlusion culling seems like a very
commonly-desired mode.

Adding the `GpuCulling` component frustum to a view puts that view into
*indirect mode*. This mode makes all drawcalls indirect, relying on the
mesh preprocessing shader to allocate instances dynamically. In indirect
mode, the `PreprocessWorkItem` `output_index` points not to a
`MeshUniform` instance slot but instead to a set of `wgpu`
`IndirectParameters`, from which it allocates an instance slot
dynamically if frustum culling succeeds. Batch building has been updated
to allocate and track indirect parameter slots, and the AABBs are now
supplied to the GPU as `MeshCullingData`.

A small amount of code relating to the frustum culling has been borrowed
from meshlets and moved into `maths.wgsl`. Note that standard Bevy
frustum culling uses AABBs, while meshlets use bounding spheres; this
means that not as much code can be shared as one might think.

This patch doesn't provide any way to perform GPU culling on shadow
maps, to avoid making this patch bigger than it already is. That can be
a followup.
pcwalton added a commit to pcwalton/bevy that referenced this pull request Apr 10, 2024
This commit implements opt-in GPU frustum culling, built on top of the
infrastructure in bevyengine#12773. To enable it on a camera, add the `GpuCulling`
component to it. To additionally disable CPU frustum culling, add the
`NoCpuCulling` component. Note that adding `GpuCulling` without
`NoCpuCulling` *currently* does nothing useful. The reason why
`GpuCulling` doesn't automatically imply `NoCpuCulling` is that I intend
to follow this patch up with GPU two-phase occlusion culling, and CPU
frustum culling plus GPU occlusion culling seems like a very
commonly-desired mode.

Adding the `GpuCulling` component frustum to a view puts that view into
*indirect mode*. This mode makes all drawcalls indirect, relying on the
mesh preprocessing shader to allocate instances dynamically. In indirect
mode, the `PreprocessWorkItem` `output_index` points not to a
`MeshUniform` instance slot but instead to a set of `wgpu`
`IndirectParameters`, from which it allocates an instance slot
dynamically if frustum culling succeeds. Batch building has been updated
to allocate and track indirect parameter slots, and the AABBs are now
supplied to the GPU as `MeshCullingData`.

A small amount of code relating to the frustum culling has been borrowed
from meshlets and moved into `maths.wgsl`. Note that standard Bevy
frustum culling uses AABBs, while meshlets use bounding spheres; this
means that not as much code can be shared as one might think.

This patch doesn't provide any way to perform GPU culling on shadow
maps, to avoid making this patch bigger than it already is. That can be
a followup.
Cyannide pushed a commit to Cyannide/bevy that referenced this pull request Apr 14, 2024
Generating MeshUniforms on the GPU crashes on Android.
Introduced by bevyengine#12773
github-merge-queue bot pushed a commit that referenced this pull request Apr 28, 2024
This commit implements opt-in GPU frustum culling, built on top of the
infrastructure in #12773. To
enable it on a camera, add the `GpuCulling` component to it. To
additionally disable CPU frustum culling, add the `NoCpuCulling`
component. Note that adding `GpuCulling` without `NoCpuCulling`
*currently* does nothing useful. The reason why `GpuCulling` doesn't
automatically imply `NoCpuCulling` is that I intend to follow this patch
up with GPU two-phase occlusion culling, and CPU frustum culling plus
GPU occlusion culling seems like a very commonly-desired mode.

Adding the `GpuCulling` component to a view puts that view into
*indirect mode*. This mode makes all drawcalls indirect, relying on the
mesh preprocessing shader to allocate instances dynamically. In indirect
mode, the `PreprocessWorkItem` `output_index` points not to a
`MeshUniform` instance slot but instead to a set of `wgpu`
`IndirectParameters`, from which it allocates an instance slot
dynamically if frustum culling succeeds. Batch building has been updated
to allocate and track indirect parameter slots, and the AABBs are now
supplied to the GPU as `MeshCullingData`.

A small amount of code relating to the frustum culling has been borrowed
from meshlets and moved into `maths.wgsl`. Note that standard Bevy
frustum culling uses AABBs, while meshlets use bounding spheres; this
means that not as much code can be shared as one might think.

This patch doesn't provide any way to perform GPU culling on shadow
maps, to avoid making this patch bigger than it already is. That can be
a followup.

## Changelog

### Added

* Frustum culling can now optionally be done on the GPU. To enable it,
add the `GpuCulling` component to a camera.
* To disable CPU frustum culling, add `NoCpuCulling` to a camera. Note
that `GpuCulling` doesn't automatically imply `NoCpuCulling`.
Cyannide pushed a commit to Cyannide/bevy that referenced this pull request Jun 23, 2024
Generating MeshUniforms on the GPU crashes on Android.
Introduced by bevyengine#12773
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-Rendering Drawing game state to the screen C-Performance A change motivated by improving speed, memory usage or compile times M-Needs-Migration-Guide A breaking change to Bevy's public API that needs to be noted in a migration guide S-Ready-For-Final-Review This PR has been approved by the community. It's ready for a maintainer to consider merging it
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants