Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix snark pool long async cycles #13409

Merged
merged 4 commits into from
Jul 12, 2023

Conversation

nholland94
Copy link
Member

As part of investigating #13324, it was discovered that the primary underlying issue was that the snark pool was leaking. The issue causing the leak was that the snark pool refcount frontier extension was not properly decrementing entries when the frontier root transitions. Additionally, when the extension would compute the removed entries, it would only count the number of scan state that caused entries to be removed, which negatively impacted the garbage collection logic in the snark pool as the implementation relied on this count.

Fixing the snark pool refcount leak would have caused another issue in that the snark pool garbage collection logic, which is now triggered properly with the fix, was wildly inefficient. The worst case number of works in the snark pool is about 2*720*128*0.75=138240 (fork_factor*k*works_per_block*f), which was large enough to warrant addressing this.

The following changes are included in this PR:

  • snark pool persistence was removed
  • fixed bug in snark pool refcount table where old root scan state works were not decremented
  • optimized snark pool garbage collection by changing snark pool refcount to send a negative diff when works are invalidated
  • optimize snark pool refcount view broadcast logic not send irrelevant diffs
  • removed unnecessary references to snark pool refcount internal tables within the snark pool

Closes #13324.

@nholland94 nholland94 requested a review from a team as a code owner June 15, 2023 00:12
@nholland94
Copy link
Member Author

!ci-build-me

@@ -328,6 +328,7 @@ let calculate_root_transition_diff t heir =
(Root_transitioned
{ new_root= new_root_data
; garbage= Full garbage_nodes
; old_root_scan_state= Full (Breadcrumb.staged_ledger root |> Staged_ledger.scan_state)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we not include the old root in garbage?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would require more changes in all the other extensions and places we consume these diffs. There's a comment that mentions we explicitly leave it out. I don't recall why, but there was a reason, and we would need to update logic that was not expecting the old root to be in garbage to now handle that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe add a comment when defining this? (also, where's the comment you mentioned? couldn't find it in this file)
It's only in Ledger_table extension and in catchup and frontier's apply-diff. And it is not clear what expected in these places. Should we look into this further? If so, could you please open an issue for this?

Also, where does the old root get removed if it is not part of garbage?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is the comment I was referring to. It's where we define the diff type itself, and describes its semantics. https://github.com/MinaProtocol/mina/blob/develop/src/lib/transition_frontier/frontier_base/diff.mli#L107

The old root is removed when we process the root transition diff in the full frontier as a special case, and doesn't get treated the same way as garbage does. https://github.com/MinaProtocol/mina/blob/develop/src/lib/transition_frontier/full_frontier/full_frontier.ml#L350

~metadata:[ ("num_removed", `Int (List.length removed_work)) ] ;
List.iter removed_work ~f:(fun work ->
Hashtbl.remove t.snark_tables.rebroadcastable work ;
Hashtbl.remove t.snark_tables.all work ) ;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't these be kept until the work is not referenced in the frontier?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The work we receive here is work that was removed from the refcount table. Work is only removed from the refcount table if there are no remaining references in the frontier.

@georgeee georgeee mentioned this pull request Jul 10, 2023
7 tasks
@nholland94 nholland94 changed the base branch from release/1.4.0 to compatible July 10, 2023 17:14
@nholland94 nholland94 force-pushed the fix/snark-pool-long-async-cycles branch from fa7893b to 3080c5e Compare July 10, 2023 17:15
@nholland94
Copy link
Member Author

!ci-build-me

@nholland94
Copy link
Member Author

!ci-build-me

@nholland94 nholland94 merged commit cec59e3 into compatible Jul 12, 2023
1 check passed
@nholland94 nholland94 deleted the fix/snark-pool-long-async-cycles branch July 12, 2023 17:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants