-
Notifications
You must be signed in to change notification settings - Fork 528
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix snark pool long async cycles #13409
Conversation
!ci-build-me |
@@ -328,6 +328,7 @@ let calculate_root_transition_diff t heir = | |||
(Root_transitioned | |||
{ new_root= new_root_data | |||
; garbage= Full garbage_nodes | |||
; old_root_scan_state= Full (Breadcrumb.staged_ledger root |> Staged_ledger.scan_state) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we not include the old root in garbage
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would require more changes in all the other extensions and places we consume these diffs. There's a comment that mentions we explicitly leave it out. I don't recall why, but there was a reason, and we would need to update logic that was not expecting the old root to be in garbage to now handle that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe add a comment when defining this? (also, where's the comment you mentioned? couldn't find it in this file)
It's only in Ledger_table
extension and in catchup and frontier's apply-diff. And it is not clear what expected in these places. Should we look into this further? If so, could you please open an issue for this?
Also, where does the old root get removed if it is not part of garbage
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is the comment I was referring to. It's where we define the diff type itself, and describes its semantics. https://github.com/MinaProtocol/mina/blob/develop/src/lib/transition_frontier/frontier_base/diff.mli#L107
The old root is removed when we process the root transition diff in the full frontier as a special case, and doesn't get treated the same way as garbage
does. https://github.com/MinaProtocol/mina/blob/develop/src/lib/transition_frontier/full_frontier/full_frontier.ml#L350
src/lib/network_pool/snark_pool.ml
Outdated
~metadata:[ ("num_removed", `Int (List.length removed_work)) ] ; | ||
List.iter removed_work ~f:(fun work -> | ||
Hashtbl.remove t.snark_tables.rebroadcastable work ; | ||
Hashtbl.remove t.snark_tables.all work ) ; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't these be kept until the work is not referenced in the frontier?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The work we receive here is work that was removed from the refcount table. Work is only removed from the refcount table if there are no remaining references in the frontier.
fa7893b
to
3080c5e
Compare
!ci-build-me |
!ci-build-me |
As part of investigating #13324, it was discovered that the primary underlying issue was that the snark pool was leaking. The issue causing the leak was that the snark pool refcount frontier extension was not properly decrementing entries when the frontier root transitions. Additionally, when the extension would compute the removed entries, it would only count the number of scan state that caused entries to be removed, which negatively impacted the garbage collection logic in the snark pool as the implementation relied on this count.
Fixing the snark pool refcount leak would have caused another issue in that the snark pool garbage collection logic, which is now triggered properly with the fix, was wildly inefficient. The worst case number of works in the snark pool is about
2*720*128*0.75=138240
(fork_factor*k*works_per_block*f
), which was large enough to warrant addressing this.The following changes are included in this PR:
Closes #13324.