Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flaky Tests CI #207

Closed
jpraynaud opened this issue May 20, 2022 · 12 comments · Fixed by #254 or #374
Closed

Flaky Tests CI #207

jpraynaud opened this issue May 20, 2022 · 12 comments · Fixed by #254 or #374
Assignees
Labels
bug ⚠️ Something isn't working

Comments

@jpraynaud
Copy link
Member

jpraynaud commented May 20, 2022

Since the Mithril Core library was refactored, we meet numerous tests turning red in the CI in a random manner.
Usually, the fix is to re-run the tests. This flakiness is awkwardly not reproducible locally 🤔

We need to find a solution, and it seems that:

  • the problems occurs frequently on tests requiring serialization/deserialization of the core library
  • maybe the problem is due to the github CI, but it is hard to investigate/test (this could be due to the underlying computer running the job)

Examples:

The tests seem to stop with a SIGILL that could be linked to either:

@iquerejeta
Copy link
Contributor

This seems hard to investigate indeed. Has it ever failed in the mithril-core serialisation tests? One thing we could do, to start ruling out possibilities, is change the key_decode_hex and key_encode_hex functions to not use serde. What do you think? I think not all functions expose a to/from_bytes function, but I'm happy to include them. This way we can also make serde optional.

@abailly-iohk
Copy link
Collaborator

Do we have systematic (eg. property-based) tests for serialisation/deserialisation? That could be a good start if we suspect this part is the culprit.

@iquerejeta
Copy link
Contributor

We do have tests for serialisation/deserialisation for VKs, SKs, StmSigs and StmAggrSigs, so I would have suspected those to fail. But I'm not aware of that happening. That is why I suggested about changing key_decode_hex to see if it really is serialisation of these types.

@iquerejeta
Copy link
Contributor

iquerejeta commented May 31, 2022

Actually, we do have at least a failure of mithril-core, which was shared by @ghubertpalo . I believe this was before the refactor, and in the test for serialisation of merkle tree paths, so that might be a hint.

https://github.com/input-output-hk/mithril/runs/6384395584?check_suite_focus=true

Code at that time was this

@iquerejeta
Copy link
Contributor

I'm trying to avoid using the serde derivations. I don't have a proper clue that that might be it, but TBH, I have no clue of what it could be. I found that there was some problems with serde derivation in the past, so worth giving a try. I've set up a PR #230 changing the key_decode_hex (and encode) to not use serde. Created a temp trait for this. However, I get that some errors have a thread _ has overflowed its stack. Anyone any idea why that might be? (those errors are reproducible locally)

@ghubertpalo
Copy link
Collaborator

I don't know if it is related to this bug but I have a recurring test error in mithril-common, I made a PR to put in evidence a u64 overflow.

@iquerejeta
Copy link
Contributor

Yes, I had the same, and I did something similar. To avoid the tests failing, I did this. Maybe if you do that change in the test in your PR, we can merge, but this does not seem to solve the problem.

@jpraynaud
Copy link
Member Author

jpraynaud commented Jun 2, 2022

Here is a summary of the tests that we have ran yesterday with @iquerejeta and @Alenar and our findings:

  • We have not succeeded in outputting a stack trace: we are completely blind and have no clue of what causes the crash
  • The issue has started at job run #801 just after the release of the new blst crypto backend
  • We have tried to reproduce the flaky behavior before this release by relaunching multiple times the job run #766 but it never failed
  • The SIGILL error is happening multiple times in the same job run with the same test executable as in #1022
  • We have downloaded the test binary file computed as an artifact by the CI in #1024:
    • It causes the SIGILL in the CI
    • It does not cause the SIGILL on 2 computers out of the CI
  • It appears that the problem could be located between:
    • The machine allocated by Github actions that is not consistent
    • The blst backend or its unsafe implementation in mithril-core

@iquerejeta
Copy link
Contributor

Made a pass to the unsafe code, none of the following solve the issue:

  • Removing transmute
  • Use as_ptr() or as_mut_ptr() instead of a reference to the first element
  • Remove sub (this was actually not needed)

@jpraynaud
Copy link
Member Author

Maybe we could use this crate signal_hook that can register an action when a SIGILL is triggered? 🤔

@iquerejeta
Copy link
Contributor

Other things tried:

  • Use blstrs for the operations over BLS12_381, and so avoid any unsafe rust in our codebase. However, we still got the problems after a few runs (8th attempt this time)
  • Use zkcrypto implementation of bls12_381. This seems to resolve the problem, at least after 10 runs

Therefore, the current decision is to create a compilation flag to choose between blst and zkcrypto's implementation (the latter being the default). Will open an issue to supranational to notify them about this UB, as it seems it could be something on their end.

@iquerejeta
Copy link
Contributor

I'm reopening this, because while we now use zkcrypto's implementation as a default, we still offer the blast feature as a backend compilation option. If these flaky tests are not resolved, we should completely remove the blast option. Once we go open source, we can share this issue with the blst developers, and see how we can collaborate to identify this problem. Before doing that, we should try to:

  • Update to latest version of blst

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug ⚠️ Something isn't working
Projects
None yet
5 participants