Flaky Tests CI #207

jpraynaud · 2022-05-20T13:55:55Z

Since the Mithril Core library was refactored, we meet numerous tests turning red in the CI in a random manner.
Usually, the fix is to re-run the tests. This flakiness is awkwardly not reproducible locally 🤔

We need to find a solution, and it seems that:

the problems occurs frequently on tests requiring serialization/deserialization of the core library
maybe the problem is due to the github CI, but it is hard to investigate/test (this could be due to the underlying computer running the job)

Examples:

The tests seem to stop with a SIGILL that could be linked to either:

The text was updated successfully, but these errors were encountered:

iquerejeta · 2022-05-30T16:39:16Z

This seems hard to investigate indeed. Has it ever failed in the mithril-core serialisation tests? One thing we could do, to start ruling out possibilities, is change the key_decode_hex and key_encode_hex functions to not use serde. What do you think? I think not all functions expose a to/from_bytes function, but I'm happy to include them. This way we can also make serde optional.

abailly-iohk · 2022-05-30T16:52:34Z

Do we have systematic (eg. property-based) tests for serialisation/deserialisation? That could be a good start if we suspect this part is the culprit.

iquerejeta · 2022-05-31T08:02:28Z

We do have tests for serialisation/deserialisation for VKs, SKs, StmSigs and StmAggrSigs, so I would have suspected those to fail. But I'm not aware of that happening. That is why I suggested about changing key_decode_hex to see if it really is serialisation of these types.

iquerejeta · 2022-05-31T08:16:37Z

Actually, we do have at least a failure of mithril-core, which was shared by @ghubertpalo . I believe this was before the refactor, and in the test for serialisation of merkle tree paths, so that might be a hint.

https://github.com/input-output-hk/mithril/runs/6384395584?check_suite_focus=true

Code at that time was this

iquerejeta · 2022-05-31T16:46:10Z

I'm trying to avoid using the serde derivations. I don't have a proper clue that that might be it, but TBH, I have no clue of what it could be. I found that there was some problems with serde derivation in the past, so worth giving a try. I've set up a PR #230 changing the key_decode_hex (and encode) to not use serde. Created a temp trait for this. However, I get that some errors have a thread _ has overflowed its stack. Anyone any idea why that might be? (those errors are reproducible locally)

ghubertpalo · 2022-06-01T15:09:40Z

I don't know if it is related to this bug but I have a recurring test error in mithril-common, I made a PR to put in evidence a u64 overflow.

iquerejeta · 2022-06-01T15:14:03Z

Yes, I had the same, and I did something similar. To avoid the tests failing, I did this. Maybe if you do that change in the test in your PR, we can merge, but this does not seem to solve the problem.

jpraynaud · 2022-06-02T09:15:27Z

Here is a summary of the tests that we have ran yesterday with @iquerejeta and @Alenar and our findings:

We have not succeeded in outputting a stack trace: we are completely blind and have no clue of what causes the crash
The issue has started at job run #801 just after the release of the new blst crypto backend
We have tried to reproduce the flaky behavior before this release by relaunching multiple times the job run #766 but it never failed
The SIGILL error is happening multiple times in the same job run with the same test executable as in #1022
We have downloaded the test binary file computed as an artifact by the CI in #1024:
- It causes the SIGILL in the CI
- It does not cause the SIGILL on 2 computers out of the CI
It appears that the problem could be located between:
- The machine allocated by Github actions that is not consistent
- The blst backend or its unsafe implementation in mithril-core

iquerejeta · 2022-06-06T16:51:18Z

Made a pass to the unsafe code, none of the following solve the issue:

Removing transmute
Use as_ptr() or as_mut_ptr() instead of a reference to the first element
Remove sub (this was actually not needed)

jpraynaud · 2022-06-07T07:27:24Z

Maybe we could use this crate signal_hook that can register an action when a SIGILL is triggered? 🤔

iquerejeta · 2022-06-08T09:56:39Z

Other things tried:

Use blstrs for the operations over BLS12_381, and so avoid any unsafe rust in our codebase. However, we still got the problems after a few runs (8th attempt this time)
Use zkcrypto implementation of bls12_381. This seems to resolve the problem, at least after 10 runs

Therefore, the current decision is to create a compilation flag to choose between blst and zkcrypto's implementation (the latter being the default). Will open an issue to supranational to notify them about this UB, as it seems it could be something on their end.

iquerejeta · 2022-07-26T14:42:45Z

I'm reopening this, because while we now use zkcrypto's implementation as a default, we still offer the blast feature as a backend compilation option. If these flaky tests are not resolved, we should completely remove the blast option. Once we go open source, we can share this issue with the blst developers, and see how we can collaborate to identify this problem. Before doing that, we should try to:

Update to latest version of blst

jpraynaud added the bug ⚠️ Something isn't working label May 20, 2022

jpraynaud assigned Alenar, jpraynaud, iquerejeta, abailly-iohk and ghubertpalo May 20, 2022

iquerejeta mentioned this issue Jun 7, 2022

Not depend on serde #230

Closed

This was referenced Jun 8, 2022

Compilation flag with zcash's implementation #254

Merged

Paralel panics #236

Closed

iquerejeta closed this as completed in #254 Jun 8, 2022

iquerejeta reopened this Jul 26, 2022

CharlesMorgan-IOHK unassigned Alenar and ghubertpalo Aug 15, 2022

iquerejeta mentioned this issue Aug 24, 2022

Flaky tests #374

Merged

iquerejeta closed this as completed in #374 Aug 26, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Flaky Tests CI #207

Flaky Tests CI #207

jpraynaud commented May 20, 2022 •

edited

Loading

iquerejeta commented May 30, 2022

abailly-iohk commented May 30, 2022

iquerejeta commented May 31, 2022

iquerejeta commented May 31, 2022 •

edited

Loading

iquerejeta commented May 31, 2022

ghubertpalo commented Jun 1, 2022

iquerejeta commented Jun 1, 2022

jpraynaud commented Jun 2, 2022 •

edited

Loading

iquerejeta commented Jun 6, 2022

jpraynaud commented Jun 7, 2022

iquerejeta commented Jun 8, 2022

iquerejeta commented Jul 26, 2022

Flaky Tests CI #207

Flaky Tests CI #207

Comments

jpraynaud commented May 20, 2022 • edited Loading

iquerejeta commented May 30, 2022

abailly-iohk commented May 30, 2022

iquerejeta commented May 31, 2022

iquerejeta commented May 31, 2022 • edited Loading

iquerejeta commented May 31, 2022

ghubertpalo commented Jun 1, 2022

iquerejeta commented Jun 1, 2022

jpraynaud commented Jun 2, 2022 • edited Loading

iquerejeta commented Jun 6, 2022

jpraynaud commented Jun 7, 2022

iquerejeta commented Jun 8, 2022

iquerejeta commented Jul 26, 2022

jpraynaud commented May 20, 2022 •

edited

Loading

iquerejeta commented May 31, 2022 •

edited

Loading

jpraynaud commented Jun 2, 2022 •

edited

Loading