-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ospf6d: Fix nexthop computation for inter-area multi-ABR ECMP #15899
ospf6d: Fix nexthop computation for inter-area multi-ABR ECMP #15899
Conversation
a58d9a6
to
cd14f1e
Compare
Pre-b74e965, we always merged nexthops of old (existing) and new (newly generated, based on a LSA update) routes, making it impossible to remove individual nexthops from a route. b74e965 replaced this by copying nexthops from the new route to the old route. This works as long as the old and new route are derived from the same LSA (e.g. multiple ECMP paths to the same ABR). However, in case of multiple parallel ABRs, each of them originates a LSA and the nexthops derived from them need to be combined to get the proper route nexthops. So instead of trying to incrementally update the route nexthops based on the old and new route nexthops, always recompute the route nexthops by merging all of its path nexthops (which we're already updating based on the LSA being processed). Fixes FRRouting#15777. Signed-off-by: Martin Buck <mb-tmp-tvguho.pbz@gromit.dyndns.org>
So far, this test only convered redundant paths to one ABR, now it checks redundant paths to redundant ABRs, covering both cases. Useful as a regression test for FRRouting#15777. Signed-off-by: Martin Buck <mb-tmp-tvguho.pbz@gromit.dyndns.org>
cd14f1e
to
217e505
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good
ci:rerun |
I can see the updated test passed on the CI before this was merged, but it fails consistently on my system (at commit 217e505) and sometimes on the CI, eg.: https://github.com/FRRouting/frr/pull/16055/checks?check_run_id=25314867923 Seems that it only finds 2 paths to R6, not 3. Help, please? |
Hmm, gave your PR a try and it didn't even compile on my test VM (Debian 11, kernel 5.10.149, gcc 10.2.1). I had to change your new function definitions from And then the ospf6_ecmp_inter_area topotest worked most of the time, both with your PR applied and without. I had one failure with missing nexthops with your PR and optimization turned on, followed by at least 5 successful attempts where I just reran the test. Plus maybe 20 successful attempts without optimization or without your PR. Can you say a bit more about your environment where it constantly fails? If I can reproduce that, I can have a closer look, but with my current failure rate, that's pretty difficult. My gut feeling is that the failures are neither caused by your PR nor by mine, I guess they would have occurred before as well if that case would have been tested. See the discussion in #15777 for another instance of this issue that vanished after updating to latest master (without any commits in between that could explain this, at least to me). I could imagine that this is timing-related and depending on the exact sequence of events, we take different code paths and one of them is buggy. |
@acooks-at-bda ping? In the meantime, I spent some time investigating the (in my case) occasional topotest failures at bit more. Failure rates in my case in different test runs were between 1% and maybe 20%, so nowhere near the 100% you saw. But anything larger that 0% is bad, so let's continue the investigation. When the topotest failed, R1 always lacked an "inter-router" summary LSA for R7 that should have been originated by R5. The one from R6 was always there. Since the LSA is missing, the nexthop computation can't possibly do the right thing, so we get missing nexthops. But it doesn't look like the nexthop computation is wrong, it rather looks like R5 didn't originate all LSAs an ABR should originate. I don't have the slightest idea yet why that could be the case, but I'll continue investigation in that direction. Another observation: When the topotest was successful, the LSDB on R1 looked different in different test runs as well, even though the resulting routes were OK. My gut feeling is that the differences are caused by LSAs that were originated while the network was converging and weren't properly flushed afterwards after better routes existed (i.e. the procedure described in the last paragraph of 12.4.3 in RFC 2328 doesn't seem to work in all cases). But investigation of that is 2nd priority. I'll probably open a new issue so this gets tracked properly. |
Ah, just found 772688d. Will give that a try before continuing. |
It doesn't fix the missing inter-router LSA issue But it seems to fix at least the "another observation" issue above: If the number of nexthops matches expectations, the LSDB now seems stable across test runs. |
Hi Martin, Thanks for investigating and apologies for the delay. I had meant to do some more investigation on my end... I'm running on a Fedora 40 development environment:
I asked a colleague for a sanity check and @jlodge-at-bda got the same failure once (1/11) on Debian 12. |
Thanks for the update. I've set up a Fedora 40 VM now but guess what, failure rates are around 10%, comparable to what I get on Debian. Nowhere near the 100% you're seeing. Guess I'll have to accept that and try to debug it this way. But I'll finally open a new issue now, discussion will continue there. |
Pre- b74e965, we always merged nexthops of old (existing) and new (newly generated, based on a LSA update) routes, making it impossible to remove individual nexthops from a route. b74e965 replaced this by copying nexthops from the new route to the old route. This works as long as the old and new route are derived from the same LSA (e.g. multiple ECMP paths to the same ABR). However, in case of multiple parallel ABRs, each of them originates a LSA and the nexthops derived from them need to be combined to get the proper route nexthops. So instead of trying to incrementally update the route nexthops based on the old and new route nexthops, always recompute the route nexthops by merging all of its path nexthops (which we're already updating based on the LSA being processed).
Fixes #15777.