ospf6d: Fix nexthop computation for inter-area multi-ABR ECMP #15899

gromit1811 · 2024-05-01T19:15:12Z

Pre- b74e965, we always merged nexthops of old (existing) and new (newly generated, based on a LSA update) routes, making it impossible to remove individual nexthops from a route. b74e965 replaced this by copying nexthops from the new route to the old route. This works as long as the old and new route are derived from the same LSA (e.g. multiple ECMP paths to the same ABR). However, in case of multiple parallel ABRs, each of them originates a LSA and the nexthops derived from them need to be combined to get the proper route nexthops. So instead of trying to incrementally update the route nexthops based on the old and new route nexthops, always recompute the route nexthops by merging all of its path nexthops (which we're already updating based on the LSA being processed).

Fixes #15777.

Pre-b74e965, we always merged nexthops of old (existing) and new (newly generated, based on a LSA update) routes, making it impossible to remove individual nexthops from a route. b74e965 replaced this by copying nexthops from the new route to the old route. This works as long as the old and new route are derived from the same LSA (e.g. multiple ECMP paths to the same ABR). However, in case of multiple parallel ABRs, each of them originates a LSA and the nexthops derived from them need to be combined to get the proper route nexthops. So instead of trying to incrementally update the route nexthops based on the old and new route nexthops, always recompute the route nexthops by merging all of its path nexthops (which we're already updating based on the LSA being processed). Fixes FRRouting#15777. Signed-off-by: Martin Buck <mb-tmp-tvguho.pbz@gromit.dyndns.org>

So far, this test only convered redundant paths to one ABR, now it checks redundant paths to redundant ABRs, covering both cases. Useful as a regression test for FRRouting#15777. Signed-off-by: Martin Buck <mb-tmp-tvguho.pbz@gromit.dyndns.org>

riw777

looks good

acooks-at-bda · 2024-05-24T02:05:06Z

ci:rerun

acooks-at-bda · 2024-05-24T02:24:13Z

I can see the updated test passed on the CI before this was merged, but it fails consistently on my system (at commit 217e505) and sometimes on the CI, eg.: https://github.com/FRRouting/frr/pull/16055/checks?check_run_id=25314867923

Seems that it only finds 2 paths to R6, not 3.

Help, please?

gromit1811 · 2024-05-27T07:35:07Z

Hmm, gave your PR a try and it didn't even compile on my test VM (Debian 11, kernel 5.10.149, gcc 10.2.1). I had to change your new function definitions from inline to static inline to make it compile (my initial tests were with -O0, guess with optimization turned on your original version would have compiled as well, but you might want to change it nevertheless).

And then the ospf6_ecmp_inter_area topotest worked most of the time, both with your PR applied and without. I had one failure with missing nexthops with your PR and optimization turned on, followed by at least 5 successful attempts where I just reran the test. Plus maybe 20 successful attempts without optimization or without your PR.

Can you say a bit more about your environment where it constantly fails? If I can reproduce that, I can have a closer look, but with my current failure rate, that's pretty difficult. My gut feeling is that the failures are neither caused by your PR nor by mine, I guess they would have occurred before as well if that case would have been tested. See the discussion in #15777 for another instance of this issue that vanished after updating to latest master (without any commits in between that could explain this, at least to me). I could imagine that this is timing-related and depending on the exact sequence of events, we take different code paths and one of them is buggy.

gromit1811 · 2024-06-04T12:57:32Z

@acooks-at-bda ping?

In the meantime, I spent some time investigating the (in my case) occasional topotest failures at bit more. Failure rates in my case in different test runs were between 1% and maybe 20%, so nowhere near the 100% you saw. But anything larger that 0% is bad, so let's continue the investigation.

When the topotest failed, R1 always lacked an "inter-router" summary LSA for R7 that should have been originated by R5. The one from R6 was always there. Since the LSA is missing, the nexthop computation can't possibly do the right thing, so we get missing nexthops. But it doesn't look like the nexthop computation is wrong, it rather looks like R5 didn't originate all LSAs an ABR should originate. I don't have the slightest idea yet why that could be the case, but I'll continue investigation in that direction.

Another observation: When the topotest was successful, the LSDB on R1 looked different in different test runs as well, even though the resulting routes were OK. My gut feeling is that the differences are caused by LSAs that were originated while the network was converging and weren't properly flushed afterwards after better routes existed (i.e. the procedure described in the last paragraph of 12.4.3 in RFC 2328 doesn't seem to work in all cases). But investigation of that is 2nd priority.

I'll probably open a new issue so this gets tracked properly.

gromit1811 · 2024-06-04T13:02:32Z

Ah, just found 772688d. Will give that a try before continuing.

gromit1811 · 2024-06-05T08:38:53Z

It doesn't fix the missing inter-router LSA issue ☹️

But it seems to fix at least the "another observation" issue above: If the number of nexthops matches expectations, the LSDB now seems stable across test runs.

acooks-at-bda · 2024-06-06T03:14:51Z

Hi Martin,

Thanks for investigating and apologies for the delay. I had meant to do some more investigation on my end...

I'm running on a Fedora 40 development environment:

[acooks@frr-dev frr]$ cat /etc/redhat-release 
Fedora release 40 (Forty)
[acooks@frr-dev frr]$ uname -a
Linux frr-dev 6.8.7-300.fc40.x86_64 #1 SMP PREEMPT_DYNAMIC Wed Apr 17 19:21:08 UTC 2024 x86_64 GNU/Linux
[acooks@frr-dev frr]$ git log -1 --oneline
217e505a6 (HEAD -> test-217e505) tests: Modify inter-area ECMP topotest to also test redundant ABRs

I asked a colleague for a sanity check and @jlodge-at-bda got the same failure once (1/11) on Debian 12.
Linux debian12 6.1.0-17-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.69-1 (2023-12-30) x86_64 GNU/Linux

gromit1811 · 2024-06-11T12:54:55Z

Thanks for the update. I've set up a Fedora 40 VM now but guess what, failure rates are around 10%, comparable to what I get on Debian. Nowhere near the 100% you're seeing. Guess I'll have to accept that and try to debug it this way.

But I'll finally open a new issue now, discussion will continue there.

frrbot bot added the ospfv3 label May 1, 2024

github-actions bot added master size/S labels May 1, 2024

gromit1811 mentioned this pull request May 1, 2024

tests: Modify inter-area ECMP topotest to also test redundant ABRs #15898

Closed

gromit1811 force-pushed the ospf6_ecmp_inter_area_bugfix_15777_pr branch from a58d9a6 to cd14f1e Compare May 1, 2024 19:21

gromit1811 added 2 commits May 3, 2024 09:49

gromit1811 force-pushed the ospf6_ecmp_inter_area_bugfix_15777_pr branch from cd14f1e to 217e505 Compare May 3, 2024 07:49

frrbot bot added the tests Topotests, make check, etc label May 3, 2024

github-actions bot added size/L and removed size/S labels May 3, 2024

riw777 approved these changes May 7, 2024

View reviewed changes

riw777 merged commit ae01d44 into FRRouting:master May 7, 2024
11 checks passed

gromit1811 mentioned this pull request Jun 11, 2024

Missing ECMP nexthops for OSPFv3 inter-area routes #16197

Open

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ospf6d: Fix nexthop computation for inter-area multi-ABR ECMP #15899

ospf6d: Fix nexthop computation for inter-area multi-ABR ECMP #15899

gromit1811 commented May 1, 2024

riw777 left a comment

acooks-at-bda commented May 24, 2024

acooks-at-bda commented May 24, 2024 •

edited

Loading

gromit1811 commented May 27, 2024

gromit1811 commented Jun 4, 2024

gromit1811 commented Jun 4, 2024

gromit1811 commented Jun 5, 2024 •

edited

Loading

acooks-at-bda commented Jun 6, 2024

gromit1811 commented Jun 11, 2024

ospf6d: Fix nexthop computation for inter-area multi-ABR ECMP #15899

ospf6d: Fix nexthop computation for inter-area multi-ABR ECMP #15899

Conversation

gromit1811 commented May 1, 2024

riw777 left a comment

Choose a reason for hiding this comment

acooks-at-bda commented May 24, 2024

acooks-at-bda commented May 24, 2024 • edited Loading

gromit1811 commented May 27, 2024

gromit1811 commented Jun 4, 2024

gromit1811 commented Jun 4, 2024

gromit1811 commented Jun 5, 2024 • edited Loading

acooks-at-bda commented Jun 6, 2024

gromit1811 commented Jun 11, 2024

acooks-at-bda commented May 24, 2024 •

edited

Loading

gromit1811 commented Jun 5, 2024 •

edited

Loading