-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Missing ECMP nexthops for OSPFv3 inter-area routes #16197
Comments
I analyzed one failure (with Git 82dcb1d, slightly newer than the one mentioned above) and I think I know what's happening, but not yet why... In this case, R5 originated the expected Inter-Router LSA for R7, but R6 didn't (ultimately). Full logs attached: ospf6d-startup-r5.log Events related to the R7 Inter-Router LSA, both from R5 (left) and R6 (right), commented:
Next step is to determine what triggers the false "This is the secondary path to the ASBR, ignore" at 12:37:51.346954. |
For reference, this is the topo diagram from the test case:
|
Humble apologies, @gromit1811. I know why the test was consistently failing on my machine - it was operator error. I had multiple versions of FRR installed due to poor config management. I changed the configure flags at some point and neglected to specify Now that I've cleaned up my environment and know what I'm running, I ran the ospf6 tests a number of times and noted that this test is mostly passing. I did get an unexpected failure on the third run of the full ospf6d set, and then decided to run this test a few more times. It passed 20 consecutive times - 1/23 failures. |
Thanks for the update & no problem - this just means that there isn't an easier way to reproduce this. And since I wasn't able to reproduce your failure rates, I went back to my original environment in the meantime anyway. Update for everybody else watching: No progress, unfortunately |
Back from holidays and other higher prio tasks 😉 I can now answer my question from above:
If the problem occcurs, we seem to have 2 entries for R7 in the brouter_table on the ABR (R6 above, but I've also seen cases where it happened on R5). One is the best path, so the ABR orginates an Inter-Router LSA. But the 2nd one isn't the best, so it expires the LSA again, leaving us with no Inter-Router LSA for R7 on the ABR. Now, should we have 2 entries in the brouter_table? If this is OK, ospf6_intra_brouter_calculation should call ospf6_abr_originate_summary only for the best one. If this is not OK, we should make sure that only the best route is in brouter_table. I'll continue investigations, unless somebody can already answer the question above. |
Are you seeing two router LSA's (type 1's) for R7, or two externals (type 5's)? These are standard areas, there's no NSSA, etc., here, right? We shouldn't see two router LSA's for a single router in our database. Is there any way to grab these two when they exist to see if they have different serial/sequence numbers, etc.? I wonder if we're receiving a new LSA and just waiting too long to kill the old one? |
One Router-LSA for R7 and IIRC one AS-External-LSA. Standard areas, yes. I've got pretty verbose logs from R5 and R6 when the issue occurs (and for reference also when it doesn't), so I could look up additional things. But: I'm one step further in the meantime and I think I know what's happening, I'm just not 100% sure yet which part of the code is wrong. Sorry for not updating the issue in time, I had the impression I'm mostly talking to myself anyway 😉 My current understanding: The issue occurs while establishing neighborships. There's a time window of a few milliseconds where not all neighborships are fully established yet and if an ABR receives the wrong LSA exactly at that time, it gets into a weird situation that persists even after all neighborships are OK. If the LSA reception doesn't hit this time window, everything works as expected. That explains why we see this only rarely. And the "wrong LSA" is actually correct at the given time with partial neighborships, but it prevents a "correct LSA" from being originated once neighborships are fully established. "LSA" in this context means Inter-Router-LSAs (type 4), all other types are looking OK. The gory details: Let's assume the problem occurs on ABR R5 (it could happen as well on ABR R6). In this case, on R1 we only see 2 ECMP paths to R7's loopback address (i.e AS-external routes) instead of the expected 3. That happens because R5 doesn't originate its Inter-Router LSA for R7 towards area 0. R6 does that correctly, so R1 sees only 2 paths: R1-R3-R6-R7 and R1-R4-R6-R7. R1-R2-R5-R7 is missing. Checking the logs a bit closer, we actually see R5 originating its Inter-Router-LSA from time to time, but it expires it immediately afterwards again. This is after neighborships are stable. What triggers the problem is the following sequence of events:
IMO there are 2 problems: In step 8, the worse LSA from R6 shouldn't "shadow" the one from R5 and shouldn't cause R5 to expire its own one. And in step 11, R5 should re-examine its LSDB and border router table and decide to originate its own LSA again. The first problem seems to be caused by R5 having 2 routes towards R7 (prefix 10.254.254.7) in its brouter table. Its own one with path type Intra-Area and and the one from R6 with path type Inter-Area. Because they have different types, they use different entries in the brouter table. Now when ospf6_intra_brouter_calculation on R5 iterates the brouter table, it first finds R5's own entry (marked "best"), originates an LSA and in the next iteration, finds the R6 one (not marked "best") and immediately expires its LSA again. I don't know which one of the following variants would be correct:
I don't know yet what's causing the 2nd problem in step 11. Strictly speaking, we have hooks notifying us about brouter table and LSDB updates, but for some reason, they don't seem to be sufficient. OSPFv2 seems to handle at least the first problem correctly, so I could probably check how it's implemented there. But the implementation seems to be sufficiently different so that they're probably not directly comparable 😦 |
Based on tests/topotests/test_ospf6_ecmp_inter_area_bug16197
Based on tests/topotests/test_ospf6_ecmp_inter_area_bug16197
New test case that reproduces this reliably in #16797 Topology:
Link R6-R7 is down initially, so R1 sees a route to R8 only via R6 (with 2 nexthops, because there are 2 ECMP paths to R6). After link up, we expect additional nexthops to R7, but we don't expect nexthops to R8 to change, because even though it is now reachable via R5 as well, the best path is still only via R6. What actually happens is that after link up, R1 routes to R8 only via R5 and ignores R6, reducing the number of nexthops to 1. This is because R6 expires its Inter-Router LSA towards area 0 after link up. It does that for the same reason that caused the intermittent failure of the original test case: Before link up, R5 creates an Inter-Router LSA for area 1 based on the one received from R6 via area 0. After link up, R6 receives this via area 1, creates a border_router table entry based on that and that shadows the brouter table entry R6 used initially to originate its Inter-Router LSA towards area 0. So R6 expires that LSA and thus R1 no longer can use this path. Note that after R6 has expired its LSA, R5 receives the LS update and expires its LSA for area 1 as well. But even that doesn't resolve the situation on R6. BTW, we're only looking at AS-external routes here which every router originates (based on loopback addresses redistributed into OSPFv3). |
Logs from a test run with #16797 (note that they contain additional log messages, mostly marked with "XXX" - feel free to ignore): Commented LSA events extracted from these logs:
The expire/originate/expire cycle is typical for this case: Expire first on LSA reception, then iterate the brouter table and find the intra-area path first (-> originate), followed by the inter-area path from the received LSA (-> expire). |
I now also analyzed the 2nd problem mentioned in #16197 (comment) (failure to re-originate the Inter-Router LSA on R6 towards area 0 after the temporary Inter-Router LSA from R5 towards area 1 has been expired): |
My current conclusion, only mentioned on Slack so far: I'm slowly getting the impression that the actual bug is Unfortunately, with my current understanding of the code, doing that isn't exactly trivial... |
Description
Since 217e505, topotest ospf6_ecmp_inter_area intermittently fails due to a wrong number of nexthops for certain routes. See comments to #16055 where this was mentioned initially and further discussion in comments to #15899.
Failure rates are ~10% in my tests but vary wildly (I also saw 200 successful runs in a row). @acooks-at-bda reported a 100% failure rate in his tests, but I've been unable to get anywhere near when I tried to reproduce his environment.
When the error occurs, the initial pre-condition nexthop check in
frr/tests/topotests/ospf6_ecmp_inter_area/test_ospf6_ecmp_inter_area.py
Line 195 in d5b0c76
Note: This report is mostly a placeholder to record the fact that I'm investigating this. I can't dedicate too much time to it, so if somebody wants to help or is faster than me, be my guest 😉
Version
How to reproduce
pytest tests/topotests/ospf6_ecmp_inter_area
Expected behavior
Test succeeds
Actual behavior
Test sometimes fails with errors like this (nexthop pattern is not always exactly the same):
Additional context
The actual issue seems to be that sometimes one of the 2 ABRs (R5 and R6) doesn't originate an Inter-Router LSAs (type 4) when it should, causing a path to the destination (R7) to be lost. I don't know yet why that happens.
Note: The problem is most likely neither caused by 217e505 nor by b925570 (the bugfix which the topotest update is trying to verify) but existed before. It was noticed only now just because there was no testcase for inter-area ECMP routes before.
Checklist
The text was updated successfully, but these errors were encountered: