Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Fixes a symbol group lookup table issue (rapidsai#14561)
This PR fixes an issue in the finite-state transducer's (FST) lookup table that is used to map an input character to a symbol group. A symbol group is a an integer that's subsequently used to select a row from the transition table. The FST uses a `OTHER` symbol group, to which all symbols are mapped that are not explicitly mapped to a symbol group. E.g., say, we have two symbol groups, one that contains braces (`{`,`}`) and one that contains brackets (`[`,`]`). ``` const std::vector<std::string> symbol_groups = {"{}", "[]"}; // symbol (ASCII value) -> symbol group // { (123) -> 0 // } (125) -> 0 // [ (91) -> 1 // ] (93) -> 1 // <anything else> -> 2 ('OTHER') So the lookup table will look something like this: // lut[0] -> 2 // lut[1] -> 2 // lut[2] -> 2 // ... // lut[91] -> 1 // lut[92] -> 2 // lut[93] -> 1 // ... // lut[123] -> 0 // lut[124] -> 2 // lut[125] -> 0 // lut[126] -> 2 ``` Now, when running the FST, we want to limit the range of lookups that we have to perform, so we bound the character to lookup to one-past-the-last index that was explicitly provided, because anything that comes after that index maps to the `OTHER` symbol group anyways. In the above example, the highest provided index is `125` (`}`) and one past it is index `126`. We clamp any character value above `126` to `126`. The _number_ of valid items is `126+1`. So the lookup at runtime becomes: ``` return sym_to_sgid[min(static_cast<SymbolGroupIdT>(symbol), num_valid_entries - 1U)]; ``` Previously, we were computing number of valid items wrongly. And the issue didn't surface because most of our FST usage included `}`, which is only succeeded by `~` and `DEL`, which are actually anyways only valid as part of string values, and hence wouldn't have changed semantics there. Authors: - Elias Stehle (https://github.com/elstehle) - Ray Douglass (https://github.com/raydouglass) Approvers: - Nghia Truong (https://github.com/ttnghia) URL: rapidsai#14561
- Loading branch information