Skip to content

Commit

Permalink
Fixes a symbol group lookup table issue (#14561)
Browse files Browse the repository at this point in the history
This PR fixes an issue in the finite-state transducer's (FST) lookup table that is used to map an input character to a symbol group. A symbol group is a an integer that's subsequently used to select a row from the transition table.


The FST uses a `OTHER` symbol group, to which all symbols are mapped that are not explicitly mapped to a symbol group.

E.g., say, we have two symbol groups, one that contains braces (`{`,`}`) and one that contains brackets (`[`,`]`).
```
const std::vector<std::string> symbol_groups = {"{}", "[]"};

// symbol (ASCII value) -> symbol group
// { (123)              -> 0
// } (125)              -> 0
// [ (91)               -> 1
// ] (93)               -> 1
// <anything else>      -> 2 ('OTHER')

So the lookup table will look something like this:

// lut[0] -> 2
// lut[1] -> 2
// lut[2] -> 2
// ...
// lut[91] -> 1
// lut[92] -> 2
// lut[93] -> 1
// ...
// lut[123] -> 0
// lut[124] -> 2
// lut[125] -> 0
// lut[126] -> 2
```

Now, when running the FST, we want to limit the range of lookups that we have to perform, so we bound the character to lookup to one-past-the-last index that was explicitly provided, because anything that comes after that index maps to the `OTHER` symbol group anyways. In the above example, the highest provided index is `125` (`}`) and one past it is index `126`. We clamp any character value above `126` to `126`. The _number_ of valid items is `126+1`. 

So the lookup at runtime becomes:
```
return sym_to_sgid[min(static_cast<SymbolGroupIdT>(symbol), num_valid_entries - 1U)];
```

Previously, we were computing number of valid items wrongly. And the issue didn't surface because most of our FST usage included `}`, which is only succeeded by `~` and `DEL`, which are actually anyways only valid as part of string values, and hence wouldn't have changed semantics there.

Authors:
  - Elias Stehle (https://github.com/elstehle)
  - Ray Douglass (https://github.com/raydouglass)

Approvers:
  - Nghia Truong (https://github.com/ttnghia)

URL: #14561
  • Loading branch information
elstehle authored Dec 7, 2023
1 parent d8f4975 commit f5dca59
Show file tree
Hide file tree
Showing 2 changed files with 9 additions and 7 deletions.
14 changes: 8 additions & 6 deletions cpp/src/io/fst/lookup_tables.cuh
Original file line number Diff line number Diff line change
Expand Up @@ -104,7 +104,7 @@ class SingleSymbolSmemLUT {
SymbolGroupIdT no_match_id = symbol_strings.size();

// The symbol with the largest value that is mapped to a symbol group id
SymbolGroupIdT max_base_match_val = 0;
SymbolGroupIdT max_lookup_index = 0;

// Initialize all entries: by default we return the no-match-id
std::fill(&init_data.sym_to_sgid[0], &init_data.sym_to_sgid[NUM_ENTRIES_PER_LUT], no_match_id);
Expand All @@ -115,17 +115,19 @@ class SingleSymbolSmemLUT {
for (auto const& sg_symbols : symbol_strings) {
// Iterate over all symbols that belong to the current symbol group
for (auto const& sg_symbol : sg_symbols) {
max_base_match_val = std::max(max_base_match_val, static_cast<SymbolGroupIdT>(sg_symbol));
max_lookup_index = std::max(max_lookup_index, static_cast<SymbolGroupIdT>(sg_symbol));
init_data.sym_to_sgid[static_cast<int32_t>(sg_symbol)] = sg_id;
}
sg_id++;
}

// Initialize the out-of-bounds lookup: sym_to_sgid[max_base_match_val+1] -> no_match_id
init_data.sym_to_sgid[max_base_match_val + 1] = no_match_id;
// Initialize the out-of-bounds lookup: sym_to_sgid[max_lookup_index+1] -> no_match_id
auto const oob_match_index = max_lookup_index + 1;
init_data.sym_to_sgid[oob_match_index] = no_match_id;

// Alias memory / return memory requirements
init_data.num_valid_entries = max_base_match_val + 1;
// The number of valid entries in the table (including the entry for the out-of-bounds symbol
// group id)
init_data.num_valid_entries = oob_match_index + 1;
init_data.pre_map_op = pre_map_op;

return init_data;
Expand Down
2 changes: 1 addition & 1 deletion cpp/tests/io/fst/fst_test.cu
Original file line number Diff line number Diff line change
Expand Up @@ -140,7 +140,7 @@ TEST_F(FstTest, GroundTruth)
R"("author": "Nigel Rees",)"
R"("title": "Sayings of the Century",)"
R"("price": 8.95)"
R"(} )"
R"(~ )"
R"({)"
R"("category": "reference",)"
R"("index:" [4,{},null,{"a":[]}],)"
Expand Down

0 comments on commit f5dca59

Please sign in to comment.