-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sql: panic in BitArray.Format #31115
Comments
assigning to @knz for triage |
I dug a little bit and managed to capture one of these bit arrays that can't be formatted:
|
Ok, here's an instant repro:
The spans that partition creates involving invoking Next() on the primary key for the listed bit array, and that key can't be pretty-printed without panicing. |
@benesch found out that Next() was called on the PK to increase it to a value that in turns makes it invalid. However the I deduce that some code somewhere is trying to manually compute a Next() value on the raw bytes of the PK without using the Next code adequate for the column type. What code would that be? |
I believe it’s in partitionccl. The example I posted creates a partition with just one bit array. This results in a range with bounds [key, key.Next()), and the Next() operation is on a roachpb.Key, not a a bit array datum. I don’t think the solution here is to avoid this call to Next(). We call Next() on keys all over the place, so I’m sure there’s another code path that could Next() a bit array key and then try to print it. I think we have to teach the bit array printer to handle nonsense lastBitArray values without panicking. |
test |
Apologies. Everywhere I've said Next() above I've been meaning to say PrefixEnd(). |
so having a next() function that simply bumps the last byte is not only bad because of this pretty-printing bug, it's might also be semantically incorrect. If, say I have a bit array The symbolic "next" value is: The canonical encoding for this is 3 varints: the XXXXX bits, then a 0 varint, then a last bits used field set to 1. The bad encoding produced by roachpb.key.next under scrutiny produces 2 varints: the XXXX bits then a lastbits used field set to 65. So if we just look at the first 2 varints, the bad encoding sorts after the good. In fact, it sorts "weirdly":
There may be some accidental magic occuring because of the tag bytes in between the varints? But maybe the lesson here is that there should be an additional tag byte after the lastbitsused varint. So that the roachpb.key.next() method would bump the tag byte, and not the actual varints that participate in the bytearray value?
|
please ignore my previous comment -- if the concern is in PrefixEnd() then my comment does not apply. |
My question here is why is the pretty print code for roachpb.keys, originally meant for debug only, involved in a user visible bug?
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.
|
Looks like it's being used to create the string description of the new replica after a split (see the stack trace in the original issue text). |
It is invalid for a bit array's lastBitsUsed field to be greater than 64. The FromEncodingParts function, however, would happily construct an invalid bitarray if given a too-large lastBitsUsed value. Teach the FromEncodingParts to return an error instead. This presented as a panic when attempting to pretty-print a key with a bitarray whose lastBitsUsed encoded value was 65. Such a key can be created when calling PrefixEnd on a key with a bitarray whose lastBitsUsed value is 64. By returning an error instead, the pretty-printing code will try again after calling UndoPrefixEnd and be able to print the key. Fix cockroachdb#31115. Release note: None
31556: importccl: re-enable job control tests r=mjibson a=mjibson I tracked down the problem. It was that after the CANCEL JOB was issued, sometimes the 2nd stage of the IMPORT (the shuffle) would have started, and sometimes it wouldn't have. If it did not start then RunJob would block forever trying to send on the allowResponse chan. Fix this by making a draining go routine after the first block. Closes #24623 Closes #24658 Release note: None 31627: storage: remove spurious call to maybeInlineSideloadedRaftCommand r=nvanbenschoten,benesch a=tschottdorf Entries are "thinned" only when passed to `r.append()` (i.e. written to disk) and they are always returned "fat" from `Entries()` (i.e. Raft's way to get entries from disk). Consequently Raft never sees thin entries and won't ask us to commit them. Touches #31618. Release note: None 31695: bitarray: don't allow FromEncodingParts to return invalid bit array r=knz a=benesch It is invalid for a bit array's lastBitsUsed field to be greater than 64. The FromEncodingParts function, however, would happily construct an invalid bitarray if given a too-large lastBitsUsed value. Teach the FromEncodingParts to return an error instead. This presented as a panic when attempting to pretty-print a key with a bitarray whose lastBitsUsed encoded value was 65. Such a key can be created when calling PrefixEnd on a key with a bitarray whose lastBitsUsed value is 64. By returning an error instead, the pretty-printing code will try again after calling UndoPrefixEnd and be able to print the key. Fix #31115. Release note: None 31697: partitionccl: deflake TestDropIndexWithZoneConfigCCL r=danhhz,eriktrinh a=benesch A particularly adversarial goroutine schedule can cause this test to observe the moment in time where the data is dropped but the zone config is not. Deflake by retrying the check for the dropped zone config until it succeeds (or times out). Fix #31678. Release note: None Co-authored-by: Matt Jibson <matt.jibson@gmail.com> Co-authored-by: Tobias Schottdorf <tobias.schottdorf@gmail.com> Co-authored-by: Nikhil Benesch <nikhil.benesch@gmail.com>
I'm looking to re-enable the partitioning tests, which were skipped as flaky a while ago. They generate random datums of the various column types we support, partition by those datums, and run some checks. The TestInitialPartitioning/BIT one is panic-ing.
Seems to repro pretty quickly under stress, but here's one of the random values I saw fail in case that's easier:
The text was updated successfully, but these errors were encountered: