Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sql: internal executor iterator pattern makes it easy to create invalid uses of txn #99209

Open
knz opened this issue Mar 22, 2023 · 5 comments
Labels
A-sql-executor SQL txn logic C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-sql-queries SQL Queries Team

Comments

@knz
Copy link
Contributor

knz commented Mar 22, 2023

This issue describes the more general problem around #99171.

The following code pattern is both very easy to introduce/use (because it's convenient) and also extremely invalid because it violates the Txn API contract:

it, _ := internalExecutor.QueryIteratorEx(... txn1 ...)
defer it.Close()

for hasNext, err = it.Next(ctx); hasNext; hasNext, err = it.Next(ctx) {
    // Here the iterator is active, the plan above is running
    ...
    /* EITHER: */
    result := internalExecutor.Exec( ... txn1 /* same txn as above */ ... )
    /* OR:  */
    /* do something else with the txn object that issues KV requests */
    /* THIS INCLUDES INDIRECT CALLS e.g. via APIs for user authorization / privilege checks */
    ...
}

The following is also problematic because the query plan may have started to execute before the first Next call:

it, _ := internalExecutor.QueryIteratorEx(... txn1 ...)
defer it.Close()

/* EITHER: */
result := internalExecutor.Exec( ... txn1 /* same txn as above */ ... )
/* OR:  */
/* do something else with the txn object that issues KV requests */
/* THIS INCLUDES INDIRECT CALLS e.g. via APIs for user authorization / privilege checks */

for hasNext, err = it.Next(ctx); hasNext; hasNext, err = it.Next(ctx) {
   // ... 
}

The issue is that while the iterator is active, the kv.Txn may have some LeafTxn objects active and issuing KV requests. During that time, it is not valid to concurrently use the RootTxn object associated with the kv.Txn -- in particular not via another concurrent use of the internal executor, but also not via other KV interfaces.

This problem already existed in the past but through the v23.1 cycle we have introduced more uses of the iterator pattern, so we have become more sensitive to it.

Jira issue: CRDB-25758

@knz knz added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. A-sql-executor SQL txn logic T-sql-queries SQL Queries Team labels Mar 22, 2023
@knz
Copy link
Contributor Author

knz commented Mar 22, 2023

@stevendanna has an insight:

The error is easy to miss in tests. Some auth calls go to the database. Others read from caches that are often [already] populated.

@knz
Copy link
Contributor Author

knz commented Mar 22, 2023

xref #42766

craig bot pushed a commit that referenced this issue Mar 23, 2023
99325: sql: disable the streamer for queries which might use internal executor r=yuzefovich a=yuzefovich

**row: allow for overlapping spans for range lookups**

This commit fixes a bug with range lookup joins when the streamer is disabled. Previously, we could hit an error about "unordered spans" when initializing the fetch for range lookups if batch limits are used. Overlapping and unordered spans are actually expected for range lookups, so this commit simply disables the error check. It also adds an extensive comment for how exactly the fetcher will behave in such scenario.

There is no regression test because it'll be introduced in the following commit (implicitly) and no release note since for the bug to occur the streamer must be disabled (and on 22.2 it would mean non-default config).

Fixes: #99330.

Release note: None

**sql: disable the streamer for queries which might use internal executor**

This commit fixes a possible violation of `kv.Txn` API that was introduced when we enabled the usage of the streamer by default in 22.2.0. Namely, the problem is as follows: the streamer requires the LeafTxn to be used since it can perform reads concurrently with other parts of the execution flow; however, if the flow contains a wrapped `planNode` which is using the internal executor, the query issued by the IE might use the RootTxn. As a result, we might have concurrency between the LeafTxn of the "outer" query and the RootTxn of the "inner" query which is not allowed.

The fix in this commit is "quick" and is disallowing the usage of the streamer in more cases than strictly necessary. In particular: 1) it makes it so that the streamer is not used by the flow that has any `planNode`s (even if they don't use the IE at all and don't interact with the `kv.Txn` otherwise either). Auditing each `planNode` implementation is error-prone, and this "quick" fix should be more reliable. 2) it makes it so that the streamer is disabled for all queries issued by the IE.

The thinking behind the second change is as follows: if the query issued by the IE uses the streamer, then it'll use the LeafTxn. The IE exposes the iterator API, so it might be possible for the user of the IE to keep the iterator "open" while returning the control flow back to the "outer" flow. If that "outer" flow is using the RootTxn, we can have the same concurrency violation with the "paused" IE iterator performing some reads in the streamer.

Overall, this is "not terrible, not great" - we effectively fallback to the pre-22.2 behavior for some types of queries. For the queries that do process a lot of data, the streamer is likely to still be enabled.

Fixes: #99093.
Informs: #99209.

Release note (bug fix): Since 22.2.0 CockroachDB could crash with "attempting to append refresh spans after the tracked timestamp has moved forward" error in some rare cases (most likely when querying `pg_catalog` and `crdb_internal` virtual tables), and this has now been fixed. The workaround before upgrading would be to run `SET CLUSTER SETTING sql.distsql.use_streamer.enabled = false;`.

99375: cli: don't fail drain cmd if cluster settings aren't available r=rafiss a=rafiss

This makes the command more robust, since it should still work even if the settings cannot be fetched. If the cluster is not fully available, then this step may fail, but it should not prevent a drain command on a specific node.

informs #98742
Release note: None

99392: sql: fix internal error when calling ts_rank with array longer than 4 r=rytaft a=rytaft

Prior to this commit, an internal error could occur when an array longer than length 4 was passed to `ts_rank`. This commit fixes the error by truncating the array to length 4.

Fixes #99334

Release note: None

99413: sql/tests: give more memory to TestSchemaChangesInParallel r=rafiss a=rafiss

The test has previously failed due to running out of memory. Since it's an expensive test, this is somewhat expected.

fixes #98850
Release note: None

Co-authored-by: Yahor Yuzefovich <yahor@cockroachlabs.com>
Co-authored-by: Rafi Shamim <rafi@cockroachlabs.com>
Co-authored-by: Rebecca Taft <becca@cockroachlabs.com>
@michae2
Copy link
Collaborator

michae2 commented Mar 28, 2023

See also #41992

@michae2
Copy link
Collaborator

michae2 commented Mar 29, 2023

The following code pattern is both very easy to introduce/use (because it's convenient) and also extremely invalid because it violates the Txn API contract:

Is this pattern still invalid if only the RootTxn is used?

@knz
Copy link
Contributor Author

knz commented Mar 30, 2023

No then it's not invalid, but beware that a single RootTxn cannot be used from concurrent goroutines either.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-sql-executor SQL txn logic C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-sql-queries SQL Queries Team
Projects
Status: Backlog
Development

No branches or pull requests

2 participants