Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unsatisfiable WHERE clause leads to exception instead of empty result #13152

Open
hosswald opened this issue Sep 28, 2022 · 11 comments
Open

Unsatisfiable WHERE clause leads to exception instead of empty result #13152

hosswald opened this issue Sep 28, 2022 · 11 comments

Comments

@hosswald
Copy link
Contributor

Affected Version

0.23.0

Description

The following query

SELECT *
FROM "druid"."mydata"
WHERE false
ORDER BY "__time" DESC

leads to the following error:
druid error: Unsupported operation (org.apache.druid.java.util.common.UOE): Time-ordering on scan queries is only supported for queries with segment specs of type MultipleSpecificSegmentSpec or SpecificSegmentSpec...a [MultipleIntervalSegmentSpec] was received instead.

false can be an arbitrary unsatisfiable boolean expression. I came across it when trying to filter for a specific exact timestamp in Superset, which is translated into the unsatisfiable expression

WHERE MILLIS_TO_TIMESTAMP(recordTimeParsed) >= TIME_PARSE('2022-09-25T00:00:00')
  AND MILLIS_TO_TIMESTAMP(recordTimeParsed) <  TIME_PARSE('2022-09-25T00:00:00')

Since it seems to be valid vor any unsatisfiable WHERE clause, I'm omitting details about my specific dataset.

@gianm
Copy link
Contributor

gianm commented Sep 28, 2022

I tried this locally and what happens here is:

  1. The unsatisfiable filter leads to the table being swapped out with an inline empty recordset. This is good, it means the planner realizes no data can possibly match the filter.
  2. The query runs as normal on the inline empty recordset. Here it fails because scan query does not support ordering on inline data.

We can fix it by allowing the scan query to order inline data.

@abhishekagarwal87
Copy link
Contributor

Yeah. I was thinking if we can just make a special case of empty inline data. Since we would be returning an empty sequence anyway. @599166320 was also making some changes in this area recently. that could be relevant.

@hosswald hosswald changed the title Unsatisfyable WHERE clause leads to exception instead of empty result Unsatisfiable WHERE clause leads to exception instead of empty result Sep 29, 2022
@599166320
Copy link
Contributor

@abhishekagarwal87 I recently improved the sorting function of common fields, which should solve the sorting problem of inline data together. However, I have recently closed this PR for some reasons. If you are interested, I will reopen this pr and let you review it.

@abhishekagarwal87
Copy link
Contributor

@paul-rogers - since you were reviewing that PR, can you take that PR to completion?

@paul-rogers
Copy link
Contributor

paul-rogers commented Oct 3, 2022

In the DB community, there is often a need to get the schema of a query without running the query. Typically a UI wants to work out the table or chart structure it will use to process data. Or, some bit of code generation wants to know how to create code to handle this particular query structure.

The fancy DBs provides a PREPARE statement to get the schema. The crude-but-effective approach is to use WHERE 0=1 or WHERE FALSE as in this example.

Each of these is a message to the planner that the client only needs the result set schema (signature), not the data. Given this, the proper approach is to plan the query, gather the signature, then go down a "no data" path. In Druid, that just means returning an empty Sequence with no execution behind it.

Given this, the need to handle sorting is moot on this code path. Because the "no data" path is a well-known idiom, I'd guess that Calcite has a way to tell us that the result set will always be empty. We can use that to trigger our no-data path.

@599166320
Copy link
Contributor

Here, I want to talk about an issue related to inline data sources.

At present, Druid can use the orderBy query of scanquery for inlineDataSources with data, but the query result set is not sorted, which I think is a problem.

For example, the following query results are not sorted:

{
  "queryType": "scan",
  "dataSource": {
    "type": "inline",
    "columnNames": ["__time","country", "city"],
    "rows": [
      [1664555457209,"United States", "San Francisco"],
      [1664555457219,"Canada", "Calgary"],
      [1664555457229,"Canada", "Calgary"],
      [1664555457239,"Canada", "Calgary"],
      [1664555457249,"Canada", "Calgary"]
    ]
  },
  "columns": ["__time","country", "city"],
  "intervals": ["0000/3000"],
    "orderBy": [
    {
      "columnName": "__time",
      "order": "descending"
    },
    {
      "columnName": "city",
      "order": "descending"
    }
  ],
    "limit": 1001
}

image
I think that for the result set of an inlineDataSource, whether there is data or no data, as long as it is sorted, it can run the same path to simplify processing.

@gianm
Copy link
Contributor

gianm commented Oct 4, 2022

I sketched out an idea for this issue. Here it is: master...gianm:druid:scan-inline-sort. I haven't made a PR yet, since I'd need to add some tests.

The idea in the patch goes beyond the empty-data case. It also handles the case where we really do have inline data, and want to actually sort it. By skipping getIntervalsFromSpecificQuerySpec, and using Intervals.ONLY_ETERNITY instead, we are basically saying we're going to sort all data instead of doing it interval-by-interval. That's fine when we have only one query runner. (Which is the case for inline data.)

@abhishekagarwal87
Copy link
Contributor

@gianm - but what piece of code does the sorting of inline data itself by the time column? I was thinking of limiting the size of inline data and then feeding it to stableLimitingSort (that uses a priority queue for ordering).

@gianm
Copy link
Contributor

gianm commented Oct 6, 2022

Good question. That's another reason that little sketch isn't ready to be a PR 🙂

I think the answer is, if there's a limit then it's done by stableLimitingSort. If there's no limit then I don't see that it actually does get sorted. That'd need to change.

Your idea sounds good. What would we do if there's no limit? Ideally, since it's inline data and it's not likely to be very big, we sort it anyway. It could be done either in the mergeRunners call, or in the runner created by createRunner (since there's only one runner for inline data). Wondering what you think.

@gianm
Copy link
Contributor

gianm commented Oct 6, 2022

Thinking about it a bit more. I think it makes the most sense to do the sorting in the runner from createRunner. I actually think that is already happening: check out the makeCursors call in ScanQueryEngine. It is either ascending or descending based on the time order of the query.

So mergeRunners just needs to merge runners that are already sorted. In the inline case, there is only one runner, so there's nothing to do. We can return it as-is.

What do you think?

@abhishekagarwal87
Copy link
Contributor

I can't spot the sorting operation in the runner. The data from physical segments are already sorted by time. That ordering is used to decide if the data needs to be reversed. That's why we are only passing the type of ordering (Asc/desc) down the makeCursors call and not the columns on which to order by. Are you suggesting passing the columns to makeCursors and then doing the sorting in RowBasedStorageAdapter?
Or you are suggesting doing the sorting in the createRunner method itself? The former seems a simpler approach to implement. We will just materialize the rows into a list and sort the list.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants