-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Consistent empty multi-value dimension behavior for groupBy / topN #5897
Comments
I could see a few ways to solve this:
None of these is perfect. I think 3 or 4 sounds best? Probably 4? Number 3 strikes me as the more 'pure' approach but I could see it being pretty unintuitive for people. And potentially not as useful as number 4. |
I would vote for 3. We should avoid such an exception like in 4 as possible as we can. Also, 3 can be easily expressed by SQL like below. SELECT explode(multi_value_col) FROM table GROUP BY explode(multi_value_col); Regarding missing rows, it would be fine if we write a good documentation so that users would notice that before using queries. |
I would advocate for 3 also. I think 4 would bring the a lot annoying baggage later on in the future. |
This issue has been marked as stale due to 280 days of inactivity. It will be closed in 2 weeks if no further activity occurs. If this issue is still relevant, please simply write any comment. Even if closed, you can still revive the issue at any time or discuss it on the dev@druid.apache.org list. Thank you for your contributions. |
This issue has been marked as stale due to 280 days of inactivity. |
Thinking about this again, I am actually more excited about option 2 than I was in the past, especially now that we have proper arrays. It seems reasonable to me to say that empty MVDs should be treated by all grouping engines as if they were a null value (as |
Currently, groupBy and topN treat empty multi-value dimensions differently. groupBy treats them like nulls (i.e. empty strings) and topN ignores them (they don't contribute to the results at all).
Consider a dataset called
tweets
that has tweets, with a multi-value dimensionhashtags
listing the hashtags found in a tweet. It could be empty (a tweet with no hashtags) or it could have potentially multiple values, for a tweet like this one: https://twitter.com/sullcrom/status/1006208351095676929.The groupBy engine returns:
And topN returns:
The Druid docs don't seem to specify which behavior is correct for grouping: http://druid.io/docs/latest/querying/multi-value-dimensions.html. We should define one of these behaviors as correct and make the two engines consistent. I think aesthetically I prefer how topN works — why should empty lists be treated like a list containing a null? — but I am not sure how to reconcile that with the possibility that a groupBy could group by a multi-value dimension and a single-value dimension. What if you group by
hashtags
andusername
, and some user never uses any hashtags? Should the fact thathashtags
is empty make the rows implode, and that user would never show up in the results?The text was updated successfully, but these errors were encountered: