Consistent empty multi-value dimension behavior for groupBy / topN #5897

gianm · 2018-06-22T16:42:05Z

Currently, groupBy and topN treat empty multi-value dimensions differently. groupBy treats them like nulls (i.e. empty strings) and topN ignores them (they don't contribute to the results at all).

Consider a dataset called tweets that has tweets, with a multi-value dimension hashtags listing the hashtags found in a tweet. It could be empty (a tweet with no hashtags) or it could have potentially multiple values, for a tweet like this one: https://twitter.com/sullcrom/status/1006208351095676929.

select hashtags, count(*)
from tweets
where floor(__time to hour) = timestamp '2018-06-22 00:00:00'
group by 1
order by 2 desc
limit 5

The groupBy engine returns:

hashtags,EXPR$1
,118719
KCAMexico,646
NBADraft,518
เป๊กผลิตโชค,249
BrasilGanha,199

And topN returns:

hashtags,EXPR$1
KCAMexico,646
NBADraft,518
เป๊กผลิตโชค,249
BrasilGanha,199
GOT7,159

The Druid docs don't seem to specify which behavior is correct for grouping: http://druid.io/docs/latest/querying/multi-value-dimensions.html. We should define one of these behaviors as correct and make the two engines consistent. I think aesthetically I prefer how topN works — why should empty lists be treated like a list containing a null? — but I am not sure how to reconcile that with the possibility that a groupBy could group by a multi-value dimension and a single-value dimension. What if you group by hashtags and username, and some user never uses any hashtags? Should the fact that hashtags is empty make the rows implode, and that user would never show up in the results?

The text was updated successfully, but these errors were encountered:

gianm · 2018-06-22T16:56:02Z

I could see a few ways to solve this:

Formally define that topN and groupBy behave differently. But this is unappealing, since I think you'd want topN and groupBys that "look equivalent" to behave similarly.
Modify topN to behave like groupBy. However, this is unappealing too, since I like how topN works now.
Modify groupBy to behave like topN when grouping on one dimension. Extend this to grouping on more than one dimensions by making groupBy "implode" rows when any multi-value dimension in the grouping set is empty. So if any of them is empty, the result row disappears. This is pretty consistent with how grouping would behave if you think of the multi-value dimensions as like an inner join on some kind of virtual table. It may be surprising to users, though, that their rows go "missing".
Modify groupBy to behave like topN when grouping on one dimension. Extend this to grouping on more than one dimensions by making groupBy treat empty multi-value dimensions as nulls in this case. This has an advantage over (3) in that rows never go "missing", but it's a little less internally consistent. It would mean groupBy behaves differently in the 1-dimension case and the more-than-one-dimension case.

None of these is perfect. I think 3 or 4 sounds best? Probably 4? Number 3 strikes me as the more 'pure' approach but I could see it being pretty unintuitive for people. And potentially not as useful as number 4.

jihoonson · 2018-06-22T19:08:34Z

I would vote for 3. We should avoid such an exception like in 4 as possible as we can. Also, 3 can be easily expressed by SQL like below. explode() is an UTDF similar to Hive's explode function. This would be better for Druid's SQL as well.

SELECT explode(multi_value_col) FROM table GROUP BY explode(multi_value_col);

Regarding missing rows, it would be fine if we write a good documentation so that users would notice that before using queries.

vogievetsky · 2018-06-22T22:45:46Z

I would advocate for 3 also. I think 4 would bring the a lot annoying baggage later on in the future.
FWIW 3 is how we do it in Plywood, but Plywood also supports both [] and [null] and those are both valid and two different things.

stale · 2019-06-21T00:30:23Z

This issue has been marked as stale due to 280 days of inactivity. It will be closed in 2 weeks if no further activity occurs. If this issue is still relevant, please simply write any comment. Even if closed, you can still revive the issue at any time or discuss it on the dev@druid.apache.org list. Thank you for your contributions.

github-actions · 2023-07-02T00:21:28Z

This issue has been marked as stale due to 280 days of inactivity.
It will be closed in 4 weeks if no further activity occurs. If this issue is still
relevant, please simply write any comment. Even if closed, you can still revive the
issue at any time or discuss it on the dev@druid.apache.org list.
Thank you for your contributions.

gianm · 2023-11-16T04:57:04Z

Thinking about this again, I am actually more excited about option 2 than I was in the past, especially now that we have proper arrays. It seems reasonable to me to say that empty MVDs should be treated by all grouping engines as if they were a null value (as groupBy currently does but topN currently does not). This avoids the implicit implosion which I do think will be confusing for people, and achieves equal behavior across the grouping engines. And proper array types provide ways for people to get other behaviors if they desire them.

gianm added the Area - Querying label Jun 22, 2018

clintropolis mentioned this issue Apr 22, 2019

Druid Multi-Value String Columns, Expressions, and SQL #7525

Open

stale bot added the stale label Jun 21, 2019

gianm removed stale labels Jul 4, 2019

clintropolis mentioned this issue Jul 10, 2019

optimize single input column multi-value expressions #8047

Merged

3 tasks

github-actions bot added the stale label Jul 2, 2023

clintropolis removed the stale label Jul 2, 2023

clintropolis mentioned this issue Jul 4, 2024

fix unnest bugs #16690

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consistent empty multi-value dimension behavior for groupBy / topN #5897

Consistent empty multi-value dimension behavior for groupBy / topN #5897

gianm commented Jun 22, 2018

gianm commented Jun 22, 2018 •

edited

Loading

jihoonson commented Jun 22, 2018 •

edited

Loading

vogievetsky commented Jun 22, 2018

stale bot commented Jun 21, 2019

github-actions bot commented Jul 2, 2023

gianm commented Nov 16, 2023

Consistent empty multi-value dimension behavior for groupBy / topN #5897

Consistent empty multi-value dimension behavior for groupBy / topN #5897

Comments

gianm commented Jun 22, 2018

gianm commented Jun 22, 2018 • edited Loading

jihoonson commented Jun 22, 2018 • edited Loading

vogievetsky commented Jun 22, 2018

stale bot commented Jun 21, 2019

github-actions bot commented Jul 2, 2023

gianm commented Nov 16, 2023

gianm commented Jun 22, 2018 •

edited

Loading

jihoonson commented Jun 22, 2018 •

edited

Loading