Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve topN algorithm #7187

Open
leventov opened this issue Mar 4, 2019 · 37 comments
Open

Improve topN algorithm #7187

leventov opened this issue Mar 4, 2019 · 37 comments

Comments

@leventov
Copy link
Member

leventov commented Mar 4, 2019

Description

@leerho proposed how the topN algorithm can be improved here: #7161 (comment)

Motivation

Avoid potentially significant errors, see #1134 and #1135.

@leerho
Copy link
Contributor

leerho commented Mar 4, 2019

Objective

  • Improve the current internal heuristic algorithm for "TopN" with streaming algorithms including DataSketches Frequent Items Sketch (FIS).

Motivation

  • Resources: The current implementation of "TopN" is likely using more compute resources than is necessary. This can be addressed by the use of several algorithms, of which the FIS is one.

Where the FIS can be applied:

  • Error vs. Size: The FIS has user configureable parameter that allows tradeoff of the sketch size and its resulting accuracy (or what we call its error threshold, or epsilon).
  • Error Clarity: The FIS has well-defined error bounds that can be applied to each item in the result set.
  • Query Flexibility: At query time the user can choose two different query modes NO_FALSE_POSITIVES (more strict) or NO_FALSE_NEGATIVES (less strict). The FIS will not return items below the noise floor. So you know that the items it does return are, in fact, significant.
  • Streaming: The FIS is a streaming algorithm, just pass the input data into the sketch once. No need to presort the input stream. This will save processing cycles and time.
  • Mergeable: The FIS is mergeable. Each node would sketch its own data independently, then the sketches from each node are passed to a central "broker" node where they are merged into a final sketch, which can then be queried. This could reduce the amount of data passing between the nodes and the broker, but it will depend on how the sketch is configured. There is no loss of accuracy due to the merging process.
  • Data Independent: The FIS error properties are independent of how the input data is partitioned into separate streams (the nodes) and how the values and their weights are distributed within the streams themselves. The FIS can be somewhat input data order sensitive, in that the weights and order of the items in the result set may vary slightly from run to run, but they will always be within the error specification of the chosen sketch configuration.

Making it Happen!

We on the sketches team are not knowledgeable about Druid internals. We certainly can advise and consult on the proper use of the sketches, but it must be someone on the Druid team to integrate the sketch library into Druid.

@gianm
Copy link
Contributor

gianm commented Mar 4, 2019

It sounds like the behavior of FIS is different enough from topN that 'replace' is not in the cards. But it does sound like something that would be really nice to expose in Druid somehow. Maybe as a new query type? Can the 'weight' of an item be any Comparable object? (This would allow it to be compatible with any Druid aggregator.) Or is it limited to ints?

@leerho
Copy link
Contributor

leerho commented Mar 5, 2019

@gianm
I think we may have some confusion here. FIS uses Identity for the object and Comparable for the weight parameter. And the weights must be numeric positive longs. (The Quantiles sketch depends on Comparable).

Imagine you are Apple and have a stream of song title objects, each with a price as purchased. Different instances of the same title may have a different price due to store discounts or whatever.

The task is to identify the largest revenue song titles, by title and total revenue.

The input stream is a stream of {Object songTitle, long purchasePrice} (Price in cents)
You create the sketch:

ItemsSketch<SongTitle> fisSketch = new ItemSketch<SongTitle>(1024);

Note: the sketch size is configurable, depending on how accurate you want it to be.

You update the sketch with

while (remainingItems > 0) {
  SongTitle songTitle = next();
  fisSketch.update(nextItem, (long)(songTitle.getPrice()));
} 

Note that the sketch does the item aggregation for you, and no need to sort the input.
The purchasePrice is the "weight". When duplicate songTitles appear in the stream, their purchasePrice is accumulated in the sketch.

When all the sketches at each of the nodes are complete, you send the sketches to the Broker where they are all merged together.

The final merged sketch is then queried:

ItemSketch.Row<T>[] rows = getFrequentItems(ErrorType.NO_FALSE_POSITIVES);

Each row of the resulting array of rows contains:

  • A reference to the frequent Item
  • An Estimate of the total weight (total revenue) for that item
  • The upper bound of what the true total Weight is
  • The lower bound of what the true total Weight is.
  • plus a few other convenience methods.

The array can be sorted by the accumulated weights, so they are in approximate rank order. The sketch will return all items that have a weight above the threshold.

This is effectively a superset of the typical "TopN" heuristic as it does a lot more than the current heuristic, and it provides accuracy guarantees that the existing heuristic cannot.

Fundamentally the Druid "TopN" heuristic is data sensitive and it can fail without warning where this will not.

So I am not clear where you think this would not replace the current functionality? What is the functionality that you think is missing?

It should certainly be worth a try :)

Lee.

@gianm
Copy link
Contributor

gianm commented Mar 5, 2019

The current topN algorithm can be used for any query shaped like SELECT x, AGG(y) FROM tbl ORDER BY AGG(y) DESC LIMIT N when useApproximateTopN is enabled (when disabled, groupBy is used instead, which is exact).

It sounds like FIS is limited in a couple ways that prevents it from being used for all queries of that form:

  • "AGG" must be "SUM" and y must all be positive. (topN will accept any aggregation function and any y)
  • "LIMIT" will not necessarily be adhered to. (topN will always return LIMIT results, or the number of distinct values of x, whichever is lower)

Now, you might argue that this level of flexibility is bad, because it involves throwing caution to the wind, and sometimes falls flat on its face. And you would have a good point. But, nevertheless, the flexibility is there, and is even useful sometimes.

Maybe it makes sense for the FIS to be exposed via its own native query type ("frequentItems" instead of "topN"). If we do that, I'm wondering how it makes sense to expose it through SQL. Maybe a context flag like useFrequentItemsForTopN that would plan a SQL query like SELECT x, AGG(y) FROM tbl ORDER BY AGG(y) DESC LIMIT N as a "frequentItems" query if "AGG" is "SUM", and would be documented to relax "LIMIT" such that fewer than N items might be returned. I'm not sure how to satisfy the "y must all be positive" requirement; maybe we need to track metadata on the brokers about min/max values for columns, and only use the frequentItems query if y is known to all be positive.

@dylwylie
Copy link
Contributor

dylwylie commented Mar 5, 2019

I'm not sure how widely it's used but TopN's also supports ascending order. Curious if the sketch supports returning the inverse "most infrequent" items too?

@dylwylie
Copy link
Contributor

dylwylie commented Mar 5, 2019

Also curious on people's thoughts on user's being explicit in their queries that can return approximate results versus optimising queries based on context/configuration.

SELECT AGG(y), x FROM tbl where x in (SELECT FREQ_I_SKETCH(x, y, 100) from tbl)
versus trying to optimise a order by limit.

@gianm
Copy link
Contributor

gianm commented Mar 5, 2019

Also curious on people's thoughts on user's being explicit in their queries that can return approximate results versus optimising queries based on context/configuration.

That's probably better, given that FIS is doing a very particular thing, and SQL users might appreciate being able to trigger its activation explicitly.

@leerho
Copy link
Contributor

leerho commented Mar 5, 2019

This is very helpful and I think I see where we are talking past each other. There appear to be several different use cases for the concept of "TopN". The starting point for these use cases is after the selection and other possible SQL manipulation, where you end up with a set of distributed lists of items where the task is to efficiently merge together these lists and obtain the "TopN" of these items. It is the next operations that follow that distinguish the several use cases.

  • Case 1: The items are Comparable with respect to some property y, which does not have to be numeric. It could be a string. If y is numeric, it could be either positive or negative, it doesn't matter. The other significant property is that these items would not be aggregated together, e.g. it would not be meaningful to do an operation like ItemC = ItemA + ItemB. To finish, the following operations would be performed: (Assume N = 100). Note that the "order" here could be either direction so that one could also obtain the "BottomN".
    -- Order the items in each node .
    -- Select top 100 from the list in each node (Note: we don't need more than N from each node)
    -- Merge the lists together in the Broker
    -- Order the items in the Broker, choose the top 100.

  • Case 2: The items have the Identity property such that one can check if ItemA == ItemB, and they have a weight (or importance) property that is independent of the identity property. In other words, ItemA == ItemB and possibly ItemA.w1 != ItemB.w2. This weight is numeric, positive and obviously comparable. These objects can be "merged" together when the identity is satisfied as in ItemC.w3 = ItemA.w1 + ItemB.w2 where w3 = w1 + w2. (Assume finalN = 100). The current algorithm would be the following
    -- Order the items in each node .
    -- Select top (nodeN=1000 from the list in each node (Note: we must select nodeN >> finalN)
    -- Merge the lists together in the Broker
    -- Order the items in the Broker, merging identical items, choose the top 100.
    The sketch equivalent of this process would be
    -- Feed the items in each node into a FIS.
    -- Each node sends its sketch to the Broker.
    -- The Broker merges the sketches together into a single sketch
    -- The result "Most frequent Items" or "Heaviest Items" are returned from the sketch.

  • Case 3: This is the same as Case 2, except that negative weight values would be allowed. This, unfortunately cannot be solved with our current sketch anyway. We will have to think about this.

I think this addresses all the use cases (this was kinda slapped together). But for both cases 1 and 2, improvements can be made in how the final steps are performed. For case 3 I would like to ask: "Is this a real case and how often does it occur?

@leerho
Copy link
Contributor

leerho commented Mar 5, 2019

I should point out that Case 1 above is also streamable and solvable with significantly less resources than you are using now. And I think we have a solution for Case 3 also, but that may require a bit more effort.

@peferron
Copy link
Contributor

peferron commented Mar 6, 2019

@leerho Going back to your song example, could FIS support getting the top song titles by number of unique listeners over the past week? The current topN can do that using HLL sketches as metric (weight). That's one example of the usefulness of accepting any Druid aggregation or post-aggregation as a metric. (Interestingly, this specific case could benefit from using a HLL Map sketch.)

Also, I don't think that the current topN algorithm orders all distinct items on each node (or segment) prior to selecting the top 1000, since that would be unnecessarily wasteful when the number of distinct items is >> 1000. Would be great if someone more familiar with the topN implementation could confirm.

@leerho
Copy link
Contributor

leerho commented Mar 7, 2019

getting the top song titles by number of unique listeners over the past week

@peferron Now that is very interesting! The current FIS doesn't handle this case, but I think it could be extended to do that, or to have a configurable aggregate function. I'll have to think about it.

I agree with you that having the ability to specify any Druid aggregation or post-aggregation as a metric is pretty powerful and I'm not trying to replace that functionality.

My principal concern is the way the final steps are performed to obtain the Top 100(for example) items:

  • for each node sort all items in the node // cost: n log n //where n = total items in the node
  • pass top 1000 items to the Broker //network cost
  • The Broker sorts (numNodes(M) * 1000 items): cost: M * 1000 log (M*1000)
  • The Broker chooses the top 100.

This heuristic approach is error prone, data sensitive and compute-wise costly. This, I think, can be improved, but not by substituting FIS!

I am not familiar with how the pre- and post-aggregation steps intertwine themselves with this process. But perhaps this heuristic could be accomplished with a Min-Heap? It would be faster, smaller, not data sensitive, and generally more robust. It would change the compute cost from a n log n to a n log k, which could be significant.

This (I realize now) has nothing to do with FIS. As I tried to state above, that Druid's concept of "TopN" consists of several very different types of operations. And the current FIS would only help in a subset of them.)

But where FIS could apply (many duplicates (with weights)) in the stream and you wish to return the aggregated heaviest weighted items (or the most frequent items). Now the the compute cost reduces to just n and makes it suitable for real-time queries. This type of query is fairly common and it might deserve a special query type of its own.

The idea of a compound FrequentItem<Item, Aggregator> sketch sounds very interesting, but we don't have it now, so it would have to be developed.

@leerho
Copy link
Contributor

leerho commented Mar 7, 2019

@peferron
By the way, we do have an HLL Map sketch, but it was designed with a different use-case in mind so doesn't quite fit what we need here.

@peferron
Copy link
Contributor

peferron commented Mar 7, 2019

@leerho As mentioned earlier, from my understanding of the topN code, there is no n log n sorting of all items in the segment (a node can contain multiple segments, which contain usually up to 5M rows each and can be processed in parallel; per-segment results are merged locally on the node, then the result is passed to the broker for another round of merging with other node results). The process is:

  1. Aggregate values of all n items in the segment, grouping by item key. Time O(n) and space O(m) where m is the number of distinct item keys. Worst case is m = n so this can actually take a lot of space compared to FIS if cardinality is high.
  2. Take top k items by aggregated value. This is currently done with a priority queue, so time O(m log k) and space O(k). Time could be improved to O(m) with quickselect but it might not make a difference in practice.

Regarding the HLL Map sketch, I was indeed referring to the one in the Yahoo Datasketches lib (should have made this clearer, sorry). If a topN includes a HLL aggregation, the current implementation keeps a dedicated HLL sketch for each distinct item key, so you can end up with millions of low-cardinality HLL sketches. It looks like the HLL Map sketch is designed to improve this scenario. But it probably would be quite a lot of work, as a naïve implementation that duplicates the set of item keys would waste most of the space gained. In any case, that's a bit off-topic since it's not related to FIS. I just mentioned it because it sounded neat. :)

@leerho
Copy link
Contributor

leerho commented Mar 7, 2019

Thanks for your clarifications. I was led to believe that the TopN process was the one I described. A PriorityQueue is a Min-heap so it sounds like it is already doing close to what I was suggesting.

As for the HLL-Map Sketch, one of its requirements was that it keep every key and its latest count estimate on every unique key that it has seen to allow arbitrary queries of any key. This consumes a huge amount of space. And there is no requirement for this here. What I imagine would be a FIS-HLL combination that would be considerably smaller. It’s size would be O(k).

@leerho
Copy link
Contributor

leerho commented Mar 8, 2019

@peferron

  1. How is the group by (your step 1) performed under the covers? Hash Table? What do you do if n or m don't fit into memory? (or is that out of scope by definition?)
  2. "per-segment results are merged on the node". How? Merging of Priority Queues from each segment? How big are the priority queues? Size k? or bigger than k?
  3. What exactly is passed from the node to the broker, effectively a Priority Queue? And how big is it WRT k?

@gianm
Copy link
Contributor

gianm commented Mar 8, 2019

How is the group by (your step 1) performed under the covers? Hash Table? What do you do if n or m don't fit into memory? (or is that out of scope by definition?)

The TopN engine's intra-segment group-by will use either an array (if it can) or hashtable (if it can't use an array). It assumes the m keys will fit in memory, which is considered to be a fair assumption since the highest it could reasonably be is a few million. (Creating segments of more than a few million rows is not considered a good idea.)

"per-segment results are merged on the node". How? Merging of Priority Queues from each segment? How big are the priority queues? Size k? or bigger than k?

All intra-segment resultsets (which are exact) are then sorted (using a priority queue) and truncated to max(k, 1000). Call that K. Then the historical merges all those per-segment resultsets pairwise by first combining the 2K elements based on the aggregation key, then sorting those 2K elements (again, using a priority queue) into a new priority queue of size K.

What exactly is passed from the node to the broker, effectively a Priority Queue? And how big is it WRT k?

It's a sorted resultset of size max(k, 1000).

@leerho
Copy link
Contributor

leerho commented Mar 9, 2019

@gianm
Where in this process is the "pre-aggregation" performed?
Where in this process is the "post-aggregation" performed?

And by "aggregation", I presume that means, for example, two identical items (each of weight 1) will be replaced by one of the items with a weight of 2. Is that correct?

I think you may still have a problem, but I can't prove it yet :)

@gianm
Copy link
Contributor

gianm commented Mar 9, 2019

What do you mean by pre-aggregation and post-aggregation in this context?

@leerho
Copy link
Contributor

leerho commented Mar 11, 2019

From @peferron 's comment above "That's one example of the usefulness of accepting any Druid aggregation or post-aggregation as a metric . "#7187 (comment).

Just wondering if "Druid Aggregation" occurs in the node and "post-aggregation" occurs on the broker?

@gianm
Copy link
Contributor

gianm commented Mar 11, 2019

From @peferron 's comment above "That's one example of the usefulness of accepting any Druid aggregation or post-aggregation as a metric . "#7187 (comment).

Just wondering if "Druid Aggregation" occurs in the node and "post-aggregation" occurs on the broker?

Ah, usually, aggregation happens on data nodes and post-aggregation happens on the broker. But if you're sorting on a post-aggregation (like an average, which in Druid is a post-aggregation division) then that particular post-aggregation is done on data nodes right before the early-limiting step. Sometimes this works great and sometimes it's wildly inaccurate. You do need to be a little careful with this particular feature.

@leerho
Copy link
Contributor

leerho commented Mar 12, 2019

Ok, this is what I've been suspecting.

If I understand what has been said here (and I still may be missing something), then WRT the TopN operations:

  • In general, anytime you do limiting prior to an aggregation step you can produce errors, and the size of the resulting error will be unknown since it is data sensitive, and there will be no warning. This will be true whether it is performed on the node or on the broker.

  • Also, the max(k, 1000) step may be wasteful in the absence of aggregation, and may not provide enough data in the presence of aggregation. And, again, there will be no warning.

@gianm
Copy link
Contributor

gianm commented Mar 12, 2019

I don't think you're missing anything. It is definitely 'well known' that the topN algorithm has data-sensitive behaviors and only works if the top things are somewhat consistent from segment to segment. (This is discussed in the documentation.) People have still found it quite useful, though. I think they would also find something FIS-based to be useful; it sounds like it's a tool that is applicable to fewer problems, but better suited for those problems it is applicable to.

@leerho
Copy link
Contributor

leerho commented Mar 12, 2019

@gianm @leventov

Perhaps it is "well-known" amongst the Druid developers, but I think the Druid team is taking a serious risk with Druid's reputation. Druid's customers may not fully understand that the current TopN functionality may return garbage, even if it is discussed in the documentation (who reads the manual, anyway :) ).

Quoting @peferron above:

Going back to your song example, could FIS support getting the top song titles by number of unique listeners over the past week? The current topN can do that using HLL sketches as metric (weight). That's one example of the usefulness of accepting any Druid aggregation or post-aggregation as a metric.

It is not clear from his statement whether he realizes that "accepting any Druid aggregation or post-aggregation" may not work as intended and, in fact may produce meaningless or misleading results.

Data sensitivity is not a good thing especially given that there is no easy way for a customer to determine whether the data he/she has processed is producing trustworthy results or not.

I would consider the current TopN functionality to be broken and unreliable, period.

Here are two alternatives (there may be others):

  • In any "TopN" operation do not allow any aggregation steps after the data set has been "limited". This has to be true both on the nodes and on the broker.
  • Adopt the Frequent Items sketch for "TopN" operations. Accept that the ordering weight metric is positive integers (longs) and that the aggregation operation is add(). (If you feel you really need real-valued weights, then we can develop for you an FIS based on doubles. That is not a big deal.)

My point is that a better, more efficient, and free solution exists for the most common TopN type operations. Why not use it?

@peferron
Copy link
Contributor

peferron commented Mar 13, 2019

@leerho It's well understood (thanks to this dicussion) that FIS would be better than the current algorithm for a subset of topN queries, and I'm sure that most Druid users would love having that.

Regarding the subset of topN queries that FIS cannot replace, though, removing that functionality from Druid without an alternative is not realistic, even though the data sensitivity and lack of error bounds are big issues.

@leerho
Copy link
Contributor

leerho commented Mar 16, 2019

@leventov @gianm @peferron @Dylan1312

I removed two comments I had here, because after thinking about this some more, I am confident that I can make even stronger statements to make my point clear.

I the "good old days" of SQL if you issued a query shaped like SELECT x, AGG(y) FROM table ORDER BY AGG(y) DESC LIMIT N, the DataBase would oblige by performing the operations brute-force on all the data, and would eventually return with the result set of N items that you could call a "TopN", but as we will discover that is a misnomer. The bottom of that list could have some items of equal weight that could represent noise in the sense that they may not have a weight that is greater than the weight of all remaining items not on the list. Nonetheless, this would be considered the "exact" solution.

However, in the attempts to speed up the horribly slow brute-force SQL query above, and based on what was described above in this thread, the Druid "TopN" algorithm was created to do this (I think this is right):

M Historical Nodes:

  1. "First Truncation": All intra-segment resultsets (which are exact) are sorted (using a priority queue) and truncated to max(k, 1000). Call that K.
  2. "Pre-Aggregation": Then the historical node merges all those per-segment resultsets pairwise by first combining the 2K elements based on the aggregation key, then sorting those 2K elements (again, using a priority queue) into a new priority queue of size K.
  3. Each H-Node passes a sorted resultset of size max(k, 1000) to the broker.

Broker Node:

  1. 2nd Truncation and "Post-Aggregation": The data from all M H-Nodes is gathered, and sorted and truncated again using a Priority Queue with a size max(k, 1000).
  2. K items are reported as the result.

These truncations and aggregations change everything!

The first belief that:

The current topN algorithm can be used for any query shaped like SELECT x, AGG(y) FROM table ORDER BY AGG(y) DESC LIMIT N when useApproximateTopN is enabled.

This belief is a total fiction and this claim will fail miserably.

The second belief:

Going back to your song example, could FIS support getting the top song titles by number of unique listeners over the past week? The current topN can do that using HLL sketches as metric (weight). That's one example of the usefulness of accepting any Druid aggregation or post-aggregation as a metric.

This claim of the ability to plug-in HLL sketches as the aggregator will also fail miserably.

The third belief:

It sounds like FIS is limited in a couple ways that prevents it from being used for all queries of that form: (because)
"AGG" must be "SUM" and y must all be positive. (topN will accept any aggregation function and any y)

The view here is that somehow FIS is "limited", and is a "subset case" because it can only sum positive values is seriously misguided.

In fact the only "aggregation" operation that can possibly produce any meaningful results with the Druid TopN algorithm is a sum of positive weights. And this case that kinda-sorta works is when the data in the H-nodes is pre-aggregated (only sum of positive weights allowed) prior to the 1st truncation (Priority Queue). Then the PQs are merged, summed and PQed again, before passing to the Broker. The process in the Broker would be similar. But this still could have significant errors.

The forth belief:

"LIMIT" (using FIS) will not necessarily be adhered to. (topN will always return LIMIT results, or the number of distinct values of x, whichever is lower).

This is also misguided. For a set of items where all have the same weight, there is no TopN! Any random selection of N values from the set could be a "valid" result, which makes the name "TopN" meaningless. In order for a set of values to have a "Top", there must be some skew in their weights.

Suppose the set has 1000 items with weight 1, and 2 items of weight 10. The top-100 will have the two heavy items, the other 98 could be randomly chosen from the remainder of the original set . Those 98 add no insight as they are essentially noise. So to require that there be N items returned from a TopN query is severely misguided. The Druid TopN also doesn't tell you which one of the small end of the TopN are noise, making it even less useful. The better term to use is either "Frequent Items" or "Heavy Hitters", both of which are used in the scientific literature.


In conclusion:

  1. The belief that a query of the shape SELECT x, AGG(y) FROM table ORDER BY AGG(y) DESC LIMIT N can actually produce meaningful results using the Druid TopN algorithm is misguided. It is especially misguided if the item weights are negative or those aggregations are non-linear functions such as an HLL sketch.

  2. The only query that might kinda-sorta-sometimes produce meaningful results is when the weights are positive and the aggregations are sum. But if this is the case, why not use the FIS, which operates with positive weights and the "aggregation" is a sum. And it will produce far superior results.

In other words the FIS is not a subset of what can be usefully queried, it is the ONLY query set that can produce meaningful results.

@peferron
Copy link
Contributor

peferron commented Mar 16, 2019

@leerho

The first belief that:

The current topN algorithm can be used for any query shaped like SELECT x, AGG(y) FROM table ORDER BY AGG(y) DESC LIMIT N when useApproximateTopN is enabled.

This belief is a total fiction and this claim will fail miserably.

You can use topN with any aggregation and get back results. This is a fact, not a belief. The misguided belief would be that these results are always meaningful, but I don't recall anybody here saying that; it has been repeated again and again that topN is data-sensitive.

The second belief:

This claim of the ability to plug-in HLL sketches as the aggregator will also fail miserably.

Same answer as above. You can absolutely plug a HLL sketch as a topN aggregation. Whether the results are meaningful or not depends on the data.

The third belief:

[...]

In fact the only "aggregation" operation that can possibly produce any meaningful results with the Druid TopN algorithm is a sum of positive weights.

Since this assertion is worded very strongly, it can be disproven by finding a counter-example where meaningful results are produced using topN in conjunction with, say, a HLL sketch.

That's easy to do for almost anyone with access to a Druid cluster. That happens to be my case, so I just ran queries against about a billion rows of real-world data. The queries are SELECT x, COUNT(DISTINCT y) GROUP BY 1 ORDER BY 2 DESC LIMIT 5.

In the first test, x is a dimension with a cardinality of 30k, and y is another dimension with high cardinality. topN returns the same results as groupBy (which is the exact counterpart of topN), but 30% faster.

In the second test, x is a dimension with a cardinality of 2M. groupBy cannot return results because it exceeds resource limits. After reducing the query interval enough to bring the cardinality of x down to levels that groupBy can handle, topN again returns the same results as groupBy, but 75% faster.

So, here we are—meaningful results produced using topN with HLL, and with significant performance gains.

These are the first queries I tried, but I'm sure I could find some where topN fails miserably. That's not the point, though—the point is that topN works well in some cases, as demonstrated. You seem to think that Druid users, including participants in this discussion, are incapable of identifying these cases, and are using topN only because they don't understand how badly it can fail. That's certainly your prerogative, and you may be right for the possibly large number of Druid users who don't dig into how the system works or read the documentation. As a somewhat power user though, I'm happy to have a tool like topN at my disposal, even if it's up to me to use it responsibly.

That being said, I certainly hope that Druid goes towards a direction where error bounds are systematically provided, as discussed in #7160. So thank you and @AlexanderSaydakov for starting this movement by integrating the DS library.

@leerho
Copy link
Contributor

leerho commented Mar 21, 2019

@peferron

Thank you for your response and for performing a test with an HLL aggregator.

The intent of my statements in the preceding entries is to protect the end users of Druid, and ultimately to protect the reputation of Druid as a trustworthy platform.

In my statements above, my use of the word "meaningful" is with respect to an end user and is meant to imply that the results of any query of the above shape could be interpreted usefully (i.e., correct interpretation and similar to what a brute-force exact computation would produce).

My use of the word "belief" is again with respect to the end-user: That he/she might believe (because the Druid specifications say that it can be done) that the results of any query of the shape stated above would be meaningful (useful and correct). Of course the query can be executed, that is beside the point, and doesn't require much trust or belief.

Unfortunately, running one test does not prove that an algorithm or process works correctly. Nor does running a hundred or thousand such tests. However, all one has to do to prove that an algorithm or process does not work is to provide one counter example test that does not work.

Relying on the fact that the documentation states that the TopN process is data sensitive is a very weak argument, as most end-users probably don't read the documentation, or don't think about what that phrase might mean.

The fact that a data sensitive query might run 30% or 75% faster is a misleading "sales pitch" to the end users, because in order for an end user to actually trust that the results from such a query is usefully correct, the end user would have to run each query in brute force, exact mode and check its results with the results from the "faster" query. The total time to do both queries is clearly longer than just running the brute force exact query by itself. If you think this is ridiculous, it derives directly from the "data sensitive" statement. Because you cannot prove that a "data sensitive" process works on all data, because you can't possibly have access to all data. More importantly, you don't have access to the end-user's data! The statement "data sensitive" is a huge admission that the process may not work, and without running brute-force tests on all runs, the user would never know whether he/she can trust the results or not.

It appears that it is your opinion that just documenting the process with the words "data sensitive" is enough, and if the user wishes to shoot themselves in the foot that that is OK. I don't share that view, and perhaps nothing I say will change your mind, nonetheless, there are a couple of thoughts that perhaps you might consider.

  1. Allowing positive and negative values as weights in the aggregations. This actually can work if there is no limiting nor truncations of the data followed by aggregations. However, with truncations followed by aggregations, negative and positive weights can easily lead to cancellations of weights of items across segments or nodes within the limited region and nullifying any earlier TopN results from individual segments or nodes. This, I hope you can see, would easily lead to nonsensical results.

  2. Allowing non-linear processes like unique counting (HLL), as an aggregation option, can also lead to very strange results both within the truncated regions and outside the truncated regions. This is not as serious a problem as allowing negative weights. It turns out that the use of a unique-counting algorithm like HLL as an aggregator inside a Heavy-Hitter / Frequent-Item sketch has been written about and studied quite a bit in the scientific literature. However, the algorithms that have been proposed so far are pretty nasty and it is not at all clear that they are practical (e.g., not mergeable, complex, the results are probabilistic, and other issues).

For a mergeable process that incorporates truncation / limiting followed by aggregation, to date we know of no algorithm that would produce trustworthy results if the aggregations allow negative values or non-linear calculations (e.g. unique counting).

To lead the user to believe (or hope) that an expression like
SELECT x, AGG(y) FROM table ORDER BY AGG(y) DESC LIMIT N when useApproximateTopN is enabled would produce meaningful, trustworthy results with arbitrary aggregation plug-ins is fundamentally misleading. And hiding behind the "data-sensitive" caveat doesn't change that.

The only algorithms that do incorporate truncation / limiting followed by aggregation and do provide trustworthy results are the sketching algorithms described in the scientific literature: the main ones are Misra-Gries, Space-Saver, Heavy-Hitter, and Frequent-Items algorithms. And all of these algorithms require that the weight aggregations must be linear, additive with positive numbers. These are not a subset of a larger working set of working algorithms. They are the only set that, to date, work.

@peferron
Copy link
Contributor

peferron commented Mar 28, 2019

@leerho I think we're on the same page regarding the behavior of the current topN. Our disagreement comes from what we consider "meaningful", which is more subjective.

For you, "meaningful" means returning useful and correct results regardless of the data being queried.

For me, it's OK if correctness depends on the data; for example, I consider binary search to be useful and capable of returning meaningful results, even though it returns garbage on unsorted data.

You probably also consider binary search to be useful. So the main difference is, I think, that it's usually simple to ensure that data fed to binary search is sorted, but much harder to ensure that data fed to topN is suitable along all combinations of aggregations/filters/intervals. That's the main practical problem with topN IMO. But the difficulty here depends on the size of the query space, which is application-specific, so I don't think there can be a single yes/no answer regarding whether one should use topN or not.

It's interesting that you mention running brute force queries to verify topN correctness. It's actually quite viable, because the topN query quickly returns results to the user while the brute force (groupBy) query runs in the background using spare cluster resources. So the user experience is improved, even though the total computation time and resources are indeed larger than running only the brute force query.

@leerho
Copy link
Contributor

leerho commented Apr 5, 2019

@peferron Thank you for your thoughtful comments. Clearly we have to leave it up to an informed user to decide. And all we can do is do our best to make sure that he/she is informed.


Bouncing back to the top of this thread, we are developing a new sketch that we are tentatively calling "Frequent Unique Nodes" (FUN).

Suppose you have a stream that contains pairs {IP address, UserID}, and you wish to identify the IP addresses that have the largest number of unique users. In this context think of a large graph where the IP addresses and users are nodes in the graph. Consider Node1 = IP and Node2 = ID, then we want to identify the Node1s that have the largest number of unique Node2s. Conversely, it might also be interesting to identify the Node2s (IDs) that have the largest number of unique Node1s (IPs). Conceptually, this can also be extended to more that just 2 nodes (although don't go nuts with this!).

With this new sketch you will be able to perform these types of queries and have some guarantees of accuracy as well.

If this is of interest, please let me know, as we could use your help in characterizing and performance testing of this, if possible.

Lee.

@peferron
Copy link
Contributor

peferron commented Apr 8, 2019

From your description, it sounds like it could handle the "top songs by unique viewers" query we discussed earlier, by processing pairs of {SongID, UserID}. Essentially like a topN with HLL but with accuracy guarantees, and still better performance than an exact query. If that's the case then it sounds extremely useful. I'd love to see how you plan to implement that.

It would be especially useful if additional aggregations could be computed at the same time, such as computing the total (non-distinct) count for each key: SELECT IPAddress, COUNT(DISTINCT UserID), COUNT(*) GROUP BY 1 ORDER BY 2 DESC LIMIT 10. In my experience it's common to compute multiple metrics in the same query in Druid. This looks like something that would be implemented in the Druid extension rather than the sketch itself though, although the sketch may need to expose some APIs to support that.

Once you have a draft extension ready I'll be happy to run it against some of our real-world datasets if that helps. I'll even pick some dimensions where topN fails badly for comparison 😄. I won't have time to help developing the extension itself, unfortunately.

@leerho
Copy link
Contributor

leerho commented Apr 8, 2019

@peferron

top songs by unique viewers

Yes.

additional aggregations could be computed at the same time

Yes, with a caveat. For the example you gave: SELECT IPAddress, COUNT(DISTINCT UserID), COUNT(*) GROUP BY 1 ORDER BY 2 DESC LIMIT 10 A sketch could be constructed to handle that specific query. However, handling arbitrary additional aggregations at query-time is more challenging as we would have to assume that the additional fields are simple counters, and restrict the type of aggregation to a simple addition.

The more generic we try to make this, the more challenging it will be to configure and the performance will be impacted.

  1. Can we start with just the "top songs by unique users" and characterize that first?

  2. Will you need an actual published artifact Jar to test this. Or would a jar generated from a branch be OK for your testing?

@peferron
Copy link
Contributor

peferron commented Apr 9, 2019

The more generic we try to make this, the more challenging it will be to configure and the performance will be impacted.

Understood. Extra aggregations can still be computed in a follow-up query anyway, such as SELECT COUNT(*), AVG(SomeColumn) WHERE IPAddress IN ('a', 'b', 'c') where a, b and c are top IP addresses previously returned by the FUN query. This can work decently well with Druid indexes. One advantage of this approach is that the FUN query returns faster since it's not loaded with extra aggregations, so you can immediately show up the top items & unique counts to the user, with extra metrics following up later. In some cases that's better UX than a longer initial wait followed up by showing up everything at once. That's a bit off-topic but hopefully there's some value in listing the different ways in which this sketch could be used within Druid.

Can we start with just the "top songs by unique users" and characterize that first?

Sure. Are you looking for any specific patterns in the test data? I assume that Yahoo already has large datasets & Druid clusters that could be used, so I'm trying to see what I could bring to the table here.

Will you need an actual published artifact Jar to test this. Or would a jar generated from a branch be OK for your testing?

The best would be a branch that I can check out to build the extension from source.

@leerho
Copy link
Contributor

leerho commented Apr 15, 2019

Just to let you know I am almost done with this new Sketch . I decided that the FUN acronym was a bit too whimsical so I have renamed it Frequent Distinct Tuples, which I think better reflects what it is doing. Not as easy to say, I realize, but we will get used to it.

This new Sketch is experimental still, as I don’t know how it will actually perform yet. But it is better than nothing, at least until a better algorithm comes along :)

@leerho
Copy link
Contributor

leerho commented Apr 19, 2019

@peferron
The new FDT sketch is in master and should be stable enough for you to start looking at it and playing with it. I'm sure you will have questions so ask away :) I need to do some more comprehensive testing and characterization before it is ready for release. You can find some simple examples in the test hierarchy as well as in the characterization repository (well, those are a bit more complex :) :) ).

I am actually quite pleased with its accuracy.

You can find the configuration file for running the characterization tests here.

Have fun!

@leerho
Copy link
Contributor

leerho commented Apr 25, 2019

@peferron @gianm
check out the updated docs

@leerho
Copy link
Contributor

leerho commented Apr 26, 2019

@peferron @gianm

This sketch has now been released as part of sketches-core 0.13.2 and is fully documented on the website with example and error characterization.

@peferron
Copy link
Contributor

Thanks @leerho. I'm in vacation atm but should be able to characterize it against a few of our datasets after I'm back in early May.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants