-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve topN algorithm #7187
Comments
Objective
Motivation
Where the FIS can be applied:
Making it Happen!We on the sketches team are not knowledgeable about Druid internals. We certainly can advise and consult on the proper use of the sketches, but it must be someone on the Druid team to integrate the sketch library into Druid. |
It sounds like the behavior of FIS is different enough from topN that 'replace' is not in the cards. But it does sound like something that would be really nice to expose in Druid somehow. Maybe as a new query type? Can the 'weight' of an item be any Comparable object? (This would allow it to be compatible with any Druid aggregator.) Or is it limited to ints? |
@gianm Imagine you are Apple and have a stream of song title objects, each with a price as purchased. Different instances of the same title may have a different price due to store discounts or whatever. The task is to identify the largest revenue song titles, by title and total revenue. The input stream is a stream of {Object songTitle, long purchasePrice} (Price in cents)
Note: the sketch size is configurable, depending on how accurate you want it to be. You update the sketch with
Note that the sketch does the item aggregation for you, and no need to sort the input. When all the sketches at each of the nodes are complete, you send the sketches to the Broker where they are all merged together. The final merged sketch is then queried:
Each row of the resulting array of rows contains:
The array can be sorted by the accumulated weights, so they are in approximate rank order. The sketch will return all items that have a weight above the threshold. This is effectively a superset of the typical "TopN" heuristic as it does a lot more than the current heuristic, and it provides accuracy guarantees that the existing heuristic cannot. Fundamentally the Druid "TopN" heuristic is data sensitive and it can fail without warning where this will not. So I am not clear where you think this would not replace the current functionality? What is the functionality that you think is missing? It should certainly be worth a try :) Lee. |
The current topN algorithm can be used for any query shaped like It sounds like FIS is limited in a couple ways that prevents it from being used for all queries of that form:
Now, you might argue that this level of flexibility is bad, because it involves throwing caution to the wind, and sometimes falls flat on its face. And you would have a good point. But, nevertheless, the flexibility is there, and is even useful sometimes. Maybe it makes sense for the FIS to be exposed via its own native query type ("frequentItems" instead of "topN"). If we do that, I'm wondering how it makes sense to expose it through SQL. Maybe a context flag like |
I'm not sure how widely it's used but TopN's also supports ascending order. Curious if the sketch supports returning the inverse "most infrequent" items too? |
Also curious on people's thoughts on user's being explicit in their queries that can return approximate results versus optimising queries based on context/configuration.
|
That's probably better, given that FIS is doing a very particular thing, and SQL users might appreciate being able to trigger its activation explicitly. |
This is very helpful and I think I see where we are talking past each other. There appear to be several different use cases for the concept of "TopN". The starting point for these use cases is after the selection and other possible SQL manipulation, where you end up with a set of distributed lists of items where the task is to efficiently merge together these lists and obtain the "TopN" of these items. It is the next operations that follow that distinguish the several use cases.
I think this addresses all the use cases (this was kinda slapped together). But for both cases 1 and 2, improvements can be made in how the final steps are performed. For case 3 I would like to ask: "Is this a real case and how often does it occur? |
I should point out that Case 1 above is also streamable and solvable with significantly less resources than you are using now. And I think we have a solution for Case 3 also, but that may require a bit more effort. |
@leerho Going back to your song example, could FIS support getting the top song titles by number of unique listeners over the past week? The current topN can do that using HLL sketches as metric (weight). That's one example of the usefulness of accepting any Druid aggregation or post-aggregation as a metric. (Interestingly, this specific case could benefit from using a HLL Map sketch.) Also, I don't think that the current topN algorithm orders all distinct items on each node (or segment) prior to selecting the top 1000, since that would be unnecessarily wasteful when the number of distinct items is >> 1000. Would be great if someone more familiar with the topN implementation could confirm. |
@peferron Now that is very interesting! The current FIS doesn't handle this case, but I think it could be extended to do that, or to have a configurable aggregate function. I'll have to think about it. I agree with you that having the ability to specify any Druid aggregation or post-aggregation as a metric is pretty powerful and I'm not trying to replace that functionality. My principal concern is the way the final steps are performed to obtain the Top 100(for example) items:
This heuristic approach is error prone, data sensitive and compute-wise costly. This, I think, can be improved, but not by substituting FIS! I am not familiar with how the pre- and post-aggregation steps intertwine themselves with this process. But perhaps this heuristic could be accomplished with a Min-Heap? It would be faster, smaller, not data sensitive, and generally more robust. It would change the compute cost from a This (I realize now) has nothing to do with FIS. As I tried to state above, that Druid's concept of "TopN" consists of several very different types of operations. And the current FIS would only help in a subset of them.) But where FIS could apply (many duplicates (with weights)) in the stream and you wish to return the aggregated heaviest weighted items (or the most frequent items). Now the the compute cost reduces to just The idea of a compound FrequentItem<Item, Aggregator> sketch sounds very interesting, but we don't have it now, so it would have to be developed. |
@peferron |
@leerho As mentioned earlier, from my understanding of the topN code, there is no n log n sorting of all items in the segment (a node can contain multiple segments, which contain usually up to 5M rows each and can be processed in parallel; per-segment results are merged locally on the node, then the result is passed to the broker for another round of merging with other node results). The process is:
Regarding the HLL Map sketch, I was indeed referring to the one in the Yahoo Datasketches lib (should have made this clearer, sorry). If a topN includes a HLL aggregation, the current implementation keeps a dedicated HLL sketch for each distinct item key, so you can end up with millions of low-cardinality HLL sketches. It looks like the HLL Map sketch is designed to improve this scenario. But it probably would be quite a lot of work, as a naïve implementation that duplicates the set of item keys would waste most of the space gained. In any case, that's a bit off-topic since it's not related to FIS. I just mentioned it because it sounded neat. :) |
Thanks for your clarifications. I was led to believe that the TopN process was the one I described. A PriorityQueue is a Min-heap so it sounds like it is already doing close to what I was suggesting. As for the HLL-Map Sketch, one of its requirements was that it keep every key and its latest count estimate on every unique key that it has seen to allow arbitrary queries of any key. This consumes a huge amount of space. And there is no requirement for this here. What I imagine would be a FIS-HLL combination that would be considerably smaller. It’s size would be O(k). |
|
The TopN engine's intra-segment group-by will use either an array (if it can) or hashtable (if it can't use an array). It assumes the
All intra-segment resultsets (which are exact) are then sorted (using a priority queue) and truncated to
It's a sorted resultset of size |
@gianm And by "aggregation", I presume that means, for example, two identical items (each of weight 1) will be replaced by one of the items with a weight of 2. Is that correct? I think you may still have a problem, but I can't prove it yet :) |
What do you mean by pre-aggregation and post-aggregation in this context? |
From @peferron 's comment above "That's one example of the usefulness of accepting any Druid aggregation or post-aggregation as a metric . "#7187 (comment). Just wondering if "Druid Aggregation" occurs in the node and "post-aggregation" occurs on the broker? |
Ah, usually, aggregation happens on data nodes and post-aggregation happens on the broker. But if you're sorting on a post-aggregation (like an average, which in Druid is a post-aggregation division) then that particular post-aggregation is done on data nodes right before the early-limiting step. Sometimes this works great and sometimes it's wildly inaccurate. You do need to be a little careful with this particular feature. |
Ok, this is what I've been suspecting. If I understand what has been said here (and I still may be missing something), then WRT the TopN operations:
|
I don't think you're missing anything. It is definitely 'well known' that the topN algorithm has data-sensitive behaviors and only works if the top things are somewhat consistent from segment to segment. (This is discussed in the documentation.) People have still found it quite useful, though. I think they would also find something FIS-based to be useful; it sounds like it's a tool that is applicable to fewer problems, but better suited for those problems it is applicable to. |
Perhaps it is "well-known" amongst the Druid developers, but I think the Druid team is taking a serious risk with Druid's reputation. Druid's customers may not fully understand that the current TopN functionality may return garbage, even if it is discussed in the documentation (who reads the manual, anyway :) ). Quoting @peferron above:
It is not clear from his statement whether he realizes that "accepting any Druid aggregation or post-aggregation" may not work as intended and, in fact may produce meaningless or misleading results. Data sensitivity is not a good thing especially given that there is no easy way for a customer to determine whether the data he/she has processed is producing trustworthy results or not. I would consider the current TopN functionality to be broken and unreliable, period. Here are two alternatives (there may be others):
My point is that a better, more efficient, and free solution exists for the most common TopN type operations. Why not use it? |
@leerho It's well understood (thanks to this dicussion) that FIS would be better than the current algorithm for a subset of topN queries, and I'm sure that most Druid users would love having that. Regarding the subset of topN queries that FIS cannot replace, though, removing that functionality from Druid without an alternative is not realistic, even though the data sensitivity and lack of error bounds are big issues. |
@leventov @gianm @peferron @Dylan1312 I removed two comments I had here, because after thinking about this some more, I am confident that I can make even stronger statements to make my point clear. I the "good old days" of SQL if you issued a query shaped like However, in the attempts to speed up the horribly slow brute-force SQL query above, and based on what was described above in this thread, the Druid "TopN" algorithm was created to do this (I think this is right): M Historical Nodes:
Broker Node:
These truncations and aggregations change everything! The first belief that:
This belief is a total fiction and this claim will fail miserably. The second belief:
This claim of the ability to plug-in HLL sketches as the aggregator will also fail miserably. The third belief:
The view here is that somehow FIS is "limited", and is a "subset case" because it can only In fact the only "aggregation" operation that can possibly produce any meaningful results with the Druid TopN algorithm is a sum of positive weights. And this case that kinda-sorta works is when the data in the H-nodes is pre-aggregated (only sum of positive weights allowed) prior to the 1st truncation (Priority Queue). Then the PQs are merged, summed and PQed again, before passing to the Broker. The process in the Broker would be similar. But this still could have significant errors. The forth belief:
This is also misguided. For a set of items where all have the same weight, there is no TopN! Any random selection of N values from the set could be a "valid" result, which makes the name "TopN" meaningless. In order for a set of values to have a "Top", there must be some skew in their weights. Suppose the set has 1000 items with weight 1, and 2 items of weight 10. The top-100 will have the two heavy items, the other 98 could be randomly chosen from the remainder of the original set . Those 98 add no insight as they are essentially noise. So to require that there be N items returned from a TopN query is severely misguided. The Druid TopN also doesn't tell you which one of the small end of the TopN are noise, making it even less useful. The better term to use is either "Frequent Items" or "Heavy Hitters", both of which are used in the scientific literature. In conclusion:
In other words the FIS is not a subset of what can be usefully queried, it is the ONLY query set that can produce meaningful results. |
You can use
Same answer as above. You can absolutely plug a HLL sketch as a
Since this assertion is worded very strongly, it can be disproven by finding a counter-example where meaningful results are produced using That's easy to do for almost anyone with access to a Druid cluster. That happens to be my case, so I just ran queries against about a billion rows of real-world data. The queries are In the first test, In the second test, So, here we are—meaningful results produced using These are the first queries I tried, but I'm sure I could find some where That being said, I certainly hope that Druid goes towards a direction where error bounds are systematically provided, as discussed in #7160. So thank you and @AlexanderSaydakov for starting this movement by integrating the DS library. |
Thank you for your response and for performing a test with an HLL aggregator. The intent of my statements in the preceding entries is to protect the end users of Druid, and ultimately to protect the reputation of Druid as a trustworthy platform. In my statements above, my use of the word "meaningful" is with respect to an end user and is meant to imply that the results of any query of the above shape could be interpreted usefully (i.e., correct interpretation and similar to what a brute-force exact computation would produce). My use of the word "belief" is again with respect to the end-user: That he/she might believe (because the Druid specifications say that it can be done) that the results of any query of the shape stated above would be meaningful (useful and correct). Of course the query can be executed, that is beside the point, and doesn't require much trust or belief. Unfortunately, running one test does not prove that an algorithm or process works correctly. Nor does running a hundred or thousand such tests. However, all one has to do to prove that an algorithm or process does not work is to provide one counter example test that does not work. Relying on the fact that the documentation states that the TopN process is data sensitive is a very weak argument, as most end-users probably don't read the documentation, or don't think about what that phrase might mean. The fact that a data sensitive query might run 30% or 75% faster is a misleading "sales pitch" to the end users, because in order for an end user to actually trust that the results from such a query is usefully correct, the end user would have to run each query in brute force, exact mode and check its results with the results from the "faster" query. The total time to do both queries is clearly longer than just running the brute force exact query by itself. If you think this is ridiculous, it derives directly from the "data sensitive" statement. Because you cannot prove that a "data sensitive" process works on all data, because you can't possibly have access to all data. More importantly, you don't have access to the end-user's data! The statement "data sensitive" is a huge admission that the process may not work, and without running brute-force tests on all runs, the user would never know whether he/she can trust the results or not. It appears that it is your opinion that just documenting the process with the words "data sensitive" is enough, and if the user wishes to shoot themselves in the foot that that is OK. I don't share that view, and perhaps nothing I say will change your mind, nonetheless, there are a couple of thoughts that perhaps you might consider.
For a mergeable process that incorporates truncation / limiting followed by aggregation, to date we know of no algorithm that would produce trustworthy results if the aggregations allow negative values or non-linear calculations (e.g. unique counting). To lead the user to believe (or hope) that an expression like The only algorithms that do incorporate truncation / limiting followed by aggregation and do provide trustworthy results are the sketching algorithms described in the scientific literature: the main ones are Misra-Gries, Space-Saver, Heavy-Hitter, and Frequent-Items algorithms. And all of these algorithms require that the weight aggregations must be linear, additive with positive numbers. These are not a subset of a larger working set of working algorithms. They are the only set that, to date, work. |
@leerho I think we're on the same page regarding the behavior of the current topN. Our disagreement comes from what we consider "meaningful", which is more subjective. For you, "meaningful" means returning useful and correct results regardless of the data being queried. For me, it's OK if correctness depends on the data; for example, I consider binary search to be useful and capable of returning meaningful results, even though it returns garbage on unsorted data. You probably also consider binary search to be useful. So the main difference is, I think, that it's usually simple to ensure that data fed to binary search is sorted, but much harder to ensure that data fed to topN is suitable along all combinations of aggregations/filters/intervals. That's the main practical problem with topN IMO. But the difficulty here depends on the size of the query space, which is application-specific, so I don't think there can be a single yes/no answer regarding whether one should use topN or not. It's interesting that you mention running brute force queries to verify topN correctness. It's actually quite viable, because the topN query quickly returns results to the user while the brute force (groupBy) query runs in the background using spare cluster resources. So the user experience is improved, even though the total computation time and resources are indeed larger than running only the brute force query. |
@peferron Thank you for your thoughtful comments. Clearly we have to leave it up to an informed user to decide. And all we can do is do our best to make sure that he/she is informed. Bouncing back to the top of this thread, we are developing a new sketch that we are tentatively calling "Frequent Unique Nodes" (FUN). Suppose you have a stream that contains pairs {IP address, UserID}, and you wish to identify the IP addresses that have the largest number of unique users. In this context think of a large graph where the IP addresses and users are nodes in the graph. Consider Node1 = IP and Node2 = ID, then we want to identify the Node1s that have the largest number of unique Node2s. Conversely, it might also be interesting to identify the Node2s (IDs) that have the largest number of unique Node1s (IPs). Conceptually, this can also be extended to more that just 2 nodes (although don't go nuts with this!). With this new sketch you will be able to perform these types of queries and have some guarantees of accuracy as well. If this is of interest, please let me know, as we could use your help in characterizing and performance testing of this, if possible. Lee. |
From your description, it sounds like it could handle the "top songs by unique viewers" query we discussed earlier, by processing pairs of {SongID, UserID}. Essentially like a topN with HLL but with accuracy guarantees, and still better performance than an exact query. If that's the case then it sounds extremely useful. I'd love to see how you plan to implement that. It would be especially useful if additional aggregations could be computed at the same time, such as computing the total (non-distinct) count for each key: Once you have a draft extension ready I'll be happy to run it against some of our real-world datasets if that helps. I'll even pick some dimensions where topN fails badly for comparison 😄. I won't have time to help developing the extension itself, unfortunately. |
Yes.
Yes, with a caveat. For the example you gave: The more generic we try to make this, the more challenging it will be to configure and the performance will be impacted.
|
Understood. Extra aggregations can still be computed in a follow-up query anyway, such as
Sure. Are you looking for any specific patterns in the test data? I assume that Yahoo already has large datasets & Druid clusters that could be used, so I'm trying to see what I could bring to the table here.
The best would be a branch that I can check out to build the extension from source. |
Just to let you know I am almost done with this new Sketch . I decided that the FUN acronym was a bit too whimsical so I have renamed it Frequent Distinct Tuples, which I think better reflects what it is doing. Not as easy to say, I realize, but we will get used to it. This new Sketch is experimental still, as I don’t know how it will actually perform yet. But it is better than nothing, at least until a better algorithm comes along :) |
@peferron I am actually quite pleased with its accuracy. You can find the configuration file for running the characterization tests here. Have fun! |
@peferron @gianm |
Thanks @leerho. I'm in vacation atm but should be able to characterize it against a few of our datasets after I'm back in early May. |
Description
@leerho proposed how the topN algorithm can be improved here: #7161 (comment)
Motivation
Avoid potentially significant errors, see #1134 and #1135.
The text was updated successfully, but these errors were encountered: