Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Proposal] More flexible dimension types #2292

Closed
jon-wei opened this issue Jan 19, 2016 · 33 comments
Closed

[Proposal] More flexible dimension types #2292

jon-wei opened this issue Jan 19, 2016 · 33 comments

Comments

@jon-wei
Copy link
Contributor

jon-wei commented Jan 19, 2016

Currently, Druid assumes that all dimensions have string values and are associated with a bitmap index.

There has been interest in loosening these constraints to support use cases that blur the existing separation between dimensions and metrics, e.g., filtering on numeric columns, aggregating dimensions at query time.

A recent discussion on these topics can be found here:
https://groups.google.com/d/msg/druid-user/Mk6omlC6Vbk/jtIFGFrACwAJ

This proposal was initially sent out on the druid-dev list, and initial comments can be found there:
https://groups.google.com/d/topic/druid-development/obtfNJnXPDg/discussion

This proposal calls for two major changes/features:


1.) Remove the assumption that dimensions always have string values.

This change is a path towards reducing the distinction between dimensions and metrics.

This would involve changes to:

  • IncrementalIndex, IndexMerger, etc. (ingestion)
  • StorageAdapters, query engines, etc. (querying)
  • Ingestion specs, allow user to specify dimension types (e.g., String, Long, Float)

2.) Allow user to choose per-column index strategies

Druid could support a wider range of index types beyond bitmaps. Giving users control over what indexes are used on a per-column basis could make Druid more powerful and efficient.

For example, if a dimension is expected to have high cardinality and range filters applied to it, the user may want to choose a tree-based index instead of bitmaps.

As another example, trie indexes could be used to better support text search on dimension values.

The existing ColumnCapabilities class could be used to describe what indexes are supported for a column.

This would involve changes to:

  • query-related components, allow them to handle columns that do not use bitmap indexes
  • on-disk storage format, to store new index types with the columns
  • ingestion specs

EDIT, Mar. 15 2016:

The following patch has been merged, it prepares IncrementalIndex for later typing-related changes:
#2263

The following PRs are currently open, and should be reviewed/merged in order:
druid-io/druid-api#75 - Adds DimensionSchema class for specifying dimension type, properties
#2607 - Updates druid main to use DimensionSchema
#2621 - larger PR that adds support for Long/Float typed dims

@jon-wei
Copy link
Contributor Author

jon-wei commented Jan 19, 2016

I have opened a PR with some initial work on Part 1 (corresponding to the ingestion subitem):

#2263

The PR changes the internal structures used by IncrementalIndex such that they can now store numeric types for dimensions.

Interfaces between IncrementalIndex and other modules are left unchanged (i.e., dims will still appear to be Strings to other modules)

@jon-wei
Copy link
Contributor Author

jon-wei commented Jan 19, 2016

As a next step, I am currently working on adding basic support for numeric dimensions to the query engines and filters (part 1, subitem 2).

At a high level, the approach I am considering is:

  • Reuse the existing Long/Float column formats to represent numeric dimensions on disk, similar to how timestamp is stored now (not adding new index types at this point)
  • No dictionary encoding on numeric dims, just use the numeric values directly
  • Change the QueryableIndex and Adapter classes to support numeric dims
  • Adapt the query-related components to support dim columns without bitmap indexes and dictionary encoding

I plan to have a PR out for this by the end of next week (end of January).

@drcrallen
Copy link
Contributor

Are these two changes required to go in at the same time? as in: is it a lot of extra effort to add string/float/long values backed by bitmaps separately from arbitrary index types?

@drcrallen
Copy link
Contributor

If I use LONG dimension values to store fake ISO dates like 20150119, does that still fit in with the "no dictionary" approach?

@gianm
Copy link
Contributor

gianm commented Jan 20, 2016

@drcrallen That kind of column should compress really well with just block compression on long columns, since there will be a lot of runs of repeated data.

@jon-wei
Copy link
Contributor Author

jon-wei commented Jan 20, 2016

@drcrallen

For your first point, I don't think they would be required to go in at the same time.

In terms of effort, I think adding float/long backed by bitmaps/dictionary encoding would probably be the "minimal" path, in that it involves less change to existing logic/interfaces, compared to adding support for arbitrary index types.

I discussed dictionary encoding for numeric dims briefly with @gianm today; from that, we were leaning towards a "no index/dictionary" approach for numeric dims as an initial implementation because of concerns with wasted storage and added cost of dictionary lookups since the numeric types are fixed-width already.

@jon-wei
Copy link
Contributor Author

jon-wei commented Jan 20, 2016

@drcrallen For your second point, what is the impact of having dictionary encoding/not having it for the fake ISO date use case?

@drcrallen
Copy link
Contributor

Imagine having a dataset where your event stream includes "signup-date" and you want to filter events based on who signed up between 30 days and 6 months ago. One potential implementation using the dictionary approach would filter dimension values on if they are in the range and use an OR function on the bitmaps.

If we're sticking with the cursor based approach and NO dictionary/index, would you want to read every row value and evaluate some predicate to determine the inclusion?

@drcrallen
Copy link
Contributor

Intuitively it seems like it would make a difference for low vs high cardinality numeric dimensions, where I can see the dictionary/index approach being useful for low cardinality ones, but potentially counter-productive (due to disk space and bitmap-merge time) for high cardinality ones.

@drcrallen
Copy link
Contributor

But some numbers to back up such things would be required

@KurtYoung
Copy link
Contributor

I think we should add another option to dimension too, which is whether is dimension is multi value and what the delimiter is. Currently, we configured a global "listDelimeter" for overall dimension and it will cause some dimension which i think should be single-value but turns out that dimension contains some "listDelimeter" and druid treat it as multi-value dimension. And in second case, it sometimes hard to let users to set one listDelimeter for all multi-value dimensions because these dimensions may come from different source and owned by different people.

@jon-wei
Copy link
Contributor Author

jon-wei commented Jan 20, 2016

@drcrallen I see, makes sense.

I think it would be a sensible end-goal to be able to support bitmaps+dictionaries, no index, and alternate indexes (e.g.,the sorted array+binary search structure for range queries that Conrad Lee suggested in the original "numeric dimensions" thread.) as a user-configurable option.

In terms of plan/development steps, I am leaning towards getting one indexing strategy working for numeric vals for initial "it works" functionality, and then adding support for configurable indexes as a later PR.

@drcrallen
Copy link
Contributor

I'm wondering now if 2 from your list makes more sense to hammer out first so that 1 will be retain backwards compat during iterations of "what dimension indices are supported"

@jon-wei
Copy link
Contributor Author

jon-wei commented Jan 20, 2016

@KurtYoung That sounds like a pretty useful enhancement, can you file an issue for that? It sounds somewhat orthogonal to the dim typing changes, so I think a separate issue would be easier to track

@drcrallen
Copy link
Contributor

@KurtYoung / @jon-wei it is orthogonal but a good feature. It would be implemented in the Parser rather than in the index itself.

@KurtYoung
Copy link
Contributor

@drcrallen @jon-wei Yes, and i once wanted to add this feature to parser but it seems the parser's codes are in Metamarkets's repository and don't know how to modify that.

@drcrallen
Copy link
Contributor

@KurtYoung java-util should be Apache v2 licensed, it just happens to be in a Metamarkets repository. xvrl, nishantmonu51, and myself have the capacity to release from the Metamarkets repositories.

@jon-wei
Copy link
Contributor Author

jon-wei commented Jan 20, 2016

@dcrallen Re: your comment about the ordering of changes

I think part 1 (loosen typing assumptions) and part 2 (more index types) could be done mostly independently of each other;

e.g., a lower impact way of implementing #1 could be to completely retain all of the existing bitmap/dictionary logic for numeric dims (setting ColumnDescriptor and ObjectStrategy based on the dim type should suffice for building the disk representation of the numeric dim, so no storage format change should be needed). Storage format optimization (for dictionary/bitmap with fixed width values) could be done later.

In parallel, I think it would be possible to implement #2 separately from #1, i.e., change the querying code to support non-bitmap indexes or indexless for Strings only

A later merge of those two parts would loosen typing for the non-bitmap index logic.

The plan I described in my third comment in the thread (indexless, no dictionary numeric dims) straddles part 1 and part 2 (needs both typing changes and support for indexless querying).

After more thinking, I'm now personally leaning towards retaining dictionary/bitmaps for numerics as an initial step. It may not be optimal in terms of storage usage/performance at first (based on cardinality probably, as you said earlier), but it may ease the overall process by limiting the scope of individual PRs.

What are some backward compatibility issues that you foresee if part 1 is done before part 2?

I expect part 2 changes to be larger/broader impact than part 1 changes.

@gianm any thoughts?

@drcrallen
Copy link
Contributor

@jon-wei

My main concern regarding backwards compatibility is that if you guess a way to do a "new" index type (non-dictionary/bitmap) and due to conversations / code review later we discover that way doesn't really work in the long run... then we're stuck with supporting the code that reads from the on-disk segments forever.

So essentially if you're going to do 1 first, the safest way is to do it in a manner which caries the least risk of having to change as soon as 2 is tackled.

@gianm
Copy link
Contributor

gianm commented Jan 20, 2016

@drcrallen the idea with (1) is that the on-disk format does not need to change or be extended at all. Hence not supporting any indexes on numeric columns at first. So at first filters on numeric columns would be done with full scans, but as part of (2) they could hit an index. If you're not filtering (maybe you're just grouping) then even (1) will give you some nice performance benefit from not having to convert things back and forth from strings.

Does that seem reasonable?

@drcrallen
Copy link
Contributor

I think that's reasonable for Long. Essentially that would make the __time column one of these new Long dimensions, right? If that's the case then I think there is good precedence that such a case has been used and will be used in the future.

I am a bit concerned about filtering on Float due to floating point rounding weirdness. But that is probably the problem of the filter rather than the on-disk data.

In the simple case multi-value dimensions are not supported, right?

@gianm
Copy link
Contributor

gianm commented Jan 20, 2016

@drcrallen yep, (1) should make treating __time as a dimension less of a special case than it is today- or ideally, not a special case at all.

Filtering on floats is weird… it probably doesn't make sense unless you have range filters. At least I can't think of a great use case for equality filters (except maybe == 0 but that doesn't have rounding weirdness)

And yeah, in (1) we'd only support things that have an existing on disk format. So no multi-valued numeric columns.

@drcrallen
Copy link
Contributor

In that case might I suggest limiting the scope of the first step to something like

Allowing single-value Long and Float dimensions. The first implementation is expected to require full-table-scans for filtering and will take up approximately Long.BYTES and Float.BYTES space per row prior to any compression.

@gianm
Copy link
Contributor

gianm commented Jan 21, 2016

They should still be lz4-compressed on disk, but yeah.

@jon-wei
Copy link
Contributor Author

jon-wei commented Jan 21, 2016

@drcrallen Good point about backwards compatibility in the segment formats.

The scope you described in the last comment seems pretty reasonable to me, I will target that for the next PR.

@binlijin
Copy link
Contributor

Should we support to store a dim with no index?

@jon-wei
Copy link
Contributor Author

jon-wei commented Jan 21, 2016

@binlinjin That's the plan for now; it would be similar to the time column, which is a Long "dimension" that has no bitmap index

@binlijin
Copy link
Contributor

@jon-wei i mean the dim with no index not limit to numeric dimensions but also string dimensions.

@jon-wei
Copy link
Contributor Author

jon-wei commented Jan 21, 2016

@binlijin I'm not planning on implementing indexless String dimensions in the PR I am currently working on, but I think it makes sense to support that later on.

@jon-wei
Copy link
Contributor Author

jon-wei commented Feb 11, 2016

I've opened a PR that adds initial support for numeric dimensions (no indexing):
#2442

@jon-wei
Copy link
Contributor Author

jon-wei commented Mar 15, 2016

Status, Mar. 15 2016:

The following patch has been merged:
#2263 - it prepares IncrementalIndex for later typing-related changes

The following PRs are currently open, and should be reviewed/merged in order:
druid-io/druid-api#75 - Adds DimensionSchema class for specifying dimension type, properties
#2607 - Updates druid main to use DimensionSchema
#2621 - larger PR that adds support for Long/Float typed dims

@stale
Copy link

stale bot commented Jun 21, 2019

This issue has been marked as stale due to 280 days of inactivity. It will be closed in 2 weeks if no further activity occurs. If this issue is still relevant, please simply write any comment. Even if closed, you can still revive the issue at any time or discuss it on the dev@druid.apache.org list. Thank you for your contributions.

@stale stale bot added the stale label Jun 21, 2019
@stale
Copy link

stale bot commented Jul 5, 2019

This issue has been closed due to lack of activity. If you think that is incorrect, or the issue requires additional review, you can revive the issue at any time.

@stale stale bot closed this as completed Jul 5, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants