[Proposal] More flexible dimension types #2292

jon-wei · 2016-01-19T23:52:08Z

Currently, Druid assumes that all dimensions have string values and are associated with a bitmap index.

There has been interest in loosening these constraints to support use cases that blur the existing separation between dimensions and metrics, e.g., filtering on numeric columns, aggregating dimensions at query time.

A recent discussion on these topics can be found here:
https://groups.google.com/d/msg/druid-user/Mk6omlC6Vbk/jtIFGFrACwAJ

This proposal was initially sent out on the druid-dev list, and initial comments can be found there:
https://groups.google.com/d/topic/druid-development/obtfNJnXPDg/discussion

This proposal calls for two major changes/features:

1.) Remove the assumption that dimensions always have string values.

This change is a path towards reducing the distinction between dimensions and metrics.

This would involve changes to:

IncrementalIndex, IndexMerger, etc. (ingestion)
StorageAdapters, query engines, etc. (querying)
Ingestion specs, allow user to specify dimension types (e.g., String, Long, Float)

2.) Allow user to choose per-column index strategies

Druid could support a wider range of index types beyond bitmaps. Giving users control over what indexes are used on a per-column basis could make Druid more powerful and efficient.

For example, if a dimension is expected to have high cardinality and range filters applied to it, the user may want to choose a tree-based index instead of bitmaps.

As another example, trie indexes could be used to better support text search on dimension values.

The existing ColumnCapabilities class could be used to describe what indexes are supported for a column.

This would involve changes to:

query-related components, allow them to handle columns that do not use bitmap indexes
on-disk storage format, to store new index types with the columns
ingestion specs

EDIT, Mar. 15 2016:

The following patch has been merged, it prepares IncrementalIndex for later typing-related changes:
#2263

The following PRs are currently open, and should be reviewed/merged in order:
druid-io/druid-api#75 - Adds DimensionSchema class for specifying dimension type, properties
#2607 - Updates druid main to use DimensionSchema
#2621 - larger PR that adds support for Long/Float typed dims

jon-wei · 2016-01-19T23:52:24Z

I have opened a PR with some initial work on Part 1 (corresponding to the ingestion subitem):

#2263

The PR changes the internal structures used by IncrementalIndex such that they can now store numeric types for dimensions.

Interfaces between IncrementalIndex and other modules are left unchanged (i.e., dims will still appear to be Strings to other modules)

jon-wei · 2016-01-19T23:52:37Z

As a next step, I am currently working on adding basic support for numeric dimensions to the query engines and filters (part 1, subitem 2).

At a high level, the approach I am considering is:

Reuse the existing Long/Float column formats to represent numeric dimensions on disk, similar to how timestamp is stored now (not adding new index types at this point)
No dictionary encoding on numeric dims, just use the numeric values directly
Change the QueryableIndex and Adapter classes to support numeric dims
Adapt the query-related components to support dim columns without bitmap indexes and dictionary encoding

I plan to have a PR out for this by the end of next week (end of January).

drcrallen · 2016-01-20T00:15:02Z

Are these two changes required to go in at the same time? as in: is it a lot of extra effort to add string/float/long values backed by bitmaps separately from arbitrary index types?

drcrallen · 2016-01-20T00:16:36Z

If I use LONG dimension values to store fake ISO dates like 20150119, does that still fit in with the "no dictionary" approach?

gianm · 2016-01-20T00:40:08Z

@drcrallen That kind of column should compress really well with just block compression on long columns, since there will be a lot of runs of repeated data.

jon-wei · 2016-01-20T00:55:50Z

@drcrallen

For your first point, I don't think they would be required to go in at the same time.

In terms of effort, I think adding float/long backed by bitmaps/dictionary encoding would probably be the "minimal" path, in that it involves less change to existing logic/interfaces, compared to adding support for arbitrary index types.

I discussed dictionary encoding for numeric dims briefly with @gianm today; from that, we were leaning towards a "no index/dictionary" approach for numeric dims as an initial implementation because of concerns with wasted storage and added cost of dictionary lookups since the numeric types are fixed-width already.

jon-wei · 2016-01-20T00:57:50Z

@drcrallen For your second point, what is the impact of having dictionary encoding/not having it for the fake ISO date use case?

drcrallen · 2016-01-20T01:04:54Z

Imagine having a dataset where your event stream includes "signup-date" and you want to filter events based on who signed up between 30 days and 6 months ago. One potential implementation using the dictionary approach would filter dimension values on if they are in the range and use an OR function on the bitmaps.

If we're sticking with the cursor based approach and NO dictionary/index, would you want to read every row value and evaluate some predicate to determine the inclusion?

drcrallen · 2016-01-20T01:07:06Z

Intuitively it seems like it would make a difference for low vs high cardinality numeric dimensions, where I can see the dictionary/index approach being useful for low cardinality ones, but potentially counter-productive (due to disk space and bitmap-merge time) for high cardinality ones.

drcrallen · 2016-01-20T01:07:22Z

But some numbers to back up such things would be required

KurtYoung · 2016-01-20T01:22:56Z

I think we should add another option to dimension too, which is whether is dimension is multi value and what the delimiter is. Currently, we configured a global "listDelimeter" for overall dimension and it will cause some dimension which i think should be single-value but turns out that dimension contains some "listDelimeter" and druid treat it as multi-value dimension. And in second case, it sometimes hard to let users to set one listDelimeter for all multi-value dimensions because these dimensions may come from different source and owned by different people.

jon-wei · 2016-01-20T01:23:45Z

@drcrallen I see, makes sense.

I think it would be a sensible end-goal to be able to support bitmaps+dictionaries, no index, and alternate indexes (e.g.,the sorted array+binary search structure for range queries that Conrad Lee suggested in the original "numeric dimensions" thread.) as a user-configurable option.

In terms of plan/development steps, I am leaning towards getting one indexing strategy working for numeric vals for initial "it works" functionality, and then adding support for configurable indexes as a later PR.

drcrallen · 2016-01-20T01:33:00Z

I'm wondering now if 2 from your list makes more sense to hammer out first so that 1 will be retain backwards compat during iterations of "what dimension indices are supported"

jon-wei · 2016-01-20T01:56:56Z

@KurtYoung That sounds like a pretty useful enhancement, can you file an issue for that? It sounds somewhat orthogonal to the dim typing changes, so I think a separate issue would be easier to track

drcrallen · 2016-01-20T01:59:07Z

@KurtYoung / @jon-wei it is orthogonal but a good feature. It would be implemented in the Parser rather than in the index itself.

KurtYoung · 2016-01-20T02:09:08Z

@drcrallen @jon-wei Yes, and i once wanted to add this feature to parser but it seems the parser's codes are in Metamarkets's repository and don't know how to modify that.

drcrallen · 2016-01-20T16:25:44Z

@KurtYoung java-util should be Apache v2 licensed, it just happens to be in a Metamarkets repository. xvrl, nishantmonu51, and myself have the capacity to release from the Metamarkets repositories.

jon-wei · 2016-01-20T23:27:15Z

@dcrallen Re: your comment about the ordering of changes

I think part 1 (loosen typing assumptions) and part 2 (more index types) could be done mostly independently of each other;

e.g., a lower impact way of implementing #1 could be to completely retain all of the existing bitmap/dictionary logic for numeric dims (setting ColumnDescriptor and ObjectStrategy based on the dim type should suffice for building the disk representation of the numeric dim, so no storage format change should be needed). Storage format optimization (for dictionary/bitmap with fixed width values) could be done later.

In parallel, I think it would be possible to implement #2 separately from #1, i.e., change the querying code to support non-bitmap indexes or indexless for Strings only

A later merge of those two parts would loosen typing for the non-bitmap index logic.

The plan I described in my third comment in the thread (indexless, no dictionary numeric dims) straddles part 1 and part 2 (needs both typing changes and support for indexless querying).

After more thinking, I'm now personally leaning towards retaining dictionary/bitmaps for numerics as an initial step. It may not be optimal in terms of storage usage/performance at first (based on cardinality probably, as you said earlier), but it may ease the overall process by limiting the scope of individual PRs.

What are some backward compatibility issues that you foresee if part 1 is done before part 2?

I expect part 2 changes to be larger/broader impact than part 1 changes.

@gianm any thoughts?

drcrallen · 2016-01-20T23:39:37Z

@jon-wei

My main concern regarding backwards compatibility is that if you guess a way to do a "new" index type (non-dictionary/bitmap) and due to conversations / code review later we discover that way doesn't really work in the long run... then we're stuck with supporting the code that reads from the on-disk segments forever.

So essentially if you're going to do 1 first, the safest way is to do it in a manner which caries the least risk of having to change as soon as 2 is tackled.

gianm · 2016-01-20T23:43:36Z

@drcrallen the idea with (1) is that the on-disk format does not need to change or be extended at all. Hence not supporting any indexes on numeric columns at first. So at first filters on numeric columns would be done with full scans, but as part of (2) they could hit an index. If you're not filtering (maybe you're just grouping) then even (1) will give you some nice performance benefit from not having to convert things back and forth from strings.

Does that seem reasonable?

drcrallen · 2016-01-20T23:50:37Z

I think that's reasonable for Long. Essentially that would make the __time column one of these new Long dimensions, right? If that's the case then I think there is good precedence that such a case has been used and will be used in the future.

I am a bit concerned about filtering on Float due to floating point rounding weirdness. But that is probably the problem of the filter rather than the on-disk data.

In the simple case multi-value dimensions are not supported, right?

gianm · 2016-01-20T23:57:58Z

@drcrallen yep, (1) should make treating __time as a dimension less of a special case than it is today- or ideally, not a special case at all.

Filtering on floats is weird… it probably doesn't make sense unless you have range filters. At least I can't think of a great use case for equality filters (except maybe == 0 but that doesn't have rounding weirdness)

And yeah, in (1) we'd only support things that have an existing on disk format. So no multi-valued numeric columns.

drcrallen · 2016-01-21T00:01:56Z

In that case might I suggest limiting the scope of the first step to something like

Allowing single-value Long and Float dimensions. The first implementation is expected to require full-table-scans for filtering and will take up approximately Long.BYTES and Float.BYTES space per row prior to any compression.

gianm · 2016-01-21T00:05:13Z

They should still be lz4-compressed on disk, but yeah.

jon-wei · 2016-01-21T00:29:32Z

@drcrallen Good point about backwards compatibility in the segment formats.

The scope you described in the last comment seems pretty reasonable to me, I will target that for the next PR.

binlijin · 2016-01-21T01:29:08Z

Should we support to store a dim with no index?

jon-wei · 2016-01-21T01:35:46Z

@binlinjin That's the plan for now; it would be similar to the time column, which is a Long "dimension" that has no bitmap index

binlijin · 2016-01-21T01:38:40Z

@jon-wei i mean the dim with no index not limit to numeric dimensions but also string dimensions.

jon-wei · 2016-01-21T05:05:16Z

@binlijin I'm not planning on implementing indexless String dimensions in the PR I am currently working on, but I think it makes sense to support that later on.

jon-wei · 2016-02-11T02:17:09Z

I've opened a PR that adds initial support for numeric dimensions (no indexing):
#2442

jon-wei · 2016-03-15T21:49:37Z

Status, Mar. 15 2016:

The following patch has been merged:
#2263 - it prepares IncrementalIndex for later typing-related changes

The following PRs are currently open, and should be reviewed/merged in order:
druid-io/druid-api#75 - Adds DimensionSchema class for specifying dimension type, properties
#2607 - Updates druid main to use DimensionSchema
#2621 - larger PR that adds support for Long/Float typed dims

stale · 2019-06-21T17:13:15Z

This issue has been marked as stale due to 280 days of inactivity. It will be closed in 2 weeks if no further activity occurs. If this issue is still relevant, please simply write any comment. Even if closed, you can still revive the issue at any time or discuss it on the dev@druid.apache.org list. Thank you for your contributions.

stale · 2019-07-05T18:10:26Z

This issue has been closed due to lack of activity. If you think that is incorrect, or the issue requires additional review, you can revive the issue at any time.

jon-wei added Feature Discuss labels Jan 19, 2016

drcrallen mentioned this issue Jan 21, 2016

Support custom parser implementation druid-io/druid-api#69

Closed

jon-wei mentioned this issue Mar 14, 2016

Allow user-configurable indexing strategies for dimensions #2654

Closed

sirpkt mentioned this issue May 2, 2016

[Proposal] Support multiple dimensions in Dimension Spec #2908

Closed

fjy mentioned this issue May 24, 2016

Filter on metric expression #3005

Closed

stale bot added the stale label Jun 21, 2019

stale bot closed this as completed Jul 5, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Proposal] More flexible dimension types #2292

[Proposal] More flexible dimension types #2292

jon-wei commented Jan 19, 2016

jon-wei commented Jan 19, 2016

jon-wei commented Jan 19, 2016

drcrallen commented Jan 20, 2016

drcrallen commented Jan 20, 2016

gianm commented Jan 20, 2016

jon-wei commented Jan 20, 2016

jon-wei commented Jan 20, 2016

drcrallen commented Jan 20, 2016

drcrallen commented Jan 20, 2016

drcrallen commented Jan 20, 2016

KurtYoung commented Jan 20, 2016

jon-wei commented Jan 20, 2016

drcrallen commented Jan 20, 2016

jon-wei commented Jan 20, 2016

drcrallen commented Jan 20, 2016

KurtYoung commented Jan 20, 2016

drcrallen commented Jan 20, 2016

jon-wei commented Jan 20, 2016

drcrallen commented Jan 20, 2016

gianm commented Jan 20, 2016

drcrallen commented Jan 20, 2016

gianm commented Jan 20, 2016

drcrallen commented Jan 21, 2016

gianm commented Jan 21, 2016

jon-wei commented Jan 21, 2016

binlijin commented Jan 21, 2016

jon-wei commented Jan 21, 2016

binlijin commented Jan 21, 2016

jon-wei commented Jan 21, 2016

jon-wei commented Feb 11, 2016

jon-wei commented Mar 15, 2016

stale bot commented Jun 21, 2019

stale bot commented Jul 5, 2019

[Proposal] More flexible dimension types #2292

[Proposal] More flexible dimension types #2292

Comments

jon-wei commented Jan 19, 2016

jon-wei commented Jan 19, 2016

jon-wei commented Jan 19, 2016

drcrallen commented Jan 20, 2016

drcrallen commented Jan 20, 2016

gianm commented Jan 20, 2016

jon-wei commented Jan 20, 2016

jon-wei commented Jan 20, 2016

drcrallen commented Jan 20, 2016

drcrallen commented Jan 20, 2016

drcrallen commented Jan 20, 2016

KurtYoung commented Jan 20, 2016

jon-wei commented Jan 20, 2016

drcrallen commented Jan 20, 2016

jon-wei commented Jan 20, 2016

drcrallen commented Jan 20, 2016

KurtYoung commented Jan 20, 2016

drcrallen commented Jan 20, 2016

jon-wei commented Jan 20, 2016

drcrallen commented Jan 20, 2016

gianm commented Jan 20, 2016

drcrallen commented Jan 20, 2016

gianm commented Jan 20, 2016

drcrallen commented Jan 21, 2016

gianm commented Jan 21, 2016

jon-wei commented Jan 21, 2016

binlijin commented Jan 21, 2016

jon-wei commented Jan 21, 2016

binlijin commented Jan 21, 2016

jon-wei commented Jan 21, 2016

jon-wei commented Feb 11, 2016

jon-wei commented Mar 15, 2016

stale bot commented Jun 21, 2019

stale bot commented Jul 5, 2019