Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow IncrementalIndex to store Long/Float dimensions #2263

Merged
merged 1 commit into from
Feb 26, 2016

Conversation

jon-wei
Copy link
Contributor

@jon-wei jon-wei commented Jan 14, 2016

This PR updates IncrementalIndex such that it can ingest and store Long/Float dimensions in addition to Strings. It is intended as a step towards the larger goals being discussed in the following topic:

https://groups.google.com/forum/#!topic/druid-development/obtfNJnXPDg/discussion

  • DimDim is now a generic class instead of assuming String values. The TimeAndDims grouping key class now stores dimension values as Comparable[] instead of String[].
  • The PR does not change the DimensionSelector or *Adapter interfaces to support non-String types, so query/index persist behavior is unchanged, nor does this PR provide a way for the user to specify non-String dimension types in the ingestion schema.
  • Schema-less dimension type discovery is not enabled in the PR (newly discovered dims are assumed to be Strings).
  • The PR includes a JMH benchmark that adds rows with Long/Float/String columns to an IncrementalIndex.

I ran the row-adding benchmark with type discovery enabled for testing purposes and the numbers are shown below:

java -jar target/benchmarks.jar IncrementalIndexAddRowsBenchmark -f 1 -i 20 -wi 10

With patch, non-String dims:

Run complete. Total time: 00:08:05

Benchmark                                       Mode  Cnt   Score   Error  Units
IncrementalIndexAddRowsBenchmark.normalFloats   avgt   20  18.918 ± 3.063  us/op
IncrementalIndexAddRowsBenchmark.normalLongs    avgt   20  10.783 ± 0.394  us/op
IncrementalIndexAddRowsBenchmark.normalStrings  avgt   20  28.218 ± 2.513  us/op

Without patch, String dims only:

Run complete. Total time: 00:09:34

Benchmark                                       Mode  Cnt   Score   Error  Units
IncrementalIndexAddRowsBenchmark.normalFloats   avgt   20  19.977 ± 2.803  us/op
IncrementalIndexAddRowsBenchmark.normalLongs    avgt   20  20.520 ± 3.298  us/op
IncrementalIndexAddRowsBenchmark.normalStrings  avgt   20  29.656 ± 3.261  us/op

@@ -74,6 +76,80 @@
{
private volatile DateTime maxIngestedEventTime;

private static final Map<String, ValueType> typeMap = ImmutableMap.<String, ValueType>builder()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

static variable name in UPPERCASE.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nishantmonu51 addressed

@jon-wei jon-wei force-pushed the flex_dims3 branch 4 times, most recently from 94c8b13 to 7de6bdc Compare January 18, 2016 21:29
@@ -74,6 +77,76 @@
{
private volatile DateTime maxIngestedEventTime;

private static final Map<String, ValueType> TYPE_MAP = ImmutableMap.<String, ValueType>builder()
.put("Long[]", ValueType.LONG)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious, do we have any parser in Druid currently that returns [] type for multi-valued dimension ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pjain1 Do you mean if the multi-value array can have subitems with an array[] type? If so, I don't think that will occur, the values would be single numerics or Strings for the JSON parsers and all single strings for the delimited parsers, I believe.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

then why there is a need for [] types here as TYPE_MAP is used like TYPE_MAP.get(singleVal.getClass().getSimpleName()) where singleValue as you said "would be single numerics or Strings for the JSON parsers and all single strings for the delimited parsers, I believe."

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pjain1 It's not strictly necessary now, I think I had it in there from an older implementation where I used the TYPE_MAP to get a ValueType from a Comparable[].

I kept it there in case, but I can remove it if needed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jon-wei I was just curious to know why it was there. No need to remove if it does not hurt anything.

@jon-wei
Copy link
Contributor Author

jon-wei commented Jan 21, 2016

closed for now to rebase to master, will reopen

@drcrallen
Copy link
Contributor

dont' force-push your branch until the PR is open or it will render this unable to re-open

@jon-wei
Copy link
Contributor Author

jon-wei commented Jan 22, 2016

@drcrallen yeah, forgot about that yesterday =\

I was able to re-open it though by doing a "git reset --hard" to the SHA at the time the PR was closed

@fjy
Copy link
Contributor

fjy commented Feb 5, 2016

@jon-wei we need to rebase this

@jon-wei jon-wei force-pushed the flex_dims3 branch 2 times, most recently from 6850331 to 7a92273 Compare February 6, 2016 00:14
@fjy fjy added this to the 0.9.1 milestone Feb 6, 2016
@jon-wei
Copy link
Contributor Author

jon-wei commented Feb 6, 2016

Updated this to rebase on top of changes introduced by:

#2085

The PR above changed TimeAndDims to use a dictionary encoded int[] array instead of String[]; this patch uses dictionary encoding for Long/Float values as well (unnecessary).

I plan to change TimeAndDims to use Comparable[] instead of int[] in a later PR, still WIP in my fork:
https://github.com/jon-wei/druid/tree/flex_dims_pt2

Strings will store a dictionary encoding in TimeAndDims as in PR #2085, numerics will store their values directly in TimeAndDims.

@jon-wei jon-wei force-pushed the flex_dims3 branch 3 times, most recently from f2aa86d to 8e51849 Compare February 6, 2016 01:55
@@ -0,0 +1,178 @@
package io.druid.benchmark;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

header

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fjy added license header

@fjy
Copy link
Contributor

fjy commented Feb 9, 2016

👍

dims[dimIndex][i] = dimLookups[dimIndex].idToIndex(dimValues[dimIndex][i]);
dims[dimIndex][i] = sortedDimLookups[dimIndex].getSortedIdFromUnsortedId(dimValues[dimIndex][i]);
//TODO: in later PR, Rowboat will use Comparable[][] instead of int[][]
// Can remove dictionary encoding for numeric dims then.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please file an issue for this todo.

@nishantmonu51
Copy link
Member

What is the behavior of null for Long/Float dimensions ?
Will it throw parseException or successfully ingest null values and treat them as null ?

can we also add tests for null values in Long/Float dimension to make sure it behaves as expected.

@nishantmonu51
Copy link
Member

Also need to add docs for the added functionality.

@jon-wei
Copy link
Contributor Author

jon-wei commented Feb 9, 2016

@nishantmonu51

RE: null handling, the nulls are being converted to 0. It's not in this PR, but a follow on PR that I am finishing up:

https://github.com/jon-wei/druid/blob/5bffacec5772718bcb948b81e6ace4192d940396/processing/src/main/java/io/druid/segment/IndexMerger.java#L1432

That PR reuses the existing single-value long/float column formats for the numeric dims, representing null without changes to the storage format (need some mechanism to distinguish null from normal values) or using dictionary encoding (have a dictionary entry for null) seems problematic. Throwing an exception sounds okay to me too (maybe let user configure? choice of "throw exception" or "convert null to 0")

RE: docs and tests, this open PR doesn't actually enable the use of long/float dimensions, it's just preparing IncrementalIndex for later changes:

  • It doesn't provide a way to specify dimension typing in the schema
  • No query/filter/extraction fn, etc. support for long/floats yet

I address those points and have tests in the follow-on patch (not ready yet), I think it would make more sense to include docs in the later PR when the numeric dims are actually usable

@jon-wei
Copy link
Contributor Author

jon-wei commented Feb 9, 2016

@fjy @nishantmonu51 will file issues for TODOs

@nishantmonu51
Copy link
Member

+1, I think this is ready to be merged. docs can be added in the follow up PR.

@fjy
Copy link
Contributor

fjy commented Feb 19, 2016

@jon-wei do you mind resolving merge conflicts?

@jon-wei
Copy link
Contributor Author

jon-wei commented Feb 24, 2016

@fjy rebased

fjy added a commit that referenced this pull request Feb 26, 2016
Allow IncrementalIndex to store Long/Float dimensions
@fjy fjy merged commit 29d29ba into apache:master Feb 26, 2016
@jon-wei jon-wei deleted the flex_dims3 branch October 6, 2017 22:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants