Correlation Metric Aggregator #16817

nknize · 2016-02-26T05:33:23Z

This first Multi-Field metric aggregation will compute the Pearson product-moment correlation coefficient for a given list of numeric fields.

Example usage (correlating stock value against S&P 500):

"pearson_correlation" : {
    "correlation" : {
        "field" : [ "FOO_Corp_Stock_Value", "AAPL", "INDEXSP" ],
    }
}

Example result:

The result provides just the upper triangle (excluding the auto-correlated diagonal which would just be 1.0)

"pearson_correlation" : {
    "correlation" : {
        "FOO_Corp_Stock_Value" : {
            "AAPL" : 0.015411381889249296,
            "INDEXSP" : 0.028740529828528586
        },
        "AAPL" : {
            "INDEXSP" : 0.429228885527728843
        }
    }
}

This hypothetical example shows a small positive correlation (0.0154) between Apple and FOO Corp. stock value. A small positive correlation (0.0287) between FOO Corp. stock and the S&P 500, and a strong (fake) positive correlation (0.4292) between Apple Stock and the S&P 500.

While a made up example with contrived data, this correlation provides a useful statistical tool for drawing correlations between all kinds of data (medical, finance, log trends) and provides a foundation for much more advanced statistics (e.g., spatial statistics and econometrics).

wamuir · 2016-02-26T14:31:54Z

Measures of covariation would be useful. However, a more general solution would be to estimate a variance/covariance matrix for two or more fields.

wamuir · 2016-02-26T15:14:20Z

Consider also when the number of fields >= 3 offering an option to select pairwise or listwise deletion of missing values. For pairwise, this would complicate the output as the query should return unique counts for each element of the matrix (if the data contain missing values).

nknize · 2016-02-26T16:09:49Z

@wamuir awesome getting this feedback!

a more general solution would be to estimate a variance/covariance matrix

This got me thinking. For a more general purpose agg, perhaps this should be renamed multi_stats (in the same manner that we have stats and extended_stats)? multi_stats would give us the ability to compute and output a handful of multi-field statistics? e.g., to start it would provide covariance and pearson correlation. What do you think? /cc @polyfractal @colings86

offering an option to select pairwise or listwise deletion of missing values.

To keep things simple the first implementation (PR shortly) is listwise omission of documents containing missing values. We need to think a bit more about how to handle missing values in future enhancements. EM would be nice, but I don't think it will be able to scale unless there is a single pass approach? Pairwise is easy in the implementation but (as you pointed out) the output is complicated. Would love thoughts/examples for an efficient output using pairwise deletion, I could get that in a follow up PR.

nknize · 2016-02-26T16:10:51Z

Also would love @brwe thoughts

markharwood · 2016-02-26T16:37:26Z

This feels like it should be a pipeline agg? The examples given are stocks but presumably the variables that go into this are:

Set keys (in this case FOO_Corp_Stock_Value:["INDEXSP", "AAPL"] )
Value source field (Presumably avg/max/last value for field STOCK_PRICE)
Sequence grouping (presumably a time based group e.g. YYYYYMMDD or YYYYMMDDHH)

Each of these variables are a config option but given the distributed nature of the data and our single-pass approach to aggs we couldn't leave 1) as an open-ended set (e.g. compare ALL stock prices for correlations) as we can't trim candidates locally on each shard to only the most-correlated stocks - we don't have all the data in one shard to answer this question. For this reason we would have to insist on a small number of set keys.

markharwood · 2016-02-26T17:06:57Z

For reference, I used Pearson's correlation on www.hivemindmap.com to time-correlate Twitter hashtags.
This can be used to auto-detect events which are shown on a calendar:

This calendar actually combines elements of anomaly detection, graph and correlation to provide this automated "event detection":

Anomaly detection identifies the hashtags that "spiked" in time
Graph analytics identified strongly co-occurring hashtags in tweets and groups them using community detection algos (this gives the choice of colour in the diagram).
Pearson correlation identifies those tags in the above that shared a similar popularity timeline.

Together these spot an "event" and the related concepts.

polyfractal · 2016-02-26T17:51:34Z

If your data is "well distributed" and i.i.d. I think a single-pass over the data would give a good estimate of correlation (I think?). If we assume that each shard has a reasonable sampling of your data, we can pretend each shard can be correlated in isolation and merge the results into a pseudo-approximate answer.

It's effectively taking the same set of tradeoffs that search makes. We can't search everything together, so just pretend the data is distributed the same across all shards and run the calculations in isolation. If your data is distributed wildly differently between shards, all bets are off (but they are with search ranking too).

That said, I was thinking about this yesterday and we'd probably need a pipeline version too. Time-series data isn't i.i.d. (current value is usually related to last value in some fashion), so you'd likely need other pipeline aggs to first remove inter-series dependence before testing for correlation. Didn't occur to me until I started playing around with TS correlations yesterday =/

nknize · 2016-02-26T18:35:20Z

That said, I was thinking about this yesterday and we'd probably need a pipeline version too.

I agree, a pipeline version would be wonderful. This is a "progress not perfection" single pass approach that can serve as a utility for initial multi valued stats aggregations.

For review, comments, suggestions, the PR is posted at #16826. I also had the version mislabeled. The aggregation refactor in 5.0 is needed for the multi ValueSource.

clintongormley · 2016-06-23T12:58:40Z

Closed by #18300

nknize added >feature :Analytics/Aggregations Aggregations v2.3.0 labels Feb 26, 2016

nknize self-assigned this Feb 26, 2016

nknize added v5.0.0-alpha1 and removed v2.3.0 labels Feb 26, 2016

nknize mentioned this issue Feb 26, 2016

[WIP] MultiField_Stats Aggregation #16826

Closed

spalger mentioned this issue Feb 26, 2016

Should vistypes be able to implement custom query logic? elastic/kibana#6345

Closed

rmuir mentioned this issue Feb 26, 2016

Enforce node level limits if node is started in production env #16733

Merged

nknize mentioned this issue Mar 16, 2016

Regression Metric Aggregator #17154

Closed

clintongormley added v5.0.0-alpha2 and removed v5.0.0-alpha1 labels Apr 4, 2016

clintongormley added v5.0.0-alpha3 and removed v5.0.0-alpha2 labels Apr 26, 2016

nknize mentioned this issue May 11, 2016

Add MultiValuesSource support to Aggregation Framework #18285

Closed

clintongormley added v5.0.0-alpha4 and removed v5.0.0-alpha3 labels May 24, 2016

clintongormley added v5.0.0-alpha5 and removed v5.0.0-alpha4 labels Jun 22, 2016

clintongormley removed the v5.0.0-alpha5 label Jun 23, 2016

$@polyfractal$ polyfractal closed this as completed Mar 22, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Correlation Metric Aggregator #16817

Correlation Metric Aggregator #16817

nknize commented Feb 26, 2016

wamuir commented Feb 26, 2016

wamuir commented Feb 26, 2016

nknize commented Feb 26, 2016

nknize commented Feb 26, 2016

markharwood commented Feb 26, 2016

markharwood commented Feb 26, 2016

polyfractal commented Feb 26, 2016

nknize commented Feb 26, 2016

clintongormley commented Jun 23, 2016

Correlation Metric Aggregator #16817

Correlation Metric Aggregator #16817

Comments

nknize commented Feb 26, 2016

wamuir commented Feb 26, 2016

wamuir commented Feb 26, 2016

nknize commented Feb 26, 2016

nknize commented Feb 26, 2016

markharwood commented Feb 26, 2016

markharwood commented Feb 26, 2016

polyfractal commented Feb 26, 2016

nknize commented Feb 26, 2016

clintongormley commented Jun 23, 2016