Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Correlation Metric Aggregator #16817

Closed
nknize opened this issue Feb 26, 2016 · 9 comments
Closed

Correlation Metric Aggregator #16817

nknize opened this issue Feb 26, 2016 · 9 comments
Assignees

Comments

@nknize
Copy link
Contributor

nknize commented Feb 26, 2016

This first Multi-Field metric aggregation will compute the Pearson product-moment correlation coefficient for a given list of numeric fields.

Example usage (correlating stock value against S&P 500):

"pearson_correlation" : {
    "correlation" : {
        "field" : [ "FOO_Corp_Stock_Value", "AAPL", "INDEXSP" ],
    }
}

Example result:

The result provides just the upper triangle (excluding the auto-correlated diagonal which would just be 1.0)

"pearson_correlation" : {
    "correlation" : {
        "FOO_Corp_Stock_Value" : {
            "AAPL" : 0.015411381889249296,
            "INDEXSP" : 0.028740529828528586
        },
        "AAPL" : {
            "INDEXSP" : 0.429228885527728843
        }
    }
}

This hypothetical example shows a small positive correlation (0.0154) between Apple and FOO Corp. stock value. A small positive correlation (0.0287) between FOO Corp. stock and the S&P 500, and a strong (fake) positive correlation (0.4292) between Apple Stock and the S&P 500.

While a made up example with contrived data, this correlation provides a useful statistical tool for drawing correlations between all kinds of data (medical, finance, log trends) and provides a foundation for much more advanced statistics (e.g., spatial statistics and econometrics).

@wamuir
Copy link

wamuir commented Feb 26, 2016

Measures of covariation would be useful. However, a more general solution would be to estimate a variance/covariance matrix for two or more fields.

@wamuir
Copy link

wamuir commented Feb 26, 2016

Consider also when the number of fields >= 3 offering an option to select pairwise or listwise deletion of missing values. For pairwise, this would complicate the output as the query should return unique counts for each element of the matrix (if the data contain missing values).

@nknize
Copy link
Contributor Author

nknize commented Feb 26, 2016

@wamuir awesome getting this feedback!

a more general solution would be to estimate a variance/covariance matrix

This got me thinking. For a more general purpose agg, perhaps this should be renamed multi_stats (in the same manner that we have stats and extended_stats)? multi_stats would give us the ability to compute and output a handful of multi-field statistics? e.g., to start it would provide covariance and pearson correlation. What do you think? /cc @polyfractal @colings86

offering an option to select pairwise or listwise deletion of missing values.

To keep things simple the first implementation (PR shortly) is listwise omission of documents containing missing values. We need to think a bit more about how to handle missing values in future enhancements. EM would be nice, but I don't think it will be able to scale unless there is a single pass approach? Pairwise is easy in the implementation but (as you pointed out) the output is complicated. Would love thoughts/examples for an efficient output using pairwise deletion, I could get that in a follow up PR.

@nknize
Copy link
Contributor Author

nknize commented Feb 26, 2016

Also would love @brwe thoughts

@markharwood
Copy link
Contributor

This feels like it should be a pipeline agg? The examples given are stocks but presumably the variables that go into this are:

  1. Set keys (in this case FOO_Corp_Stock_Value:["INDEXSP", "AAPL"] )
  2. Value source field (Presumably avg/max/last value for field STOCK_PRICE)
  3. Sequence grouping (presumably a time based group e.g. YYYYYMMDD or YYYYMMDDHH)

Each of these variables are a config option but given the distributed nature of the data and our single-pass approach to aggs we couldn't leave 1) as an open-ended set (e.g. compare ALL stock prices for correlations) as we can't trim candidates locally on each shard to only the most-correlated stocks - we don't have all the data in one shard to answer this question. For this reason we would have to insist on a small number of set keys.

@markharwood
Copy link
Contributor

For reference, I used Pearson's correlation on www.hivemindmap.com to time-correlate Twitter hashtags.
This can be used to auto-detect events which are shown on a calendar:

hivemindmap

This calendar actually combines elements of anomaly detection, graph and correlation to provide this automated "event detection":

  1. Anomaly detection identifies the hashtags that "spiked" in time
  2. Graph analytics identified strongly co-occurring hashtags in tweets and groups them using community detection algos (this gives the choice of colour in the diagram).
  3. Pearson correlation identifies those tags in the above that shared a similar popularity timeline.

Together these spot an "event" and the related concepts.

@polyfractal
Copy link
Contributor

If your data is "well distributed" and i.i.d. I think a single-pass over the data would give a good estimate of correlation (I think?). If we assume that each shard has a reasonable sampling of your data, we can pretend each shard can be correlated in isolation and merge the results into a pseudo-approximate answer.

It's effectively taking the same set of tradeoffs that search makes. We can't search everything together, so just pretend the data is distributed the same across all shards and run the calculations in isolation. If your data is distributed wildly differently between shards, all bets are off (but they are with search ranking too).

That said, I was thinking about this yesterday and we'd probably need a pipeline version too. Time-series data isn't i.i.d. (current value is usually related to last value in some fashion), so you'd likely need other pipeline aggs to first remove inter-series dependence before testing for correlation. Didn't occur to me until I started playing around with TS correlations yesterday =/

@nknize
Copy link
Contributor Author

nknize commented Feb 26, 2016

That said, I was thinking about this yesterday and we'd probably need a pipeline version too.

I agree, a pipeline version would be wonderful. This is a "progress not perfection" single pass approach that can serve as a utility for initial multi valued stats aggregations.

For review, comments, suggestions, the PR is posted at #16826. I also had the version mislabeled. The aggregation refactor in 5.0 is needed for the multi ValueSource.

@clintongormley
Copy link

Closed by #18300

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants