-
Notifications
You must be signed in to change notification settings - Fork 24.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Correlation Metric Aggregator #16817
Comments
Measures of covariation would be useful. However, a more general solution would be to estimate a variance/covariance matrix for two or more fields. |
Consider also when the number of fields >= 3 offering an option to select pairwise or listwise deletion of missing values. For pairwise, this would complicate the output as the query should return unique counts for each element of the matrix (if the data contain missing values). |
@wamuir awesome getting this feedback!
This got me thinking. For a more general purpose agg, perhaps this should be renamed
To keep things simple the first implementation (PR shortly) is listwise omission of documents containing missing values. We need to think a bit more about how to handle missing values in future enhancements. EM would be nice, but I don't think it will be able to scale unless there is a single pass approach? Pairwise is easy in the implementation but (as you pointed out) the output is complicated. Would love thoughts/examples for an efficient output using pairwise deletion, I could get that in a follow up PR. |
Also would love @brwe thoughts |
This feels like it should be a pipeline agg? The examples given are stocks but presumably the variables that go into this are:
Each of these variables are a config option but given the distributed nature of the data and our single-pass approach to aggs we couldn't leave 1) as an open-ended set (e.g. compare ALL stock prices for correlations) as we can't trim candidates locally on each shard to only the most-correlated stocks - we don't have all the data in one shard to answer this question. For this reason we would have to insist on a small number of set keys. |
For reference, I used Pearson's correlation on www.hivemindmap.com to time-correlate Twitter hashtags. This calendar actually combines elements of anomaly detection, graph and correlation to provide this automated "event detection":
Together these spot an "event" and the related concepts. |
If your data is "well distributed" and i.i.d. I think a single-pass over the data would give a good estimate of correlation (I think?). If we assume that each shard has a reasonable sampling of your data, we can pretend each shard can be correlated in isolation and merge the results into a pseudo-approximate answer. It's effectively taking the same set of tradeoffs that search makes. We can't search everything together, so just pretend the data is distributed the same across all shards and run the calculations in isolation. If your data is distributed wildly differently between shards, all bets are off (but they are with search ranking too). That said, I was thinking about this yesterday and we'd probably need a pipeline version too. Time-series data isn't i.i.d. (current value is usually related to last value in some fashion), so you'd likely need other pipeline aggs to first remove inter-series dependence before testing for correlation. Didn't occur to me until I started playing around with TS correlations yesterday =/ |
I agree, a pipeline version would be wonderful. This is a "progress not perfection" single pass approach that can serve as a utility for initial multi valued stats aggregations. For review, comments, suggestions, the PR is posted at #16826. I also had the version mislabeled. The aggregation refactor in 5.0 is needed for the multi ValueSource. |
Closed by #18300 |
This first Multi-Field metric aggregation will compute the Pearson product-moment correlation coefficient for a given list of numeric fields.
Example usage (correlating stock value against S&P 500):
Example result:
The result provides just the upper triangle (excluding the auto-correlated diagonal which would just be 1.0)
This hypothetical example shows a small positive correlation (0.0154) between Apple and FOO Corp. stock value. A small positive correlation (0.0287) between FOO Corp. stock and the S&P 500, and a strong (fake) positive correlation (0.4292) between Apple Stock and the S&P 500.
While a made up example with contrived data, this correlation provides a useful statistical tool for drawing correlations between all kinds of data (medical, finance, log trends) and provides a foundation for much more advanced statistics (e.g., spatial statistics and econometrics).
The text was updated successfully, but these errors were encountered: