-
Notifications
You must be signed in to change notification settings - Fork 24.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] MultiField_Stats Aggregation #16826
Conversation
Define correlation matrix for x1...x4: Generate some data (10M obs):
numpy.matrix.transpose(data) -> index as doubles -> es query: "pearson_correlation": Very good |
* Interface for Correlation Metric Aggregation | ||
*/ | ||
public interface Correlation extends Aggregation { | ||
long count(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add Javadocs to these methods?
@nknize I left a couple of comments but I like it so far |
Consider implementing pairwise deletion and making this the default. For one potential gotcha w/ listwise deletion, using the example provided by @nknize in #16817, consider replacing "pearson_correlation" : {
"correlation" : {
"field" : [ "FOO_Corp_Stock_Value", "AAPL", "INDEXSP" ],
}
} Also consider if the ticker symbol changed for "FOO_Corp_Stock" during this period, and the tickers are indexed separately. This throws an exception, but the best we could get back with listwise deletion is null output. More importantly, consider if data aren't missing at random and missingness on A moderates the relationship beween B and C. Also related, @polyfractal mentions tests for correlation in #16817:
If this agg cannot accept sub-aggregations (e.g., extended stats) then consider returning the appropriate stats with the agg. I'd argue that the appropriate stats are extended stats (incl. counts, means and variance) for variables, by pair. Combined with an option for listwise deletion, one should be able to produce a variance-covariance matrix and perform many statistical tests. @polyfractal also mentions serial correlation. Down the road, there probably should be a way to examine residuals if this agg is extended to test hypotheses about a population beyond the bivariate case. But the ouput now seems sound: unless I am mistaken, the sample correlation coefficient alone (as output now) should be an unbiased estimate of the population parameter to the extent that the items are normally distributed and were measured without error; it is the standard error of this estimate (and therefore test statistics) that is impacted. Now, this isn't to say that the coefficient won't change after partialling out one or more lags, but that might be best viewed as a specification issue on the part of the user. For example, in the case of OLS (which would be a nice implementation, down the road), the test statistic makes distributional assumptions of the residual such that all elements along the diagonal of the residual var/cov matrix equal a constant (homogeneity) and elements off the diagonal are zero (independence). In the case that this i.i.d. assumption is violated, beta estimates are still unbiased and consistent but variance estimates of OLS estimators are both biased and inconsistent---standard errors are not trustworthy. |
This is some wonderful feedback! I've had a chance to revisit this a bit and should have an updated commit in a day or so.
Will be addressing this shortly. Though I think there's plenty of time to get this as an enhancement in another PR?
Here we're assuming the stock ticker is the fieldname. So if the ticker symbol changes the application will need to map that to the field name.
I think this is a biggie. I'm looking at addressing this by way of count based weights. But I'm just getting started on this part. Again, might hold off to another PR? Based on the feedback I'm thinking of making the following changes:
I've completed counts, mean, covariance, and correlation. Before I split into 2 separate aggs and add kurtosis and skewness I wanted to solicit feedback for the usefulness of the additional measurements and for having 2 separate aggregations. Update: Might be nice to just merge this with |
Updated the PR to do the following:
Still todo in this PR:
Interested in feedback @wamuir, @colings86, @polyfractal |
71cc104
to
61db3e8
Compare
@colings86 If you get a chance can you have a look at commit 61db3e8? Make sure I didn't miss anything while trying to parse the Scripts as a Map. |
4ff1401
to
aa6a1a5
Compare
So the PR has gotten pretty big, but everything should be ready. This does not yet handle pairwise missing values. As discussed above that can be handled in a separate PR. Here is an example of the output: "aggregations" : {
"multifieldstats" : {
"counts" : {
"randVal1" : 100000,
"value" : 100000,
"randVal2" : 100000,
"values" : 100000
},
"means" : {
"randVal1" : 0.5007814114859065,
"value" : 50000.499999999134,
"randVal2" : 0.5004831083181183,
"values" : 50001.999999999134
},
"variances" : {
"randVal1" : 0.08361899306430848,
"value" : 8.333416666666644E8,
"randVal2" : 0.0832774520285396,
"values" : 8.333416666666644E8
},
"skewness" : {
"randVal1" : -0.0030206367840570187,
"value" : 7.154127608560924E-17,
"randVal2" : -7.692826885603065E-4,
"values" : 7.154127608560924E-17
},
"kurtosis" : {
"randVal1" : 1.8005677156780868,
"value" : 1.799999999760016,
"randVal2" : 1.8008100300780934,
"values" : 1.799999999760016
},
"covariance" : {
"randVal1" : {
"value" : 58.078219752650746,
"randVal2" : -5.485860518247862E-4,
"values" : 58.078219752650746
},
"value" : {
"randVal2" : -7.158019911939834,
"values" : 8.333416666666662E8
},
"randVal2" : {
"values" : -7.158019911939815
}
},
"correlation" : {
"randVal1" : {
"value" : 0.006957436968003296,
"randVal2" : -0.006573983049459838,
"values" : 0.006957436968003296
},
"value" : {
"randVal2" : -8.59246237772021E-4,
"values" : 1.000000000000002
},
"randVal2" : {
"values" : -8.592462377720186E-4
}
}
}
} It might be nice (in another PR) to specify which stats the user would like returned. It may not change what is computed, but simplifies the output if the application only cares about certain fields. |
@Override | ||
public double getMean(String field) { | ||
if (multiFieldStatsResults == null || multiFieldStatsResults.means.containsKey(field) == false) { | ||
return Double.NaN; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
General thought for these various Double.NaN
's .. should we be returning null
instead (and subsequently returning a Double
instead of double
)? E.g. because some fields may actually have NaN
's as real values, and lacking a field isn't quite the same as not-a-number?
Not sure, probably needs several more opinions :)
@rashidkpc I spoke with @spalger a little bit about this at the all hands. The tabular heatmap visualization is typical for a correlation matrix. Strong positive correlation (closer to +1.0) is usually depicted as a red cell with weaker correlation (closer to 0.0) colored white and strong negative correlation (closer to -1.0) colored blue. Here's a good example: |
Yeah, I'm fine with that structure. We'll have to transform it from object to array, but as long as the fields aren't in a meaningful order this is fine. |
@nknize Can you add a simplified example of the current request and response format? |
@rashidkpc Sure thing. Here's a simple 2 variable example using the same data as above:
Request: "aggs": {
"multifieldstats": {
"multifield_stats": {
"field": ["poverty", "income"]
}
}
} Response: "aggregations": {
"multifieldstats": {
"income": {
"count": 50,
"mean": 51985.1,
"variance": 7.383377037755103E7,
"skewness": 0.5595114003506483,
"kurtosis": 2.5692365287787124,
"covariance_correlation": {
"poverty": {
"covariance": -21093.65836734694,
"correlation": -0.8352655256272504
}
}
},
"poverty": {
"count": 50,
"mean": 12.732000000000001,
"variance": 8.637730612244896,
"skewness": 0.4516049811903419,
"kurtosis": 2.8615929677997767,
"covariance_correlation": {
"income": {
"covariance": -21093.65836734694,
"correlation": -0.8352655256272504
}
}
}
}
} |
8d1d12e
to
ea5348e
Compare
@clintongormley how about making this feature experimental for 5.0.0? |
delta.put(fieldName, fieldValue * docCount - fieldSum.get(fieldName)); | ||
|
||
// update running mean, variance, skewness, kurtosis | ||
if (means.containsKey(fieldName) == true) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we add a reference here? Some paper, book chapter, wikipedia article or something.
Adds the first Multi-Field metric agg that computes the Pearson product-moment correlation coefficient for a given list of numeric fields.
This commit refactors the correlation specific aggregator to a more generic MultiFieldStats aggregator. It also adds the following statistics: * Field Count (to capture missing values - not yet implemented) * Mean * Variance * Skewness * Kurtosis * Variance-Covariance matrix * Correlation
… script has a unique name
This commit fixes the following: * Splits results from running statistics by way of mutable RunningStats and Immutable MultiFieldStatsResults classes * Updates output to print stats per field * Updates based on PR feedback
8b164e4
to
071b20f
Compare
Superseded by #18300 |
re: issue #16817 this is a WIP PR (for initial feedback and review) to add the first Multi-Field metric agg that computes the Pearson product-moment correlation coefficient for a given list of numeric fields.
It's a pretty simple single pass calculation that I simulated in Matlab validating against Matlab's
corr
method. This PR includes a validation of the actual aggregation results (computed across shards) against the expected value computed using a large single set (see CorrelationIT). Initial rough calculations over a set of 1M distributed observations give at least 1e-7 accuracy.UPDATE: This aggregation is now called
multifield_stats
. It takes a list of multiple fields and computes the following statistics: