[WIP] MultiField_Stats Aggregation #16826

nknize · 2016-02-26T18:33:05Z

re: issue #16817 this is a WIP PR (for initial feedback and review) to add the first Multi-Field metric agg that computes the Pearson product-moment correlation coefficient for a given list of numeric fields.

It's a pretty simple single pass calculation that I simulated in Matlab validating against Matlab's corr method. This PR includes a validation of the actual aggregation results (computed across shards) against the expected value computed using a large single set (see CorrelationIT). Initial rough calculations over a set of 1M distributed observations give at least 1e-7 accuracy.

UPDATE: This aggregation is now called multifield_stats. It takes a list of multiple fields and computes the following statistics:

counts
means
variances
skewness
kurtosis
covariance
correlation

wamuir · 2016-02-28T04:17:45Z

Define correlation matrix for x1...x4:
corr = numpy.array(
[[1.00, .397, .214, .569],
[.397, 1.00, .448, .439],
[.214, .448, 1.00, .404],
[.569, .439, .404, 1.00]])

Generate some data (10M obs):
numpy.random.seed(seed=1)
mtx = numpy.random.randn(4, 10000000)
chol = numpy.linalg.cholesky(corr)
data = numpy.dot(chol, mtx)

numpy.set_printoptions(precision=17)
print(numpy.corrcoef(data))
[[ 1. 0.39650687829250175 0.21417973223669709 0.56896721641827785]
[ 0.39650687829250175 1. 0.44797611026090212 0.43881323671321049]
[ 0.21417973223669709 0.44797611026090212 1. 0.40411226138655432]
[ 0.56896721641827785 0.43881323671321049 0.40411226138655432 1. ]]

numpy.matrix.transpose(data) -> index as doubles -> es query: "pearson_correlation":
[[ 1. ]
[ 0.3965068782924764 1. ]
[ 0.21417973223672318 0.44797611026089856 1. ]
[ 0.5689672164183015 0.43881323671322076 0.40411226138653933 1. ]]

Very good

colings86 · 2016-02-29T10:29:58Z

core/src/main/java/org/elasticsearch/search/aggregations/metrics/correlation/Correlation.java

+ * Interface for Correlation Metric Aggregation
+ */
+public interface Correlation extends Aggregation {
+    long count();


Could you add Javadocs to these methods?

colings86 · 2016-02-29T10:43:30Z

@nknize I left a couple of comments but I like it so far

wamuir · 2016-02-29T14:32:42Z

Consider implementing pairwise deletion and making this the default. For one potential gotcha w/ listwise deletion, using the example provided by @nknize in #16817, consider replacing FOO_Corp_Stock_Value with FB or TWTR:

"pearson_correlation" : {
    "correlation" : {
        "field" : [ "FOO_Corp_Stock_Value", "AAPL", "INDEXSP" ],
    }
}

Also consider if the ticker symbol changed for "FOO_Corp_Stock" during this period, and the tickers are indexed separately. This throws an exception, but the best we could get back with listwise deletion is null output. More importantly, consider if data aren't missing at random and missingness on A moderates the relationship beween B and C.

Also related, @polyfractal mentions tests for correlation in #16817:

Time-series data isn't i.i.d. (current value is usually related to last value in some fashion), so you'd likely need other pipeline aggs to first remove inter-series dependence before testing for correlation

If this agg cannot accept sub-aggregations (e.g., extended stats) then consider returning the appropriate stats with the agg. I'd argue that the appropriate stats are extended stats (incl. counts, means and variance) for variables, by pair. Combined with an option for listwise deletion, one should be able to produce a variance-covariance matrix and perform many statistical tests.

@polyfractal also mentions serial correlation. Down the road, there probably should be a way to examine residuals if this agg is extended to test hypotheses about a population beyond the bivariate case. But the ouput now seems sound: unless I am mistaken, the sample correlation coefficient alone (as output now) should be an unbiased estimate of the population parameter to the extent that the items are normally distributed and were measured without error; it is the standard error of this estimate (and therefore test statistics) that is impacted. Now, this isn't to say that the coefficient won't change after partialling out one or more lags, but that might be best viewed as a specification issue on the part of the user.

For example, in the case of OLS (which would be a nice implementation, down the road), the test statistic makes distributional assumptions of the residual such that all elements along the diagonal of the residual var/cov matrix equal a constant (homogeneity) and elements off the diagonal are zero (independence). In the case that this i.i.d. assumption is violated, beta estimates are still unbiased and consistent but variance estimates of OLS estimators are both biased and inconsistent---standard errors are not trustworthy.

nknize · 2016-03-03T23:01:54Z

This is some wonderful feedback! I've had a chance to revisit this a bit and should have an updated commit in a day or so.

Consider implementing pairwise deletion and making this the default.

Will be addressing this shortly. Though I think there's plenty of time to get this as an enhancement in another PR?

consider if the ticker symbol changed for "FOO_Corp_Stock" during this period

Here we're assuming the stock ticker is the fieldname. So if the ticker symbol changes the application will need to map that to the field name.

consider if data aren't missing at random and missingness on A moderates the relationship beween B and C.

I think this is a biggie. I'm looking at addressing this by way of count based weights. But I'm just getting started on this part. Again, might hold off to another PR?

Based on the feedback I'm thinking of making the following changes:

The agg will be changed to be more than just correlation. So I'll be refactoring to multifield_stats. Any objection to the name? The multifield_stats will provide
- count - per field count to (eventually) communicate missing values
- mean - per field means
- covariance - upper triangle matrix (to include diagonals for variance) of the variance-covariance
In addition to the above, a sister extended_muitifield_stats (not really liking the name so I'm open to suggestions) will provide the following:
- correlation - upper triangle matrix (to include diagonals for autocorrelation) of the pearson product-moment correlation coefficients
- kurtosis - vector describing the shape of the per-field distributions
- skewness - vector describing the symmetry of the per-field distribution about the means

I've completed counts, mean, covariance, and correlation. Before I split into 2 separate aggs and add kurtosis and skewness I wanted to solicit feedback for the usefulness of the additional measurements and for having 2 separate aggregations.

Update: Might be nice to just merge this with stats and extended_stats where they can accept either single or multiple fields.

nknize · 2016-03-05T05:31:20Z

Updated the PR to do the following:

rename aggregation to multifield_stats
adds mean, variance, skewness, kurtosis, covariance
updates singleFieldTest to include testing all generated statistics

Still todo in this PR:

add testing for missing and multi values
incorporate @colings86 feedback re: script names
update javadocs

Interested in feedback @wamuir, @colings86, @polyfractal

nknize · 2016-03-10T21:42:38Z

@colings86 If you get a chance can you have a look at commit 61db3e8? Make sure I didn't miss anything while trying to parse the Scripts as a Map.

nknize · 2016-03-11T16:19:38Z

So the PR has gotten pretty big, but everything should be ready. This does not yet handle pairwise missing values. As discussed above that can be handled in a separate PR.

Here is an example of the output:

  "aggregations" : {
    "multifieldstats" : {
      "counts" : {
        "randVal1" : 100000,
        "value" : 100000,
        "randVal2" : 100000,
        "values" : 100000
      },
      "means" : {
        "randVal1" : 0.5007814114859065,
        "value" : 50000.499999999134,
        "randVal2" : 0.5004831083181183,
        "values" : 50001.999999999134
      },
      "variances" : {
        "randVal1" : 0.08361899306430848,
        "value" : 8.333416666666644E8,
        "randVal2" : 0.0832774520285396,
        "values" : 8.333416666666644E8
      },
      "skewness" : {
        "randVal1" : -0.0030206367840570187,
        "value" : 7.154127608560924E-17,
        "randVal2" : -7.692826885603065E-4,
        "values" : 7.154127608560924E-17
      },
      "kurtosis" : {
        "randVal1" : 1.8005677156780868,
        "value" : 1.799999999760016,
        "randVal2" : 1.8008100300780934,
        "values" : 1.799999999760016
      },
      "covariance" : {
        "randVal1" : {
          "value" : 58.078219752650746,
          "randVal2" : -5.485860518247862E-4,
          "values" : 58.078219752650746
        },
        "value" : {
          "randVal2" : -7.158019911939834,
          "values" : 8.333416666666662E8
        },
        "randVal2" : {
          "values" : -7.158019911939815
        }
      },
      "correlation" : {
        "randVal1" : {
          "value" : 0.006957436968003296,
          "randVal2" : -0.006573983049459838,
          "values" : 0.006957436968003296
        },
        "value" : {
          "randVal2" : -8.59246237772021E-4,
          "values" : 1.000000000000002
        },
        "randVal2" : {
          "values" : -8.592462377720186E-4
        }
      }
    }
  }

It might be nice (in another PR) to specify which stats the user would like returned. It may not change what is computed, but simplifies the output if the application only cares about certain fields.

/cc @polyfractal @colings86 @wamuir @markharwood

polyfractal · 2016-03-11T19:05:26Z

.../org/elasticsearch/search/aggregations/metrics/stats/multifield/InternalMultiFieldStats.java

+    @Override
+    public double getMean(String field) {
+        if (multiFieldStatsResults == null || multiFieldStatsResults.means.containsKey(field) == false) {
+            return Double.NaN;


General thought for these various Double.NaN's .. should we be returning null instead (and subsequently returning a Double instead of double)? E.g. because some fields may actually have NaN's as real values, and lacking a field isn't quite the same as not-a-number?

Not sure, probably needs several more opinions :)

nknize · 2016-03-16T21:51:08Z

@rashidkpc I spoke with @spalger a little bit about this at the all hands. The tabular heatmap visualization is typical for a correlation matrix. Strong positive correlation (closer to +1.0) is usually depicted as a red cell with weaker correlation (closer to 0.0) colored white and strong negative correlation (closer to -1.0) colored blue.

Here's a good example:

spalger · 2016-03-16T22:12:49Z

Yeah, I'm fine with that structure. We'll have to transform it from object to array, but as long as the fields aren't in a meaningful order this is fine.

rashidkpc · 2016-04-11T16:03:34Z

@nknize Can you add a simplified example of the current request and response format?

nknize · 2016-04-11T16:20:06Z

@rashidkpc Sure thing. Here's a simple 2 variable example using the same data as above:

poverty
income

Request:

"aggs": {
    "multifieldstats": {
        "multifield_stats": {
            "field": ["poverty", "income"]
        }
     }
}

Response:

"aggregations": {
    "multifieldstats": {
        "income": {
            "count": 50,
            "mean": 51985.1,
            "variance": 7.383377037755103E7,
            "skewness": 0.5595114003506483,
            "kurtosis": 2.5692365287787124,
            "covariance_correlation": {
                "poverty": {
                    "covariance": -21093.65836734694,
                    "correlation": -0.8352655256272504
                }
            }
        },
        "poverty": {
            "count": 50,
            "mean": 12.732000000000001,
            "variance": 8.637730612244896,
            "skewness": 0.4516049811903419,
            "kurtosis": 2.8615929677997767,
            "covariance_correlation": {
                "income": {
                    "covariance": -21093.65836734694,
                    "correlation": -0.8352655256272504
                }
            }
        }
    }
}

nknize · 2016-04-12T02:51:16Z

@clintongormley how about making this feature experimental for 5.0.0?

brwe · 2016-04-18T10:44:55Z

...c/main/java/org/elasticsearch/search/aggregations/metrics/stats/multifield/RunningStats.java

+            delta.put(fieldName, fieldValue * docCount - fieldSum.get(fieldName));
+
+            // update running mean, variance, skewness, kurtosis
+            if (means.containsKey(fieldName) == true) {


Could we add a reference here? Some paper, book chapter, wikipedia article or something.

Adds the first Multi-Field metric agg that computes the Pearson product-moment correlation coefficient for a given list of numeric fields.

This commit refactors the correlation specific aggregator to a more generic MultiFieldStats aggregator. It also adds the following statistics: * Field Count (to capture missing values - not yet implemented) * Mean * Variance * Skewness * Kurtosis * Variance-Covariance matrix * Correlation

… script has a unique name

This commit fixes the following: * Splits results from running statistics by way of mutable RunningStats and Immutable MultiFieldStatsResults classes * Updates output to print stats per field * Updates based on PR feedback

clintongormley · 2016-05-13T11:43:51Z

Superseded by #18300

nknize added >feature review WIP :Analytics/Aggregations Aggregations v5.0.0-alpha1 labels Feb 26, 2016

nknize mentioned this pull request Feb 26, 2016

Correlation Metric Aggregator #16817

Closed

colings86 reviewed Feb 29, 2016
View reviewed changes

nknize added the discuss label Mar 3, 2016

nknize force-pushed the feature/correlation_agg_WIP branch from 71cc104 to 61db3e8 Compare March 10, 2016 18:28

nknize force-pushed the feature/correlation_agg_WIP branch from 4ff1401 to aa6a1a5 Compare March 11, 2016 16:08

nknize changed the title ~~[WIP] Correlation Aggregation~~ [WIP] MultiField_Stats Aggregation Mar 11, 2016

$polyfractal$

polyfractal reviewed Mar 11, 2016
View reviewed changes

clintongormley added v5.0.0-alpha2 and removed v5.0.0-alpha1 labels Apr 4, 2016

nknize force-pushed the feature/correlation_agg_WIP branch from 8d1d12e to ea5348e Compare April 12, 2016 02:43

nknize removed the WIP label Apr 12, 2016

brwe reviewed Apr 18, 2016
View reviewed changes

clintongormley added v5.0.0-alpha3 and removed v5.0.0-alpha2 labels Apr 26, 2016

nknize added 11 commits May 3, 2016 10:27

[WIP] Correlation Aggregation

c06d423

Adds the first Multi-Field metric agg that computes the Pearson product-moment correlation coefficient for a given list of numeric fields.

update aggregation name to multifield_stats

bdd2a0f

update MultiValuesSourceParser to parse a Map<String, Script> so each…

b90c73f

… script has a unique name

add testEmptyAggregation

a639a1c

add unmapped, partiallyUnmapped, and getProperty tests

aaffc7e

finish out tests

0d9dea1

Updates from PR feedback

476b801

This commit fixes the following: * Splits results from running statistics by way of mutable RunningStats and Immutable MultiFieldStatsResults classes * Updates output to print stats per field * Updates based on PR feedback

cleaning up wildcard imports and line length

ce91e1f

updates for elastic#17546

2c4f087

rebasing with changes to master

071b20f

nknize force-pushed the feature/correlation_agg_WIP branch from 8b164e4 to 071b20f Compare May 3, 2016 18:03

This was referenced May 11, 2016

Add MultiValuesSource support to Aggregation Framework #18285

Closed

New Matrix Stats Aggregation module #18300

Merged

clintongormley removed the v5.0.0-alpha3 label May 13, 2016

clintongormley closed this May 13, 2016

nknize mentioned this pull request Jul 10, 2019

Annotated Heatmap of a Correlation Matrix elastic/elastic-charts#263

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] MultiField_Stats Aggregation #16826

[WIP] MultiField_Stats Aggregation #16826

nknize commented Feb 26, 2016

wamuir commented Feb 28, 2016

colings86 Feb 29, 2016

colings86 commented Feb 29, 2016

wamuir commented Feb 29, 2016

nknize commented Mar 3, 2016

nknize commented Mar 5, 2016

nknize commented Mar 10, 2016

nknize commented Mar 11, 2016

$@polyfractal$ polyfractal Mar 11, 2016

nknize commented Mar 16, 2016

spalger commented Mar 16, 2016

rashidkpc commented Apr 11, 2016

nknize commented Apr 11, 2016

nknize commented Apr 12, 2016

brwe Apr 18, 2016

clintongormley commented May 13, 2016

[WIP] MultiField_Stats Aggregation #16826

[WIP] MultiField_Stats Aggregation #16826

Conversation

nknize commented Feb 26, 2016

wamuir commented Feb 28, 2016

colings86 Feb 29, 2016

Choose a reason for hiding this comment

colings86 commented Feb 29, 2016

wamuir commented Feb 29, 2016

nknize commented Mar 3, 2016

nknize commented Mar 5, 2016

nknize commented Mar 10, 2016

nknize commented Mar 11, 2016

polyfractal Mar 11, 2016

Choose a reason for hiding this comment

nknize commented Mar 16, 2016

spalger commented Mar 16, 2016

rashidkpc commented Apr 11, 2016

nknize commented Apr 11, 2016

nknize commented Apr 12, 2016

brwe Apr 18, 2016

Choose a reason for hiding this comment

clintongormley commented May 13, 2016

$@polyfractal$ polyfractal Mar 11, 2016