Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] MultiField_Stats Aggregation #16826

Closed
wants to merge 11 commits into from

Conversation

nknize
Copy link
Contributor

@nknize nknize commented Feb 26, 2016

re: issue #16817 this is a WIP PR (for initial feedback and review) to add the first Multi-Field metric agg that computes the Pearson product-moment correlation coefficient for a given list of numeric fields.

It's a pretty simple single pass calculation that I simulated in Matlab validating against Matlab's corr method. This PR includes a validation of the actual aggregation results (computed across shards) against the expected value computed using a large single set (see CorrelationIT). Initial rough calculations over a set of 1M distributed observations give at least 1e-7 accuracy.

UPDATE: This aggregation is now called multifield_stats. It takes a list of multiple fields and computes the following statistics:

  1. counts
  2. means
  3. variances
  4. skewness
  5. kurtosis
  6. covariance
  7. correlation

@wamuir
Copy link

wamuir commented Feb 28, 2016

Define correlation matrix for x1...x4:
corr = numpy.array(
[[1.00, .397, .214, .569],
[.397, 1.00, .448, .439],
[.214, .448, 1.00, .404],
[.569, .439, .404, 1.00]])

Generate some data (10M obs):
numpy.random.seed(seed=1)
mtx = numpy.random.randn(4, 10000000)
chol = numpy.linalg.cholesky(corr)
data = numpy.dot(chol, mtx)

numpy.set_printoptions(precision=17)
print(numpy.corrcoef(data))
[[ 1. 0.39650687829250175 0.21417973223669709 0.56896721641827785]
[ 0.39650687829250175 1. 0.44797611026090212 0.43881323671321049]
[ 0.21417973223669709 0.44797611026090212 1. 0.40411226138655432]
[ 0.56896721641827785 0.43881323671321049 0.40411226138655432 1. ]]

numpy.matrix.transpose(data) -> index as doubles -> es query: "pearson_correlation":
[[ 1. ]
[ 0.3965068782924764 1. ]
[ 0.21417973223672318 0.44797611026089856 1. ]
[ 0.5689672164183015 0.43881323671322076 0.40411226138653933 1. ]]

Very good

* Interface for Correlation Metric Aggregation
*/
public interface Correlation extends Aggregation {
long count();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add Javadocs to these methods?

@colings86
Copy link
Contributor

@nknize I left a couple of comments but I like it so far

@wamuir
Copy link

wamuir commented Feb 29, 2016

Consider implementing pairwise deletion and making this the default. For one potential gotcha w/ listwise deletion, using the example provided by @nknize in #16817, consider replacing FOO_Corp_Stock_Value with FB or TWTR:

"pearson_correlation" : {
    "correlation" : {
        "field" : [ "FOO_Corp_Stock_Value", "AAPL", "INDEXSP" ],
    }
}

Also consider if the ticker symbol changed for "FOO_Corp_Stock" during this period, and the tickers are indexed separately. This throws an exception, but the best we could get back with listwise deletion is null output. More importantly, consider if data aren't missing at random and missingness on A moderates the relationship beween B and C.

Also related, @polyfractal mentions tests for correlation in #16817:

Time-series data isn't i.i.d. (current value is usually related to last value in some fashion), so you'd likely need other pipeline aggs to first remove inter-series dependence before testing for correlation

If this agg cannot accept sub-aggregations (e.g., extended stats) then consider returning the appropriate stats with the agg. I'd argue that the appropriate stats are extended stats (incl. counts, means and variance) for variables, by pair. Combined with an option for listwise deletion, one should be able to produce a variance-covariance matrix and perform many statistical tests.

@polyfractal also mentions serial correlation. Down the road, there probably should be a way to examine residuals if this agg is extended to test hypotheses about a population beyond the bivariate case. But the ouput now seems sound: unless I am mistaken, the sample correlation coefficient alone (as output now) should be an unbiased estimate of the population parameter to the extent that the items are normally distributed and were measured without error; it is the standard error of this estimate (and therefore test statistics) that is impacted. Now, this isn't to say that the coefficient won't change after partialling out one or more lags, but that might be best viewed as a specification issue on the part of the user.

For example, in the case of OLS (which would be a nice implementation, down the road), the test statistic makes distributional assumptions of the residual such that all elements along the diagonal of the residual var/cov matrix equal a constant (homogeneity) and elements off the diagonal are zero (independence). In the case that this i.i.d. assumption is violated, beta estimates are still unbiased and consistent but variance estimates of OLS estimators are both biased and inconsistent---standard errors are not trustworthy.

@nknize nknize added the discuss label Mar 3, 2016
@nknize
Copy link
Contributor Author

nknize commented Mar 3, 2016

This is some wonderful feedback! I've had a chance to revisit this a bit and should have an updated commit in a day or so.

Consider implementing pairwise deletion and making this the default.

Will be addressing this shortly. Though I think there's plenty of time to get this as an enhancement in another PR?

consider if the ticker symbol changed for "FOO_Corp_Stock" during this period

Here we're assuming the stock ticker is the fieldname. So if the ticker symbol changes the application will need to map that to the field name.

consider if data aren't missing at random and missingness on A moderates the relationship beween B and C.

I think this is a biggie. I'm looking at addressing this by way of count based weights. But I'm just getting started on this part. Again, might hold off to another PR?

Based on the feedback I'm thinking of making the following changes:

  1. The agg will be changed to be more than just correlation. So I'll be refactoring to multifield_stats. Any objection to the name? The multifield_stats will provide
    • count - per field count to (eventually) communicate missing values
    • mean - per field means
    • covariance - upper triangle matrix (to include diagonals for variance) of the variance-covariance
  2. In addition to the above, a sister extended_muitifield_stats (not really liking the name so I'm open to suggestions) will provide the following:
    • correlation - upper triangle matrix (to include diagonals for autocorrelation) of the pearson product-moment correlation coefficients
    • kurtosis - vector describing the shape of the per-field distributions
    • skewness - vector describing the symmetry of the per-field distribution about the means

I've completed counts, mean, covariance, and correlation. Before I split into 2 separate aggs and add kurtosis and skewness I wanted to solicit feedback for the usefulness of the additional measurements and for having 2 separate aggregations.

Update: Might be nice to just merge this with stats and extended_stats where they can accept either single or multiple fields.

@nknize
Copy link
Contributor Author

nknize commented Mar 5, 2016

Updated the PR to do the following:

  • rename aggregation to multifield_stats
  • adds mean, variance, skewness, kurtosis, covariance
  • updates singleFieldTest to include testing all generated statistics

Still todo in this PR:

  • add testing for missing and multi values
  • incorporate @colings86 feedback re: script names
  • update javadocs

Interested in feedback @wamuir, @colings86, @polyfractal

@nknize nknize force-pushed the feature/correlation_agg_WIP branch from 71cc104 to 61db3e8 Compare March 10, 2016 18:28
@nknize
Copy link
Contributor Author

nknize commented Mar 10, 2016

@colings86 If you get a chance can you have a look at commit 61db3e8? Make sure I didn't miss anything while trying to parse the Scripts as a Map.

@nknize nknize force-pushed the feature/correlation_agg_WIP branch from 4ff1401 to aa6a1a5 Compare March 11, 2016 16:08
@nknize
Copy link
Contributor Author

nknize commented Mar 11, 2016

So the PR has gotten pretty big, but everything should be ready. This does not yet handle pairwise missing values. As discussed above that can be handled in a separate PR.

Here is an example of the output:

  "aggregations" : {
    "multifieldstats" : {
      "counts" : {
        "randVal1" : 100000,
        "value" : 100000,
        "randVal2" : 100000,
        "values" : 100000
      },
      "means" : {
        "randVal1" : 0.5007814114859065,
        "value" : 50000.499999999134,
        "randVal2" : 0.5004831083181183,
        "values" : 50001.999999999134
      },
      "variances" : {
        "randVal1" : 0.08361899306430848,
        "value" : 8.333416666666644E8,
        "randVal2" : 0.0832774520285396,
        "values" : 8.333416666666644E8
      },
      "skewness" : {
        "randVal1" : -0.0030206367840570187,
        "value" : 7.154127608560924E-17,
        "randVal2" : -7.692826885603065E-4,
        "values" : 7.154127608560924E-17
      },
      "kurtosis" : {
        "randVal1" : 1.8005677156780868,
        "value" : 1.799999999760016,
        "randVal2" : 1.8008100300780934,
        "values" : 1.799999999760016
      },
      "covariance" : {
        "randVal1" : {
          "value" : 58.078219752650746,
          "randVal2" : -5.485860518247862E-4,
          "values" : 58.078219752650746
        },
        "value" : {
          "randVal2" : -7.158019911939834,
          "values" : 8.333416666666662E8
        },
        "randVal2" : {
          "values" : -7.158019911939815
        }
      },
      "correlation" : {
        "randVal1" : {
          "value" : 0.006957436968003296,
          "randVal2" : -0.006573983049459838,
          "values" : 0.006957436968003296
        },
        "value" : {
          "randVal2" : -8.59246237772021E-4,
          "values" : 1.000000000000002
        },
        "randVal2" : {
          "values" : -8.592462377720186E-4
        }
      }
    }
  }

It might be nice (in another PR) to specify which stats the user would like returned. It may not change what is computed, but simplifies the output if the application only cares about certain fields.

/cc @polyfractal @colings86 @wamuir @markharwood

@nknize nknize changed the title [WIP] Correlation Aggregation [WIP] MultiField_Stats Aggregation Mar 11, 2016
@Override
public double getMean(String field) {
if (multiFieldStatsResults == null || multiFieldStatsResults.means.containsKey(field) == false) {
return Double.NaN;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

General thought for these various Double.NaN's .. should we be returning null instead (and subsequently returning a Double instead of double)? E.g. because some fields may actually have NaN's as real values, and lacking a field isn't quite the same as not-a-number?

Not sure, probably needs several more opinions :)

@nknize
Copy link
Contributor Author

nknize commented Mar 16, 2016

@rashidkpc I spoke with @spalger a little bit about this at the all hands. The tabular heatmap visualization is typical for a correlation matrix. Strong positive correlation (closer to +1.0) is usually depicted as a red cell with weaker correlation (closer to 0.0) colored white and strong negative correlation (closer to -1.0) colored blue.

Here's a good example:

network_correlations

@spalger
Copy link
Contributor

spalger commented Mar 16, 2016

Yeah, I'm fine with that structure. We'll have to transform it from object to array, but as long as the fields aren't in a meaningful order this is fine.

@rashidkpc
Copy link

@nknize Can you add a simplified example of the current request and response format?

@nknize
Copy link
Contributor Author

nknize commented Apr 11, 2016

@rashidkpc Sure thing. Here's a simple 2 variable example using the same data as above:

  • poverty
  • income

Request:

"aggs": {
    "multifieldstats": {
        "multifield_stats": {
            "field": ["poverty", "income"]
        }
     }
}

Response:

"aggregations": {
    "multifieldstats": {
        "income": {
            "count": 50,
            "mean": 51985.1,
            "variance": 7.383377037755103E7,
            "skewness": 0.5595114003506483,
            "kurtosis": 2.5692365287787124,
            "covariance_correlation": {
                "poverty": {
                    "covariance": -21093.65836734694,
                    "correlation": -0.8352655256272504
                }
            }
        },
        "poverty": {
            "count": 50,
            "mean": 12.732000000000001,
            "variance": 8.637730612244896,
            "skewness": 0.4516049811903419,
            "kurtosis": 2.8615929677997767,
            "covariance_correlation": {
                "income": {
                    "covariance": -21093.65836734694,
                    "correlation": -0.8352655256272504
                }
            }
        }
    }
}

@nknize nknize force-pushed the feature/correlation_agg_WIP branch from 8d1d12e to ea5348e Compare April 12, 2016 02:43
@nknize nknize removed the WIP label Apr 12, 2016
@nknize
Copy link
Contributor Author

nknize commented Apr 12, 2016

@clintongormley how about making this feature experimental for 5.0.0?

delta.put(fieldName, fieldValue * docCount - fieldSum.get(fieldName));

// update running mean, variance, skewness, kurtosis
if (means.containsKey(fieldName) == true) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we add a reference here? Some paper, book chapter, wikipedia article or something.

nknize added 11 commits May 3, 2016 10:27
Adds the first Multi-Field metric agg that computes the Pearson product-moment correlation coefficient for a given list of numeric fields.
This commit refactors the correlation specific aggregator to a more generic MultiFieldStats aggregator. It also adds the following statistics:

* Field Count (to capture missing values - not yet implemented)
* Mean
* Variance
* Skewness
* Kurtosis
* Variance-Covariance matrix
* Correlation
This commit fixes the following:

* Splits results from running statistics by way of mutable RunningStats and Immutable MultiFieldStatsResults classes
* Updates output to print stats per field
* Updates based on PR feedback
@clintongormley
Copy link

Superseded by #18300

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants