Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ECOD implementation does not correspond to what is described in the paper (and as it stands the skewness part does nothing here because of it) #453

Open
L-F-A opened this issue Nov 3, 2022 · 3 comments

Comments

@L-F-A
Copy link

L-F-A commented Nov 3, 2022

Summary:

Looking at the implementation of decision_function (same for the parallel version) in ecod.py, this is not what is described in the paper (https://arxiv.org/abs/2201.00382).

Furthermore, as a side effect of this implementation, the score obtained when considering the skewness does absolutely nothing ever aka:

self.O = np.maximum(self.U_l, self.U_r)
self.O = np.maximum(self.U_skew , self.O)

will always give the same as just (one can always comment # self.O = np.maximum(self.U_skew , self.O) and see without even having to think about it that I am right)

self.O = np.maximum(self.U_l, self.U_r)

This is a big problem in a sense that we are doing something for nothing. The method as is can be useful, but it is not ECOD as described in the paper.

Explanation:

Here is a in depth explanation:

image

This correctly calculates the negative log of the cdf left, right and skew for each feature of each data point.
However, the next operation will choose for each feature of each datapoint which of the left or right score is maximum for that feature and data point and forget the other one.

image

This is not what is described in the paper. Here (line 153) we are left with a matrix where each element is either the max left score or max right score for each feature. There is no notion of max at the feature level in the paper as far as I can tell, it is at the data point level. Then the next line of code does not do anything at all.

image

Indeed, self.U_skew as calculated (see line 150 and 151 above) is a matrix where each element is either the feature left or right score depending on the skewness sign of that feature. But each element of self.O is already the max of either the left or right score of that feature and thus taking the max again with either left or right score from skew just give the same answer, we already have the max for that feature. Therefore, this way of implementing the score eliminates any effect of the skew score, which I thought was one of the main points of the paper.

Finally, the sum for each feature is taken aka for a row (data point), the sum is a mix of left and right score as each feature has different max:

image

This is not what is described in Eqs. 4, 5 and 6 of the paper.

In conclusion:

PYOD:

The implementation here is: Find a left and right score for each feature of a data point, keep only the max one for each feature and then sum these scores together. This is kind of inverting additions and max operations as compared to the paper.

PAPER
In the paper it is rather: Find a left, right and skew (left or right depending on sign of skewness) score for each feature of a data point. Independently sum these scores over the features, left with left, right with right and skew with skew. Choose the max of these three values for each datapoint. This is very different than what is implemented.

Discussion:

Seems that the way to implement what is in the paper if one wants that algorithm would be to replace the code of lines 153 to 159 with something like:

We sum across features independently left, right, skew

self.O_left = self.U_l.sum(axis=1)
self.O_right = self.U_r.sum(axis=1)
self.O_skew = self.U_skew.sum(axis=1)

For each row we keep the max value of score between left, right and skew

O_tmp = np.concatenate((self.O_left.reshape(-1, 1), self.O_right.reshape(-1, 1), self.O_skew.reshape(-1, 1)), axis=1)

Self.O = np.max(O_tmp, axis=1)

Decision score is just self.O

If hasattr(self, ‘X_train’):
Decision_scores_ = self.O[-original_size:]
else:
Decision_scores_ = self.O

@Lucew
Copy link
Contributor

Lucew commented Dec 2, 2022

General

Hey Louis, Hey Yue,

I went through this issue, read the paper, and compared the formulas. IMHO Louis is correct. The summation definitely comes before the maximum selection (see steps 6 and 7 below) and should not be done per feature tail probability but on the cumulated log probability.

Please extend the collapsible to see the steps I'm referring to as a screenshot from the original paper.

ECOD_Algorithm

In the algorithmic overview, the left tail probability U_l and right tail probability U_r are clearly computed by summing over j, which is the feature dimension index. This is done in step 6. Only after that (in step 7) is the maximum selected. Additionally, it also makes logical sense to first build the joint (log) probability of a certain sample with j features and then search for the maximum probability of being an outlier.

IMHO it is the same case for the COPOD algorithm! If we decide this is unintentional and fix this, we should touch COPOD as well. See the algorithm excerpt and code in the collapsible below. One can quickly see COPOD is implemented quite similarly and has the same behavior.

COPOD_Algorithm

Copied from the COPOD Implementation

self.U_l = -1 * np.log(np.apply_along_axis(ecdf, 0, X))
self.U_r = -1 * np.log(np.apply_along_axis(ecdf, 0, -X))

skewness = np.sign(skew(X, axis=0))
self.U_skew = self.U_l * -1 * np.sign(skewness - 1) + self.U_r * np.sign(skewness + 1)
self.O = np.maximum(self.U_skew, np.add(self.U_l, self.U_r) / 2)

Additional remarks

Other than that, I think even the U_Skew array is not computed entirely correctly. But this is only very minor. The problem here is that we add up U_r and U_l in the case where the estimated skewness of a feature column is zero. For an explanation extend the collapsible below.

Let's analyze the code:

self.U_l = -1 * np.log(np.apply_along_axis(ecdf, 0, X))
self.U_r = -1 * np.log(np.apply_along_axis(ecdf, 0, -X))

skewness = np.sign(skew(X, axis=0))

self.U_skew = self.U_l * -1 * np.sign(skewness - 1) + self.U_r * np.sign(skewness + 1)

self.O = np.maximum(self.U_l, self.U_r)
self.O = np.maximum(self.U_skew, self.O)

The variable skewness refers to a vector with as many values as we have features per sample. Let's check skewness = np.sign(skew(X, axis=0)):

  1. skew < 0 -> np.sign(skew) = -1
  2. skew = 0 -> np.sign(skew) = 0
  3. skew > 1 -> np.sign(skew) = 1

The algorithm states we should select left tail probability or right probability per sample and feature based on the feature skewness.

Further left tail probability U_l should be selected if skewness is negative. The right tail probability U_r should be selected if skewness is zero or positive. Let's check when np.sign(skewness - 1) and np.sign(skewness + 1) which are used in self.U_skew = self.U_l * -1 * np.sign(skewness - 1) + self.U_r * np.sign(skewness + 1):

  1. skew < 0 -> skewness = -1
    a) np.sign(skewness - 1) = np.sign(-1-1) = -1 -> U_l gets selected
    b) np.sign(skewness + 1) = np.sign(-1+1) = 0 -> U_r omitted
  2. skew = 0 -> skewness = 0
    a) np.sign(skewness - 1) = np.sign(0-1) = -1 -> U_l gets selected
    b) np.sign(skewness + 1) = np.sign(0+1) = 1 -> U_r gets selected
  3. skew > 0 -> skewness = 1
    a) np.sign(skewness - 1) = np.sign(1-1) = 0 -> U_l gets omitted
    b) np.sign(skewness + 1) = np.sign(1+1) = 1 -> U_r gets selected

So in the second case, both U_l and U_r get selected, which is not as written in the paper. The algorithm there states in the case of skewness = 0 U_r should get selected. I know this is a very unlikely case as skewness would need to be exactly and numerical zero which is most likely never the case for an empirically estimated skewness. But I thought if we discuss how to fix the ECOD we should consider everything.

I would maybe fix it like this

skewness = skew(X, axis=0)
self.U_skew = self.U_l
self.U_skew[skewness >= 0] = self.U_l[skewness >= 0]

or

skewness_weight = skew(X, axis=0) < 0
self.U_skew = self.U_l *  skewness_weight + self.U_r * (~skewness_weight)

I also have ideas on how to rewrite the estimated ECDF in order to get rid of the statsmodels dependency and only have NlogN runtime instead of 2NlogN as it is currently. This would require implementing the ECDF by myself which is doable.

If I have time this weekend, I might work on this!
Nice catch Louis!

Best regards
Lucas

@yzhao062
Copy link
Owner

yzhao062 commented Dec 4, 2022

This is nice. PR welcomed!!!!

@Lucew
Copy link
Contributor

Lucew commented Dec 20, 2022

Hey!

Unfortunately, it took me a while...Christmas and Advent of Code got in the way.

I just now implemented an ECDF estimator that runs in less time than the previous one since it reuses the feature columns instead of searching for each value in them.

I also have ideas on how to rewrite the estimated ECDF in order to get rid of the statsmodels dependency and only have NlogN runtime instead of 2NlogN as it is currently. This would require implementing the ECDF by myself which is doable.

I will open another issue and a pull request for that once the CI tests are successful. This would yield the possibility of completely getting rid of the statsmodels dependency.

After that, I will now go and fix the issue we discussed here.

Lucas

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants