Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

calculate_annotation_association chokes on columns with only one unique value #1

Open
tobsecret opened this issue Sep 9, 2020 · 1 comment
Assignees

Comments

@tobsecret
Copy link
Collaborator

Expected behavior

Ignore columns in the annotation table that have only one unique value.

data = phdc.ProteomicsData(
    phospho = phospho,
    protein = protein,
    normed_phospho = normed_phospho,
    modules = modules,
    possible_regulator_list = possible_regulator_list,
)
data.add_annotations(
    #Filtering for tumor samples incidentally also makes it so all values in the Type column are 'Tumor'
    annotations.loc[annotations.Type=='Tumor'], 
    pd.Series(col_types)
) 
data.calculate_annotation_association(cat_method='RRA', cont_method='spearmanr')

Either calculate_annotation_association should ignore columns with only a single unique value, or add_annotations should treat them differently, e.g. drop them in their own attribute:

################### new code ####################
        self.non_unique_annotations = annotations[
            [column for column in annotations.columns if annotations[column].unique.__len__() ==1]]
        annotations = annotations.drop(self.non_unique_annotations.columns)
###############################################
        self.categorical_annotations = binarize_categorical(
            annotations,
            annotations.columns[column_types == 0]
        )
        self.continuous_annotations = annotations[
            annotations.columns[column_types == 1]
        ].astype(float)

Observed behavior

Columns in the annotation DataFrame that are made up of only a single unique value will result in a KeyError because after converting each column to a dummy variable, those columns that only had a single value to begin with will only have True as a single value.
As a result, when calculate_annotation_association tries to pull the True and False rows for such a column, it does not find any rows containing False:

KeyError                                  Traceback (most recent call last)
<ipython-input-47-e84a77b246a7> in <module>
      1 data.add_annotations(annotations.loc[phospho.columns], pd.Series(col_types))
----> 2 data.calculate_annotation_association(cat_method='RRA', cont_method='spearmanr')

/gpfs/data/ruggleslab/phosphodisco/phosphodisco/phosphodisco/classes.py in calculate_annotation_association(self, cat_method, cont_method, **multitest_kwargs)
    502         cat_annots = self.categorical_annotations
    503 
--> 504         cat = categorical_score_association(
    505             cat_annots,
    506             self.module_scores,

/gpfs/data/ruggleslab/phosphodisco/phosphodisco/phosphodisco/annotation_association.py in categorical_score_association(annotations, module_scores, cat_method, **test_kws)
    151         temp = annotations[col].reset_index()
    152         temp = temp.groupby(col)[indname].apply(list)
--> 153         results[col] = scores.apply(
    154             compare_fn,
    155             axis=1

~/scratch/miniconda3/envs/phdis/lib/python3.8/site-packages/pandas/core/frame.py in apply(self, func, axis, raw, result_type, args, **kwds)
   6876             kwds=kwds,
   6877         )
-> 6878         return op.get_result()
   6879 
   6880     def applymap(self, func) -> "DataFrame":

~/scratch/miniconda3/envs/phdis/lib/python3.8/site-packages/pandas/core/apply.py in get_result(self)
    184             return self.apply_raw()
    185 
--> 186         return self.apply_standard()
    187 
    188     def apply_empty_result(self):

~/scratch/miniconda3/envs/phdis/lib/python3.8/site-packages/pandas/core/apply.py in apply_standard(self)
    293 
    294             try:
--> 295                 result = libreduction.compute_reduction(
    296                     values, self.f, axis=self.axis, dummy=dummy, labels=labels
    297                 )

pandas/_libs/reduction.pyx in pandas._libs.reduction.compute_reduction()

pandas/_libs/reduction.pyx in pandas._libs.reduction.Reducer.get_result()

/gpfs/data/ruggleslab/phosphodisco/phosphodisco/phosphodisco/annotation_association.py in <lambda>(row)
    146 
    147     compare_fn = lambda row: categorial_methods[cat_method](
--> 148         row[temp[True]], row[temp[False]], **test_kws
    149     )[1]
    150     for col in annotations.columns:

~/scratch/miniconda3/envs/phdis/lib/python3.8/site-packages/pandas/core/series.py in __getitem__(self, key)
    869         key = com.apply_if_callable(key, self)
    870         try:
--> 871             result = self.index.get_value(self, key)
    872 
    873             if not is_scalar(result):

~/scratch/miniconda3/envs/phdis/lib/python3.8/site-packages/pandas/core/indexes/base.py in get_value(self, series, key)
   4402         k = self._convert_scalar_indexer(k, kind="getitem")
   4403         try:
-> 4404             return self._engine.get_value(s, k, tz=getattr(series.dtype, "tz", None))
   4405         except KeyError as e1:
   4406             if len(self) > 0 and (self.holds_integer() or self.is_boolean()):

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_value()

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_value()

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: False
​```
@tobsecret tobsecret self-assigned this Sep 9, 2020
@tobsecret
Copy link
Collaborator Author

add_annotations should probably also separate out any categorical columns where every value is different - as can be true for e.g. Sample.ID, Patient.ID. These columns add a disproportionate amount of meaningless statistical tests to our statistical experiments, which comes back to bite us when we correct for multiple testing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant