Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add counts to annotated XML generation #371

Merged
merged 20 commits into from
Apr 26, 2023

Conversation

apriltuesday
Copy link
Contributor

@apriltuesday apriltuesday commented Apr 3, 2023

Count and compute simple metrics for comparing CMAT annotations with those in ClinVar. Primary functionality changes are:

  • map_genes.py and map_xrefs.py for mapping gene symbols to Ensembl gene ID (using Biomart) and xrefs to EFO synonyms (using OLS)
  • annotated_clinvar.py and set_metrics.py for actually computing the metrics
  • pipeline.nf for hooking everything together via an optional --evaluate parameter

(Changes in create_efo_table.py and repeat_variant.py are not related, just getting some tests to be less flaky.)

Sample output below:

Overall counts (RCVs):
total                 2352
has_supported_measure 2350
has_supported_trait   1444

Gene annotations:
Total = 2350
        Category  Count  Percent  F1 Score 
     exact_match   2041    86.9%      1.00 
   cmat_superset     33     1.4%      0.67 
     cmat_subset    189     8.0%      0.62 
 divergent_match      1     0.0%      0.50 
        mismatch      1     0.0%      0.00 
 => both_present   2265    96.4%      0.96 
      cv_missing      0     0.0%      0.00 
    cmat_missing     72     3.1%      0.00 
    both_missing     13     0.6%      0.00 

Functional consequences:
Total = 2350
        Category  Count  Percent  F1 Score 
     exact_match   1657    70.5%      1.00 
   cmat_superset      0     0.0%      0.00 
     cmat_subset    608    25.9%      0.64 
 divergent_match      0     0.0%      0.00 
        mismatch      0     0.0%      0.00 
 => both_present   2265    96.4%      0.90 
      cv_missing      0     0.0%      0.00 
    cmat_missing      1     0.0%      0.00 
    both_missing     84     3.6%      0.00 

Trait mappings:
Total = 1686
        Category  Count  Percent  F1 Score 
     exact_match    838    49.7%      1.00 
   cmat_superset     34     2.0%      0.74 
     cmat_subset    420    24.9%      0.66 
 divergent_match     17     1.0%      0.49 
        mismatch    187    11.1%      0.00 
 => both_present   1496    88.7%      0.77 
      cv_missing    162     9.6%      0.00 
    cmat_missing     10     0.6%      0.00 
    both_missing     18     1.1%      0.00 

@apriltuesday apriltuesday marked this pull request as ready for review April 17, 2023 12:10
@apriltuesday apriltuesday self-assigned this Apr 18, 2023
@apriltuesday
Copy link
Contributor Author

@tcezard @M-casado I've removed the "work in progress" sample output to avoid confusion, the top-level comment has the most up-to-date info. I realise it's a big PR so feel free to focus your review on the most relevant bits.

Copy link
Member

@tcezard tcezard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. A tiny comment.

bin/evaluation/map_xrefs.py Outdated Show resolved Hide resolved
bin/trait_mapping/create_efo_table.py Outdated Show resolved Hide resolved
Copy link
Member

@tcezard tcezard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. A tiny comment.

@apriltuesday
Copy link
Contributor Author

Merging this to fix the memory issue in ClinVarDataset for evidence string generation, as always comments still welcome and will be addressed in a subsequent PR refining the metrics.

@apriltuesday apriltuesday merged commit b1778c9 into EBIvariation:master Apr 26, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants