Add counts to annotated XML generation #371

apriltuesday · 2023-04-03T15:26:49Z

Count and compute simple metrics for comparing CMAT annotations with those in ClinVar. Primary functionality changes are:

map_genes.py and map_xrefs.py for mapping gene symbols to Ensembl gene ID (using Biomart) and xrefs to EFO synonyms (using OLS)
annotated_clinvar.py and set_metrics.py for actually computing the metrics
pipeline.nf for hooking everything together via an optional --evaluate parameter

(Changes in create_efo_table.py and repeat_variant.py are not related, just getting some tests to be less flaky.)

Sample output below:

Overall counts (RCVs):
total                 2352
has_supported_measure 2350
has_supported_trait   1444

Gene annotations:
Total = 2350
        Category  Count  Percent  F1 Score 
     exact_match   2041    86.9%      1.00 
   cmat_superset     33     1.4%      0.67 
     cmat_subset    189     8.0%      0.62 
 divergent_match      1     0.0%      0.50 
        mismatch      1     0.0%      0.00 
 => both_present   2265    96.4%      0.96 
      cv_missing      0     0.0%      0.00 
    cmat_missing     72     3.1%      0.00 
    both_missing     13     0.6%      0.00 

Functional consequences:
Total = 2350
        Category  Count  Percent  F1 Score 
     exact_match   1657    70.5%      1.00 
   cmat_superset      0     0.0%      0.00 
     cmat_subset    608    25.9%      0.64 
 divergent_match      0     0.0%      0.00 
        mismatch      0     0.0%      0.00 
 => both_present   2265    96.4%      0.90 
      cv_missing      0     0.0%      0.00 
    cmat_missing      1     0.0%      0.00 
    both_missing     84     3.6%      0.00 

Trait mappings:
Total = 1686
        Category  Count  Percent  F1 Score 
     exact_match    838    49.7%      1.00 
   cmat_superset     34     2.0%      0.74 
     cmat_subset    420    24.9%      0.66 
 divergent_match     17     1.0%      0.49 
        mismatch    187    11.1%      0.00 
 => both_present   1496    88.7%      0.77 
      cv_missing    162     9.6%      0.00 
    cmat_missing     10     0.6%      0.00 
    both_missing     18     1.1%      0.00

apriltuesday · 2023-04-18T10:45:38Z

@tcezard @M-casado I've removed the "work in progress" sample output to avoid confusion, the top-level comment has the most up-to-date info. I realise it's a big PR so feel free to focus your review on the most relevant bits.

tcezard

Looks good. A tiny comment.

bin/evaluation/map_xrefs.py

bin/trait_mapping/create_efo_table.py

tcezard

Looks good. A tiny comment.

apriltuesday · 2023-04-26T08:13:56Z

Merging this to fix the memory issue in ClinVarDataset for evidence string generation, as always comments still welcome and will be addressed in a subsequent PR refining the metrics.

apriltuesday marked this pull request as ready for review April 17, 2023 12:10

apriltuesday self-assigned this Apr 18, 2023

apriltuesday requested review from tcezard and M-casado April 18, 2023 10:41

tcezard approved these changes Apr 25, 2023

View reviewed changes

bin/evaluation/map_xrefs.py Outdated Show resolved Hide resolved

bin/trait_mapping/create_efo_table.py Outdated Show resolved Hide resolved

tcezard approved these changes Apr 25, 2023

View reviewed changes

apriltuesday added 19 commits April 25, 2023 13:44

add report of counts to annotated xml

e176cd5

initialise counts in iter method

d2f93f6

add counts for genes, consequences, and EFO ids already in ClinVar

9c27bad

add retries to oxo xref fetching

606e2ec

sort hgvs in repeat expansion pipeline

cf4cbdd

add scoring to annotated clinvar

216d5f8

compute f1 only when both annotations present

e080e96

refactor metrics

6cd14a5

reformat Orphanet identifiers for easier comparison

f9ca150

fix errors in counting

e8c9b05

update expected output

aa49c1c

WIP - mapping genes and xrefs

472a2de

fix xref mapping logic, add percentages

8e5bad9

fix test

382f553

cleaning up

fc87856

clarify some comments

b0b3ee3

fix memory issue

e9fecb2

make uri_to_curie slightly more robust

b34f633

address review comments

ed4ddc3

apriltuesday force-pushed the results-for-paper branch from 9b4137b to ed4ddc3 Compare April 25, 2023 12:44

bump pipeline version

18673c7

apriltuesday merged commit b1778c9 into EBIvariation:master Apr 26, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add counts to annotated XML generation #371

Add counts to annotated XML generation #371

apriltuesday commented Apr 3, 2023 •

edited

Loading

apriltuesday commented Apr 18, 2023

tcezard left a comment

tcezard left a comment

apriltuesday commented Apr 26, 2023

Add counts to annotated XML generation #371

Add counts to annotated XML generation #371

Conversation

apriltuesday commented Apr 3, 2023 • edited Loading

apriltuesday commented Apr 18, 2023

tcezard left a comment

Choose a reason for hiding this comment

tcezard left a comment

Choose a reason for hiding this comment

apriltuesday commented Apr 26, 2023

apriltuesday commented Apr 3, 2023 •

edited

Loading