Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scale up evidence string generation #416

Closed
apriltuesday opened this issue Feb 2, 2024 · 1 comment · Fixed by #421
Closed

Scale up evidence string generation #416

apriltuesday opened this issue Feb 2, 2024 · 1 comment · Fixed by #421

Comments

@apriltuesday
Copy link
Contributor

apriltuesday commented Feb 2, 2024

Evidence string generation time (i.e. without consequence prediction steps) for the entire ClinVar release is now in excess of 24 hours and can only be expected to increase. Furthermore when validation errors are found, the process crashes (see here) and needs to be started from scratch.

We should try to parallelise or otherwise optimise the evidence generation step. For example, splitting the file by RCV should allow us to leverage Nextflow parallelism and resumability. We could also consider handling validation errors as is currently done in the PGKB pipeline (see here), which would let us detect all validation errors in a single run while still crashing the process in the end.

We should also possibly find ways to be more proactive about detecting changes in the data that may require schema or code changes.

@apriltuesday
Copy link
Contributor Author

Ran the evidence generation on a small set using cProfile, I think the time-consuming part of the iteration is just validating each evidence string. This makes sense as there are no external queries in this part of the pipeline.

   93779765 function calls (85532202 primitive calls) in 53.396 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000   53.414   53.414 {built-in method builtins.exec}
        1    0.001    0.001   53.414   53.414 <string>:1(<module>)
        1    0.000    0.000   53.413   53.413 clinvar_to_evidence_strings.py:125(launch_pipeline)
        1    0.027    0.027   53.400   53.400 clinvar_to_evidence_strings.py:139(clinvar_to_evidence_strings)
     1131    0.007    0.000   50.904    0.045 clinvar_to_evidence_strings.py:111(validate_evidence_string)
     1131    0.009    0.000   50.895    0.045 validators.py:871(validate)
3360626/2262    9.173    0.000   50.420    0.022 validators.py:296(iter_errors)
3319938/96243    2.262    0.000   49.885    0.001 validators.py:343(descend)
714792/58812    3.118    0.000   49.205    0.001 _validators.py:276(properties)
934931/170375    2.104    0.000   47.595    0.000 _validators.py:252(ref)
     1131    0.003    0.000   45.929    0.041 validators.py:291(check_schema)

<snip>

This means we should be able to run through once to count how many records, and then provide ranges in e.g. 10 evenly sized chunks to run the evidence generation in parallel (@tcezard 's suggestion). As long as we don't actually generate evidence strings, iterating through the data each time will have some overhead but should still be fast enough. (And presumably the overhead would be less than actually splitting the file.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant