-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Scale up evidence string generation #416
Comments
Ran the evidence generation on a small set using
This means we should be able to run through once to count how many records, and then provide ranges in e.g. 10 evenly sized chunks to run the evidence generation in parallel (@tcezard 's suggestion). As long as we don't actually generate evidence strings, iterating through the data each time will have some overhead but should still be fast enough. (And presumably the overhead would be less than actually splitting the file.) |
Evidence string generation time (i.e. without consequence prediction steps) for the entire ClinVar release is now in excess of 24 hours and can only be expected to increase. Furthermore when validation errors are found, the process crashes (see here) and needs to be started from scratch.
We should try to parallelise or otherwise optimise the evidence generation step. For example, splitting the file by RCV should allow us to leverage Nextflow parallelism and resumability. We could also consider handling validation errors as is currently done in the PGKB pipeline (see here), which would let us detect all validation errors in a single run while still crashing the process in the end.
We should also possibly find ways to be more proactive about detecting changes in the data that may require schema or code changes.
The text was updated successfully, but these errors were encountered: