-
Notifications
You must be signed in to change notification settings - Fork 403
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Document and test remote inputs #668
Conversation
A previous commit modified what data we upload and removed 3 diagnostic files (the output from `rule diagnostic`). This commit represents the other side of this decision: it removes the ability for builds to define those files as remote inputs (the "to-exclude" key of an input). We also fix a potential bug in `_collect_exclusion_files()` as when starting from a masked input we do not have the ability to generate the exclusion file (as that needs access to the alignment). Finally the `compress_exclusion_file` rule is removed as that was only used for file upload. Note that this commit does not affect the uploaded `mutation-summary.tsv.xz` file.
We now exclusively download (xz) compressed FASTA files
We previously allowed inputs (origins) to start from a "aligned" or "masked" filepath _only_ if that file was remote (s3://...). This commit allows these filepaths to be local as well, which is what we already allowed for the "filtered" entrypoint.
# If the input starting point is "masked" then we also ignore the second file, as the alignment is not available | ||
if config["filter"].get(wildcards["origin"], {}).get("skip_diagnostics", False): | ||
return [ config["files"]["exclude"] ] | ||
if "masked" in config["inputs"][wildcards["origin"]]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Revisit this once #667 is merged, as that changes rule diagnostic
to no longer depend on the alignment
I really like how this showcases all the data that's made available! I am a bit out-of-sync with the process as a whole (a lot of keep up with!) but I read through and left a few small comments :D |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for putting together these docs and tests, @jameshadfield! 🌈
I made many pedantic suggestions below. The most substantial comment is about using --directory
with Snakemake to improve how we run functional tests (locally or via CI).
tests/different-inputs.t
Outdated
Before running each test we must clean out these directories, otherwise | ||
snakemake may use intermediate files from previous runs, thus producing | ||
inconsistent test results. This has the unfortunate side-effect of removing | ||
intermediate files from any runs which we may wish to keep. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This approach seems fine for Travis CI runs, but I agree it's frustrating for local tests. There is a Snakemake command line argument --directory
that allows you to set the working directory for the workflow. We could setup the tests
directory to mirror the top-level directory (just the files it needs like defaults/
and data/
) and then run the workflow from the top-level directory with snakemake --directory tests/ ...
. Would you want to try this out, @jameshadfield?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I knew snakemake would have some magic for this. I'll check it out. I'm hesitant to copy & paste defaults etc, as that will lead to divergence and misleading test results, but we can probably make the cram set-up copy over the directory etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Completely rewrote the tests to use this approach and all is working 💯 locally. I'm currently invoking cram with --preserve-env
but there may be alternative / better approaches.
Now running cram --preserve-env tests/different-inputs.t
only modifies the (gitignored) ./tests/output
directory so it can be run independently of "normal" workflows.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The tests are running locally, but are failing on GitHub CI when running sanitize_metadata.py
on the (n=1) references_metadata.tsv
. @huddlej have you seen this before? Relevant logs:
python3 scripts/sanitize_metadata.py --metadata data/references_metadata.tsv ...
Traceback (most recent call last):
...
_csv.Error: Could not determine delimiter
Update: this seems to be stochastic, a subsequent run passed this stage...
There's also some AWS errors related to Fixed the aws s3 cp ...
which I don't yet understand:aws s3 cp
bug - you must set $AWS_DEFAULT_REGION when running on GitHub CI
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm going to take out the GitHub actions running of the tests and create a separate PR. They are failing in strange ways which are hard to debug.
metadata: tests/local-inputs-compressed/data/asia_metadata.tsv.xz | ||
sequences: tests/local-inputs-compressed/data/asia_sequences.fasta.xz |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These files are excellent! How did you generate them? Do you have a script to automate the process? I'm just thinking about how out of date the current "example data" are and how they are missing metadata fields we may soon add (aa_substitutions
) and have fields we no longer need (like *_exposure
). It would be nice if we could quickly regenerate these test files from more recent data as we need to...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah - that would be really useful. I just manually split the example metadata up into 5 files & used augur filter
to generate corresponding sequence files. The benefit of having region-specific files was that I could eyeball the build and see whether genomes from each input made it through.
It should be easy to make a small script to generate these, but probably best for a subsequent PR.
ee509ce
to
d8e8035
Compare
These tests cover the different starting points for sequence data in each input (sequences vs aligned vs masked vs filtered). They also cover uncompressed and (xz) compressed starting points, as well as local and remote (s3) addresses. (Note: `.gz` compression is not yet covered by these tests.) Running of the tests is best described in the `tests/different-inputs.t` file itself.
106b77b
to
566486e
Compare
With the upcoming availability of GenBank intermediate files, this provides documentation of which assets are uploaded, to where, and how to start builds from these files.
566486e
to
20a083b
Compare
This PR represents two improvements motivated by work on
nextstrain-open
profile.It adds three test builds which cover various input stages (sequences, aligned, masked, filtered), uncompressed vs xz compression, and local vs remote filepaths. These tests highlighted some shortcomings in out rules which were tidied up.
There's a cram file, but it's not working currently (at least it isn't for me), and it's not clear how to run tests locally without first wiping out
./results
which is frustrating if you have intermediate files you wish to keep.Documentation is also added to explain which intermediate files are uploaded and how to use these as inputs to your build. Proof-reading / comments welcome -- treat this as a draft and one part of the upcoming docs around GenBank builds.