Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document and test remote inputs #668

Merged
merged 5 commits into from
Jun 23, 2021
Merged

Document and test remote inputs #668

merged 5 commits into from
Jun 23, 2021

Conversation

jameshadfield
Copy link
Member

This PR represents two improvements motivated by work on nextstrain-open profile.

It adds three test builds which cover various input stages (sequences, aligned, masked, filtered), uncompressed vs xz compression, and local vs remote filepaths. These tests highlighted some shortcomings in out rules which were tidied up.
There's a cram file, but it's not working currently (at least it isn't for me), and it's not clear how to run tests locally without first wiping out ./results which is frustrating if you have intermediate files you wish to keep.

Documentation is also added to explain which intermediate files are uploaded and how to use these as inputs to your build. Proof-reading / comments welcome -- treat this as a draft and one part of the upcoming docs around GenBank builds.

A previous commit modified what data we upload and removed 3 diagnostic
files (the output from `rule diagnostic`). This commit represents the
other side of this decision: it removes the ability for builds to
define those files as remote inputs (the "to-exclude" key of an input).

We also fix a potential bug in `_collect_exclusion_files()` as when
starting from a masked input we do not have the ability to generate
the exclusion file (as that needs access to the alignment).

Finally the `compress_exclusion_file` rule is removed as that was
only used for file upload.

Note that this commit does not affect the uploaded
`mutation-summary.tsv.xz` file.
We now exclusively download (xz) compressed FASTA files
We previously allowed inputs (origins) to start from a "aligned" or
"masked" filepath _only_ if that file was remote (s3://...). This
commit allows these filepaths to be local as well, which is what
we already allowed for the "filtered" entrypoint.
# If the input starting point is "masked" then we also ignore the second file, as the alignment is not available
if config["filter"].get(wildcards["origin"], {}).get("skip_diagnostics", False):
return [ config["files"]["exclude"] ]
if "masked" in config["inputs"][wildcards["origin"]]:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Revisit this once #667 is merged, as that changes rule diagnostic to no longer depend on the alignment

docs/remote_inputs.md Outdated Show resolved Hide resolved
docs/remote_inputs.md Outdated Show resolved Hide resolved
@emmahodcroft
Copy link
Member

I really like how this showcases all the data that's made available! I am a bit out-of-sync with the process as a whole (a lot of keep up with!) but I read through and left a few small comments :D

Copy link
Contributor

@huddlej huddlej left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for putting together these docs and tests, @jameshadfield! 🌈

I made many pedantic suggestions below. The most substantial comment is about using --directory with Snakemake to improve how we run functional tests (locally or via CI).

docs/remote_inputs.md Outdated Show resolved Hide resolved
docs/remote_inputs.md Outdated Show resolved Hide resolved
docs/remote_inputs.md Outdated Show resolved Hide resolved
docs/remote_inputs.md Outdated Show resolved Hide resolved
docs/remote_inputs.md Outdated Show resolved Hide resolved
tests/different-inputs.t Outdated Show resolved Hide resolved
Comment on lines 9 to 12
Before running each test we must clean out these directories, otherwise
snakemake may use intermediate files from previous runs, thus producing
inconsistent test results. This has the unfortunate side-effect of removing
intermediate files from any runs which we may wish to keep.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This approach seems fine for Travis CI runs, but I agree it's frustrating for local tests. There is a Snakemake command line argument --directory that allows you to set the working directory for the workflow. We could setup the tests directory to mirror the top-level directory (just the files it needs like defaults/ and data/) and then run the workflow from the top-level directory with snakemake --directory tests/ .... Would you want to try this out, @jameshadfield?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I knew snakemake would have some magic for this. I'll check it out. I'm hesitant to copy & paste defaults etc, as that will lead to divergence and misleading test results, but we can probably make the cram set-up copy over the directory etc.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Completely rewrote the tests to use this approach and all is working 💯 locally. I'm currently invoking cram with --preserve-env but there may be alternative / better approaches.

Now running cram --preserve-env tests/different-inputs.t only modifies the (gitignored) ./tests/output directory so it can be run independently of "normal" workflows.

Copy link
Member Author

@jameshadfield jameshadfield Jun 23, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tests are running locally, but are failing on GitHub CI when running sanitize_metadata.py on the (n=1) references_metadata.tsv. @huddlej have you seen this before? Relevant logs:

python3 scripts/sanitize_metadata.py  --metadata data/references_metadata.tsv   ...
Traceback (most recent call last):
...
_csv.Error: Could not determine delimiter

Update: this seems to be stochastic, a subsequent run passed this stage...

There's also some AWS errors related to aws s3 cp ... which I don't yet understand: Fixed the aws s3 cp bug - you must set $AWS_DEFAULT_REGION when running on GitHub CI

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to take out the GitHub actions running of the tests and create a separate PR. They are failing in strange ways which are hard to debug.

tests/different-inputs.t Outdated Show resolved Hide resolved
tests/local-inputs-compressed/builds.yaml Outdated Show resolved Hide resolved
Comment on lines 3 to 4
metadata: tests/local-inputs-compressed/data/asia_metadata.tsv.xz
sequences: tests/local-inputs-compressed/data/asia_sequences.fasta.xz
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These files are excellent! How did you generate them? Do you have a script to automate the process? I'm just thinking about how out of date the current "example data" are and how they are missing metadata fields we may soon add (aa_substitutions) and have fields we no longer need (like *_exposure). It would be nice if we could quickly regenerate these test files from more recent data as we need to...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah - that would be really useful. I just manually split the example metadata up into 5 files & used augur filter to generate corresponding sequence files. The benefit of having region-specific files was that I could eyeball the build and see whether genomes from each input made it through.

It should be easy to make a small script to generate these, but probably best for a subsequent PR.

@jameshadfield jameshadfield force-pushed the remote-inputs branch 2 times, most recently from ee509ce to d8e8035 Compare June 23, 2021 03:10
These tests cover the different starting points for sequence data
in each input (sequences vs aligned vs masked vs filtered). They also
cover uncompressed and (xz) compressed starting points, as well as
local and remote (s3) addresses. (Note: `.gz` compression is not yet
covered by these tests.)

Running of the tests is best described in the `tests/different-inputs.t`
file itself.
@jameshadfield jameshadfield force-pushed the remote-inputs branch 14 times, most recently from 106b77b to 566486e Compare June 23, 2021 09:56
With the upcoming availability of GenBank intermediate files,
this provides documentation of which assets are uploaded,
to where, and how to start builds from these files.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants