Document and test remote inputs #668

jameshadfield · 2021-06-22T07:19:35Z

This PR represents two improvements motivated by work on nextstrain-open profile.

It adds three test builds which cover various input stages (sequences, aligned, masked, filtered), uncompressed vs xz compression, and local vs remote filepaths. These tests highlighted some shortcomings in out rules which were tidied up.
There's a cram file, but it's not working currently (at least it isn't for me), and it's not clear how to run tests locally without first wiping out ./results which is frustrating if you have intermediate files you wish to keep.

Documentation is also added to explain which intermediate files are uploaded and how to use these as inputs to your build. Proof-reading / comments welcome -- treat this as a draft and one part of the upcoming docs around GenBank builds.

A previous commit modified what data we upload and removed 3 diagnostic files (the output from `rule diagnostic`). This commit represents the other side of this decision: it removes the ability for builds to define those files as remote inputs (the "to-exclude" key of an input). We also fix a potential bug in `_collect_exclusion_files()` as when starting from a masked input we do not have the ability to generate the exclusion file (as that needs access to the alignment). Finally the `compress_exclusion_file` rule is removed as that was only used for file upload. Note that this commit does not affect the uploaded `mutation-summary.tsv.xz` file.

We now exclusively download (xz) compressed FASTA files

We previously allowed inputs (origins) to start from a "aligned" or "masked" filepath _only_ if that file was remote (s3://...). This commit allows these filepaths to be local as well, which is what we already allowed for the "filtered" entrypoint.

jameshadfield · 2021-06-22T07:28:52Z

workflow/snakemake_rules/main_workflow.smk

+    # If the input starting point is "masked" then we also ignore the second file, as the alignment is not available
+    if config["filter"].get(wildcards["origin"], {}).get("skip_diagnostics", False):
+        return [ config["files"]["exclude"] ]
+    if "masked" in config["inputs"][wildcards["origin"]]:


Revisit this once #667 is merged, as that changes rule diagnostic to no longer depend on the alignment

docs/remote_inputs.md

emmahodcroft · 2021-06-22T08:09:23Z

I really like how this showcases all the data that's made available! I am a bit out-of-sync with the process as a whole (a lot of keep up with!) but I read through and left a few small comments :D

huddlej

Thank you for putting together these docs and tests, @jameshadfield! 🌈

I made many pedantic suggestions below. The most substantial comment is about using --directory with Snakemake to improve how we run functional tests (locally or via CI).

docs/remote_inputs.md

tests/different-inputs.t

huddlej · 2021-06-22T19:08:59Z

tests/different-inputs.t

+Before running each test we must clean out these directories, otherwise
+snakemake may use intermediate files from previous runs, thus producing
+inconsistent test results. This has the unfortunate side-effect of removing
+intermediate files from any runs which we may wish to keep.


This approach seems fine for Travis CI runs, but I agree it's frustrating for local tests. There is a Snakemake command line argument --directory that allows you to set the working directory for the workflow. We could setup the tests directory to mirror the top-level directory (just the files it needs like defaults/ and data/) and then run the workflow from the top-level directory with snakemake --directory tests/ .... Would you want to try this out, @jameshadfield?

I knew snakemake would have some magic for this. I'll check it out. I'm hesitant to copy & paste defaults etc, as that will lead to divergence and misleading test results, but we can probably make the cram set-up copy over the directory etc.

Completely rewrote the tests to use this approach and all is working 💯 locally. I'm currently invoking cram with --preserve-env but there may be alternative / better approaches.

Now running cram --preserve-env tests/different-inputs.t only modifies the (gitignored) ./tests/output directory so it can be run independently of "normal" workflows.

The tests are running locally, but are failing on GitHub CI when running sanitize_metadata.py on the (n=1) references_metadata.tsv. @huddlej have you seen this before? Relevant logs:

python3 scripts/sanitize_metadata.py --metadata data/references_metadata.tsv ... Traceback (most recent call last): ... _csv.Error: Could not determine delimiter

Update: this seems to be stochastic, a subsequent run passed this stage...

~~There's also some AWS errors related to aws s3 cp ... which I don't yet understand:~~ Fixed the aws s3 cp bug - you must set $AWS_DEFAULT_REGION when running on GitHub CI

I'm going to take out the GitHub actions running of the tests and create a separate PR. They are failing in strange ways which are hard to debug.

tests/different-inputs.t

tests/local-inputs-compressed/builds.yaml

huddlej · 2021-06-22T19:12:22Z

tests/local-inputs-compressed/builds.yaml

+    metadata: tests/local-inputs-compressed/data/asia_metadata.tsv.xz
+    sequences: tests/local-inputs-compressed/data/asia_sequences.fasta.xz


These files are excellent! How did you generate them? Do you have a script to automate the process? I'm just thinking about how out of date the current "example data" are and how they are missing metadata fields we may soon add (aa_substitutions) and have fields we no longer need (like *_exposure). It would be nice if we could quickly regenerate these test files from more recent data as we need to...

Yeah - that would be really useful. I just manually split the example metadata up into 5 files & used augur filter to generate corresponding sequence files. The benefit of having region-specific files was that I could eyeball the build and see whether genomes from each input made it through.

It should be easy to make a small script to generate these, but probably best for a subsequent PR.

These tests cover the different starting points for sequence data in each input (sequences vs aligned vs masked vs filtered). They also cover uncompressed and (xz) compressed starting points, as well as local and remote (s3) addresses. (Note: `.gz` compression is not yet covered by these tests.) Running of the tests is best described in the `tests/different-inputs.t` file itself.

With the upcoming availability of GenBank intermediate files, this provides documentation of which assets are uploaded, to where, and how to start builds from these files.

jameshadfield added 3 commits June 22, 2021 17:50

Remove unused deflate param

9a1e592

We now exclusively download (xz) compressed FASTA files

jameshadfield commented Jun 22, 2021

View reviewed changes

emmahodcroft reviewed Jun 22, 2021

View reviewed changes

docs/remote_inputs.md Outdated Show resolved Hide resolved

emmahodcroft reviewed Jun 22, 2021

View reviewed changes

docs/remote_inputs.md Outdated Show resolved Hide resolved

huddlej reviewed Jun 22, 2021

View reviewed changes

jameshadfield force-pushed the remote-inputs branch 2 times, most recently from ee509ce to d8e8035 Compare June 23, 2021 03:10

jameshadfield force-pushed the remote-inputs branch 14 times, most recently from 106b77b to 566486e Compare June 23, 2021 09:56

Document intermediate GenBank files

20a083b

With the upcoming availability of GenBank intermediate files, this provides documentation of which assets are uploaded, to where, and how to start builds from these files.

jameshadfield force-pushed the remote-inputs branch from 566486e to 20a083b Compare June 23, 2021 22:43

jameshadfield merged commit db1d42e into open Jun 23, 2021

jameshadfield deleted the remote-inputs branch June 23, 2021 22:45

jameshadfield mentioned this pull request Jun 23, 2021

Add GitHub action to run tests #671

Closed

jameshadfield mentioned this pull request Jun 12, 2024

Revisit updating AWS CLI to v2 nextstrain/docker-base#216

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Document and test remote inputs #668

Document and test remote inputs #668

jameshadfield commented Jun 22, 2021

jameshadfield Jun 22, 2021

emmahodcroft commented Jun 22, 2021

huddlej left a comment

huddlej Jun 22, 2021

jameshadfield Jun 23, 2021

jameshadfield Jun 23, 2021

jameshadfield Jun 23, 2021 •

edited

Loading

jameshadfield Jun 23, 2021

huddlej Jun 22, 2021

jameshadfield Jun 23, 2021

		metadata: tests/local-inputs-compressed/data/asia_metadata.tsv.xz
		sequences: tests/local-inputs-compressed/data/asia_sequences.fasta.xz

Document and test remote inputs #668

Document and test remote inputs #668

Conversation

jameshadfield commented Jun 22, 2021

jameshadfield Jun 22, 2021

Choose a reason for hiding this comment

emmahodcroft commented Jun 22, 2021

huddlej left a comment

Choose a reason for hiding this comment

huddlej Jun 22, 2021

Choose a reason for hiding this comment

jameshadfield Jun 23, 2021

Choose a reason for hiding this comment

jameshadfield Jun 23, 2021

Choose a reason for hiding this comment

jameshadfield Jun 23, 2021 • edited Loading

Choose a reason for hiding this comment

jameshadfield Jun 23, 2021

Choose a reason for hiding this comment

huddlej Jun 22, 2021

Choose a reason for hiding this comment

jameshadfield Jun 23, 2021

Choose a reason for hiding this comment

jameshadfield Jun 23, 2021 •

edited

Loading