Support default GISAID metadata and sequences #640

huddlej · 2021-05-19T23:49:25Z

Description of proposed changes

We would like to support GISAID's standard metadata and sequence data formats from the "Download packages" interface of EpiCoV. This PR expands the existing "sanitize" scripts for metadata and sequences with the following major changes to support these data:

Remove whitespace from strain names in sequence and metadata (whitespace is not allowed in record ids for FASTA deflines)
Strip metadata from FASTA deflines (these metadata entries should be present already in the corresponding separate metadata file)
Rename GISAID metadata columns to Augur-style names ("Virus name" -> "strain" and "Collection date" -> "date")
Parse region, country, division, and location values from the single Location field in the GISAID metadata
Resolve duplicate metadata records by preferring those with the latest GISAID accession number (with an option to produce an error with a list of all duplicates instead)
Resolve duplicate sequence records by preferring the first sequence encountered (necessary for downstream components of the workflow when the combine-and-dedup step doesn't get run and duplicates exist in the sequence data)
Ingest sequences (.fasta) and metadata (.tsv) files directly from GISAID tarballs (allows users to download data and run the workflow directly on those files without manually extracting/decompressing sequences and metadata)

Currently, these changes are applied to all inputs including data that are not in the GISAID default format.

Testing

Manually test sanitizer scripts locally with the full GISAID downloads
Run a small build with a 200 random strains from the full downloads
Tested with a full Nextstrain build on a SLURM cluster
Tested with AWS Batch trial build

Release checklist

If this pull request introduces backward incompatible changes, complete the following steps for a new release of the workflow:

Determine the version number for the new release by incrementing the most recent release -> v7
Update docs/change_log.md in this pull request to document these changes and the new version number.
After merging, create a new GitHub release with the new version number as the tag and release title.

emmahodcroft · 2021-05-20T08:09:56Z

Apologies if this is a naive question, is this also stripping out hCoV/ or whatever the header is - so that the root sequence & exclude files will work properly?
Apologies if this is in there and I just missed it while skimming changes 🙃

huddlej · 2021-05-20T15:07:56Z

@emmahodcroft Yeah, we actually do that as part of the current sanitize scripts based on these config parameters. We had to make that change to get the "Augur input" format downloads to work with our standard include/exclude files and then this PR builds on those existing scripts.

emmahodcroft · 2021-05-20T15:31:41Z

Thanks John! I had thought I'd seen this somewhere before, but I just couldn't remember exactly where. Apologies for the repeat!

huddlej · 2021-05-20T22:31:43Z

This PR implements duplicate resolution for metadata using a procedure described as a possible improvement for Augur's own metadata reader. If this approach to resolving duplicates seems reasonable, we can port this function into Augur.

This PR does not implement duplicate resolution for sequences (at least, not yet!).

huddlej · 2021-05-24T22:22:12Z

After some further modifications to the workflow, we can now run builds like the following that use GISAID tarballs directly as metadata and sequences inputs:

# Define inputs with preferred sequences/metadata listed last.
inputs:
  - name: north-america
    metadata: data/ncov_north-america.tar.gz
    sequences: data/ncov_north-america.tar.gz
  - name: washington
    metadata: data/gisaid_auspice_input_hcov-19_2021_05_24_21.tar
    sequences: data/gisaid_auspice_input_hcov-19_2021_05_24_21.tar

# Define builds.
builds:
  washington:
    region: North America
    country: USA
    division: Washington
    subsampling_scheme: focal-contextual

# Define subsampling scheme.
subsampling:
  focal-contextual:
    focal:
      query: --query "division == '{division}'"
      max_sequences: 20
    contextual:
      query: --query "division != '{division}'"
      max_sequences: 20
      group_by: region year month
      priorities:
        type: proximity
        focus: focal

Internally, the sanitize_sequences.py script extracts the first .tsv or .fasta file from a given (compressed or uncompressed) tarball, decompresses that file as needed, and returns the corresponding buffer to be consumed downstream.

rneher

This looks good to me. Thanks John. I left one comment about the pandas.drop_duplicates

scripts/sanitize_metadata.py

Adds configuration parameters and new arguments/flags to the sanitize metadata and sequences scripts to convert default GISAID metadata and sequences into a format expected by our workflows. The new sanitize operations for metadata include renaming specific fields, parsing the single location field into separate geographic scale fields (region, country, etc.), replacing whitespace in strain names, and resolving duplicate records. When resolving duplicate records, the script sorts records by strain name and all available database accession fields, groups by strain name, and takes the last record of each group. This approach allows us to handle cases like GenBank metadata which include the GISAID accession column but may be missing data for specific records in that column. In the case where accessions are missing in all columns, this approach defaults to the sane default of picking the last record per strain. The new operations applied during metadata sanitization appear in the script's help in the order that they are applied. The script's usage text also reflects that the available operations are applied in the order they appear in the help list. This order of operations is important because some operations (e.g., renaming fields) change values that other operations could depend on (e.g., parsing location field). We sanitize default GISAID sequences by replacing whitespace in strain names, stripping out additional metadata that appears in the FASTA defline, and dropping duplicate sequences to avoid errors downstream in the workflow. This deduplication is especially important when the workflow runs with a single input and the "combine and dedup" step does not run on the inputs. This implementation copies the deduplication logic of that combine and dedup script into a new function that could eventually be ported into the Augur `io` module. Finally, we add support for reading data from GISAID tarballs. GISAID provides tarballs (e.g., `.tar.gz` or `.tar.xz`) for packages available through their "Download" interface. These tarballs typically include a README with the GISAID terms and conditions and a metadata file (`.tsv`) or a sequence file (`.fasta`). This commit adds a utility function to look for one of these file types in a tarball and updates the metadata sanitizer to use this function when a tarball is provided as the metadata file. GISAID provides tarballs in different formats including gzip-compressed tarballs (`.tar.gz`) with uncompressed data inside (`.fasta`) and uncompressed tarballs (`.tar`) with LZMA-compressed data inside (`.fasta.xz`). To handle the case of compressed data inside the tar file, we need to explicitly decompress those data with the Python LZMA library before trying to process the data. Additionally, sequence data need to be decoded prior to consumption by BioPython while metadata can be consumed by pandas without any prior decoding.

Adds checks to the adjust metadata script for exposure columns before attempting to use those columns. These columns exist in the Nextstrain metadata but not in GISAID metadata.

Now that the sequence sanitizer script has to handle duplicates anyway, we no longer need to pipe its output to a separate script that does the same thing. This commit also pipes output from the sanitizer script to xz to speed up compression.

huddlej · 2021-05-27T22:30:23Z

Based on conversation in Slack, I opted to not standardize column names in the metadata in favor of the simpler approach of manually renaming all columns.

huddlej force-pushed the support-default-gisaid-metadata branch from ad43a19 to ee3c1f7 Compare May 20, 2021 22:16

huddlej force-pushed the support-default-gisaid-metadata branch 2 times, most recently from 5579b71 to 3d315e5 Compare May 24, 2021 21:00

huddlej marked this pull request as ready for review May 24, 2021 23:28

huddlej requested review from emmahodcroft, jameshadfield and rneher May 24, 2021 23:29

rneher approved these changes May 25, 2021

View reviewed changes

scripts/sanitize_metadata.py Outdated Show resolved Hide resolved

huddlej force-pushed the support-default-gisaid-metadata branch 3 times, most recently from e45903e to fba1fac Compare May 27, 2021 15:26

huddlej added 5 commits May 27, 2021 15:04

Compressed adjusted metadata like other metadata files in the workflow

b95a6a8

Do not attempt to adjust nonexistent exposure columns

8970046

Adds checks to the adjust metadata script for exposure columns before attempting to use those columns. These columns exist in the Nextstrain metadata but not in GISAID metadata.

Update change log for v7 release

e153db3

huddlej force-pushed the support-default-gisaid-metadata branch from fba1fac to e153db3 Compare May 27, 2021 22:07

huddlej merged commit c1a2de4 into master May 27, 2021

huddlej deleted the support-default-gisaid-metadata branch May 27, 2021 22:31

huddlej mentioned this pull request May 27, 2021

Handle multiple file inputs in the same precedence order. #639

Merged

ammaraziz mentioned this pull request Nov 12, 2021

Feature request - (de)parse subcommand for formatting fasta/metadata nextstrain/augur#783

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support default GISAID metadata and sequences #640

Support default GISAID metadata and sequences #640

huddlej commented May 19, 2021 •

edited

Loading

emmahodcroft commented May 20, 2021

huddlej commented May 20, 2021

emmahodcroft commented May 20, 2021

huddlej commented May 20, 2021

huddlej commented May 24, 2021

rneher left a comment

huddlej commented May 27, 2021

Support default GISAID metadata and sequences #640

Support default GISAID metadata and sequences #640

Conversation

huddlej commented May 19, 2021 • edited Loading

Description of proposed changes

Testing

Release checklist

emmahodcroft commented May 20, 2021

huddlej commented May 20, 2021

emmahodcroft commented May 20, 2021

huddlej commented May 20, 2021

huddlej commented May 24, 2021

rneher left a comment

Choose a reason for hiding this comment

huddlej commented May 27, 2021

huddlej commented May 19, 2021 •

edited

Loading