You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Add support for one or more sequence files as input to subcommands. As we plan to support multiple inputs to --metadata arguments in the future, we should support multiple inputs for the --sequences arguments that often accompany the metadata arguments.
Background
Historically, the parse command has been our entry point to Nextstrain builds because we did all the heavy lifting to merge sequences and metadata in our lab’s sequence database. A more common entry point for external users is a curated data set (e.g., metadata and sequences from GISAID) and one or more of their own sets of metadata and sequences.
Most Nextstrain workflows assume that all metadata and sequences have been sufficiently curated prior to starting the workflow that there is only one metadata file and one sequences file.
Internally, if we need to merge two or more FASTA files of sequences, we tend to concatenate these files manually with the UNIX cat command. However, there is a failure mode of cat when files are missing trailing newlines, which some (typically non-Unix) editors and other programs produce, so supporting multiple inputs nicely side-steps this issue for sequences.
Support for multiple sequences already exists in the augur align command, so this functionality has some precedent.
Possible solutions
Internally, we would need to support reading sequences from multiple files into the same standard data structure. We might implement this with a read_sequences function that behaves similarly to the load_alignments function.
To address the external interface on the command line, one solution would be to identify all augur commands that current support unaligned sequences as input with the --sequences argument and add support for multiple arguments to the command line interface.
Another solution would be to encourage users to merge their metadata and sequences as early as possible in their workflow, to avoid multiple sequence inputs and merges downstream. For example, we could add a merge subcommand that knows how to safely merge sequences and metadata into our standard format:
This command would be the entry point for most external users and produce the same standard outputs we expect from augur parse. If we use this approach, we should focus on a minimal set of functionality to merge data without trying to address all possible data sanitation issues that exist in the world.
I forgot that we already implemented a read_sequences function that takes one or more filenames and loads all distinct sequences into a list. Supporting multiple sequence file inputs would then be a matter of calling this function from the augur subcommands where we wish to support multiple inputs.
Proposed feature
Add support for one or more sequence files as input to subcommands. As we plan to support multiple inputs to
--metadata
arguments in the future, we should support multiple inputs for the--sequences
arguments that often accompany the metadata arguments.Background
Historically, the
parse
command has been our entry point to Nextstrain builds because we did all the heavy lifting to merge sequences and metadata in our lab’s sequence database. A more common entry point for external users is a curated data set (e.g., metadata and sequences from GISAID) and one or more of their own sets of metadata and sequences.Most Nextstrain workflows assume that all metadata and sequences have been sufficiently curated prior to starting the workflow that there is only one metadata file and one sequences file.
Internally, if we need to merge two or more FASTA files of sequences, we tend to concatenate these files manually with the UNIX
cat
command. However, there is a failure mode ofcat
when files are missing trailing newlines, which some (typically non-Unix) editors and other programs produce, so supporting multiple inputs nicely side-steps this issue for sequences.Support for multiple sequences already exists in the
augur align
command, so this functionality has some precedent.Possible solutions
Internally, we would need to support reading sequences from multiple files into the same standard data structure. We might implement this with a
read_sequences
function that behaves similarly to the load_alignments function.To address the external interface on the command line, one solution would be to identify all augur commands that current support unaligned sequences as input with the
--sequences
argument and add support for multiple arguments to the command line interface.Another solution would be to encourage users to merge their metadata and sequences as early as possible in their workflow, to avoid multiple sequence inputs and merges downstream. For example, we could add a
merge
subcommand that knows how to safely merge sequences and metadata into our standard format:This command would be the entry point for most external users and produce the same standard outputs we expect from
augur parse
. If we use this approach, we should focus on a minimal set of functionality to merge data without trying to address all possible data sanitation issues that exist in the world.Related issues
This issue is related to the issue of supporting multiple metadata inputs through the augur API and, eventually, the command line.
The text was updated successfully, but these errors were encountered: