Skip to content

A pipeline for viral identification from metagenomic samples

License

Notifications You must be signed in to change notification settings

NCBI-Hackathons/ViruSpy

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ViruSpy: a pipeline for viral identification from metagenomic samples

Table of Contents

What is ViruSpy?
Why is this important?
ViruSpy Workflow
Quickstart
Installing ViruSpy
ViruSpy Usage
ViruSpy Testing and Validation
Additional Functionality

What is ViruSpy?

ViruSpy is a pipeline designed for virus discovery from metagenomic sequencing data available in NCBI’s SRA database. The first step identifies viral reads in the metagenomic sample with Magic-BLAST, which allows this step without needing to download the (often quite large) metagenomic dataset. The extracted raw reads are assembled into contigs using MEGAHIT and annotated for genes by Glimmer and for conserved domains by RPS-tBLASTn. Following annotation, the Building Up Domains (BUD) algorithm allows us to tell whether the viral genomes are non-native (i.e. integrated) to a host genome.

Why is this important?

Viruses compose a large amount of the genomic biodiversity on the planet, but only a small fraction of the viruses that exist are known. To help fill this gap in knowledge we created a pipeline that can identify putative viral sequences from large scale metagenomic datasets that already exist in the SRA database.

Viruses across multiple virus families are found integrated in host genomes. By including the BUD algorith in the pipeline, we are able to identify these and distinguish them from exogenous viruses.

ViruSpy Workflow

The ViruSpy pipeline requires the user to provide the SRA ID of the metagenomic sample to be searched through and a reference viral genome database. The reference viral genome database can be either supplied by the user in the form of a FASTA file or BLAST database. If neither is provided, ViruSpy will default to the RefSeq viral genome database and attempt to download those sequences in FASTA format.

In the first step Magic-BLAST returns all of the virus-like sequences from the SRA sample, which are assembled into contigs using the MEGAHIT assembler.

The contigs are verified as viral sequences through two methods: prediction of open reading frames within the contigs using Glimmer3, and prediction of conserved protein domains using RPS-tBLASTn. The viral conserved domains (CD) are determined using the NCBI CDD database. Output files from both of these methods are then combined to identify a set of high confidence viral contigs.

Using the identified viral reads, the determination of endogenous reads within a host relies upon the Building Up Domains (BUD) algorithm. BUD takes as input an identified peprocessed viral contig from a metagenomics dataset and feeds the contig ends from both sides to Magic-BLAST, which searches for overlapping reads in the SRA dataset. The reads are then used to extend the contig in both directions. This process continues until non-viral domains are identified on either side of the original viral contig, implying that the original contig was endogenous in the host, or until a specified number of iterations has been reached (default iteration value was set to 10). This process is depicted below:

Useful References

Magic-BLAST

BLAST Command Line Manual
Magic-BLAST GitHub repo
Magic-BLAST NCBI Insights

MEGAHIT

MEGAHIT GitHub repo
MEGAHIT Paper

Protein Domain Identification

BLAST Command Line Manual
NCBI Conserved Domain and Protein Classification

Glimmer3

Glimmer3 Page at JHU
Glimmer3 Paper
Glimmer3 Manual

Installing ViruSpy

Required software

The ViruSpy /scripts/ directory should be added to the user's $PATH.

ViruSpy Usage

Example usage

viruspy.sh [-d] [-f viral_genomes.fasta/-b viral_db] -s SRR1553459 -o output_folder

Required arguments:

Option Description
-s SRR acession number from SRA database
-o Folder to be used for pipeline output

Optional arguments:

Option Description
-f FASTA file containing viral sequences to be used in construction of a BLAST database. If neither this argument nor -b are used, ViruSpy will default to using the Refseq viral genome database.
-b BLAST database with viral sequences to be used with Magic-BLAST. If neither this argument nor -f are used, ViruSpy will default to using the Refseq viral genome database.
-d Determines signature of viruses that are integrated into a host genome (runs the BUD algorithm)

ViruSpy Testing and Validation

Additional Functionality