abs-quanti-nanopore

This repo contains the key codes and logic for generating the absolute quantification results in the paper "Rapid Absolute Quantification of Pathogens and ARGs by Nanopore Sequencing" by Yang, Yu et al. 2021.

This paper is published here
Raw sequencing data files are available under BioProject: PRJNA728386
Intended use for research only

Components for reproducible analysis

Construction of the Structured Average Genome Size (SAGS) Database
End-to-End Absolute Quantification workflow
Citations

1. Construction of the `Structured Average Genome Size (SAGS)` Database

Tools used:
- taxonkit
- MetaPhlAn2: merge_metaphlan_tables.py
SAGS is built upon the bacterial and archaeal taxonomy and metadata files from GTDB_r95

2. End-to-End `Absolute Quantification` workflow

A) Tools used:

seqtk
seqkit
Kraken2
In case of Illumina metagenomic shotgun reads, Braken2
Minimap2
filter_fasta_by_list_of_headers.py

B) Additional files besides original sequence files required: (files bracketed by should be provided by users*):

Kraken2_gtdb_db: *your Kraken2-compatible GTDB index database files*
mClover3 fasta file: ./fasta/mClover3.fa
nucleotide ARG database and the structure file: *nucleotide-ARG-DB.fasta* & *ARG_structure*
Structured Avg Genome Size (AGS) database: *SAGS* constructed as above
Nanopore DNA CS fasta file: ./fasta/DCS.fasta
Pathogen list: *pathogen.list* original list
Please refer to our manuscript for details of the conversion to GTDB taxonomy nomenclature
Original data can be obtained upon request

C) Logic flow and key codes:

Prepare sequencing reads
- merge reads, convert file types, length filtering by seqtk and seqkit;
- identify (Minimap2) and remove DCS reads if DCS is used in ONT library preparation

seqtk seq -a input.fq > input.fa
seqkit fx2tab -l input.fa
seqkit seq -m 1000 input.fa > input_1kb.fa
minimap2 -cx map-ont ./fasta/DCS.fasta input.fasta > output_DCS_minimap.paf

Kraken2 for rapid taxonomic classification using GTDB r95 database
- compile and stratify taxonomic abundance results into different taxonomic resolutions at the number of bases and the number of genome copy levels

kraken2 --db Kraken2_gtdb_db input_1kb.fa  --output kraken2_gtdb_r95 --use-names --report kraken2_report_gtdbr95 --unclassified-out kraken2_gtdb_r95_unclassified --classified-out kraken2_gtdb_r95_classified

Spiked marker gene alignment by minimap2
- Identify mClover3 reads by Minimap2 and filter results with parameters described in our paper;
- calculate mClover3 gene copy number for a final number of spike cell genome copy number approximation;

minimap2 -cx map-ont ./fasta/mClover3.fa input_1kb.fa > minimap_mClover3_algn.paf

ARG identification by Minimap2 against nucleotide ARG database
- align reads to nucleotide ARG database by Minimap2 and filter results with cutoffs from (here)
- calculate the gene copy number of different ARGs
- keep those ARG-carrying reads with at least addtional 1kb walkout distance for ARG host tracking

minimap2  -cx map-ont nucleotide-ARG-DB.fasta input_1kb.fa > minimap2_ARG_algn.paf

Calculation of the absolute abundance of microbial cells in unit sample volumn
- refer to our paper for the calculation of scaling factor for converting seqenced genome copy number into cell number per unit sample volumn
- absolute abundance of pathogens and ARG-carrying hosts can then be extracted

D) Running time estimation for major steps:

For an input fasta file with size 10 Gb, an approximated 3-4 hr data processing time would be expected to generate the final microbial absolute quantification results.

Kraken2 for taxonomic classification -- 30 min with 10 threads and 300 G memory pre-allocated.
Minimap2 for mClover3 (spiked gene) identification -- 2.5 min with 10 threads and 150 G memory pre-allocated.
Processing kraken2 output to convert the sequenced genome copy numbers to the final absolute cell abundance per unit sample volumn:
- Summing bases for all the classified reads to different Kraken2-assigned LCA taxonomic lineages -- 2.5 hr with 10 threads and 150 G memory pre-allocated.
- Stratifying the summation results above into different taxonomic levels -- 5 min with 10 threads and 150 G memory pre-allocated.
- Convert the sequenced genome copy numbers into the asbolute cell abundance per unit sample volumn -- untimed, but approx. 15 min with single thread.

If you intend to use these commands, please cite these resources:

GTDB
Kraken2
In case of Illumina metagenomic shotgun reads, Braken2
Nucleotide ARG database
Minimap2
taxonkit
seqtk
seqkit
MetaPhlAn: merge_metaphlan_tables.py
Pathogen list

I try hard to credit all the third-party resources/tools/codes. If any unintentional infringements, please contact elly.yu.yang@gmail.com.

Name		Name	Last commit message	Last commit date
Latest commit History 41 Commits
fasta		fasta
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

abs-quanti-nanopore

Components for reproducible analysis

1. Construction of the `Structured Average Genome Size (SAGS)` Database

2. End-to-End `Absolute Quantification` workflow

A) Tools used:

B) Additional files besides original sequence files required: (files bracketed by should be provided by users*):

C) Logic flow and key codes:

D) Running time estimation for major steps:

If you intend to use these commands, please cite these resources:

About

Releases

Packages

License

ellyyuyang/abs-quanti-nanopore

Folders and files

Latest commit

History

Repository files navigation

abs-quanti-nanopore

Components for reproducible analysis

1. Construction of the Structured Average Genome Size (SAGS) Database

2. End-to-End Absolute Quantification workflow

A) Tools used:

B) Additional files besides original sequence files required: (files bracketed by * should be provided by users):

C) Logic flow and key codes:

D) Running time estimation for major steps:

If you intend to use these commands, please cite these resources:

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

1. Construction of the `Structured Average Genome Size (SAGS)` Database

2. End-to-End `Absolute Quantification` workflow

B) Additional files besides original sequence files required: (files bracketed by should be provided by users*):

Packages