This repository provides codes and files to reproduce data and figures from the manuscript "Critical assessment of pan-genomics of metagenome-assembled genomes", by Tang Li and Yanbin Yin* (*corresponding author). Here, the Python and shell scripts cover downloading genome data, simulating metagenome-assembled genomes (MAGs) from complete genomes, analyzing pan-genome, performing Clusters of Orthologous Group (COG) functional annotations and comparing phylogenetic trees. The R codes cover reformatting data, generating plots and combining plots.
- FastANI: Calculate Average Nucleotide Identity (ANI).
- Prokka: Prokaryotic genome annotation.
- Blast+: Compare sequences to database.
- Roary: Pan-genome analysis.
- Anvi'o: Pan-genome analysis.
- BPGA: Pan-genome analysis.
- Fasttree: Phylogenetic tree construction.
- The entire data generated in this study is too large to store on Github, some example data for
Escherichia coli
are available online for testing MAG simulation, generating mixed MAG datasets, extracting and comparing core genes, and evaluating downstream analysis. Anaconda
is used to create conda environment to run Python scripts, the required package conda_list can be downloaded usingconda create --name <env> --file conda_list
.- Information about R packages needed to run R codes can be found in R_packages.
- Python_Shell_scripts
- 1.Genome_Data_Collection:
- download_all_complete_genome_fasta.sh: Download complete bacteria genomes from assembly_summary.txt.
- download_genus_contaminaton_genomes.sh: Download bacteria genomes as contamination datasets.
- fastANI.sh: Calculate average nucleotide identity (ANI) for bacteria species.
- 1.Genome_Data_Collection: