This repository provides codes and files to reproduce data and figures from the manuscript "Critical assessment of pan-genomics of metagenome-assembled genomes", by Tang Li and Yanbin Yin* (*corresponding author). Here, the Python and shell scripts cover downloading genome data, simulating metagenome-assembled genomes (MAGs) from complete genomes, analyzing pan-genome, performing Clusters of Orthologous Group (COG) functional annotations and comparing phylogenetic trees. The R codes cover reformatting data, generating plots and combining plots.
- FastANI: Calculate Average Nucleotide Identity (ANI).
- Prokka: Prokaryotic genome annotation.
- Blast+: Compare sequences to database.
- Roary: Pan-genome analysis.
- Anvi'o: Pan-genome analysis.
- BPGA: Pan-genome analysis.
- Fasttree: Phylogenetic tree construction.
- The entire data generated in this study is too large to store on Github, some example data for
Escherichia coli
are available online for testing MAG simulation, generating mixed MAG datasets, extracting and comparing core genes, and evaluating downstream analysis. Anaconda
is used to create conda environment to run Python scripts, the required package conda_list can be downloaded usingconda create --name <env> --file conda_list
.- Information about R packages needed to run R codes can be found in R_packages.
- Python_Shell_scripts
-
- Genome_Data_Collection: collect and analyze genome data.
- download_all_complete_genome_fasta.sh: Download complete bacteria genomes from assembly_summary.txt.
- download_genus_contaminaton_genomes.sh: Download bacteria genomes as contamination datasets.
- fastANI.sh: Calculate average nucleotide identity (ANI) for bacteria species.
-
- 17_species: pan-genome analysis for 17 species.
- prokka.sh: Genome annotation by using Prokka.
- gen_gff.sh: Rename .gff files from Prokka results.
- roary_species.sh: Pan-genome analysis by using Roary.
- sbatch_roary.sh: Run multiple jobs for pan-genome analysis.
-
- MAG_Simulation: simulate MAGs from complete genomes.
- fragmentation.py: Fragmentation simulation - random cut the genome to fragments (random number of fragments).
- fragmentation_avrg_length.py: Fragmentation simulation - random cut the genome to fragments (random length of fragments).
- incompleteness.py: Incompleteness simulation - remove a percentage of sequence length from each fragment.
- contamination.py: Contamination simulation - add fragments from other genomes in the same species (intraspecies).
- contamination_genus.py: Contamination simulation - add fragments from other genomes in the same genus (interspecies).
- random_distribution: Generate random numbers following F distribution for simualtion.
- generate_numbers.sh: Generate numbers for genome list to assign random fragmentation/incompleteness/contamination numbers.
- simulation.sh: Automatic simulation scripts.
- batch_files.sh: Batch files for simulation.
- multiple_dataset.sh: Generate multiple datasets for testing the dataset variations.
-
- Mixed_datasets: generate mxied datasets contain MAGs and complete genomes.
- rad_combine.sh: Generate mixed datasets with different percentage of MAGs.
- copy_ori_file.py: Generate mixed datasets by combining original and simulated MAG dataset.
- Pan-genome_and_summary.sh: Perform pan-genome analysis for mixed datasets.
- loop_rad_combine.sh: Run rad_combine.sh for multiple times.
- roary_sum.py: Summary Roary results for multiple mixed datasets.
-
- Three_tools:
-