Welcome to PhyloRooting repository! Here you will find the scripts to infer phylogenomic rooting neighborhood as described in https://doi.org/10.1101/758581.
The main analysis is done with pgroot.py. The input of the analysis is a concatenated file of rooted trees whose branch values represent AD. Get this formt by running MAD (mad.py) with -u flag.
Important! Sequence headers must follow the following format:
>12_1
MDVGKKKTKGC
>12_2
MDVGKKKTKGC
>14_1
MDVGKKKTKGC
Where the first element corresponds to the genome/species id and the second element after the underscore indicates the copy number. For instance, in this example, species 12 contains two paralogs of the same protein, while species 14 has only one.
The program has two running modes:
If the flag --inputpartitions
is used, you should provide a partition file as input containing the partitions to be tested. The file should look as follows:
Partition_id,358,363,474,367,70,371,365,345,94,263,348,AHKW2b,275
1,1,1,1,1,1,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,0,1,1
3,0,0,0,0,0,1,1,1,1,1,1,0,0
Where the first line contains the OTU ids, and the next lines contain the different partitions to be tested.
If the flag --inputpartitions
is not used, the candidate root partitions will be identified from the set of single-copy complete gene trees.
The program will output two files:
*.df
contains a dataframe with the AD values per tree and partition
*.partitions
describes the OTUs composition of the candidate root partitions (only provided if the flag --inputpartitions
is used).
If the flag --neighborhood
is used, the inference of a root neighborhood will be performed.
Additionally, the data sets used in the study can be found here. The analysis for the proteobacteria dataset can be run as follows:
pgroot.py -t Proteobacteria.nwk.AD_unrooted -AD Proteobacteria.df -p Proteobacteria_part.mat --neighborhood
An R script is also provided, which can be used to replicate some of the figues in the manuscript.