-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Generate consensus NJ tree for each lineage #127
Comments
Actually Biopython has functions for generating consensus trees - just need to check whether they are valid for unrooted trees. |
Generating consensus tree is pretty time-consuming using Biopython - so is generating pairwise distance matrices: art@orolo:~/git/covizu$ python3 scripts/clustering.py data/gisaid-human.2.json
loading lineage classifications from database
loading JSON
filtered 12522 problematic features
dropped 17598 records with uncalled bases in excess of 600
[0:00:00.000011] start B.4, 391 entries
[0:00:17.583098] generated distance matrix
[0:00:17.583331] built NJ trees
[0:00:44.149396] built consensus tree
[0:00:44.151085] start A.1, 2102 entries
[0:07:45.064944] generated distance matrix
[0:07:45.065991] built NJ trees
[0:40:23.512280] built consensus tree
[0:40:23.519807] start A, 733 entries
[0:41:25.497257] generated distance matrix
[0:41:25.497792] built NJ trees
[0:43:22.176186] built consensus tree
[0:43:22.179588] start B, 2116 entries
[0:50:34.568322] generated distance matrix
[0:50:34.569634] built NJ trees
[1:13:17.847289] built consensus tree
[1:13:17.854521] start B.2, 1652 entries
[1:17:51.896432] generated distance matrix
[1:17:51.897559] built NJ trees
[1:31:10.700830] built consensus tree
[1:31:10.706830] start B.1, 17381 entries |
Eating up all my RAM: art@orolo:~$ free -h
total used free shared buff/cache available
Mem: 31G 30G 642M 200K 428M 637M
Swap: 31G 22G 9.6G
art@orolo:~$ top -b -n1 | head
top - 15:32:33 up 47 days, 22:51, 2 users, load average: 20.04, 34.05, 41.70
Tasks: 559 total, 1 running, 344 sleeping, 0 stopped, 0 zombie
%Cpu(s): 0.1 us, 0.1 sy, 0.0 ni, 99.7 id, 0.1 wa, 0.0 hi, 0.0 si, 0.0 st
KiB Mem : 32852408 total, 454092 free, 32036184 used, 362132 buff/cache
KiB Swap: 33422844 total, 10810840 free, 22612004 used. 410220 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
88079 art 20 0 50.307g 0.029t 460 D 11.8 93.3 99:53.30 python3
93904 art 20 0 42176 4024 3024 R 11.8 0.0 0:00.03 top
90460 root 0 -20 0 0 0 I 5.9 0.0 8:50.79 kworker/u+ |
This part is eating up most of the RAM: sym_diffs = {}
for i in range(n-1):
for j in range(i+1, n):
sym_diffs[(i, j)] = fvs[i] ^ fvs[j] Try bitstring encoding instead. |
Bitstrings were still consuming too much RAM. Using integer indexing instead. |
Crud. Consensus tree crashed: [0:00:52.695685] 63000000/77413602
[0:00:54.649947] sym diff matrix complete
[0:00:56.759948] allocated dists matrix
[0:01:55.304963] applied weights
[0:05:30.301271] applied weights
[0:09:07.968314] applied weights
[0:12:47.213306] applied weights
[0:16:20.412634] applied weights
[0:20:00.963304] applied weights
[0:23:44.389612] applied weights
[0:27:15.579611] applied weights
[0:30:48.231747] applied weights
[0:34:26.481124] applied weights
Traceback (most recent call last):
File "scripts/clustering.py", line 322, in <module>
contree = majority_consensus(trees, cutoff=0.5)
File "/home/art/miniconda3/envs/pangolin/lib/python3.6/site-packages/Bio/Phylo/Consensus.py", line 291, in majority_consensus
terms = first_tree.get_terminals()
File "/home/art/miniconda3/envs/pangolin/lib/python3.6/site-packages/Bio/Phylo/BaseTree.py", line 408, in get_terminals
return list(self.find_clades(terminal=True, order=order))
File "/home/art/miniconda3/envs/pangolin/lib/python3.6/site-packages/Bio/Phylo/BaseTree.py", line 53, in _preorder_traverse
for elem in dfs(root):
File "/home/art/miniconda3/envs/pangolin/lib/python3.6/site-packages/Bio/Phylo/BaseTree.py", line 51, in dfs
for u in dfs(v):
File "/home/art/miniconda3/envs/pangolin/lib/python3.6/site-packages/Bio/Phylo/BaseTree.py", line 51, in dfs
for u in dfs(v):
File "/home/art/miniconda3/envs/pangolin/lib/python3.6/site-packages/Bio/Phylo/BaseTree.py", line 51, in dfs
for u in dfs(v):
[Previous line repeated 991 more times]
File "/home/art/miniconda3/envs/pangolin/lib/python3.6/site-packages/Bio/Phylo/BaseTree.py", line 50, in dfs
for v in get_children(elem):
RecursionError: maximum recursion depth exceeded At least the code is a lot faster now (compare to above - note timer is being restarted with every lineage now): art@orolo:~/git/covizu$ python3 scripts/clustering.py data/gisaid-human.2.json
loading lineage classifications from database
loading JSON
filtered 12522 problematic features
dropped 17598 records with uncalled bases in excess of 600
[2020-09-09 22:51:25.229627] start B.4, 391 entries
[0:00:00.000065] generated distance matrix
[0:00:01.372820] built NJ trees
[0:00:01.884189] built consensus tree
[2020-09-09 22:51:27.115067] start A.1, 2102 entries
[0:00:00.000164] generated distance matrix
[0:00:16.466439] built NJ trees
[0:00:29.149093] built consensus tree
[2020-09-09 22:51:56.267855] start A, 733 entries
[0:00:00.001888] generated distance matrix
[0:00:03.235111] built NJ trees
[0:00:04.952541] built consensus tree
[2020-09-09 22:52:01.222126] start B, 2116 entries
[0:00:00.000400] generated distance matrix
[0:00:22.136273] built NJ trees
[0:00:38.917880] built consensus tree
[2020-09-09 22:52:40.146011] start B.2, 1652 entries
[0:00:00.003416] generated distance matrix
[0:00:11.900515] built NJ trees
[0:00:24.812831] built consensus tree |
Trying to reproduce with this code: print('loading lineage classifications from database')
lineages = dump_lineages(args.db)
print('loading JSON')
features = import_json(args.json)
by_lineage = split_by_lineage(features, lineages)
lineage = 'B.1'
lfeatures = by_lineage[lineage]
labels = [row['name'] for row in lfeatures]
trees = []
counter = 0
for dists in get_distances(lfeatures, 10):
phy = rapidnj(dists, labels)
Phylo.write(phy, file='B.1-tree{}.nwk'.format(counter), format='newick')
trees.append(phy)
counter += 1
contree = majority_consensus(trees, cutoff=0.5)
Phylo.write(contree, file='{}.nwk'.format(lineage), format='newick')
sys.exit() Running into a different recursion error: [0:00:52.563452] 63000000/77413602
[0:00:54.539143] sym diff matrix complete
[0:00:56.622507] allocated dists matrix
[0:01:54.642581] applied weights
Traceback (most recent call last):
File "scripts/clustering.py", line 322, in <module>
Phylo.write(phy, file='B.1-tree{}.nwk'.format(counter), format='newick')
File "/home/art/miniconda3/envs/pangolin/lib/python3.6/site-packages/Bio/Phylo/_io.py", line 82, in write
n = getattr(supported_formats[format], 'write')(trees, fp, **kwargs)
File "/home/art/miniconda3/envs/pangolin/lib/python3.6/site-packages/Bio/Phylo/NewickIO.py", line 58, in write
return Writer(trees).write(handle, plain=plain, **kwargs)
File "/home/art/miniconda3/envs/pangolin/lib/python3.6/site-packages/Bio/Phylo/NewickIO.py", line 251, in write
for treestr in self.to_strings(**kwargs):
File "/home/art/miniconda3/envs/pangolin/lib/python3.6/site-packages/Bio/Phylo/NewickIO.py", line 289, in to_strings
rawtree = newickize(tree.root) + ';'
File "/home/art/miniconda3/envs/pangolin/lib/python3.6/site-packages/Bio/Phylo/NewickIO.py", line 281, in newickize
return '(%s)%s' % (','.join(subtrees),
File "/home/art/miniconda3/envs/pangolin/lib/python3.6/site-packages/Bio/Phylo/NewickIO.py", line 280, in <genexpr>
subtrees = (newickize(sub) for sub in clade)
[... repeated many times ...]
File "/home/art/miniconda3/envs/pangolin/lib/python3.6/site-packages/Bio/Phylo/NewickIO.py", line 280, in newickize
subtrees = (newickize(sub) for sub in clade)
RecursionError: maximum recursion depth exceeded while calling a Python object |
Could not reproduce running one tree manually: art@orolo:~/git/covizu$ python3
Python 3.6.9 (default, Jul 17 2020, 12:50:27)
[GCC 8.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from Bio import Phylo
>>> phy = Phylo.read('test.nwk', 'newick')
>>> phy
Tree(rooted=False, weight=1.0)
>>> Phylo
<module 'Bio.Phylo' from '/home/art/miniconda3/envs/pangolin/lib/python3.6/site-packages/Bio/Phylo/__init__.py'>
>>> outfile = '/tmp/tmpxxm_sg28'
>>> binpath='rapidnj'
>>> cmd = [binpath, outfile, '-i', 'pd']
>>> cmd.append('--no-negative-length')
>>> import subprocess
>>> stdout = subprocess.check_output(cmd, stderr=subprocess.DEVNULL)
>>> stdout[:20]
b"('hCoV-19/England/CA"
>>> from io import StringIO
>>> with StringIO(stdout.decode('utf-8')) as handle:
... phy = Phylo.read(handle, 'newick')
...
>>> phy
Tree(rooted=False, weight=1.0)
>>> |
Ok at least I was able to reproduce the recursion error running the same script, so hopefully it's not stochastic. |
This is a known issue with BioPython noted by the nextstrain devs: Simple workaround: >>> sys.getrecursionlimit()
1000
>>> sys.setrecursionlimit(10000)
>>> Phylo.write(phy, file='test.nwk', format='newick')
1
>>> sys.setrecursionlimit(1000)
>>> Phylo.write(phy, file='test.nwk', format='newick')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/art/miniconda3/envs/pangolin/lib/python3.6/site-packages/Bio/Phylo/_io.py", line 82, in write |
Successful run! [0:00:52.513348] 63000000/77413602
[0:00:54.486921] sym diff matrix complete
[0:00:56.525685] allocated dists matrix
[0:01:54.666593] applied weights
[0:05:32.339871] applied weights
[0:09:09.574341] applied weights
[0:12:49.922643] applied weights
[0:16:20.204779] applied weights
[0:19:46.601700] applied weights
[0:23:26.567673] applied weights
[0:27:04.533986] applied weights
[0:30:46.346693] applied weights
[0:34:35.218386] applied weights
Traceback (most recent call last):
File "scripts/clustering.py", line 335, in <module>
sys.exit()
SystemExit
>>> |
See Geneious manual for general guidelines on computing consensus tree from bootstrap samples:
http://assets.geneious.com/manual/8.1/GeneiousManualsu91.html
The text was updated successfully, but these errors were encountered: