Generate consensus NJ tree for each lineage #127

ArtPoon · 2020-09-05T03:45:13Z

See Geneious manual for general guidelines on computing consensus tree from bootstrap samples:
http://assets.geneious.com/manual/8.1/GeneiousManualsu91.html

ArtPoon · 2020-09-08T16:06:53Z

Actually Biopython has functions for generating consensus trees - just need to check whether they are valid for unrooted trees.

ArtPoon · 2020-09-08T19:28:52Z

Generating consensus tree is pretty time-consuming using Biopython - so is generating pairwise distance matrices:

art@orolo:~/git/covizu$ python3 scripts/clustering.py data/gisaid-human.2.json 
loading lineage classifications from database
loading JSON
filtered 12522 problematic features
dropped 17598 records with uncalled bases in excess of 600
[0:00:00.000011] start B.4, 391 entries
[0:00:17.583098] generated distance matrix
[0:00:17.583331] built NJ trees
[0:00:44.149396] built consensus tree
[0:00:44.151085] start A.1, 2102 entries
[0:07:45.064944] generated distance matrix
[0:07:45.065991] built NJ trees
[0:40:23.512280] built consensus tree
[0:40:23.519807] start A, 733 entries
[0:41:25.497257] generated distance matrix
[0:41:25.497792] built NJ trees
[0:43:22.176186] built consensus tree
[0:43:22.179588] start B, 2116 entries
[0:50:34.568322] generated distance matrix
[0:50:34.569634] built NJ trees
[1:13:17.847289] built consensus tree
[1:13:17.854521] start B.2, 1652 entries
[1:17:51.896432] generated distance matrix
[1:17:51.897559] built NJ trees
[1:31:10.700830] built consensus tree
[1:31:10.706830] start B.1, 17381 entries

ArtPoon · 2020-09-08T19:32:49Z

Eating up all my RAM:

art@orolo:~$ free -h
              total        used        free      shared  buff/cache   available
Mem:            31G         30G        642M        200K        428M        637M
Swap:           31G         22G        9.6G
art@orolo:~$ top -b -n1 | head
top - 15:32:33 up 47 days, 22:51,  2 users,  load average: 20.04, 34.05, 41.70
Tasks: 559 total,   1 running, 344 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.1 us,  0.1 sy,  0.0 ni, 99.7 id,  0.1 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem : 32852408 total,   454092 free, 32036184 used,   362132 buff/cache
KiB Swap: 33422844 total, 10810840 free, 22612004 used.   410220 avail Mem 

   PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
 88079 art       20   0 50.307g 0.029t    460 D  11.8 93.3  99:53.30 python3
 93904 art       20   0   42176   4024   3024 R  11.8  0.0   0:00.03 top
 90460 root       0 -20       0      0      0 I   5.9  0.0   8:50.79 kworker/u+

ArtPoon · 2020-09-08T20:12:05Z

This part is eating up most of the RAM:

    sym_diffs = {}
    for i in range(n-1):
        for j in range(i+1, n):
            sym_diffs[(i, j)] = fvs[i] ^ fvs[j]

Try bitstring encoding instead.

ArtPoon · 2020-09-09T00:43:29Z

Bitstrings were still consuming too much RAM. Using integer indexing instead.

ArtPoon · 2020-09-10T02:56:01Z

Crud. Consensus tree crashed:

[0:00:52.695685] 63000000/77413602
[0:00:54.649947] sym diff matrix complete
[0:00:56.759948] allocated dists matrix
[0:01:55.304963] applied weights
[0:05:30.301271] applied weights
[0:09:07.968314] applied weights
[0:12:47.213306] applied weights
[0:16:20.412634] applied weights
[0:20:00.963304] applied weights
[0:23:44.389612] applied weights
[0:27:15.579611] applied weights
[0:30:48.231747] applied weights
[0:34:26.481124] applied weights
Traceback (most recent call last):
  File "scripts/clustering.py", line 322, in <module>
    contree = majority_consensus(trees, cutoff=0.5)
  File "/home/art/miniconda3/envs/pangolin/lib/python3.6/site-packages/Bio/Phylo/Consensus.py", line 291, in majority_consensus
    terms = first_tree.get_terminals()
  File "/home/art/miniconda3/envs/pangolin/lib/python3.6/site-packages/Bio/Phylo/BaseTree.py", line 408, in get_terminals
    return list(self.find_clades(terminal=True, order=order))
  File "/home/art/miniconda3/envs/pangolin/lib/python3.6/site-packages/Bio/Phylo/BaseTree.py", line 53, in _preorder_traverse
    for elem in dfs(root):
  File "/home/art/miniconda3/envs/pangolin/lib/python3.6/site-packages/Bio/Phylo/BaseTree.py", line 51, in dfs
    for u in dfs(v):
  File "/home/art/miniconda3/envs/pangolin/lib/python3.6/site-packages/Bio/Phylo/BaseTree.py", line 51, in dfs
    for u in dfs(v):
  File "/home/art/miniconda3/envs/pangolin/lib/python3.6/site-packages/Bio/Phylo/BaseTree.py", line 51, in dfs
    for u in dfs(v):
  [Previous line repeated 991 more times]
  File "/home/art/miniconda3/envs/pangolin/lib/python3.6/site-packages/Bio/Phylo/BaseTree.py", line 50, in dfs
    for v in get_children(elem):
RecursionError: maximum recursion depth exceeded

At least the code is a lot faster now (compare to above - note timer is being restarted with every lineage now):

art@orolo:~/git/covizu$ python3 scripts/clustering.py data/gisaid-human.2.json 
loading lineage classifications from database
loading JSON
filtered 12522 problematic features
dropped 17598 records with uncalled bases in excess of 600
[2020-09-09 22:51:25.229627] start B.4, 391 entries
[0:00:00.000065] generated distance matrix
[0:00:01.372820] built NJ trees
[0:00:01.884189] built consensus tree
[2020-09-09 22:51:27.115067] start A.1, 2102 entries
[0:00:00.000164] generated distance matrix
[0:00:16.466439] built NJ trees
[0:00:29.149093] built consensus tree
[2020-09-09 22:51:56.267855] start A, 733 entries
[0:00:00.001888] generated distance matrix
[0:00:03.235111] built NJ trees
[0:00:04.952541] built consensus tree
[2020-09-09 22:52:01.222126] start B, 2116 entries
[0:00:00.000400] generated distance matrix
[0:00:22.136273] built NJ trees
[0:00:38.917880] built consensus tree
[2020-09-09 22:52:40.146011] start B.2, 1652 entries
[0:00:00.003416] generated distance matrix
[0:00:11.900515] built NJ trees
[0:00:24.812831] built consensus tree

ArtPoon · 2020-09-11T03:25:54Z

Trying to reproduce with this code:

    print('loading lineage classifications from database')
    lineages = dump_lineages(args.db)

    print('loading JSON')
    features = import_json(args.json)

    by_lineage = split_by_lineage(features, lineages)

    lineage = 'B.1'
    lfeatures = by_lineage[lineage]
    labels = [row['name'] for row in lfeatures]

    trees = []
    counter = 0
    for dists in get_distances(lfeatures, 10):
        phy = rapidnj(dists, labels)
        Phylo.write(phy, file='B.1-tree{}.nwk'.format(counter), format='newick')
        trees.append(phy)
        counter += 1

    contree = majority_consensus(trees, cutoff=0.5)
    Phylo.write(contree, file='{}.nwk'.format(lineage), format='newick')

    sys.exit()

Running into a different recursion error:

[0:00:52.563452] 63000000/77413602
[0:00:54.539143] sym diff matrix complete
[0:00:56.622507] allocated dists matrix
[0:01:54.642581] applied weights
Traceback (most recent call last):
  File "scripts/clustering.py", line 322, in <module>
    Phylo.write(phy, file='B.1-tree{}.nwk'.format(counter), format='newick')
  File "/home/art/miniconda3/envs/pangolin/lib/python3.6/site-packages/Bio/Phylo/_io.py", line 82, in write
    n = getattr(supported_formats[format], 'write')(trees, fp, **kwargs)
  File "/home/art/miniconda3/envs/pangolin/lib/python3.6/site-packages/Bio/Phylo/NewickIO.py", line 58, in write
    return Writer(trees).write(handle, plain=plain, **kwargs)
  File "/home/art/miniconda3/envs/pangolin/lib/python3.6/site-packages/Bio/Phylo/NewickIO.py", line 251, in write
    for treestr in self.to_strings(**kwargs):
  File "/home/art/miniconda3/envs/pangolin/lib/python3.6/site-packages/Bio/Phylo/NewickIO.py", line 289, in to_strings
    rawtree = newickize(tree.root) + ';'
  File "/home/art/miniconda3/envs/pangolin/lib/python3.6/site-packages/Bio/Phylo/NewickIO.py", line 281, in newickize
    return '(%s)%s' % (','.join(subtrees),
  File "/home/art/miniconda3/envs/pangolin/lib/python3.6/site-packages/Bio/Phylo/NewickIO.py", line 280, in <genexpr>
    subtrees = (newickize(sub) for sub in clade)
[... repeated many times ...]
  File "/home/art/miniconda3/envs/pangolin/lib/python3.6/site-packages/Bio/Phylo/NewickIO.py", line 280, in newickize
    subtrees = (newickize(sub) for sub in clade)
RecursionError: maximum recursion depth exceeded while calling a Python object

ArtPoon · 2020-09-11T03:26:33Z

Could not reproduce running one tree manually:

art@orolo:~/git/covizu$ python3
Python 3.6.9 (default, Jul 17 2020, 12:50:27) 
[GCC 8.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from Bio import Phylo
>>> phy = Phylo.read('test.nwk', 'newick')
>>> phy
Tree(rooted=False, weight=1.0)
>>> Phylo
<module 'Bio.Phylo' from '/home/art/miniconda3/envs/pangolin/lib/python3.6/site-packages/Bio/Phylo/__init__.py'>
>>> outfile = '/tmp/tmpxxm_sg28'
>>> binpath='rapidnj'
>>> cmd = [binpath, outfile, '-i', 'pd']
>>> cmd.append('--no-negative-length')
>>> import subprocess
>>> stdout = subprocess.check_output(cmd, stderr=subprocess.DEVNULL)
>>> stdout[:20]
b"('hCoV-19/England/CA"
>>> from io import StringIO
>>> with StringIO(stdout.decode('utf-8')) as handle:
...   phy = Phylo.read(handle, 'newick')
... 
>>> phy
Tree(rooted=False, weight=1.0)
>>>

ArtPoon · 2020-09-11T03:33:32Z

Ok at least I was able to reproduce the recursion error running the same script, so hopefully it's not stochastic.

ArtPoon · 2020-09-14T01:05:21Z

This is a known issue with BioPython noted by the nextstrain devs:
nextstrain/augur#328

Simple workaround:

>>> sys.getrecursionlimit()
1000
>>> sys.setrecursionlimit(10000)
>>> Phylo.write(phy, file='test.nwk', format='newick')
1
>>> sys.setrecursionlimit(1000)
>>> Phylo.write(phy, file='test.nwk', format='newick')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/art/miniconda3/envs/pangolin/lib/python3.6/site-packages/Bio/Phylo/_io.py", line 82, in write

ArtPoon · 2020-09-14T02:13:19Z

Successful run!

[0:00:52.513348] 63000000/77413602
[0:00:54.486921] sym diff matrix complete
[0:00:56.525685] allocated dists matrix
[0:01:54.666593] applied weights
[0:05:32.339871] applied weights
[0:09:09.574341] applied weights
[0:12:49.922643] applied weights
[0:16:20.204779] applied weights
[0:19:46.601700] applied weights
[0:23:26.567673] applied weights
[0:27:04.533986] applied weights
[0:30:46.346693] applied weights
[0:34:35.218386] applied weights
Traceback (most recent call last):
  File "scripts/clustering.py", line 335, in <module>
    sys.exit()
SystemExit
>>>

ArtPoon added a commit that referenced this issue Sep 9, 2020

replaced bitstrings with integer indices, see #127

930f133

ArtPoon closed this as completed Sep 14, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Generate consensus NJ tree for each lineage #127

Generate consensus NJ tree for each lineage #127

ArtPoon commented Sep 5, 2020

ArtPoon commented Sep 8, 2020

ArtPoon commented Sep 8, 2020

ArtPoon commented Sep 8, 2020

ArtPoon commented Sep 8, 2020

ArtPoon commented Sep 9, 2020

ArtPoon commented Sep 10, 2020

ArtPoon commented Sep 11, 2020

ArtPoon commented Sep 11, 2020

ArtPoon commented Sep 11, 2020

ArtPoon commented Sep 14, 2020

ArtPoon commented Sep 14, 2020

Generate consensus NJ tree for each lineage #127

Generate consensus NJ tree for each lineage #127

Comments

ArtPoon commented Sep 5, 2020

ArtPoon commented Sep 8, 2020

ArtPoon commented Sep 8, 2020

ArtPoon commented Sep 8, 2020

ArtPoon commented Sep 8, 2020

ArtPoon commented Sep 9, 2020

ArtPoon commented Sep 10, 2020

ArtPoon commented Sep 11, 2020

ArtPoon commented Sep 11, 2020

ArtPoon commented Sep 11, 2020

ArtPoon commented Sep 14, 2020

ArtPoon commented Sep 14, 2020