Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generate consensus NJ tree for each lineage #127

Closed
ArtPoon opened this issue Sep 5, 2020 · 11 comments
Closed

Generate consensus NJ tree for each lineage #127

ArtPoon opened this issue Sep 5, 2020 · 11 comments

Comments

@ArtPoon
Copy link
Contributor

ArtPoon commented Sep 5, 2020

See Geneious manual for general guidelines on computing consensus tree from bootstrap samples:
http://assets.geneious.com/manual/8.1/GeneiousManualsu91.html

@ArtPoon
Copy link
Contributor Author

ArtPoon commented Sep 8, 2020

Actually Biopython has functions for generating consensus trees - just need to check whether they are valid for unrooted trees.

@ArtPoon
Copy link
Contributor Author

ArtPoon commented Sep 8, 2020

Generating consensus tree is pretty time-consuming using Biopython - so is generating pairwise distance matrices:

art@orolo:~/git/covizu$ python3 scripts/clustering.py data/gisaid-human.2.json 
loading lineage classifications from database
loading JSON
filtered 12522 problematic features
dropped 17598 records with uncalled bases in excess of 600
[0:00:00.000011] start B.4, 391 entries
[0:00:17.583098] generated distance matrix
[0:00:17.583331] built NJ trees
[0:00:44.149396] built consensus tree
[0:00:44.151085] start A.1, 2102 entries
[0:07:45.064944] generated distance matrix
[0:07:45.065991] built NJ trees
[0:40:23.512280] built consensus tree
[0:40:23.519807] start A, 733 entries
[0:41:25.497257] generated distance matrix
[0:41:25.497792] built NJ trees
[0:43:22.176186] built consensus tree
[0:43:22.179588] start B, 2116 entries
[0:50:34.568322] generated distance matrix
[0:50:34.569634] built NJ trees
[1:13:17.847289] built consensus tree
[1:13:17.854521] start B.2, 1652 entries
[1:17:51.896432] generated distance matrix
[1:17:51.897559] built NJ trees
[1:31:10.700830] built consensus tree
[1:31:10.706830] start B.1, 17381 entries

@ArtPoon
Copy link
Contributor Author

ArtPoon commented Sep 8, 2020

Eating up all my RAM:

art@orolo:~$ free -h
              total        used        free      shared  buff/cache   available
Mem:            31G         30G        642M        200K        428M        637M
Swap:           31G         22G        9.6G
art@orolo:~$ top -b -n1 | head
top - 15:32:33 up 47 days, 22:51,  2 users,  load average: 20.04, 34.05, 41.70
Tasks: 559 total,   1 running, 344 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.1 us,  0.1 sy,  0.0 ni, 99.7 id,  0.1 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem : 32852408 total,   454092 free, 32036184 used,   362132 buff/cache
KiB Swap: 33422844 total, 10810840 free, 22612004 used.   410220 avail Mem 

   PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
 88079 art       20   0 50.307g 0.029t    460 D  11.8 93.3  99:53.30 python3
 93904 art       20   0   42176   4024   3024 R  11.8  0.0   0:00.03 top
 90460 root       0 -20       0      0      0 I   5.9  0.0   8:50.79 kworker/u+

@ArtPoon
Copy link
Contributor Author

ArtPoon commented Sep 8, 2020

This part is eating up most of the RAM:

    sym_diffs = {}
    for i in range(n-1):
        for j in range(i+1, n):
            sym_diffs[(i, j)] = fvs[i] ^ fvs[j]

Try bitstring encoding instead.

@ArtPoon
Copy link
Contributor Author

ArtPoon commented Sep 9, 2020

Bitstrings were still consuming too much RAM. Using integer indexing instead.

@ArtPoon
Copy link
Contributor Author

ArtPoon commented Sep 10, 2020

Crud. Consensus tree crashed:

[0:00:52.695685] 63000000/77413602
[0:00:54.649947] sym diff matrix complete
[0:00:56.759948] allocated dists matrix
[0:01:55.304963] applied weights
[0:05:30.301271] applied weights
[0:09:07.968314] applied weights
[0:12:47.213306] applied weights
[0:16:20.412634] applied weights
[0:20:00.963304] applied weights
[0:23:44.389612] applied weights
[0:27:15.579611] applied weights
[0:30:48.231747] applied weights
[0:34:26.481124] applied weights
Traceback (most recent call last):
  File "scripts/clustering.py", line 322, in <module>
    contree = majority_consensus(trees, cutoff=0.5)
  File "/home/art/miniconda3/envs/pangolin/lib/python3.6/site-packages/Bio/Phylo/Consensus.py", line 291, in majority_consensus
    terms = first_tree.get_terminals()
  File "/home/art/miniconda3/envs/pangolin/lib/python3.6/site-packages/Bio/Phylo/BaseTree.py", line 408, in get_terminals
    return list(self.find_clades(terminal=True, order=order))
  File "/home/art/miniconda3/envs/pangolin/lib/python3.6/site-packages/Bio/Phylo/BaseTree.py", line 53, in _preorder_traverse
    for elem in dfs(root):
  File "/home/art/miniconda3/envs/pangolin/lib/python3.6/site-packages/Bio/Phylo/BaseTree.py", line 51, in dfs
    for u in dfs(v):
  File "/home/art/miniconda3/envs/pangolin/lib/python3.6/site-packages/Bio/Phylo/BaseTree.py", line 51, in dfs
    for u in dfs(v):
  File "/home/art/miniconda3/envs/pangolin/lib/python3.6/site-packages/Bio/Phylo/BaseTree.py", line 51, in dfs
    for u in dfs(v):
  [Previous line repeated 991 more times]
  File "/home/art/miniconda3/envs/pangolin/lib/python3.6/site-packages/Bio/Phylo/BaseTree.py", line 50, in dfs
    for v in get_children(elem):
RecursionError: maximum recursion depth exceeded

At least the code is a lot faster now (compare to above - note timer is being restarted with every lineage now):

art@orolo:~/git/covizu$ python3 scripts/clustering.py data/gisaid-human.2.json 
loading lineage classifications from database
loading JSON
filtered 12522 problematic features
dropped 17598 records with uncalled bases in excess of 600
[2020-09-09 22:51:25.229627] start B.4, 391 entries
[0:00:00.000065] generated distance matrix
[0:00:01.372820] built NJ trees
[0:00:01.884189] built consensus tree
[2020-09-09 22:51:27.115067] start A.1, 2102 entries
[0:00:00.000164] generated distance matrix
[0:00:16.466439] built NJ trees
[0:00:29.149093] built consensus tree
[2020-09-09 22:51:56.267855] start A, 733 entries
[0:00:00.001888] generated distance matrix
[0:00:03.235111] built NJ trees
[0:00:04.952541] built consensus tree
[2020-09-09 22:52:01.222126] start B, 2116 entries
[0:00:00.000400] generated distance matrix
[0:00:22.136273] built NJ trees
[0:00:38.917880] built consensus tree
[2020-09-09 22:52:40.146011] start B.2, 1652 entries
[0:00:00.003416] generated distance matrix
[0:00:11.900515] built NJ trees
[0:00:24.812831] built consensus tree

@ArtPoon
Copy link
Contributor Author

ArtPoon commented Sep 11, 2020

Trying to reproduce with this code:

    print('loading lineage classifications from database')
    lineages = dump_lineages(args.db)

    print('loading JSON')
    features = import_json(args.json)

    by_lineage = split_by_lineage(features, lineages)

    lineage = 'B.1'
    lfeatures = by_lineage[lineage]
    labels = [row['name'] for row in lfeatures]

    trees = []
    counter = 0
    for dists in get_distances(lfeatures, 10):
        phy = rapidnj(dists, labels)
        Phylo.write(phy, file='B.1-tree{}.nwk'.format(counter), format='newick')
        trees.append(phy)
        counter += 1

    contree = majority_consensus(trees, cutoff=0.5)
    Phylo.write(contree, file='{}.nwk'.format(lineage), format='newick')

    sys.exit()

Running into a different recursion error:

[0:00:52.563452] 63000000/77413602
[0:00:54.539143] sym diff matrix complete
[0:00:56.622507] allocated dists matrix
[0:01:54.642581] applied weights
Traceback (most recent call last):
  File "scripts/clustering.py", line 322, in <module>
    Phylo.write(phy, file='B.1-tree{}.nwk'.format(counter), format='newick')
  File "/home/art/miniconda3/envs/pangolin/lib/python3.6/site-packages/Bio/Phylo/_io.py", line 82, in write
    n = getattr(supported_formats[format], 'write')(trees, fp, **kwargs)
  File "/home/art/miniconda3/envs/pangolin/lib/python3.6/site-packages/Bio/Phylo/NewickIO.py", line 58, in write
    return Writer(trees).write(handle, plain=plain, **kwargs)
  File "/home/art/miniconda3/envs/pangolin/lib/python3.6/site-packages/Bio/Phylo/NewickIO.py", line 251, in write
    for treestr in self.to_strings(**kwargs):
  File "/home/art/miniconda3/envs/pangolin/lib/python3.6/site-packages/Bio/Phylo/NewickIO.py", line 289, in to_strings
    rawtree = newickize(tree.root) + ';'
  File "/home/art/miniconda3/envs/pangolin/lib/python3.6/site-packages/Bio/Phylo/NewickIO.py", line 281, in newickize
    return '(%s)%s' % (','.join(subtrees),
  File "/home/art/miniconda3/envs/pangolin/lib/python3.6/site-packages/Bio/Phylo/NewickIO.py", line 280, in <genexpr>
    subtrees = (newickize(sub) for sub in clade)
[... repeated many times ...]
  File "/home/art/miniconda3/envs/pangolin/lib/python3.6/site-packages/Bio/Phylo/NewickIO.py", line 280, in newickize
    subtrees = (newickize(sub) for sub in clade)
RecursionError: maximum recursion depth exceeded while calling a Python object

@ArtPoon
Copy link
Contributor Author

ArtPoon commented Sep 11, 2020

Could not reproduce running one tree manually:

art@orolo:~/git/covizu$ python3
Python 3.6.9 (default, Jul 17 2020, 12:50:27) 
[GCC 8.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from Bio import Phylo
>>> phy = Phylo.read('test.nwk', 'newick')
>>> phy
Tree(rooted=False, weight=1.0)
>>> Phylo
<module 'Bio.Phylo' from '/home/art/miniconda3/envs/pangolin/lib/python3.6/site-packages/Bio/Phylo/__init__.py'>
>>> outfile = '/tmp/tmpxxm_sg28'
>>> binpath='rapidnj'
>>> cmd = [binpath, outfile, '-i', 'pd']
>>> cmd.append('--no-negative-length')
>>> import subprocess
>>> stdout = subprocess.check_output(cmd, stderr=subprocess.DEVNULL)
>>> stdout[:20]
b"('hCoV-19/England/CA"
>>> from io import StringIO
>>> with StringIO(stdout.decode('utf-8')) as handle:
...   phy = Phylo.read(handle, 'newick')
... 
>>> phy
Tree(rooted=False, weight=1.0)
>>> 

@ArtPoon
Copy link
Contributor Author

ArtPoon commented Sep 11, 2020

Ok at least I was able to reproduce the recursion error running the same script, so hopefully it's not stochastic.

@ArtPoon
Copy link
Contributor Author

ArtPoon commented Sep 14, 2020

This is a known issue with BioPython noted by the nextstrain devs:
nextstrain/augur#328

Simple workaround:

>>> sys.getrecursionlimit()
1000
>>> sys.setrecursionlimit(10000)
>>> Phylo.write(phy, file='test.nwk', format='newick')
1
>>> sys.setrecursionlimit(1000)
>>> Phylo.write(phy, file='test.nwk', format='newick')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/art/miniconda3/envs/pangolin/lib/python3.6/site-packages/Bio/Phylo/_io.py", line 82, in write

@ArtPoon
Copy link
Contributor Author

ArtPoon commented Sep 14, 2020

Successful run!

[0:00:52.513348] 63000000/77413602
[0:00:54.486921] sym diff matrix complete
[0:00:56.525685] allocated dists matrix
[0:01:54.666593] applied weights
[0:05:32.339871] applied weights
[0:09:09.574341] applied weights
[0:12:49.922643] applied weights
[0:16:20.204779] applied weights
[0:19:46.601700] applied weights
[0:23:26.567673] applied weights
[0:27:04.533986] applied weights
[0:30:46.346693] applied weights
[0:34:35.218386] applied weights
Traceback (most recent call last):
  File "scripts/clustering.py", line 335, in <module>
    sys.exit()
SystemExit
>>> 

@ArtPoon ArtPoon closed this as completed Sep 14, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant