Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move backend processing back from Paphlagon to BEVi #499

Closed
ArtPoon opened this issue Dec 7, 2023 · 21 comments
Closed

Move backend processing back from Paphlagon to BEVi #499

ArtPoon opened this issue Dec 7, 2023 · 21 comments
Assignees

Comments

@ArtPoon
Copy link
Contributor

ArtPoon commented Dec 7, 2023

Now that I've done the RAM upgrade, we should restore the data processing workflow on the cluster

@ArtPoon
Copy link
Contributor Author

ArtPoon commented Dec 12, 2023

We'll need to install R package dependencies to handle the number of infections model, which hasn't been run on BEVi before.

@GopiGugan
Copy link
Contributor

[gopigugan@BEVi ~]$ python3 --version
Python 3.6.8
[gopigugan@BEVi ~]$ R --version
R version 3.6.0 (2019-04-26) -- "Planting of a Tree"
Copyright (C) 2019 The R Foundation for Statistical Computing
Platform: x86_64-redhat-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under the terms of the
GNU General Public License versions 2 or 3.
For more information about these matters see
https://www.gnu.org/licenses/.

We are currently running older versions of Python and R. Should these be updated on BEVi?

@ArtPoon
Copy link
Contributor Author

ArtPoon commented Jan 9, 2024

My subscription for the clusterware system has expired so I cannot use the package manager to update python on BEVi. The easiest workaround would probably be to do a local installation (i.e. into /usr/local/bin) of a newer version of Python and make sure that it is in the $PATH. Same thing to get R up to version 4.0.

@ArtPoon
Copy link
Contributor Author

ArtPoon commented Jan 16, 2024

  • Updated R to version 4.3 at /usr/local
  • Python updated to 3.11 at /usr/local
  • ran into SSL certificate issues downloading packages

@ArtPoon
Copy link
Contributor Author

ArtPoon commented Jan 16, 2024

@GopiGugan to run some tests on BEVi before we switch back over

@GopiGugan
Copy link
Contributor

GopiGugan commented Jan 21, 2024

Running into issues installing the tidyquant R package on BEVi:

[gopigugan@BEVi ~]$ R -e "install.packages('tidyquant',dependencies=TRUE, repos='http://cran.rstudio.com/')"
...
ERROR: dependency ‘textshaping’ is not available for package ‘ragg’
...
ERROR: dependency ‘ragg’ is not available for package ‘tidyverse’
...
ERROR: dependency ‘tidyverse’ is not available for package ‘tidyquant’

Looks like there are some dependencies not available for packages on version 4.3.2 of R. We are currently using version 4.2.2 on Paphlagon.

Downgrading R from version 4.3.2 to 4.2.2 on BEVi

@GopiGugan
Copy link
Contributor

GopiGugan commented Jan 28, 2024

  • gcc v4.8.5 is the latest supported version on the package manager
  • R package tidyquant has dependencies that are failing to install because gcc is too old. Compiled gcc version 11.2.0 locally (/usr/local), however gfortran missing. Reinstalling gcc with the following:
    • ./configure --prefix=/usr/local --enable-languages=c,c++,fortran --disable-multilib
    • make
    • make install

@GopiGugan
Copy link
Contributor

Successfully installed R packages. Now running into error when installing rpy2 package:

# pip3 install .
Processing /home/gopigugan/rpy2-RELEASE_3_5_14
  Installing build dependencies ... done
  Getting requirements to build wheel ... error
  error: subprocess-exited-with-error

  × Getting requirements to build wheel did not run successfully.
  │ exit code: 1
  ╰─> [45 lines of output]
      R was not built as a library
      /home/gopigugan/rpy2-RELEASE_3_5_14/./rpy2/situation.py:335: UserWarning: No libraries as -l arguments to the compiler.
        warnings.warn('No libraries as -l arguments to the compiler.')
      R was not built as a library
      /home/gopigugan/rpy2-RELEASE_3_5_14/./rpy2/situation.py:322: UserWarning: No include specified
        warnings.warn('No include specified')
      /tmp/tmp_pw_r_7nwyffu6/test_pw_r.c:1:10: fatal error: Rinterface.h: No such file or directory
          1 | #include <Rinterface.h>

@GopiGugan
Copy link
Contributor

Issue seems to be the following: R was not built as a library

Reinstalling R version 4.2.2 with the --enable-R-shlib option:

make clean
./configure --prefix=/usr/local --enable-R-shlib
make
make install

@GopiGugan
Copy link
Contributor

rpy2 successfully installed but error importing rpy2

# python3
Python 3.11.3 (main, Jan 16 2024, 01:12:27) [GCC 11.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from rpy2.robjects import pandas2ri
Error in glue(.Internal(R.home()), "library", "base", "R", "base", sep = .Platform$file.sep) :
  4 arguments passed to .Internal(paste) which requires 3
Error: could not find function "attach"
Error: object '.ArgsEnv' not found
Fatal error: unable to initialize the JIT

Had to set the following variable to resolve error: export LD_LIBRARY_PATH="$(python3 -m rpy2.situation LD_LIBRARY_PATH)":${LD_LIBRARY_PATH}

@GopiGugan
Copy link
Contributor

Pipeline ran successfully with the test data file:

[covizu@BEVi covizu]$ python3 batch.py --dry-run --infile dev.2000.json.xz
🏄 [0:00:01.038814] Processing GISAID feed data
🏄 [0:00:03.346096] aligned 0 records
🏄 [0:00:03.430148] filtered 1066 problematic features
🏄 [0:00:03.430193]          671 genomes with excess missing sites
🏄 [0:00:03.430204]          163 genomes with excess divergence
🏄 [0:00:03.430838] Parsing Pango lineage designations
🏄 [0:00:05.122239] Identifying lineage representative genomes
🏄 [0:00:05.185900] Reconstructing tree with fasttree2
FastTree Version 2.1.11 Double precision (No SSE3)
...
🏄 [0:01:58.282415][5/56] starting BA.2.1
🏄 [0:02:04.622877][0/56] starting BF.7.5
🏄 [0:02:04.949943][0/56] starting BA.5.1.3
🏄 [0:02:05.022366] Parsing output files
R[write to console]: In addition:
R[write to console]: There were 50 or more warnings (use warnings() to see the first 50)
R[write to console]:

R[write to console]: In addition:
R[write to console]: There were 50 or more warnings (use warnings() to see the first 50)
R[write to console]:

🏄 [0:03:30.864626] All done!

Initiated a dry run to verify there are no issues: nohup python3 batch.py --dry-run > ~/iss499.log &

@ArtPoon
Copy link
Contributor Author

ArtPoon commented Feb 6, 2024

@GopiGugan reports a successful run

@ArtPoon
Copy link
Contributor Author

ArtPoon commented Feb 13, 2024

  • where are we going to store the database?
  • BEVi has a couple of RAIDs:
    • /dev/md126 is mounted at /home and is a RAID1 of two drives for 1.8TB storage (790GB currently available)
    • /dev/md127 is mounted at /data and is a RAID5 of four drives for 11TB storage (9.8 TB currently available)
  • I don't think there would be a latency difference in writing to one RAID versus another, other than the performance hit of different formats.

@ArtPoon
Copy link
Contributor Author

ArtPoon commented Feb 13, 2024

  • current estimate for database size is on the order of 10 GB
  • I think we can get away with storing it on /home for now for the improved write performance of RAID1 over RAID5
  • can do database dump backups to /data

@ArtPoon
Copy link
Contributor Author

ArtPoon commented Feb 27, 2024

Obviously this is on hold until we can get the damn cluster back online (#516)

@GopiGugan
Copy link
Contributor

Currently building database on BEVi (#493, #485)

@GopiGugan
Copy link
Contributor

Investigating a KeyError while building the database:

Traceback (most recent call last):
  File "/home/covizu/covizu/batch.py", line 250, in <module>
    by_lineage = process_feed(args, cur, cb.callback)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/covizu/covizu/batch.py", line 179, in process_feed
    return gisaid_utils.sort_by_lineage(filtered, callback=callback)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/covizu/covizu/covizu/utils/gisaid_utils.py", line 277, in sort_by_lineage
    for i, record in enumerate(records):
  File "/home/covizu/covizu/covizu/utils/gisaid_utils.py", line 220, in filter_problematic
    for record in records:
  File "/home/covizu/covizu/covizu/utils/gisaid_utils.py", line 179, in extract_features
    record = new_records[qname]
             ~~~~~~~~~~~^^^^^^^
KeyError: 'hCoV-19/South'

@GopiGugan
Copy link
Contributor

The issue is that when there is a space in the virus name (qname), e.g. hCoV-19/South Africa/..... it gets cut off in the minimap2 output:

for line in output:
if line == '' or line.startswith('@'):
# split on \n leaves empty line; @ prefix header lines
continue
qname, flag, rname, rpos, _, cigar, _, _, _, seq = \
line.strip('\n').split('\t')[:10]

So the output was failing when trying to retrieve a record by qname:

for qname, diffs, missing in result:
# reconcile minimap2 output with GISAID record
record = new_records[qname]
record.update({'diffs': diffs, 'missing': missing})

Pipeline is also failing because we are retrieving records and inserting records into the database based on the qname instead of the accession id and qname is not unique:

if cur:
cur.execute("SELECT * FROM SEQUENCES WHERE qname = '%s'"%qname)
result = cur.fetchone()

record = new_records[qname]
record.update({'diffs': diffs, 'missing': missing})
if cur:
# inserting diffs and missing as json strings
cur.execute("INSERT INTO SEQUENCES VALUES(%s, %s, %s, %s, %s, %s, %s)",
[json.dumps(v) if k in ['diffs', 'missing'] else v for k, v in record.items()])
cur.execute("INSERT INTO NEW_RECORDS VALUES(%s, %s)",
[qname, record['covv_lineage']])

@ArtPoon
Copy link
Contributor Author

ArtPoon commented Apr 16, 2024

Let's write database dumps to the filesystem on the following basis:

  • weekly (with every run)
  • erase 3 weeks out of 4 past three months (retain monthly dumps beyond 3 months to present)
  • in the long run (?) retain quarterly dumps beyond 3 years

@ArtPoon
Copy link
Contributor Author

ArtPoon commented Apr 23, 2024

@GopiGugan testing out script for clearing out expired logs

@ArtPoon
Copy link
Contributor Author

ArtPoon commented Apr 30, 2024

@GopiGugan to push the clean up script to repo and close

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants