-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Restore local database for extracted features #485
Comments
Created a temporary |
@SandeepThokala to update this issue with the current database schema (tables, fields, etc.) |
Database schema for now:
|
|
Facing the following error, if a the same test file is used for a second time.
|
@SandeepThokala - can you please push your updates to the |
reprocessing the exact same input data feed is an edge case (we almost always expect there to be new data in production) so we should just have some exception handling for this situation |
@SandeepThokala when you have the exception handling in place, can you please post some timing results on (1) processing a data feed de novo (i.e., with an empty database), versus (2) reprocessing the same data feed with a full database. In other words, we need to measure what the overhead cost of querying the database for sample identifiers is. |
Test data used has a total of 2001 records.
Time taken with a full database:
|
@SandeepThokala reports that building the database took about 3 minutes for 2,000 sequences on his machine. Extrapolating that out to 10 million records (factor of 5000), that would be roughly 250 hours or about 10 days.
Would there be any advantage (speed or space efficiency) in switching from sqlite to a more advanced database system like postgres? |
|
Timing results on
|
thanks @SandeepThokala can you also post timing for rerunning the same data feed with the complete database? |
Timing results for rerunning the 1 million dataset with complete database (covizu) [sandeep@Paphlagon covizu]$ time python3 batch.py --infile provision.1m.json.xz --dry-run
🏄 [0:00:00.819510] Processing GISAID feed data
🏄 [0:03:59.062307] filtered 0 problematic features
🏄 [0:03:59.062413] 0 genomes with excess missing sites
...
...
0.00 -TreeAnc: set-up
TreeAnc: tree in /tmp/cvz_tt_fv16_7wj as only 1 tips. Please check your tree!
Tree loading/building failed.
No tree -- exiting.
🏄 [0:04:02.186036] Provided records are already stored.
🏄 [0:04:02.279103] Recoding features, compressing variants..
🏄 [0:04:02.279667] start MPI on minor lineages
🏄 [0:04:06.207955] Parsing output files
🏄 [0:04:08.992934] All done!
real 4m10.655s
user 3m47.634s
sys 5m31.654s When comparing the |
Thanks for posting this timing info. Something strange is going on - the treetime step is complaining about only having a single tip in the tree, which implies that nearly all records are being discarded. |
@GopiGugan investigating nosql database to reduce the number of transactions, i.e., insert operations to multiple tables per record |
Once past the initial step of updating the database with new records from the provision feed, we should be streaming JSON from the database with |
@GopiGugan estimates the NoSQL database will require about 15GB to store a JSON of 13 million records. |
Timing results on
For the covizu/covizu/utils/gisaid_utils.py Lines 172 to 175 in a1c5f3b
|
The above timings represent running the entire analysis (extracting feature vectors, calculating distances and inferring clusters). We need a breakdown of the time consumed for different stages of the analysis, especially the feature extract (sequence alignment) stage that having a local database is supposed to streamline. @SandeepThokala can you please re-run these tests and measure the time required for each stage, or at least the first step? It might be easier to just run the first step. |
If you have saved the console output with timing messages then you should be able to figure out how long it takes to get the step where features have been extracted from aligned genomes, compressed into unique vectors, and sorted by lineage. |
For an 🏄 [0:00:00.833155] Processing GISAID feed data
🏄 [0:00:03.705643] aligned 0 records
🏄 [0:00:04.183092] filtered 1066 problematic features
🏄 [0:00:04.183195] 671 genomes with excess missing sites
🏄 [0:00:04.183209] 163 genomes with excess divergence
🏄 [0:00:04.183306] Parsing Pango lineage designations
🏄 [0:00:05.930795] Identifying lineage representative genomes
🏄 [0:00:06.017177] Reconstructing tree with fasttree2 Console output with 🏄 [0:00:00.918631] Processing GISAID feed data
🏄 [0:00:02.674425] aligned 0 records
🏄 [0:00:02.926458] filtered 1066 problematic features
🏄 [0:00:02.926625] 671 genomes with excess missing sites
🏄 [0:00:02.926641] 163 genomes with excess divergence
🏄 [0:00:02.926736] Parsing Pango lineage designations
🏄 [0:00:04.859742] Identifying lineage representative genomes
🏄 [0:00:04.926775] Reconstructing tree with fasttree2 |
For an 🏄 [0:00:00.903292] Processing GISAID feed data
🏄 [0:00:03.725789] aligned 0 records
🏄 [0:00:21.246651] filtered 4387 problematic features
🏄 [0:00:21.247484] 5503 genomes with excess missing sites
🏄 [0:00:21.247500] 424 genomes with excess divergence
🏄 [0:00:21.247601] Parsing Pango lineage designations
🏄 [0:00:23.049981] Identifying lineage representative genomes
🏄 [0:00:23.216430] Reconstructing tree with fasttree2 Console output with 🏄 [0:00:00.923452] Processing GISAID feed data
🏄 [0:00:04.967376] aligned 0 records
🏄 [0:00:17.769201] filtered 4387 problematic features
🏄 [0:00:17.769395] 5503 genomes with excess missing sites
🏄 [0:00:17.769414] 424 genomes with excess divergence
🏄 [0:00:17.769542] Parsing Pango lineage designations
🏄 [0:00:19.787453] Identifying lineage representative genomes
🏄 [0:00:19.936859] Reconstructing tree with fasttree2 |
For an 🏄 [0:00:00.891205] Processing GISAID feed data
🏄 [0:00:03.464269] aligned 0 records
🏄 [0:01:00.020073] aligned 10000 records
🏄 [0:02:34.053302] aligned 20000 records
🏄 [0:04:01.141220] aligned 30000 records
🏄 [0:06:13.659268] aligned 40000 records
🏄 [0:09:36.927096] aligned 50000 records
🏄 [0:11:21.948412] filtered 68125 problematic features
🏄 [0:11:21.948540] 38432 genomes with excess missing sites
🏄 [0:11:21.948553] 6358 genomes with excess divergence
🏄 [0:11:21.948665] Parsing Pango lineage designations
🏄 [0:11:23.708734] Identifying lineage representative genomes
🏄 [0:11:25.330877] Reconstructing tree with fasttree2 Console output with 🏄 [0:00:00.760735] Processing GISAID feed data
🏄 [0:00:34.837885] aligned 0 records
🏄 [0:06:27.779398] aligned 10000 records
🏄 [0:12:10.327561] aligned 20000 records
🏄 [0:16:20.266935] aligned 30000 records
🏄 [0:21:52.947085] aligned 40000 records
🏄 [0:27:49.226798] aligned 50000 records
🏄 [0:29:35.536452] filtered 68125 problematic features
🏄 [0:29:35.536590] 38432 genomes with excess missing sites
🏄 [0:29:35.536609] 6358 genomes with excess divergence
🏄 [0:29:35.536694] Parsing Pango lineage designations
🏄 [0:29:37.601106] Identifying lineage representative genomes
🏄 [0:29:39.239793] Reconstructing tree with fasttree2 |
Trying to get results for 1 million sequences, on empty database. still running after 🏄 [8:47:48.732020] aligned 300000 records |
@SandeepThokala can you make a plot summarizing your timing results (time versus number of sequences)? Also I need you to analyze why re-running from a full database is not faster. |
We're using the |
Currently working on analyzing why re-run from full database is not faster |
Sequentially searching for every record using virus name (Line 120), I think, is causing the delay. covizu/covizu/utils/gisaid_utils.py Lines 116 to 121 in a1c5f3b
So, I tried adding an index on the |
It's surprising to me that there is not much difference in computing time between the empty and full database scenarios. Are you sure that minimap2 is being run when a record is not found in the database? |
@SandeepThokala can you please add timings for running this part of the pipeline WITHOUT the database (aligning everything with minimap2)? |
We also need to figure out why running with an empty database is giving us new alignments "for free" (at no additional cost to computing time). This does not make sense to me! the other possibility is that retrieving the alignment results ("feature vectors") from the database for each record is taking so much time that it is roughly equivalent in cost to re-aligning each sequence de novo! |
Captured the time taken to run the following function: Lines 106 to 116 in 0254ac4
MongoDB results: Empty Database:
Populated Database:
|
I think we might need to be more aggressive about reducing data processing and filter the provisioned JSON stream by record submission date (i.e., cutoff by last CoVizu run date). This bypasses database tranasactions, and it also gives us a way of tracking which lineages have been updated (see #493). The downside of this approach is that any records that have been removed from the database would still be in our data set. |
@SandeepThokala can you give this a try with your test data please? |
There is a possibility that new records in the database have submission dates in the past, to the degree that they are earlier than our last processing date. We could handle this by extending our date cutoff by a week or more. We could also check records in this extended range for accession numbers. |
Using dataset with 2000 records (dev.2000.json.xz)I selected a date (2022-01-20) so that there are approximately 600 sequences out of the 2000 which have older submission date. (covizu) [sandeep@Paphlagon covizu]$ python3 batch.py --infile dev.2000.json.xz --mindate 2022-01-20 --dry-run
🏄 [0:00:00.878787] Processing GISAID feed data
🏄 [0:00:02.926217] aligned 0 records
🏄 [0:00:03.350381] filtered 612 problematic features
🏄 [0:00:03.350517] 493 genomes with excess missing sites
🏄 [0:00:03.350567] 142 genomes with excess divergence
Function sort_by_lineage took 2.4718 seconds to execute
🏄 [0:00:03.350973] Parsing Pango lineage designations
(covizu) [sandeep@Paphlagon covizu]$ python3 batch.py --infile dev.2000.json.xz --mindate 2022-01-20 --dry-run
🏄 [0:00:00.761366] Processing GISAID feed data
🏄 [0:00:01.507638] aligned 0 records
🏄 [0:00:01.682629] filtered 612 problematic features
🏄 [0:00:01.682715] 493 genomes with excess missing sites
🏄 [0:00:01.682726] 142 genomes with excess divergence
Function sort_by_lineage took 0.9213 seconds to execute
🏄 [0:00:01.683012] Parsing Pango lineage designations |
Using dataset with 10,000 records (dev.10000.json.xz)I selected a date (2022-01-29) so that there are approximately 3,000 sequences out of the 10,000 which have older submission date. (covizu) [sandeep@Paphlagon covizu]$ python3 batch.py --infile dev.10000.json.xz --mindate 2022-01-29 --dry-run
🏄 [0:00:00.923995] Processing GISAID feed data
🏄 [0:00:04.266356] aligned 0 records
🏄 [0:00:12.119832] filtered 3432 problematic features
🏄 [0:00:12.119925] 3820 genomes with excess missing sites
🏄 [0:00:12.119941] 356 genomes with excess divergence
Function sort_by_lineage took 11.1960 seconds to execute
🏄 [0:00:12.120137] Parsing Pango lineage designations
(covizu) [sandeep@Paphlagon covizu]$ python3 batch.py --infile dev.10000.json.xz --mindate 2022-01-29 --dry-run
🏄 [0:00:01.085012] Processing GISAID feed data
🏄 [0:00:02.514592] aligned 0 records
🏄 [0:00:05.223317] filtered 3432 problematic features
🏄 [0:00:05.223389] 3820 genomes with excess missing sites
🏄 [0:00:05.223402] 356 genomes with excess divergence
Function sort_by_lineage took 4.1384 seconds to execute
🏄 [0:00:05.223610] Parsing Pango lineage designations |
@SandeepThokala can you please make sure that the outputs ( |
Facing the following error at line 160 when providing now new sequences by specifying a recent mindate. (covizu) [sandeep@Paphlagon covizu]$ nohup python3 batch.py --infile dev.2000.json.xz --mindate 2023-11-26
Traceback (most recent call last):
File "/home/sandeep/Documents/covizu/batch.py", line 158, in <module>
timetree, residuals = build_timetree(by_lineage, args, cb.callback)
File "/home/sandeep/Documents/covizu/covizu/utils/batch_utils.py", line 87, in build_timetree
return covizu.treetime.parse_nexus(nexus_file, fasta)
File "/home/sandeep/Documents/covizu/covizu/treetime.py", line 160, in parse_nexus
rmean = statistics.mean(rvals)
File "/home/sandeep/miniconda3/envs/covizu/lib/python3.10/statistics.py", line 328, in mean
raise StatisticsError('mean requires at least one data point')
statistics.StatisticsError: mean requires at least one data point Lines 158 to 163 in 4ab1ba4
|
|
When a recent mindate is specified:
When a different outdir is specified along with recent mindate:
|
During the initial run, the 'missing' data is stored using lists, whereas on the subsequent run, they are stored using tuples. |
|
The previous results were inaccurate due to my lack of experience with git stashing. While switching between the These are the final results.
|
Thanks, that's much better. Looks like for the lookups that we are doing, we should be sticking to SQL.
|
File size for database vs the number of records
|
Pending merge into |
We've been burning a lot of CPU re-aligning millions of sequences every time we process a new provision file. Even though this is pretty fast with minimap2, there is a lot of sequences! For the sake of our hardware, we should think about bringing back the database scheme implemented by @ewong347 ages ago. This is the general idea:
This should be implemented in an experimental branch
The text was updated successfully, but these errors were encountered: