Restore local database for extracted features #485

ArtPoon · 2023-06-28T20:36:39Z

We've been burning a lot of CPU re-aligning millions of sequences every time we process a new provision file. Even though this is pretty fast with minimap2, there is a lot of sequences! For the sake of our hardware, we should think about bringing back the database scheme implemented by @ewong347 ages ago. This is the general idea:

while iterating through the lzma-decompressed JSON data provision stream, query each EPI accession against the database.
If it is in the database, continue to the next record.
if it is not, do pairwise alignment to reference via minimap2 as usual, extract features and store in the database along with the metadata we retain from the JSON
since the resulting database will be enormous, we might need to merge sequence records with identical feature vectors (which will probably mean reworking Emmanuel's schema)

This should be implemented in an experimental branch

SandeepThokala · 2023-07-04T03:11:00Z

Created a temporary gsaid.db file for populating a SEQUENCES table every time a new sequence is given.
Updated code in a new branch iss485
Working on merging sequence records with identical feature vectors.

GopiGugan · 2023-07-18T13:05:16Z

@SandeepThokala to update this issue with the current database schema (tables, fields, etc.)

SandeepThokala · 2023-07-18T13:44:19Z

Database schema for now:

sequences
- accession (primary key)
- name
- lineage
- date
- location
- sequence
features
- accession
- name
- lineage
- date
- location
- vectors (primary key)

features table is a result of merging records with identical feature vectors from sequences table.
For example if there are two records with identical feature vectors, then the merged record in features table would look like..

accession	name	location	lineage	vectors (pk)
accession1,accession2	name1,name2	location1,location2	lineage1,lineage2	fvecs string

ArtPoon · 2023-07-25T13:13:10Z

I don't think we should store raw sequences in the database (sequences table)
we should have a separate table linking members of a "variant" (a set of sequences with the same feature vector) to the variant ID - alternatively we could store a key to the feature vector in the sequences table

SandeepThokala · 2023-08-15T12:28:33Z

Facing the following error, if a the same test file is used for a second time.

File "./covizu/batch.py", line 153, in <module>
    timetree, residuals = build_timetree(by_lineage, args, cb.callback)
  File "./covizu/covizu/utils/batch_utils.py", line 87, in build_timetree
    return covizu.treetime.parse_nexus(nexus_file, fasta)
  File "./covizu/covizu/treetime.py", line 160, in parse_nexus
    rmean = statistics.mean(rvals)
  File "/usr/lib/python3.10/statistics.py", line 328, in mean
    raise StatisticsError('mean requires at least one data point')
statistics.StatisticsError: mean requires at least one data point

GopiGugan · 2023-08-15T12:32:49Z

Facing the following error, if a the same test file is used for a second time.

@SandeepThokala - can you please push your updates to the iss485 branch so we can investigate this issue further?

ArtPoon · 2023-08-15T13:12:33Z

reprocessing the exact same input data feed is an edge case (we almost always expect there to be new data in production) so we should just have some exception handling for this situation

ArtPoon · 2023-08-15T13:14:11Z

@SandeepThokala when you have the exception handling in place, can you please post some timing results on (1) processing a data feed de novo (i.e., with an empty database), versus (2) reprocessing the same data feed with a full database. In other words, we need to measure what the overhead cost of querying the database for sample identifiers is.

SandeepThokala · 2023-08-22T06:21:53Z

Test data used has a total of 2001 records.
Time taken with an empty database: 3 mins 31 secs

🏄 [0:03:31.892925] All done!

Time taken with a full database: 7 secs

🏄 [0:00:07.021282]

ArtPoon · 2023-08-29T13:10:04Z

@SandeepThokala reports that building the database took about 3 minutes for 2,000 sequences on his machine. Extrapolating that out to 10 million records (factor of 5000), that would be roughly 250 hours or about 10 days.
We also need to consider the size of the database.
@SandeepThokala can you report on:

additional, more precise timing of building database on Paphlagon
measure size of database for a given number of records

Would there be any advantage (speed or space efficiency) in switching from sqlite to a more advanced database system like postgres?

ArtPoon · 2023-09-05T13:26:26Z

use nohup <cmd> & to run a process in the background so that if your shell is logged out the process will continue to run
use UNIX time <cmd> to measure computing time precisely

SandeepThokala · 2023-09-07T01:45:10Z

Timing results on Paphlagon using postgresql for a dataset of 1 million sequences
Time taken: 4hrs 50mins

🏄 [0:00:00.754457] Processing GISAID feed data
🏄 [0:00:03.848454] aligned 0 records
🏄 [0:01:48.123322] aligned 10000 records
🏄 [0:03:36.460588] aligned 20000 records
..
..
🏄 [1:18:56.193301] aligned 600000 records
🏄 [1:19:57.546790] aligned 610000 records
🏄 [1:19:57.844676] filtered 758998 problematic features
🏄 [1:19:57.844722]          292531 genomes with excess missing sites
🏄 [1:19:57.844734]          45545 genomes with excess divergence
🏄 [1:19:57.844926] Parsing Pango lineage designations
🏄 [1:19:59.502800] Identifying lineage representative genomes
🏄 [1:20:09.876231] Reconstructing tree with fasttree2
..
..
--- divergence tree saved in nexus format as  
	 data/divergence_tree.nexus

🏄 [1:34:08.394258] Recoding features, compressing variants..
🏄 [1:34:27.818366] start MPI on minor lineages
🏄 [1:34:32.566257][100/128] starting AY.4.8
🏄 [1:34:32.598005][96/128] starting B.1.311
..
..
🏄 [3:54:31.695501] start AY.3, 5719 entries
🏄 [3:58:41.902696] Parsing output files
R[write to console]: Error in Edges[, 1] : incorrect number of dimensions
..
..
R[write to console]: Error in optimx.check(par, optcfg$ufn, optcfg$ugr, optcfg$uhess, lower,  : 
  Function returned is infinite or NA (non-computable)

R[write to console]: Error in optimx.check(par, optcfg$ufn, optcfg$ugr, optcfg$uhess, lower,  : 
  Function returned is infinite or NA (non-computable)

🏄 [4:50:48.659124] All done!

ArtPoon · 2023-09-12T13:14:36Z

thanks @SandeepThokala can you also post timing for rerunning the same data feed with the complete database?
Also is it possible to compare output data files from both versions?

SandeepThokala · 2023-09-25T22:12:50Z

Timing results for rerunning the 1 million dataset with complete database

(covizu) [sandeep@Paphlagon covizu]$ time python3 batch.py --infile provision.1m.json.xz --dry-run
🏄 [0:00:00.819510] Processing GISAID feed data
🏄 [0:03:59.062307] filtered 0 problematic features
🏄 [0:03:59.062413]          0 genomes with excess missing sites
...
...
0.00    -TreeAnc: set-up
TreeAnc: tree in /tmp/cvz_tt_fv16_7wj as only 1 tips. Please check your tree!
Tree loading/building failed.
No tree -- exiting.
🏄 [0:04:02.186036] Provided records are already stored.
🏄 [0:04:02.279103] Recoding features, compressing variants..
🏄 [0:04:02.279667] start MPI on minor lineages
🏄 [0:04:06.207955] Parsing output files
🏄 [0:04:08.992934] All done!

real    4m10.655s
user    3m47.634s
sys     5m31.654s

When comparing the clusters.json output file before and after the run, there were no changes detected in the file.

ArtPoon · 2023-09-26T13:06:55Z

Thanks for posting this timing info. Something strange is going on - the treetime step is complaining about only having a single tip in the tree, which implies that nearly all records are being discarded.
Why are the filters reading zero? Are you checking inputs from the data feed with records in the database?

ArtPoon · 2023-09-26T13:08:00Z

@GopiGugan investigating nosql database to reduce the number of transactions, i.e., insert operations to multiple tables per record

ArtPoon · 2023-09-26T13:24:15Z

Once past the initial step of updating the database with new records from the provision feed, we should be streaming JSON from the database with {name, diffs, missing}, with one row per record, to the next stage of the pipeline - i.e., the output from extract_features

ArtPoon · 2023-10-03T13:09:22Z

@GopiGugan estimates the NoSQL database will require about 15GB to store a JSON of 13 million records.

SandeepThokala · 2023-10-17T03:15:03Z

Timing results on Paphlagon using postgresql

No. of records	Time (Empty)	Time (Full)
2,000	2m 49s	2m 46s
10,000	8m 00s	8m 01s
100,000	2h 48m 02s	3h 07m 14s

For the infile containing 100,000 sequences, the run took longer with full database. I think this is because of the way I'm storing diffs and missing, which involves JSON strings.

covizu/covizu/utils/gisaid_utils.py

Lines 172 to 175 in a1c5f3b

    
           # inserting diffs and missing as json strings 
        
           cur.execute("INSERT INTO SEQUENCES VALUES(%s, %s, %s, %s, %s, %s, %s)", 
        
                       [json.dumps(v) if k in ['diffs', 'missing'] else v for k, v in record.items()]) 
        
           yield record

ArtPoon · 2023-10-17T13:11:43Z

The above timings represent running the entire analysis (extracting feature vectors, calculating distances and inferring clusters). We need a breakdown of the time consumed for different stages of the analysis, especially the feature extract (sequence alignment) stage that having a local database is supposed to streamline. @SandeepThokala can you please re-run these tests and measure the time required for each stage, or at least the first step? It might be easier to just run the first step.

ArtPoon · 2023-10-17T13:13:42Z

If you have saved the console output with timing messages then you should be able to figure out how long it takes to get the step where features have been extracted from aligned genomes, compressed into unique vectors, and sorted by lineage.

SandeepThokala · 2023-10-22T23:12:26Z

For an infile containing 2,000 records:
Console output with empty database. Time taken to get the step where features have been extracted: 04.18s

🏄 [0:00:00.833155] Processing GISAID feed data
🏄 [0:00:03.705643] aligned 0 records
🏄 [0:00:04.183092] filtered 1066 problematic features
🏄 [0:00:04.183195]          671 genomes with excess missing sites
🏄 [0:00:04.183209]          163 genomes with excess divergence
🏄 [0:00:04.183306] Parsing Pango lineage designations
🏄 [0:00:05.930795] Identifying lineage representative genomes
🏄 [0:00:06.017177] Reconstructing tree with fasttree2

Console output with full database. Time taken to get the step where features have been extracted: 02.92s

🏄 [0:00:00.918631] Processing GISAID feed data
🏄 [0:00:02.674425] aligned 0 records
🏄 [0:00:02.926458] filtered 1066 problematic features
🏄 [0:00:02.926625]          671 genomes with excess missing sites
🏄 [0:00:02.926641]          163 genomes with excess divergence
🏄 [0:00:02.926736] Parsing Pango lineage designations
🏄 [0:00:04.859742] Identifying lineage representative genomes
🏄 [0:00:04.926775] Reconstructing tree with fasttree2

SandeepThokala · 2023-10-22T23:14:54Z

For an infile containing 10,000 records:
Console output with empty database. Time taken to get the step where features have been extracted: 21.24s

🏄 [0:00:00.903292] Processing GISAID feed data
🏄 [0:00:03.725789] aligned 0 records
🏄 [0:00:21.246651] filtered 4387 problematic features
🏄 [0:00:21.247484]          5503 genomes with excess missing sites
🏄 [0:00:21.247500]          424 genomes with excess divergence
🏄 [0:00:21.247601] Parsing Pango lineage designations
🏄 [0:00:23.049981] Identifying lineage representative genomes
🏄 [0:00:23.216430] Reconstructing tree with fasttree2

Console output with full database. Time taken to get the step where features have been extracted: 17.76s

🏄 [0:00:00.923452] Processing GISAID feed data
🏄 [0:00:04.967376] aligned 0 records
🏄 [0:00:17.769201] filtered 4387 problematic features
🏄 [0:00:17.769395]          5503 genomes with excess missing sites
🏄 [0:00:17.769414]          424 genomes with excess divergence
🏄 [0:00:17.769542] Parsing Pango lineage designations
🏄 [0:00:19.787453] Identifying lineage representative genomes
🏄 [0:00:19.936859] Reconstructing tree with fasttree2

SandeepThokala · 2023-10-22T23:22:06Z

For an infile containing 100,000 records:
Console output with empty database. Time taken to get the step where features have been extracted: 11m 21.94s

🏄 [0:00:00.891205] Processing GISAID feed data
🏄 [0:00:03.464269] aligned 0 records
🏄 [0:01:00.020073] aligned 10000 records
🏄 [0:02:34.053302] aligned 20000 records
🏄 [0:04:01.141220] aligned 30000 records
🏄 [0:06:13.659268] aligned 40000 records
🏄 [0:09:36.927096] aligned 50000 records
🏄 [0:11:21.948412] filtered 68125 problematic features
🏄 [0:11:21.948540]          38432 genomes with excess missing sites
🏄 [0:11:21.948553]          6358 genomes with excess divergence
🏄 [0:11:21.948665] Parsing Pango lineage designations
🏄 [0:11:23.708734] Identifying lineage representative genomes
🏄 [0:11:25.330877] Reconstructing tree with fasttree2

Console output with full database. Time taken to get the step where features have been extracted: 29m 35.53s

🏄 [0:00:00.760735] Processing GISAID feed data
🏄 [0:00:34.837885] aligned 0 records
🏄 [0:06:27.779398] aligned 10000 records
🏄 [0:12:10.327561] aligned 20000 records
🏄 [0:16:20.266935] aligned 30000 records
🏄 [0:21:52.947085] aligned 40000 records
🏄 [0:27:49.226798] aligned 50000 records
🏄 [0:29:35.536452] filtered 68125 problematic features
🏄 [0:29:35.536590]          38432 genomes with excess missing sites
🏄 [0:29:35.536609]          6358 genomes with excess divergence
🏄 [0:29:35.536694] Parsing Pango lineage designations
🏄 [0:29:37.601106] Identifying lineage representative genomes
🏄 [0:29:39.239793] Reconstructing tree with fasttree2

SandeepThokala · 2023-10-23T08:05:31Z

Trying to get results for 1 million sequences, on empty database. still running after 8h 47m

🏄 [8:47:48.732020] aligned 300000 records

ArtPoon · 2023-10-25T13:17:39Z

@SandeepThokala can you make a plot summarizing your timing results (time versus number of sequences)? Also I need you to analyze why re-running from a full database is not faster.

GopiGugan · 2023-10-27T18:03:22Z

@GopiGugan what branch is this database work being done on? I might need to use it to work on #493

We're using the iss485 and iss485-nosql branches for development

SandeepThokala · 2023-10-29T22:07:31Z

@SandeepThokala can you make a plot summarizing your timing results (time versus number of sequences)? Also I need you to analyze why re-running from a full database is not faster.

Currently working on analyzing why re-run from full database is not faster

SandeepThokala · 2023-10-31T05:57:14Z

Sequentially searching for every record using virus name (Line 120), I think, is causing the delay.

covizu/covizu/utils/gisaid_utils.py

Lines 116 to 121 in a1c5f3b

    
           for i, record in enumerate(gen, 1): 
        
               qname = record['covv_virus_name'] 
        
               sequence = record.pop('sequence') 
        
               cur.execute("SELECT * FROM SEQUENCES WHERE qname = '%s'"%qname) 
        
               result = cur.fetchone()

So, I tried adding an index on the qname column and repeated the experiment.
Solid lines represent the results after adding the index.

ArtPoon · 2023-10-31T13:16:57Z

It's surprising to me that there is not much difference in computing time between the empty and full database scenarios. Are you sure that minimap2 is being run when a record is not found in the database?

SandeepThokala · 2023-10-31T13:44:54Z

Just got to test on full database with 1m sequences.

ArtPoon · 2023-11-07T14:11:03Z

@SandeepThokala can you please add timings for running this part of the pipeline WITHOUT the database (aligning everything with minimap2)?

ArtPoon · 2023-11-07T14:13:12Z

We also need to figure out why running with an empty database is giving us new alignments "for free" (at no additional cost to computing time). This does not make sense to me! the other possibility is that retrieving the alignment results ("feature vectors") from the database for each record is taking so much time that it is roughly equivalent in cost to re-aligning each sequence de novo!

GopiGugan · 2023-11-13T20:04:40Z

@GopiGugan to repeat experiment with MongoDB (NOSQL) instead of SQL to see if it is faster

Captured the time taken to run the following function:

covizu/batch.py

Lines 106 to 116 in 0254ac4

    
           def process_feed(args, collection, callback=None): 
        
               """ Process feed data """ 
        
               if callback: 
        
                   callback("Processing GISAID feed data") 
        
               loader = gisaid_utils.load_gisaid(args.infile, minlen=args.minlen, mindate=args.mindate) 
        
               batcher = gisaid_utils.batch_fasta(loader, collection, size=args.batchsize) 
        
               aligned = gisaid_utils.extract_features(batcher, collection, ref_file=args.ref, binpath=args.mmbin, 
        
                                                       nthread=args.mmthreads, minlen=args.minlen) 
        
               filtered = gisaid_utils.filter_problematic(aligned, vcf_file=args.vcf, cutoff=args.poisson_cutoff, 
        
                                                          callback=callback) 
        
               return gisaid_utils.sort_by_lineage(filtered, callback=callback)

MongoDB results:

Empty Database:

Number of Sequences	time(min)
2000	0.0774
10,000	0.3622
100,000	3.7555
1,000,000	33.2657

Populated Database:

Number of Sequences	time(min)
2000	0.0282
10,000	0.1390
100,000	1.9545
1,000,000	19.0912

ArtPoon · 2023-11-15T15:11:08Z

I think we might need to be more aggressive about reducing data processing and filter the provisioned JSON stream by record submission date (i.e., cutoff by last CoVizu run date). This bypasses database tranasactions, and it also gives us a way of tracking which lineages have been updated (see #493). The downside of this approach is that any records that have been removed from the database would still be in our data set.

ArtPoon · 2023-11-15T15:12:00Z

@SandeepThokala can you give this a try with your test data please?

ArtPoon · 2023-11-15T15:14:00Z

There is a possibility that new records in the database have submission dates in the past, to the degree that they are earlier than our last processing date. We could handle this by extending our date cutoff by a week or more. We could also check records in this extended range for accession numbers.

SandeepThokala · 2023-11-21T13:25:17Z

Using dataset with 2000 records (dev.2000.json.xz)

I selected a date (2022-01-20) so that there are approximately 600 sequences out of the 2000 which have older submission date.
Timing results on
empty database

(covizu) [sandeep@Paphlagon covizu]$ python3 batch.py --infile dev.2000.json.xz --mindate 2022-01-20 --dry-run
🏄 [0:00:00.878787] Processing GISAID feed data
🏄 [0:00:02.926217] aligned 0 records
🏄 [0:00:03.350381] filtered 612 problematic features
🏄 [0:00:03.350517]          493 genomes with excess missing sites
🏄 [0:00:03.350567]          142 genomes with excess divergence
Function sort_by_lineage took 2.4718 seconds to execute
🏄 [0:00:03.350973] Parsing Pango lineage designations

populated database

(covizu) [sandeep@Paphlagon covizu]$ python3 batch.py --infile dev.2000.json.xz --mindate 2022-01-20 --dry-run
🏄 [0:00:00.761366] Processing GISAID feed data
🏄 [0:00:01.507638] aligned 0 records
🏄 [0:00:01.682629] filtered 612 problematic features
🏄 [0:00:01.682715]          493 genomes with excess missing sites
🏄 [0:00:01.682726]          142 genomes with excess divergence
Function sort_by_lineage took 0.9213 seconds to execute
🏄 [0:00:01.683012] Parsing Pango lineage designations

SandeepThokala · 2023-11-21T13:28:26Z

Using dataset with 10,000 records (dev.10000.json.xz)

I selected a date (2022-01-29) so that there are approximately 3,000 sequences out of the 10,000 which have older submission date.
Timing results on
empty database

(covizu) [sandeep@Paphlagon covizu]$ python3 batch.py --infile dev.10000.json.xz --mindate 2022-01-29 --dry-run
🏄 [0:00:00.923995] Processing GISAID feed data
🏄 [0:00:04.266356] aligned 0 records
🏄 [0:00:12.119832] filtered 3432 problematic features
🏄 [0:00:12.119925]          3820 genomes with excess missing sites
🏄 [0:00:12.119941]          356 genomes with excess divergence
Function sort_by_lineage took 11.1960 seconds to execute
🏄 [0:00:12.120137] Parsing Pango lineage designations

populated database

(covizu) [sandeep@Paphlagon covizu]$ python3 batch.py --infile dev.10000.json.xz --mindate 2022-01-29 --dry-run
🏄 [0:00:01.085012] Processing GISAID feed data
🏄 [0:00:02.514592] aligned 0 records
🏄 [0:00:05.223317] filtered 3432 problematic features
🏄 [0:00:05.223389]          3820 genomes with excess missing sites
🏄 [0:00:05.223402]          356 genomes with excess divergence
Function sort_by_lineage took 4.1384 seconds to execute
🏄 [0:00:05.223610] Parsing Pango lineage designations

ArtPoon · 2023-11-21T14:15:22Z

@SandeepThokala can you please make sure that the outputs (by_lineage dict) are identical between runs?
Can you compare to the nosql database, and get some more detailed timing? What happens when none of the sequences are new?

SandeepThokala · 2023-11-28T02:16:56Z

Facing the following error at line 160 when providing now new sequences by specifying a recent mindate.

(covizu) [sandeep@Paphlagon covizu]$ nohup python3 batch.py --infile dev.2000.json.xz --mindate 2023-11-26 
Traceback (most recent call last):
  File "/home/sandeep/Documents/covizu/batch.py", line 158, in <module>
    timetree, residuals = build_timetree(by_lineage, args, cb.callback)
  File "/home/sandeep/Documents/covizu/covizu/utils/batch_utils.py", line 87, in build_timetree
    return covizu.treetime.parse_nexus(nexus_file, fasta)
  File "/home/sandeep/Documents/covizu/covizu/treetime.py", line 160, in parse_nexus
    rmean = statistics.mean(rvals)
  File "/home/sandeep/miniconda3/envs/covizu/lib/python3.10/statistics.py", line 328, in mean
    raise StatisticsError('mean requires at least one data point')
statistics.StatisticsError: mean requires at least one data point

covizu/covizu/treetime.py

Lines 158 to 163 in 4ab1ba4

    
           # normalize residuals and append to tip labels 
        
           rvals = residuals.values() 
        
           rmean = statistics.mean(rvals) 
        
           rstdev = statistics.stdev(rvals) 
        
           for tip, resid in residuals.items(): 
        
               residuals[tip] = (resid-rmean) / rstdev

SandeepThokala · 2023-11-28T14:02:58Z

Nosql vs Postgresql timing results

ArtPoon · 2023-11-28T14:21:39Z

try serializing the dictionary as JSON with a non-zero indent argument to induce line breaks, if you are going to export these as text files for inspection
it might be easier to have both versions of the by_lineage dictionary in memory in the same instance of Python, compare the key sets and then step through the associated values to identify differences

SandeepThokala · 2023-12-03T11:49:18Z

When a recent mindate is specified:

by_lineage dict is empty.
The code never executes line 145, resulting in empty residuals.

Resulted in statistics.StatisticsError when trying to calculate mean of empty list.

covizu/covizu/treetime.py

Lines 137 to 145 in 4ab1ba4

    
           with open(nexus_file) as handle: 
        
               for line in handle: 
        
                   for m in pat.finditer(line): 
        
                       node_name, branch_length, date_est = m.groups() 
        
                       coldate = coldates.get(node_name, None)  # internal nodes have no collection date 
        
                       if coldate: 
        
                           # if estimated date is more recent (higher) than actual date, 
        
                           # lineage has more substitutions than expected under molecular clock 
        
                           residuals.update({node_name: float(date_est) - coldate})

When a different outdir is specified along with recent mindate:

timetree function returned None and stored it in nexus_file

Resulted in TypeError when trying to open nexus_file

covizu/covizu/utils/batch_utils.py

Lines 83 to 84 in 4ab1ba4

    
           nexus_file = covizu.treetime.treetime(nwk, fasta, outdir=args.outdir, binpath=args.ttbin, 
        
                                          clock=args.clock, verbosity=0)

SandeepThokala · 2023-12-03T12:48:23Z

try serializing the dictionary as JSON with a non-zero indent argument to induce line breaks, if you are going to export these as text files for inspection

it might be easier to have both versions of the by_lineage dictionary in memory in the same instance of Python, compare the key sets and then step through the associated values to identify differences

During the initial run, the 'missing' data is stored using lists, whereas on the subsequent run, they are stored using tuples.

SandeepThokala · 2023-12-05T14:01:57Z

Timing results postgresql vs nosql

Database	Scenario	time (s)	time (s)	time (s)
Postgres	Empty	4.84	19.63	787
Postgres	Populated	2.1	15.7	1980
NoSQL	Empty	4.9	22.8	214
NoSQL	Populated	2.8	7.8	89

ArtPoon · 2023-12-05T14:15:19Z

it's still confusing why the empty nosql database runs are always slower, it would be nice to determine what the limiting step is
can you please add timings for the current version (i.e., no database) to the above plot?

SandeepThokala · 2023-12-08T06:10:33Z

The previous results were inaccurate due to my lack of experience with git stashing. While switching between the iss485 and iss485-nosql branches, I accidentally lost the changes where I had added an index on the 'qname' column. As a result, the runs on populated database costed more time.

These are the final results.

Database	Scenario	2k	10k	100k	1m
Postgres	Empty	4.35	16.34	155.35	1344.49
Postgres	Populated	1.99	6.05	41.27	338.7
NoSQL	Empty	6.4	24.91	240.41	2074.86
NoSQL	Populated	3.1	8.79	88.72	843.27

SandeepThokala · 2023-12-12T04:33:59Z

Compared timing for postgresql vs run with out database

Database	Scenario	2k	10k	100k	1m
Postgres	Empty	4.12	15.65	156.25	1343.14
Postgres	Populated	2.13	5.0	45.72	385.24
No DB		3.28	13.29	122.69	1044.63

ArtPoon · 2023-12-12T14:19:38Z

Thanks, that's much better. Looks like for the lookups that we are doing, we should be sticking to SQL.

check on file size for database and how it scales with number of records
clean up iss485 branch and prepare a pull request to dev

SandeepThokala · 2023-12-21T09:46:01Z

File size for database vs the number of records

records	file size
2k	2904 kB
10k	14 MB
100k	158 MB
1m	1408 MB

ArtPoon · 2024-01-16T14:14:39Z

Pending merge into dev

ArtPoon assigned SandeepThokala Jun 28, 2023

SandeepThokala added a commit that referenced this issue Jul 4, 2023

Creating a temporary database for #485

73b0953

SandeepThokala added a commit that referenced this issue Aug 15, 2023

Removing comma separated concatenation logic #485

3a4536e

ArtPoon mentioned this issue Oct 17, 2023

Re-run cluster analyses only for lineages that have changed #493

Closed

ArtPoon closed this as completed Jan 23, 2024

GopiGugan mentioned this issue Jan 28, 2024

Provide option to run batch.py without using a database #508

Closed

GopiGugan mentioned this issue Mar 4, 2024

Move backend processing back from Paphlagon to BEVi #499

Closed

GopiGugan mentioned this issue Mar 21, 2024

Restore local database and fix bugs #520

Merged

Restore local database for extracted features #485

Restore local database for extracted features #485

Comments

ArtPoon commented Jun 28, 2023

SandeepThokala commented Jul 4, 2023

GopiGugan commented Jul 18, 2023

SandeepThokala commented Jul 18, 2023

ArtPoon commented Jul 25, 2023 • edited Loading

SandeepThokala commented Aug 15, 2023

GopiGugan commented Aug 15, 2023

ArtPoon commented Aug 15, 2023

ArtPoon commented Aug 15, 2023 • edited Loading

SandeepThokala commented Aug 22, 2023

ArtPoon commented Aug 29, 2023

ArtPoon commented Sep 5, 2023

SandeepThokala commented Sep 7, 2023

ArtPoon commented Sep 12, 2023 • edited Loading

SandeepThokala commented Sep 25, 2023

ArtPoon commented Sep 26, 2023

ArtPoon commented Sep 26, 2023

ArtPoon commented Sep 26, 2023 • edited Loading

ArtPoon commented Oct 3, 2023

SandeepThokala commented Oct 17, 2023

ArtPoon commented Oct 17, 2023

ArtPoon commented Oct 17, 2023

SandeepThokala commented Oct 22, 2023 • edited Loading

SandeepThokala commented Oct 22, 2023

SandeepThokala commented Oct 22, 2023

SandeepThokala commented Oct 23, 2023

ArtPoon commented Oct 25, 2023

GopiGugan commented Oct 27, 2023

SandeepThokala commented Oct 29, 2023

SandeepThokala commented Oct 31, 2023 • edited Loading

ArtPoon commented Oct 31, 2023

SandeepThokala commented Oct 31, 2023

ArtPoon commented Nov 7, 2023

ArtPoon commented Nov 7, 2023 • edited Loading

GopiGugan commented Nov 13, 2023

ArtPoon commented Nov 15, 2023

ArtPoon commented Nov 15, 2023

ArtPoon commented Nov 15, 2023 • edited Loading

SandeepThokala commented Nov 21, 2023

Using dataset with 2000 records (dev.2000.json.xz)

SandeepThokala commented Nov 21, 2023

Using dataset with 10,000 records (dev.10000.json.xz)

ArtPoon commented Nov 21, 2023 • edited Loading

SandeepThokala commented Nov 28, 2023

SandeepThokala commented Nov 28, 2023

ArtPoon commented Nov 28, 2023

SandeepThokala commented Dec 3, 2023

SandeepThokala commented Dec 3, 2023

SandeepThokala commented Dec 5, 2023

ArtPoon commented Dec 5, 2023

SandeepThokala commented Dec 8, 2023

SandeepThokala commented Dec 12, 2023

ArtPoon commented Dec 12, 2023

SandeepThokala commented Dec 21, 2023

ArtPoon commented Jan 16, 2024

ArtPoon commented Jul 25, 2023 •

edited

Loading

ArtPoon commented Aug 15, 2023 •

edited

Loading

ArtPoon commented Sep 12, 2023 •

edited

Loading

ArtPoon commented Sep 26, 2023 •

edited

Loading

SandeepThokala commented Oct 22, 2023 •

edited

Loading

SandeepThokala commented Oct 31, 2023 •

edited

Loading

ArtPoon commented Nov 7, 2023 •

edited

Loading

ArtPoon commented Nov 15, 2023 •

edited

Loading

ArtPoon commented Nov 21, 2023 •

edited

Loading