Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restore local database for extracted features #485

Closed
ArtPoon opened this issue Jun 28, 2023 · 54 comments
Closed

Restore local database for extracted features #485

ArtPoon opened this issue Jun 28, 2023 · 54 comments
Assignees

Comments

@ArtPoon
Copy link
Contributor

ArtPoon commented Jun 28, 2023

We've been burning a lot of CPU re-aligning millions of sequences every time we process a new provision file. Even though this is pretty fast with minimap2, there is a lot of sequences! For the sake of our hardware, we should think about bringing back the database scheme implemented by @ewong347 ages ago. This is the general idea:

  • while iterating through the lzma-decompressed JSON data provision stream, query each EPI accession against the database.
  • If it is in the database, continue to the next record.
  • if it is not, do pairwise alignment to reference via minimap2 as usual, extract features and store in the database along with the metadata we retain from the JSON
  • since the resulting database will be enormous, we might need to merge sequence records with identical feature vectors (which will probably mean reworking Emmanuel's schema)

This should be implemented in an experimental branch

@SandeepThokala
Copy link
Collaborator

Created a temporary gsaid.db file for populating a SEQUENCES table every time a new sequence is given.
Updated code in a new branch iss485
Working on merging sequence records with identical feature vectors.

@GopiGugan
Copy link
Contributor

@SandeepThokala to update this issue with the current database schema (tables, fields, etc.)

@SandeepThokala
Copy link
Collaborator

Database schema for now:

  • sequences

    • accession (primary key)
    • name
    • lineage
    • date
    • location
    • sequence
  • features

    • accession
    • name
    • lineage
    • date
    • location
    • vectors (primary key)

features table is a result of merging records with identical feature vectors from sequences table.
For example if there are two records with identical feature vectors, then the merged record in features table would look like..

accession name location lineage vectors (pk)
accession1,accession2 name1,name2 location1,location2 lineage1,lineage2 fvecs string

@ArtPoon
Copy link
Contributor Author

ArtPoon commented Jul 25, 2023

  • I don't think we should store raw sequences in the database (sequences table)
  • we should have a separate table linking members of a "variant" (a set of sequences with the same feature vector) to the variant ID - alternatively we could store a key to the feature vector in the sequences table

@SandeepThokala
Copy link
Collaborator

Facing the following error, if a the same test file is used for a second time.

File "./covizu/batch.py", line 153, in <module>
    timetree, residuals = build_timetree(by_lineage, args, cb.callback)
  File "./covizu/covizu/utils/batch_utils.py", line 87, in build_timetree
    return covizu.treetime.parse_nexus(nexus_file, fasta)
  File "./covizu/covizu/treetime.py", line 160, in parse_nexus
    rmean = statistics.mean(rvals)
  File "/usr/lib/python3.10/statistics.py", line 328, in mean
    raise StatisticsError('mean requires at least one data point')
statistics.StatisticsError: mean requires at least one data point

@GopiGugan
Copy link
Contributor

Facing the following error, if a the same test file is used for a second time.

@SandeepThokala - can you please push your updates to the iss485 branch so we can investigate this issue further?

@ArtPoon
Copy link
Contributor Author

ArtPoon commented Aug 15, 2023

reprocessing the exact same input data feed is an edge case (we almost always expect there to be new data in production) so we should just have some exception handling for this situation

@ArtPoon
Copy link
Contributor Author

ArtPoon commented Aug 15, 2023

@SandeepThokala when you have the exception handling in place, can you please post some timing results on (1) processing a data feed de novo (i.e., with an empty database), versus (2) reprocessing the same data feed with a full database. In other words, we need to measure what the overhead cost of querying the database for sample identifiers is.

@SandeepThokala
Copy link
Collaborator

Test data used has a total of 2001 records.
Time taken with an empty database: 3 mins 31 secs

🏄 [0:03:31.892925] All done!

Time taken with a full database: 7 secs

🏄 [0:00:07.021282]

@ArtPoon
Copy link
Contributor Author

ArtPoon commented Aug 29, 2023

@SandeepThokala reports that building the database took about 3 minutes for 2,000 sequences on his machine. Extrapolating that out to 10 million records (factor of 5000), that would be roughly 250 hours or about 10 days.
We also need to consider the size of the database.
@SandeepThokala can you report on:

  1. additional, more precise timing of building database on Paphlagon
  2. measure size of database for a given number of records

Would there be any advantage (speed or space efficiency) in switching from sqlite to a more advanced database system like postgres?

@ArtPoon
Copy link
Contributor Author

ArtPoon commented Sep 5, 2023

  • use nohup <cmd> & to run a process in the background so that if your shell is logged out the process will continue to run
  • use UNIX time <cmd> to measure computing time precisely

@SandeepThokala
Copy link
Collaborator

Timing results on Paphlagon using postgresql for a dataset of 1 million sequences
Time taken: 4hrs 50mins

🏄 [0:00:00.754457] Processing GISAID feed data
🏄 [0:00:03.848454] aligned 0 records
🏄 [0:01:48.123322] aligned 10000 records
🏄 [0:03:36.460588] aligned 20000 records
..
..
🏄 [1:18:56.193301] aligned 600000 records
🏄 [1:19:57.546790] aligned 610000 records
🏄 [1:19:57.844676] filtered 758998 problematic features
🏄 [1:19:57.844722]          292531 genomes with excess missing sites
🏄 [1:19:57.844734]          45545 genomes with excess divergence
🏄 [1:19:57.844926] Parsing Pango lineage designations
🏄 [1:19:59.502800] Identifying lineage representative genomes
🏄 [1:20:09.876231] Reconstructing tree with fasttree2
..
..
--- divergence tree saved in nexus format as  
	 data/divergence_tree.nexus

🏄 [1:34:08.394258] Recoding features, compressing variants..
🏄 [1:34:27.818366] start MPI on minor lineages
🏄 [1:34:32.566257][100/128] starting AY.4.8
🏄 [1:34:32.598005][96/128] starting B.1.311
..
..
🏄 [3:54:31.695501] start AY.3, 5719 entries
🏄 [3:58:41.902696] Parsing output files
R[write to console]: Error in Edges[, 1] : incorrect number of dimensions
..
..
R[write to console]: Error in optimx.check(par, optcfg$ufn, optcfg$ugr, optcfg$uhess, lower,  : 
  Function returned is infinite or NA (non-computable)

R[write to console]: Error in optimx.check(par, optcfg$ufn, optcfg$ugr, optcfg$uhess, lower,  : 
  Function returned is infinite or NA (non-computable)

🏄 [4:50:48.659124] All done!

@ArtPoon
Copy link
Contributor Author

ArtPoon commented Sep 12, 2023

thanks @SandeepThokala can you also post timing for rerunning the same data feed with the complete database?
Also is it possible to compare output data files from both versions?

@SandeepThokala
Copy link
Collaborator

Timing results for rerunning the 1 million dataset with complete database

(covizu) [sandeep@Paphlagon covizu]$ time python3 batch.py --infile provision.1m.json.xz --dry-run
🏄 [0:00:00.819510] Processing GISAID feed data
🏄 [0:03:59.062307] filtered 0 problematic features
🏄 [0:03:59.062413]          0 genomes with excess missing sites
...
...
0.00    -TreeAnc: set-up
TreeAnc: tree in /tmp/cvz_tt_fv16_7wj as only 1 tips. Please check your tree!
Tree loading/building failed.
No tree -- exiting.
🏄 [0:04:02.186036] Provided records are already stored.
🏄 [0:04:02.279103] Recoding features, compressing variants..
🏄 [0:04:02.279667] start MPI on minor lineages
🏄 [0:04:06.207955] Parsing output files
🏄 [0:04:08.992934] All done!

real    4m10.655s
user    3m47.634s
sys     5m31.654s

When comparing the clusters.json output file before and after the run, there were no changes detected in the file.

@ArtPoon
Copy link
Contributor Author

ArtPoon commented Sep 26, 2023

Thanks for posting this timing info. Something strange is going on - the treetime step is complaining about only having a single tip in the tree, which implies that nearly all records are being discarded.
Why are the filters reading zero? Are you checking inputs from the data feed with records in the database?

@ArtPoon
Copy link
Contributor Author

ArtPoon commented Sep 26, 2023

@GopiGugan investigating nosql database to reduce the number of transactions, i.e., insert operations to multiple tables per record

@ArtPoon
Copy link
Contributor Author

ArtPoon commented Sep 26, 2023

Once past the initial step of updating the database with new records from the provision feed, we should be streaming JSON from the database with {name, diffs, missing}, with one row per record, to the next stage of the pipeline - i.e., the output from extract_features

@ArtPoon
Copy link
Contributor Author

ArtPoon commented Oct 3, 2023

@GopiGugan estimates the NoSQL database will require about 15GB to store a JSON of 13 million records.

@SandeepThokala
Copy link
Collaborator

Timing results on Paphlagon using postgresql

No. of records Time (Empty) Time (Full)
2,000 2m 49s 2m 46s
10,000 8m 00s 8m 01s
100,000 2h 48m 02s 3h 07m 14s

For the infile containing 100,000 sequences, the run took longer with full database. I think this is because of the way I'm storing diffs and missing, which involves JSON strings.

# inserting diffs and missing as json strings
cur.execute("INSERT INTO SEQUENCES VALUES(%s, %s, %s, %s, %s, %s, %s)",
[json.dumps(v) if k in ['diffs', 'missing'] else v for k, v in record.items()])
yield record

@ArtPoon
Copy link
Contributor Author

ArtPoon commented Oct 17, 2023

The above timings represent running the entire analysis (extracting feature vectors, calculating distances and inferring clusters). We need a breakdown of the time consumed for different stages of the analysis, especially the feature extract (sequence alignment) stage that having a local database is supposed to streamline. @SandeepThokala can you please re-run these tests and measure the time required for each stage, or at least the first step? It might be easier to just run the first step.

@ArtPoon
Copy link
Contributor Author

ArtPoon commented Oct 17, 2023

If you have saved the console output with timing messages then you should be able to figure out how long it takes to get the step where features have been extracted from aligned genomes, compressed into unique vectors, and sorted by lineage.

@SandeepThokala
Copy link
Collaborator

SandeepThokala commented Oct 22, 2023

For an infile containing 2,000 records:
Console output with empty database. Time taken to get the step where features have been extracted: 04.18s

🏄 [0:00:00.833155] Processing GISAID feed data
🏄 [0:00:03.705643] aligned 0 records
🏄 [0:00:04.183092] filtered 1066 problematic features
🏄 [0:00:04.183195]          671 genomes with excess missing sites
🏄 [0:00:04.183209]          163 genomes with excess divergence
🏄 [0:00:04.183306] Parsing Pango lineage designations
🏄 [0:00:05.930795] Identifying lineage representative genomes
🏄 [0:00:06.017177] Reconstructing tree with fasttree2

Console output with full database. Time taken to get the step where features have been extracted: 02.92s

🏄 [0:00:00.918631] Processing GISAID feed data
🏄 [0:00:02.674425] aligned 0 records
🏄 [0:00:02.926458] filtered 1066 problematic features
🏄 [0:00:02.926625]          671 genomes with excess missing sites
🏄 [0:00:02.926641]          163 genomes with excess divergence
🏄 [0:00:02.926736] Parsing Pango lineage designations
🏄 [0:00:04.859742] Identifying lineage representative genomes
🏄 [0:00:04.926775] Reconstructing tree with fasttree2

@SandeepThokala
Copy link
Collaborator

For an infile containing 10,000 records:
Console output with empty database. Time taken to get the step where features have been extracted: 21.24s

🏄 [0:00:00.903292] Processing GISAID feed data
🏄 [0:00:03.725789] aligned 0 records
🏄 [0:00:21.246651] filtered 4387 problematic features
🏄 [0:00:21.247484]          5503 genomes with excess missing sites
🏄 [0:00:21.247500]          424 genomes with excess divergence
🏄 [0:00:21.247601] Parsing Pango lineage designations
🏄 [0:00:23.049981] Identifying lineage representative genomes
🏄 [0:00:23.216430] Reconstructing tree with fasttree2

Console output with full database. Time taken to get the step where features have been extracted: 17.76s

🏄 [0:00:00.923452] Processing GISAID feed data
🏄 [0:00:04.967376] aligned 0 records
🏄 [0:00:17.769201] filtered 4387 problematic features
🏄 [0:00:17.769395]          5503 genomes with excess missing sites
🏄 [0:00:17.769414]          424 genomes with excess divergence
🏄 [0:00:17.769542] Parsing Pango lineage designations
🏄 [0:00:19.787453] Identifying lineage representative genomes
🏄 [0:00:19.936859] Reconstructing tree with fasttree2

@SandeepThokala
Copy link
Collaborator

For an infile containing 100,000 records:
Console output with empty database. Time taken to get the step where features have been extracted: 11m 21.94s

🏄 [0:00:00.891205] Processing GISAID feed data
🏄 [0:00:03.464269] aligned 0 records
🏄 [0:01:00.020073] aligned 10000 records
🏄 [0:02:34.053302] aligned 20000 records
🏄 [0:04:01.141220] aligned 30000 records
🏄 [0:06:13.659268] aligned 40000 records
🏄 [0:09:36.927096] aligned 50000 records
🏄 [0:11:21.948412] filtered 68125 problematic features
🏄 [0:11:21.948540]          38432 genomes with excess missing sites
🏄 [0:11:21.948553]          6358 genomes with excess divergence
🏄 [0:11:21.948665] Parsing Pango lineage designations
🏄 [0:11:23.708734] Identifying lineage representative genomes
🏄 [0:11:25.330877] Reconstructing tree with fasttree2

Console output with full database. Time taken to get the step where features have been extracted: 29m 35.53s

🏄 [0:00:00.760735] Processing GISAID feed data
🏄 [0:00:34.837885] aligned 0 records
🏄 [0:06:27.779398] aligned 10000 records
🏄 [0:12:10.327561] aligned 20000 records
🏄 [0:16:20.266935] aligned 30000 records
🏄 [0:21:52.947085] aligned 40000 records
🏄 [0:27:49.226798] aligned 50000 records
🏄 [0:29:35.536452] filtered 68125 problematic features
🏄 [0:29:35.536590]          38432 genomes with excess missing sites
🏄 [0:29:35.536609]          6358 genomes with excess divergence
🏄 [0:29:35.536694] Parsing Pango lineage designations
🏄 [0:29:37.601106] Identifying lineage representative genomes
🏄 [0:29:39.239793] Reconstructing tree with fasttree2

@SandeepThokala
Copy link
Collaborator

Trying to get results for 1 million sequences, on empty database. still running after 8h 47m

🏄 [8:47:48.732020] aligned 300000 records

@ArtPoon
Copy link
Contributor Author

ArtPoon commented Oct 25, 2023

@SandeepThokala can you make a plot summarizing your timing results (time versus number of sequences)? Also I need you to analyze why re-running from a full database is not faster.

@GopiGugan
Copy link
Contributor

@GopiGugan what branch is this database work being done on? I might need to use it to work on #493

We're using the iss485 and iss485-nosql branches for development

@SandeepThokala
Copy link
Collaborator

@SandeepThokala can you make a plot summarizing your timing results (time versus number of sequences)? Also I need you to analyze why re-running from a full database is not faster.

covizu-plot

Currently working on analyzing why re-run from full database is not faster

@SandeepThokala
Copy link
Collaborator

SandeepThokala commented Oct 31, 2023

Sequentially searching for every record using virus name (Line 120), I think, is causing the delay.

for i, record in enumerate(gen, 1):
qname = record['covv_virus_name']
sequence = record.pop('sequence')
cur.execute("SELECT * FROM SEQUENCES WHERE qname = '%s'"%qname)
result = cur.fetchone()

So, I tried adding an index on the qname column and repeated the experiment.
Solid lines represent the results after adding the index.

@ArtPoon
Copy link
Contributor Author

ArtPoon commented Oct 31, 2023

It's surprising to me that there is not much difference in computing time between the empty and full database scenarios. Are you sure that minimap2 is being run when a record is not found in the database?

@SandeepThokala
Copy link
Collaborator

Just got to test on full database with 1m sequences.

@ArtPoon
Copy link
Contributor Author

ArtPoon commented Nov 7, 2023

@SandeepThokala can you please add timings for running this part of the pipeline WITHOUT the database (aligning everything with minimap2)?

@ArtPoon
Copy link
Contributor Author

ArtPoon commented Nov 7, 2023

We also need to figure out why running with an empty database is giving us new alignments "for free" (at no additional cost to computing time). This does not make sense to me! the other possibility is that retrieving the alignment results ("feature vectors") from the database for each record is taking so much time that it is roughly equivalent in cost to re-aligning each sequence de novo!

@GopiGugan
Copy link
Contributor

@GopiGugan to repeat experiment with MongoDB (NOSQL) instead of SQL to see if it is faster

Captured the time taken to run the following function:

covizu/batch.py

Lines 106 to 116 in 0254ac4

def process_feed(args, collection, callback=None):
""" Process feed data """
if callback:
callback("Processing GISAID feed data")
loader = gisaid_utils.load_gisaid(args.infile, minlen=args.minlen, mindate=args.mindate)
batcher = gisaid_utils.batch_fasta(loader, collection, size=args.batchsize)
aligned = gisaid_utils.extract_features(batcher, collection, ref_file=args.ref, binpath=args.mmbin,
nthread=args.mmthreads, minlen=args.minlen)
filtered = gisaid_utils.filter_problematic(aligned, vcf_file=args.vcf, cutoff=args.poisson_cutoff,
callback=callback)
return gisaid_utils.sort_by_lineage(filtered, callback=callback)

MongoDB results:

Empty Database:

Number of Sequences time(min)
2000 0.0774
10,000 0.3622
100,000 3.7555
1,000,000 33.2657

Populated Database:

Number of Sequences time(min)
2000 0.0282
10,000 0.1390
100,000 1.9545
1,000,000 19.0912

MongoDB timing

@ArtPoon
Copy link
Contributor Author

ArtPoon commented Nov 15, 2023

I think we might need to be more aggressive about reducing data processing and filter the provisioned JSON stream by record submission date (i.e., cutoff by last CoVizu run date). This bypasses database tranasactions, and it also gives us a way of tracking which lineages have been updated (see #493). The downside of this approach is that any records that have been removed from the database would still be in our data set.

@ArtPoon
Copy link
Contributor Author

ArtPoon commented Nov 15, 2023

@SandeepThokala can you give this a try with your test data please?

@ArtPoon
Copy link
Contributor Author

ArtPoon commented Nov 15, 2023

There is a possibility that new records in the database have submission dates in the past, to the degree that they are earlier than our last processing date. We could handle this by extending our date cutoff by a week or more. We could also check records in this extended range for accession numbers.

@SandeepThokala
Copy link
Collaborator

Using dataset with 2000 records (dev.2000.json.xz)

I selected a date (2022-01-20) so that there are approximately 600 sequences out of the 2000 which have older submission date.
Timing results on
empty database

(covizu) [sandeep@Paphlagon covizu]$ python3 batch.py --infile dev.2000.json.xz --mindate 2022-01-20 --dry-run
🏄 [0:00:00.878787] Processing GISAID feed data
🏄 [0:00:02.926217] aligned 0 records
🏄 [0:00:03.350381] filtered 612 problematic features
🏄 [0:00:03.350517]          493 genomes with excess missing sites
🏄 [0:00:03.350567]          142 genomes with excess divergence
Function sort_by_lineage took 2.4718 seconds to execute
🏄 [0:00:03.350973] Parsing Pango lineage designations

populated database

(covizu) [sandeep@Paphlagon covizu]$ python3 batch.py --infile dev.2000.json.xz --mindate 2022-01-20 --dry-run
🏄 [0:00:00.761366] Processing GISAID feed data
🏄 [0:00:01.507638] aligned 0 records
🏄 [0:00:01.682629] filtered 612 problematic features
🏄 [0:00:01.682715]          493 genomes with excess missing sites
🏄 [0:00:01.682726]          142 genomes with excess divergence
Function sort_by_lineage took 0.9213 seconds to execute
🏄 [0:00:01.683012] Parsing Pango lineage designations

@SandeepThokala
Copy link
Collaborator

Using dataset with 10,000 records (dev.10000.json.xz)

I selected a date (2022-01-29) so that there are approximately 3,000 sequences out of the 10,000 which have older submission date.
Timing results on
empty database

(covizu) [sandeep@Paphlagon covizu]$ python3 batch.py --infile dev.10000.json.xz --mindate 2022-01-29 --dry-run
🏄 [0:00:00.923995] Processing GISAID feed data
🏄 [0:00:04.266356] aligned 0 records
🏄 [0:00:12.119832] filtered 3432 problematic features
🏄 [0:00:12.119925]          3820 genomes with excess missing sites
🏄 [0:00:12.119941]          356 genomes with excess divergence
Function sort_by_lineage took 11.1960 seconds to execute
🏄 [0:00:12.120137] Parsing Pango lineage designations

populated database

(covizu) [sandeep@Paphlagon covizu]$ python3 batch.py --infile dev.10000.json.xz --mindate 2022-01-29 --dry-run
🏄 [0:00:01.085012] Processing GISAID feed data
🏄 [0:00:02.514592] aligned 0 records
🏄 [0:00:05.223317] filtered 3432 problematic features
🏄 [0:00:05.223389]          3820 genomes with excess missing sites
🏄 [0:00:05.223402]          356 genomes with excess divergence
Function sort_by_lineage took 4.1384 seconds to execute
🏄 [0:00:05.223610] Parsing Pango lineage designations

@ArtPoon
Copy link
Contributor Author

ArtPoon commented Nov 21, 2023

@SandeepThokala can you please make sure that the outputs (by_lineage dict) are identical between runs?
Can you compare to the nosql database, and get some more detailed timing? What happens when none of the sequences are new?

@SandeepThokala
Copy link
Collaborator

Facing the following error at line 160 when providing now new sequences by specifying a recent mindate.

(covizu) [sandeep@Paphlagon covizu]$ nohup python3 batch.py --infile dev.2000.json.xz --mindate 2023-11-26 
Traceback (most recent call last):
  File "/home/sandeep/Documents/covizu/batch.py", line 158, in <module>
    timetree, residuals = build_timetree(by_lineage, args, cb.callback)
  File "/home/sandeep/Documents/covizu/covizu/utils/batch_utils.py", line 87, in build_timetree
    return covizu.treetime.parse_nexus(nexus_file, fasta)
  File "/home/sandeep/Documents/covizu/covizu/treetime.py", line 160, in parse_nexus
    rmean = statistics.mean(rvals)
  File "/home/sandeep/miniconda3/envs/covizu/lib/python3.10/statistics.py", line 328, in mean
    raise StatisticsError('mean requires at least one data point')
statistics.StatisticsError: mean requires at least one data point

covizu/covizu/treetime.py

Lines 158 to 163 in 4ab1ba4

# normalize residuals and append to tip labels
rvals = residuals.values()
rmean = statistics.mean(rvals)
rstdev = statistics.stdev(rvals)
for tip, resid in residuals.items():
residuals[tip] = (resid-rmean) / rstdev

@SandeepThokala
Copy link
Collaborator

Nosql vs Postgresql timing results
covizu-plot (1)

@ArtPoon
Copy link
Contributor Author

ArtPoon commented Nov 28, 2023

  • try serializing the dictionary as JSON with a non-zero indent argument to induce line breaks, if you are going to export these as text files for inspection
  • it might be easier to have both versions of the by_lineage dictionary in memory in the same instance of Python, compare the key sets and then step through the associated values to identify differences

@SandeepThokala
Copy link
Collaborator

When a recent mindate is specified:

  • by_lineage dict is empty.
  • The code never executes line 145, resulting in empty residuals.
  • Resulted in statistics.StatisticsError when trying to calculate mean of empty list.

    covizu/covizu/treetime.py

    Lines 137 to 145 in 4ab1ba4

    with open(nexus_file) as handle:
    for line in handle:
    for m in pat.finditer(line):
    node_name, branch_length, date_est = m.groups()
    coldate = coldates.get(node_name, None) # internal nodes have no collection date
    if coldate:
    # if estimated date is more recent (higher) than actual date,
    # lineage has more substitutions than expected under molecular clock
    residuals.update({node_name: float(date_est) - coldate})

When a different outdir is specified along with recent mindate:

  • timetree function returned None and stored it in nexus_file
  • Resulted in TypeError when trying to open nexus_file
    nexus_file = covizu.treetime.treetime(nwk, fasta, outdir=args.outdir, binpath=args.ttbin,
    clock=args.clock, verbosity=0)

@SandeepThokala
Copy link
Collaborator

  • try serializing the dictionary as JSON with a non-zero indent argument to induce line breaks, if you are going to export these as text files for inspection
  • it might be easier to have both versions of the by_lineage dictionary in memory in the same instance of Python, compare the key sets and then step through the associated values to identify differences

During the initial run, the 'missing' data is stored using lists, whereas on the subsequent run, they are stored using tuples.

image

@SandeepThokala
Copy link
Collaborator

Timing results postgresql vs nosql

Database Scenario time (s) time (s) time (s)
Postgres Empty 4.84 19.63 787
Postgres Populated 2.1 15.7 1980
NoSQL Empty 4.9 22.8 214
NoSQL Populated 2.8 7.8 89

covizu-plot (2)

@ArtPoon
Copy link
Contributor Author

ArtPoon commented Dec 5, 2023

  • it's still confusing why the empty nosql database runs are always slower, it would be nice to determine what the limiting step is
  • can you please add timings for the current version (i.e., no database) to the above plot?

@SandeepThokala
Copy link
Collaborator

The previous results were inaccurate due to my lack of experience with git stashing. While switching between the iss485 and iss485-nosql branches, I accidentally lost the changes where I had added an index on the 'qname' column. As a result, the runs on populated database costed more time.

These are the final results.

Database Scenario 2k 10k 100k 1m
Postgres Empty 4.35 16.34 155.35 1344.49
Postgres Populated 1.99 6.05 41.27 338.7
NoSQL Empty 6.4 24.91 240.41 2074.86
NoSQL Populated 3.1 8.79 88.72 843.27

covizu-plot (3)

@SandeepThokala
Copy link
Collaborator

Compared timing for postgresql vs run with out database

Database Scenario 2k 10k 100k 1m
Postgres Empty 4.12 15.65 156.25 1343.14
Postgres Populated 2.13 5.0 45.72 385.24
No DB 3.28 13.29 122.69 1044.63

covizu-plot (4)

@ArtPoon
Copy link
Contributor Author

ArtPoon commented Dec 12, 2023

Thanks, that's much better. Looks like for the lookups that we are doing, we should be sticking to SQL.

  • check on file size for database and how it scales with number of records
  • clean up iss485 branch and prepare a pull request to dev

@SandeepThokala
Copy link
Collaborator

File size for database vs the number of records

records file size
2k 2904 kB
10k 14 MB
100k 158 MB
1m 1408 MB

@ArtPoon
Copy link
Contributor Author

ArtPoon commented Jan 16, 2024

Pending merge into dev

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants