-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RAM-based dataset segmentation #754
Conversation
@@ -55,7 +55,7 @@ | |||
"include_modules": ["engine/sm"], | |||
"data_cleaner": true, | |||
"data_limit": false, | |||
"workers": 2 | |||
"workers": 4 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CircleCI seems to share CPU resources. Even though our selected instance says "2 CPUs", using 4 workers sometimes (but not always) causes it to take half as much time. This was responsible for a lot of the performance difference between the Spark and Lithops implementations on CircleCI
# Replace this with the branch to run sci-test against | ||
only: feat/ibm-cloud | ||
only: | ||
- master |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had accidentally merged only: feat/ibm-cloud
, which disabled sci-test for master. This re-enables it and hopefully using a list here prevents that mistake from happening again.
@@ -102,7 +102,7 @@ def test_index_ds_works(sm_config, test_db, es_dsl_search, sm_index, ds_config, | |||
ds_id=ds_id, moldb=moldb, isocalc=isocalc_mock, | |||
) | |||
|
|||
wait_for_es(sec=1) | |||
wait_for_es(sec=1.5) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an unrelated fix. I found that CircleCI spontaneously started having intermittent failures in this test. It seems like ElasticSearch might not have enough time to finish indexing the document. e.g.
- https://app.circleci.com/pipelines/github/metaspace2020/metaspace/1183/workflows/a1e13dbe-df28-4021-9ee2-73948873b3db/jobs/13823
- https://app.circleci.com/pipelines/github/metaspace2020/metaspace/1162/workflows/f0947cb7-e718-49e3-b545-b993c7b9fc92/jobs/13696
Not sure what has changed, but hopefully increasing the delay makes it stable again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
The problem of dataset segmentation is that peaks in the
imzML
files are sorted byspectrum_idx, mz
, and the annotation step needs them inmz, spectrum_idx
order for efficient processing. Additionally, the spectrum is split into 128MB segments for to allow for easily loading small ranges of m/z values.Due to memory constraints, the old code would read chunks of the input file and distribute it across a set of temporary files, then re-read each temporary file, sort the peaks and save a segment to COS. Unfortunately disk access was so slow (~50MB/s) that large datasets would timeout.
This PR changes it to do the segmentation completely in RAM with no temporary files. This requires much more RAM(2-4x the size of the
.ibd
file), but it is overall much faster because it doesn't have the I/O bottleneck.