Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RAM-based dataset segmentation #754

Merged
merged 6 commits into from
Feb 15, 2021
Merged

Conversation

LachlanStuart
Copy link
Contributor

The problem of dataset segmentation is that peaks in the imzML files are sorted by spectrum_idx, mz, and the annotation step needs them in mz, spectrum_idx order for efficient processing. Additionally, the spectrum is split into 128MB segments for to allow for easily loading small ranges of m/z values.

Due to memory constraints, the old code would read chunks of the input file and distribute it across a set of temporary files, then re-read each temporary file, sort the peaks and save a segment to COS. Unfortunately disk access was so slow (~50MB/s) that large datasets would timeout.

This PR changes it to do the segmentation completely in RAM with no temporary files. This requires much more RAM(2-4x the size of the .ibd file), but it is overall much faster because it doesn't have the I/O bottleneck.

@@ -55,7 +55,7 @@
"include_modules": ["engine/sm"],
"data_cleaner": true,
"data_limit": false,
"workers": 2
"workers": 4
Copy link
Contributor Author

@LachlanStuart LachlanStuart Feb 11, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CircleCI seems to share CPU resources. Even though our selected instance says "2 CPUs", using 4 workers sometimes (but not always) causes it to take half as much time. This was responsible for a lot of the performance difference between the Spark and Lithops implementations on CircleCI

# Replace this with the branch to run sci-test against
only: feat/ibm-cloud
only:
- master
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had accidentally merged only: feat/ibm-cloud, which disabled sci-test for master. This re-enables it and hopefully using a list here prevents that mistake from happening again.

@@ -102,7 +102,7 @@ def test_index_ds_works(sm_config, test_db, es_dsl_search, sm_index, ds_config,
ds_id=ds_id, moldb=moldb, isocalc=isocalc_mock,
)

wait_for_es(sec=1)
wait_for_es(sec=1.5)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an unrelated fix. I found that CircleCI spontaneously started having intermittent failures in this test. It seems like ElasticSearch might not have enough time to finish indexing the document. e.g.

Not sure what has changed, but hopefully increasing the delay makes it stable again.

Copy link
Contributor

@sergii-mamedov sergii-mamedov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@LachlanStuart LachlanStuart merged commit fb48626 into master Feb 15, 2021
@LachlanStuart LachlanStuart deleted the feat/lithops-ds-sort-in-ram branch February 15, 2021 16:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants