RAM-based dataset segmentation #754

LachlanStuart · 2021-02-11T16:14:04Z

The problem of dataset segmentation is that peaks in the imzML files are sorted by spectrum_idx, mz, and the annotation step needs them in mz, spectrum_idx order for efficient processing. Additionally, the spectrum is split into 128MB segments for to allow for easily loading small ranges of m/z values.

Due to memory constraints, the old code would read chunks of the input file and distribute it across a set of temporary files, then re-read each temporary file, sort the peaks and save a segment to COS. Unfortunately disk access was so slow (~50MB/s) that large datasets would timeout.

This PR changes it to do the segmentation completely in RAM with no temporary files. This requires much more RAM(2-4x the size of the .ibd file), but it is overall much faster because it doesn't have the I/O bottleneck.

LachlanStuart · 2021-02-11T16:16:19Z

metaspace/engine/conf/scitest_config.json

@@ -55,7 +55,7 @@
      "include_modules": ["engine/sm"],
      "data_cleaner": true,
      "data_limit": false,
-      "workers": 2
+      "workers": 4


CircleCI seems to share CPU resources. Even though our selected instance says "2 CPUs", using 4 workers sometimes (but not always) causes it to take half as much time. This was responsible for a lot of the performance difference between the Spark and Lithops implementations on CircleCI

LachlanStuart · 2021-02-11T16:17:58Z

.circleci/config.yml

-              # Replace this with the branch to run sci-test against
-              only: feat/ibm-cloud
+              only:
+                - master


I had accidentally merged only: feat/ibm-cloud, which disabled sci-test for master. This re-enables it and hopefully using a list here prevents that mistake from happening again.

LachlanStuart · 2021-02-11T16:22:46Z

metaspace/engine/tests/test_es_export.py

@@ -102,7 +102,7 @@ def test_index_ds_works(sm_config, test_db, es_dsl_search, sm_index, ds_config,
            ds_id=ds_id, moldb=moldb, isocalc=isocalc_mock,
        )

-    wait_for_es(sec=1)
+    wait_for_es(sec=1.5)


This is an unrelated fix. I found that CircleCI spontaneously started having intermittent failures in this test. It seems like ElasticSearch might not have enough time to finish indexing the document. e.g.

https://app.circleci.com/pipelines/github/metaspace2020/metaspace/1183/workflows/a1e13dbe-df28-4021-9ee2-73948873b3db/jobs/13823

https://app.circleci.com/pipelines/github/metaspace2020/metaspace/1162/workflows/f0947cb7-e718-49e3-b545-b993c7b9fc92/jobs/13696

Not sure what has changed, but hopefully increasing the delay makes it stable again.

sergii-mamedov

LGTM

metaspace/engine/sm/engine/annotation_lithops/load_ds.py

LachlanStuart added 5 commits February 11, 2021 16:52

Sort & partition dataset completely in memory

ec7f33c

Force mergesort instead of autodetect

4b8a8e5

Run sci-test in branch

d286cfd

Increase wait time for ES due to flaky test

38a0dfd

Update memory estimation & add explanation

77fd2a7

LachlanStuart commented Feb 11, 2021

View reviewed changes

LachlanStuart marked this pull request as ready for review February 11, 2021 16:23

LachlanStuart assigned sergii-mamedov Feb 11, 2021

LachlanStuart requested a review from sergii-mamedov February 11, 2021 16:23

Fix broken reconstruction of sp_idxs

9ce6325

sergii-mamedov approved these changes Feb 15, 2021

View reviewed changes

metaspace/engine/sm/engine/annotation_lithops/load_ds.py Show resolved Hide resolved

LachlanStuart merged commit fb48626 into master Feb 15, 2021

LachlanStuart deleted the feat/lithops-ds-sort-in-ram branch February 15, 2021 16:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RAM-based dataset segmentation #754

RAM-based dataset segmentation #754

LachlanStuart commented Feb 11, 2021

LachlanStuart Feb 11, 2021 •

edited

Loading

LachlanStuart Feb 11, 2021

LachlanStuart Feb 11, 2021

sergii-mamedov left a comment

RAM-based dataset segmentation #754

RAM-based dataset segmentation #754

Conversation

LachlanStuart commented Feb 11, 2021

LachlanStuart Feb 11, 2021 • edited Loading

Choose a reason for hiding this comment

LachlanStuart Feb 11, 2021

Choose a reason for hiding this comment

LachlanStuart Feb 11, 2021

Choose a reason for hiding this comment

sergii-mamedov left a comment

Choose a reason for hiding this comment

LachlanStuart Feb 11, 2021 •

edited

Loading