RFC: Representing sequencing library preparations in the HCA DCP metadata standard #87

malloryfreeberg · 2019-07-16T16:13:11Z

Summary

The current HCA DCP metadata model explicitly represents cell suspensions (single cells or multiple cells suspended in some media) but not the sequencing library preparations derived from them. This is creating challenges for contributors, consumers, and DCP implementation teams when submitting, processing, and interpreting sequencing data. Here we propose a solution for explicitly identifying library preparation biomaterials in a sequencing experiment by making them a first-class biomaterial type entity in the metadata standard.

July 31: Last call for community review
August 15: Last call for oversight review

diekhans · 2019-07-16T16:20:17Z

Please include source files for images so that the RFC can be maintained when the author goes on to new adventures.

malloryfreeberg · 2019-07-16T16:26:35Z

Please include source files for images so that the RFC can be maintained when the author goes on to new adventures.

Added!

hewgreen · 2019-07-16T16:44:04Z

This turned out really nice, thanks metadata team. This gets my vote because the new entity is the perfect pivot for consuming metadata downstream. Another unmentioned benefit is that traversing the whole graph isn't as varied because you expect to find this lib prep entity.

With this new entity are there any fields currently on the library_preparation_protocol that would better fit onto this new biomaterial? Many of those fields are the most essential to downstream analysis so it may be nicer to have then on the main backbone and as prominent as possible.

Will sequence_file.library_prep_id be removed? I think this is implied and it should be.

malloryfreeberg · 2019-07-17T08:27:33Z

Will sequence_file.library_prep_id be removed? I think this is implied and it should be.

Yep, I can add this explicitly in the RFC.

With this new entity are there any fields currently on the library_preparation_protocol that would better fit onto this new biomaterial?

We can think about this. From a practical perspective, I can't immediately think of any fields the would benefit from being able to indicate them on each library prep (which means having to copy the value potentially many times) that can't be satisfied by putting that field once in the protocol. But, it's an interesting point to consider.

mckinsel · 2019-07-22T18:29:27Z

rfcs/text/0000-rfc-library-preparation.md

+
+### Definitions
+
+**Logical unit** - a set of metadata and data files that are consumed together to provide a context for understanding, processing, and interpreting data. In the HCA DCP, a logical unit is instantiated by a primary or secondary “bundle”, although the meaning of bundle is not stable or clearly defined at the moment.


This seems like a pretty broad concept for what we're talking about here. Isn't the core problem to solve just figuring out what data should be processed together in a secondary analysis pipeline?

It's true that the impetus for this RFC was to support being able to processing more than one set of sequence data files together. My goal was to have a term that basically represented a "bundle", but I didn't want to use the word "bundle" given that the definition of bundle isn't concrete at the moment. I wanted to provide a term to use throughout the RFC that essentially was talking about a bundle but wouldn't preclude talking about groups of things to be processed together if definition of bundle changes (or we move away from bundles as they exist now). Is there another term or definition that might be less broad than "logical unit"?

The fact that we don't know the semantics of a fundamental data structure is alarming and I think rather than create new terminology, having a rigorous definition of the DCP's use of a bundle should be considered a blocker and this RRC and data submissions of this type stalled until it is resolved.

It isn't clear to me if FASTQs end up in both library prep and assays bundles or if they only end up in one or the other.

@diekhans apologies for the confusion. My intention was not to create new DCP terminology, but rather provide a term to use just in the RFC to aid readability (have updated RFC to reflect this). Essentially a "logical unit" is a bundle, but I was hoping to avoid the DCP-specific bundle term because, as you say, bundle definition is ongoing work.

I don't think it's unreasonable to ask for the bundle definition RFC to be accepted prior to this RFC, but in that case, the bundle definition RFC needs a Shepherd!

It isn't clear to me if FASTQs end up in both library prep and assays bundles or if they only end up in one or the other.

Fastqs will end up in the same bundles they currently do: always in primary bundles; also in assay/secondary bundles since we currently implement copy-forward. I am not proposing a separate "library prep" bundle type here - only proposing that the library prep, not the sequencing assay process, be the entity that defines the scope of a bundle.

the alarming part is about the DCP, not the RFC!!

I think "library" rather than "library_prep" would be more common terminology. Library prep is what you do, library is what you end to up with. I

@diekhans I think I would agree with Mark here, in principle.

yeah, a quick google would suggest that calling the biomaterial library and the protocol the library prep seems like a good way to disambiguate

mckinsel · 2019-07-22T18:30:08Z

rfcs/text/0000-rfc-library-preparation.md

+
+**Logical unit** - a set of metadata and data files that are consumed together to provide a context for understanding, processing, and interpreting data. In the HCA DCP, a logical unit is instantiated by a primary or secondary “bundle”, although the meaning of bundle is not stable or clearly defined at the moment.
+
+**Library preparation** (biomaterial) - A collection of DNA fragments that have been prepared for sequencing from a single biological sample. DNA fragments can be prepared from genomic DNA or transcribed from RNA and are usually ligated to adapter and barcode oligonucleotides.


Is "from a single biological sample" always true?

I guess not necessarily. I can reword this to not be as specific.

mckinsel · 2019-07-22T18:31:17Z

rfcs/text/0000-rfc-library-preparation.md

+"biomaterial_description": {
+    "description": "A general description of the biomaterial.",
+},
+"ncbi_taxon_id" : {


This is maybe out of scope for this RFC, but why does taxon id get attached to every biomaterial? Isn't it just inherited from the donor_organism?

+1 to Marcus' comment. I would recommend removing the taxon ID from anything down from the donor, unless there's a reason or occasion for this to be necessary.

Technically, ncbi_taxon_id gets added to each biomaterial because it is in the biomaterial_core schema, which is inherited by all biomaterials. I proposed adding ncbi_taxon_id to the library preparation entity (at least for now) to be consistent with all other biomaterials, which all contain biomaterial_core and thus all have ncbi_taxon_id.

There are legitimate, if rare, reasons to have ncbi taxon id on both donor and specimen, such as a humanized mouse with a human tumour but I agree that downwards from specimen it would seem unlikely. I know @simonjupp had plans which might make it easier to remove the field from other places, we could certainly create convenience tools to make it easier for contributors to not need to provide that info more times than needed.

Those feel like things which should go into a separate feature request though and not derail this process

I'll add another note below the schema about considering other ways to populate this field so it's easier for contributors, but I will keep the recommendation about including biomaterial_core in the new LP schema.

mckinsel · 2019-07-22T18:31:57Z

rfcs/text/0000-rfc-library-preparation.md

+
+#### Added/changed library preparation-specific fields
+
+The new `library_preparation.json` schema will include a field for capturing INSDC experiment accessions if the project is already archived. This field (`process.insdc_experiment.insdc_experiment_accession`) generally represents a single library preparation in the archives (ENA, SRA, DDJB). This field currently lives awkwardly in a process module for historical reasons and should be moved to the new library preparation schema. 


Are there scenarios where this isn't an SRX id?

If the experiment has been submitted to one of the mentioned archives, then it will necessarily have a SRX/ERX/DRX accession, AFAIK. Of course, some submitted data might not be in an archive yet, in which case it won't have this accession.

rfcs/text/0000-rfc-library-preparation.md

mckinsel · 2019-07-22T18:33:44Z

rfcs/text/0000-rfc-library-preparation.md

+
+```
+
+will return all bundles (primary and secondary) that contain the specified library preparation.


The number of bundles that a query like this could return is pretty limited right? Seems like it would be 0, 1, or 2.

For now, that is correct. Might return more if we start to have matrix bundles, if the data from a library preparation is put into a project-wide matrix and possibly other granularity of matrix bundles. The point of this statement was to clarify that you would always get all bundles, regardless of how many there are.

mckinsel · 2019-07-22T18:39:26Z

rfcs/text/0000-rfc-library-preparation.md

+
+#### Matrix Service
+
+- Is anything coded against cell suspension?


The matrix service (and azul/data browser I believe) does make the assumption that the "lowest" biomaterial for a given cell is a cell_suspension. So there will be a little work required to shift that down to library prep. Also, it might get a little confusing if we use "library prep" as a shorthand to refer to both the library preparation protocol and the library preparation biomaterial.

The matrix service (and azul/data browser I believe) does make the assumption that the "lowest" biomaterial for a given cell is a cell_suspension. So there will be a little work required to shift that down to library prep.

I appreciate this change might need to be made. I don't know exactly what the assumptions are that these services make, but I wonder if cell suspension could still be used for some things (like assigning cell IDs) or whether it just makes it easier to shift everything down to LP.

Also, it might get a little confusing if we use "library prep" as a shorthand to refer to both the library preparation protocol and the library preparation biomaterial.

I agree that it's confusing to have both "Library preparation protocol" and "Library preparation" (the biomaterial). Based on my background of both working with and making library preparations, I feel these are the most "accurate" terms to describe these concepts, but am open to doing some research to see if there are alternative, but still accurate, terms to use. This RFC could probably be approved w/o resolving the naming issue.

I want to actually do some research to see how submitters/researchers refer to this entity.
I also want to see how they keep and differentiate this information themselves.

Am I right that the single cell expression atlas call this thing an assay? (not saying that's the right name, just wondered if it's the same thing.)

If there are any current submitters who will benefit from having the library prep ID to correctly represent their experiment (if they did 2 libraries, e.g.), maybe we can briefly chat to them?

Lastly, what is the assumption - do they keep this ID themselves and we're benefiting from this, or are we making them make it up on the spot?

Gabs.

Am I right that the single cell expression atlas call this thing an assay?

Based on the fact that in 1 project I checked for the SCEA they have SRR accessions in the Assay column, they use the term "Assay" to refer to 1 "run". A library preparation (biomaterial) can have 1 or more runs (the process that follows the sequencing protocol).

If there are any current submitters who will benefit from having the library prep ID to correctly represent their experiment, maybe we can briefly chat to them?

Don't see why not! Ask in the wrangling channel, as I'm not working with any current contributors.

what is the assumption - do they keep this ID themselves and we're benefiting from this, or are we making them make it up on the spot?

For the most part, I imaging if a simple 1 cell suspension -> 1 library prep experiment is done, then the ID they keep track of might be the same for both of these. If the design is 1 cell suspension -> 1+ library preps, I imagine they would have a new ID for the LPs (maybe by adding "rep1", "rep2" or something like that to the end of the cell suspension ID).

mckinsel · 2019-07-22T18:40:32Z

rfcs/text/0000-rfc-library-preparation.md

+#### Matrix Service
+
+- Is anything coded against cell suspension?
+- Will there be disruptions to how Matrix Service identifies single cells if we insert a library preparation entity between cell suspension and sequence file (thinking especially for plate-based technologies)?


The matrix service uses the cell suspension uuid as the cell id for smart-seq2 experiments. I think we could actually just keep doing that and not break anything because there a one-to-one relationship between cell suspension and library preparation for ss2.

Sounds good! Certainly don't want to make it harder to generate cell IDs for plate-based assays.

mckinsel · 2019-07-22T18:42:11Z

rfcs/text/0000-rfc-library-preparation.md

+
+#### To solve during RFC process
+- How will the DCP enforce the requirement of having the library preparation entity in every logical unit (i.e. bundle)?
+- How will current production sequencing datasets be updated? All of the current datasets will need to be updated as all of them will need to have library preparation entities added. This is a complex update - change in "bundle" structure by adding a `library_preparation_0.json` entity - so it can not be done with simple AUDR. If the update happens after GA but before complex AUDR, the UUIDs will have to be maintained somehow.


Not even thinking about AUDR constraints, is there a way to automatically migrate projects, even in theory? For example, I assume there would need to be some bundle mergers that would require a conversation with the data contributor.

I think we must come up with a plan which doesn't require complete re-ingestion for this to work. There has been some internal discussions within ingest and we think this might be possible but it probably won't be pretty.

Ensuring we are correctly representing the library prep strategy might need a discussion with some contributors but we should never need to discuss the bundle structure with a contributor

I'd really prefer to take the time/resources required to make it pretty. A more ad hoc pseudo-re-ingestion strategy will introduce a lot of confusion to downstream components.

I assume there would need to be some bundle mergers that would require a conversation with the data contributor.

We have addressed all the merged bundles with the most recent re-ingestion. So, all the current bundles are correct now in terms of containing the correct data files. What would need to be changed is to insert a library preparation biomaterial JSON into each bundle, and update the linking to put the LP biomaterial JSON in the correct place in the experimental graph.

malloryfreeberg · 2019-07-29T13:07:19Z

@gabsie @mckinsel @lauraclarke I have tried to respond to everyone's comments, some with updates to the RFC. The last day for comments in July 31, which is also my last day with the DCP. Please, if I have adequately addressed your comment/concern, can you Resolve the comment? If not, can you let me know what questions/concerns you still have?

Thank you!

@brianraymor how is a Shepherd assigned for the RFC?

malloryfreeberg · 2019-07-30T09:44:56Z

@justincc has volunteered to be Shepherd.

diekhans

The new metadata entity is great.

I am concerned about the bundling.

diekhans · 2019-07-30T23:28:36Z

rfcs/text/0000-rfc-library-preparation.md

+### Overview
+
+We will resolve the challenges described above by **creating a first-class library preparation biomaterial entity** as part of the HCA metadata standard. This approach will allow the metadata model to explicitly represent which sequencing data files came from the same library preparation even if the data files were generated across multiple sequencing runs (i.e. multiple processes). This approach would also enable data consumers to more easily identify all data files associated with the same library preparation (i.e. from one logical unit) which is required for correct data processing. Unless and until the fundamental way that DNA sequencing experiments are performed changes, all sequencing-based experiments will involve generating library preparations; therefore, the addition of this entity to the metadata model will be used by all current and future sequence-based datasets.
+


I find it concerning that we are solving a downstream processing problem by changing the storage structure of the data. The lack of any separation between storage and processing is severely limiting. This is not sustainable if we have other or conflicting analysis needs that require being notified in a different manner.

This is the only short-term hack currently available, so I am not suggesting that it not be done, but that it shouldn't be considered a general solutuon.

It also creates complexity as it is creating bundles that look different to the user.

it is creating bundles that look different to the user.

While I agree that implementation of this RFC means technically bundles might look different to the user (some bundles might have a single triplet of fastq files if an LP was sequenced once, while other bundles might have multiple triplets of fastq files if an LP was sequenced more than once), the overwhelming opinion I get from the comp bios I've talked to (e.g. @barkasn @kishorikonwar) is that "all fastq files from the same library preparation need to be processed together", not just in the DCP but as a general rule for processing single cell sequencing data anywhere.

Thus, from my perspective, providing an efficient way to get all fastq files per LP is preferred over providing that exact same structure within a bundle (e.g. 1 triplet per bundle, what we've been doing up until now). We already know the structure within a bundle is going to be different for imaging data, anyway.

I went ahead and created a scratch DCP terminology page. Here I've written that a bundle is just a collection of file references so if you don't agree please edit/comment :)

We also have the Bundle Definition RFC! #93

I doubt that an rfc was the right mechanism for that. Why is andrey the only reviewer?

And in the latest part of this comedy it turns out there is already a Google doc

@justincc I believe that RFC is still in Draft mode. Re. google docs: we should be slowly migrating the important info from google docs to RFCs to avoid the google drive morass.

This isn't suitable for an RFC since it's something that changes in increments over time, You can't start an RFC every time you want to add a definition - that just means no one every writes anything because the process barrier is too high.

diekhans · 2019-07-30T23:34:22Z

rfcs/text/0000-rfc-library-preparation.md

+
+### Definitions
+
+**Logical unit** - a set of metadata and data files that are consumed together to provide a context for understanding, processing, and interpreting data. In the HCA DCP, a logical unit is instantiated by a primary or secondary “bundle”, although the meaning of bundle is not stable or clearly defined at the moment.


The fact that we don't know the semantics of a fundamental data structure is alarming and I think rather than create new terminology, having a rigorous definition of the DCP's use of a bundle should be considered a blocker and this RRC and data submissions of this type stalled until it is resolved.

It isn't clear to me if FASTQs end up in both library prep and assays bundles or if they only end up in one or the other.

…n.md

…data standard (HumanCellAtlas#87) * First draft library prep RFC. * Fixed image links. * Fixed formatting of things. * Fixed formatting of things. * Tested resizing image. * Tested resizing image. * Tested resizing image. * Tested resizing image. * Tested resizing image. * Fixed formatting of things. * Fixed formatting of things. * Fixed formatting of things. * Debugging. * Debugging. * Debugging. * Debugging. * Debugging. * Debugging. * Debugging. * Final formatting fixes. * Fixed author formatting. * Added graffle image files. * Table formatting fixes. * Added section about removing old field. * Fixed table formatting. * Updated graffle file. * Updated ES query. * Updated definition of LP biomaterial. * Added caveat about ncbi_taxon_id. * Added Justin as Shepherd. * Cleaned up author/shepherd formatting. * Updated logical unit definition. * spelling fix in surname * add last call for oversight * Rename 0000-rfc-library-preparation.md to 0010-rfc-library-preparation.md * add link

malloryfreeberg added 20 commits July 16, 2019 16:07

First draft library prep RFC.

860687d

Fixed image links.

a6fab3b

Fixed formatting of things.

e189bf7

Fixed formatting of things.

e514cd4

Tested resizing image.

ad190ca

Tested resizing image.

eab833c

Tested resizing image.

7489fbb

Tested resizing image.

2cbc221

Tested resizing image.

cdfd533

Fixed formatting of things.

8c59981

Fixed formatting of things.

36de394

Fixed formatting of things.

b8fb931

Debugging.

21bc457

Debugging.

7f23eba

Debugging.

05dcbab

Debugging.

17db6f8

Debugging.

42bfd26

Debugging.

40cc466

Debugging.

17b0b6c

Final formatting fixes.

8bbcb85

malloryfreeberg added rfc-community-review Architecture labels Jul 16, 2019

malloryfreeberg changed the title ~~Representing sequencing library preparations in the HCA DCP metadata standard~~ RFC: Representing sequencing library preparations in the HCA DCP metadata standard Jul 16, 2019

malloryfreeberg added 2 commits July 16, 2019 17:24

Fixed author formatting.

14b7200

Added graffle image files.

66dac51

Table formatting fixes.

d1d3e3c

Updated graffle file.

7ffd25b

brianraymor mentioned this pull request Jul 18, 2019

[spike] Enable data files from the same sequencing library to be collected as a logical unit HumanCellAtlas/dcp#415

Open

mckinsel reviewed Jul 22, 2019

View reviewed changes

malloryfreeberg added 2 commits July 24, 2019 16:27

Updated ES query.

e6f5036

Updated definition of LP biomaterial.

4fb9d46

lauraclarke mentioned this pull request Jul 29, 2019

Processing Datasets that Span Multiple Data Collection Runs #88

Merged

Added caveat about ncbi_taxon_id.

f1cd7fb

malloryfreeberg added 2 commits July 30, 2019 10:46

Added Justin as Shepherd.

694201e

Cleaned up author/shepherd formatting.

96a8317

diekhans reviewed Jul 30, 2019

View reviewed changes

Updated logical unit definition.

6e5c9ae

mshadbolt self-assigned this Aug 2, 2019

spelling fix in surname

884346c

justincc approved these changes Aug 6, 2019

View reviewed changes

add last call for oversight

2073de7

brianraymor added rfc-oversight-review and removed rfc-community-review labels Aug 15, 2019

justincc added rfc-approved and removed rfc-oversight-review labels Aug 21, 2019

justincc added 2 commits August 21, 2019 17:33

Rename 0000-rfc-library-preparation.md to 0010-rfc-library-preparatio…

41c6957

…n.md

add link

68f0c42

justincc merged commit 00f0acd into master Aug 21, 2019

justincc deleted the mfreeberg-rfc-library-preparation branch August 21, 2019 16:36

This was referenced Sep 12, 2019

Enable data files from the same sequencing library to be collected as a logical unit HumanCellAtlas/dcp#413

Open

[spike] Implementation plan for Representing sequencing library preparations in the HCA DCP metadata standard HumanCellAtlas/dcp#511

Open

jlchang mentioned this pull request Sep 18, 2019

How to determine expected metadata content in Matrix Service loom files HumanCellAtlas/matrix-service#381

Open

arschat mentioned this pull request Sep 27, 2023

HCA Tier 1 metadata mapping to DCP metadata fields ebi-ait/hca-ebi-wrangler-central#1178

Open


		### Definitions

		Logical unit - a set of metadata and data files that are consumed together to provide a context for understanding, processing, and interpreting data. In the HCA DCP, a logical unit is instantiated by a primary or secondary “bundle”, although the meaning of bundle is not stable or clearly defined at the moment.


		Logical unit - a set of metadata and data files that are consumed together to provide a context for understanding, processing, and interpreting data. In the HCA DCP, a logical unit is instantiated by a primary or secondary “bundle”, although the meaning of bundle is not stable or clearly defined at the moment.

		Library preparation (biomaterial) - A collection of DNA fragments that have been prepared for sequencing from a single biological sample. DNA fragments can be prepared from genomic DNA or transcribed from RNA and are usually ligated to adapter and barcode oligonucleotides.


		#### Added/changed library preparation-specific fields

		The new `library_preparation.json` schema will include a field for capturing INSDC experiment accessions if the project is already archived. This field (`process.insdc_experiment.insdc_experiment_accession`) generally represents a single library preparation in the archives (ENA, SRA, DDJB). This field currently lives awkwardly in a process module for historical reasons and should be moved to the new library preparation schema.


		```

		will return all bundles (primary and secondary) that contain the specified library preparation.


		#### Matrix Service

		- Is anything coded against cell suspension?

		### Overview

		We will resolve the challenges described above by creating a first-class library preparation biomaterial entity as part of the HCA metadata standard. This approach will allow the metadata model to explicitly represent which sequencing data files came from the same library preparation even if the data files were generated across multiple sequencing runs (i.e. multiple processes). This approach would also enable data consumers to more easily identify all data files associated with the same library preparation (i.e. from one logical unit) which is required for correct data processing. Unless and until the fundamental way that DNA sequencing experiments are performed changes, all sequencing-based experiments will involve generating library preparations; therefore, the addition of this entity to the metadata model will be used by all current and future sequence-based datasets.

RFC: Representing sequencing library preparations in the HCA DCP metadata standard #87

RFC: Representing sequencing library preparations in the HCA DCP metadata standard #87

Conversation

malloryfreeberg commented Jul 16, 2019 • edited by brianraymor Loading

Summary

diekhans commented Jul 16, 2019

malloryfreeberg commented Jul 16, 2019

hewgreen commented Jul 16, 2019

malloryfreeberg commented Jul 17, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

malloryfreeberg Jul 31, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

diekhans Aug 4, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

malloryfreeberg Jul 24, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

malloryfreeberg Jul 25, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

malloryfreeberg Jul 24, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

malloryfreeberg Jul 24, 2019 • edited Loading

Choose a reason for hiding this comment

malloryfreeberg commented Jul 29, 2019

malloryfreeberg commented Jul 30, 2019

diekhans left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

malloryfreeberg commented Jul 16, 2019 •

edited by brianraymor

Loading

malloryfreeberg Jul 31, 2019 •

edited

Loading

diekhans Aug 4, 2019 •

edited

Loading

malloryfreeberg Jul 24, 2019 •

edited

Loading

malloryfreeberg Jul 25, 2019 •

edited

Loading

malloryfreeberg Jul 24, 2019 •

edited

Loading

malloryfreeberg Jul 24, 2019 •

edited

Loading