Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: Representing sequencing library preparations in the HCA DCP metadata standard #87

Merged
merged 36 commits into from
Aug 21, 2019

Conversation

malloryfreeberg
Copy link
Member

@malloryfreeberg malloryfreeberg commented Jul 16, 2019

Summary

The current HCA DCP metadata model explicitly represents cell suspensions (single cells or multiple cells suspended in some media) but not the sequencing library preparations derived from them. This is creating challenges for contributors, consumers, and DCP implementation teams when submitting, processing, and interpreting sequencing data. Here we propose a solution for explicitly identifying library preparation biomaterials in a sequencing experiment by making them a first-class biomaterial type entity in the metadata standard.

July 31: Last call for community review
August 15: Last call for oversight review

@diekhans
Copy link
Contributor

Please include source files for images so that the RFC can be maintained when the author goes on to new adventures.

@malloryfreeberg malloryfreeberg changed the title Representing sequencing library preparations in the HCA DCP metadata standard RFC: Representing sequencing library preparations in the HCA DCP metadata standard Jul 16, 2019
@malloryfreeberg
Copy link
Member Author

Please include source files for images so that the RFC can be maintained when the author goes on to new adventures.

Added!

@hewgreen
Copy link

This turned out really nice, thanks metadata team. This gets my vote because the new entity is the perfect pivot for consuming metadata downstream. Another unmentioned benefit is that traversing the whole graph isn't as varied because you expect to find this lib prep entity.

With this new entity are there any fields currently on the library_preparation_protocol that would better fit onto this new biomaterial? Many of those fields are the most essential to downstream analysis so it may be nicer to have then on the main backbone and as prominent as possible.

Will sequence_file.library_prep_id be removed? I think this is implied and it should be.

@malloryfreeberg
Copy link
Member Author

Will sequence_file.library_prep_id be removed? I think this is implied and it should be.

Yep, I can add this explicitly in the RFC.

With this new entity are there any fields currently on the library_preparation_protocol that would better fit onto this new biomaterial?

We can think about this. From a practical perspective, I can't immediately think of any fields the would benefit from being able to indicate them on each library prep (which means having to copy the value potentially many times) that can't be satisfied by putting that field once in the protocol. But, it's an interesting point to consider.


### Definitions

**Logical unit** - a set of metadata and data files that are consumed together to provide a context for understanding, processing, and interpreting data. In the HCA DCP, a logical unit is instantiated by a primary or secondary “bundle”, although the meaning of bundle is not stable or clearly defined at the moment.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a pretty broad concept for what we're talking about here. Isn't the core problem to solve just figuring out what data should be processed together in a secondary analysis pipeline?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's true that the impetus for this RFC was to support being able to processing more than one set of sequence data files together. My goal was to have a term that basically represented a "bundle", but I didn't want to use the word "bundle" given that the definition of bundle isn't concrete at the moment. I wanted to provide a term to use throughout the RFC that essentially was talking about a bundle but wouldn't preclude talking about groups of things to be processed together if definition of bundle changes (or we move away from bundles as they exist now). Is there another term or definition that might be less broad than "logical unit"?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fact that we don't know the semantics of a fundamental data structure is alarming and I think rather than create new terminology, having a rigorous definition of the DCP's use of a bundle should be considered a blocker and this RRC and data submissions of this type stalled until it is resolved.

It isn't clear to me if FASTQs end up in both library prep and assays bundles or if they only end up in one or the other.

Copy link
Member Author

@malloryfreeberg malloryfreeberg Jul 31, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@diekhans apologies for the confusion. My intention was not to create new DCP terminology, but rather provide a term to use just in the RFC to aid readability (have updated RFC to reflect this). Essentially a "logical unit" is a bundle, but I was hoping to avoid the DCP-specific bundle term because, as you say, bundle definition is ongoing work.

I don't think it's unreasonable to ask for the bundle definition RFC to be accepted prior to this RFC, but in that case, the bundle definition RFC needs a Shepherd!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It isn't clear to me if FASTQs end up in both library prep and assays bundles or if they only end up in one or the other.

Fastqs will end up in the same bundles they currently do: always in primary bundles; also in assay/secondary bundles since we currently implement copy-forward. I am not proposing a separate "library prep" bundle type here - only proposing that the library prep, not the sequencing assay process, be the entity that defines the scope of a bundle.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the alarming part is about the DCP, not the RFC!!

Copy link
Contributor

@diekhans diekhans Aug 4, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think "library" rather than "library_prep" would be more common terminology. Library prep is what you do, library is what you end to up with. I

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@diekhans I think I would agree with Mark here, in principle.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, a quick google would suggest that calling the biomaterial library and the protocol the library prep seems like a good way to disambiguate


**Logical unit** - a set of metadata and data files that are consumed together to provide a context for understanding, processing, and interpreting data. In the HCA DCP, a logical unit is instantiated by a primary or secondary “bundle”, although the meaning of bundle is not stable or clearly defined at the moment.

**Library preparation** (biomaterial) - A collection of DNA fragments that have been prepared for sequencing from a single biological sample. DNA fragments can be prepared from genomic DNA or transcribed from RNA and are usually ligated to adapter and barcode oligonucleotides.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is "from a single biological sample" always true?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess not necessarily. I can reword this to not be as specific.

"biomaterial_description": {
"description": "A general description of the biomaterial.",
},
"ncbi_taxon_id" : {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is maybe out of scope for this RFC, but why does taxon id get attached to every biomaterial? Isn't it just inherited from the donor_organism?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to Marcus' comment. I would recommend removing the taxon ID from anything down from the donor, unless there's a reason or occasion for this to be necessary.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically, ncbi_taxon_id gets added to each biomaterial because it is in the biomaterial_core schema, which is inherited by all biomaterials. I proposed adding ncbi_taxon_id to the library preparation entity (at least for now) to be consistent with all other biomaterials, which all contain biomaterial_core and thus all have ncbi_taxon_id.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are legitimate, if rare, reasons to have ncbi taxon id on both donor and specimen, such as a humanized mouse with a human tumour but I agree that downwards from specimen it would seem unlikely. I know @simonjupp had plans which might make it easier to remove the field from other places, we could certainly create convenience tools to make it easier for contributors to not need to provide that info more times than needed.

Those feel like things which should go into a separate feature request though and not derail this process

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll add another note below the schema about considering other ways to populate this field so it's easier for contributors, but I will keep the recommendation about including biomaterial_core in the new LP schema.


#### Added/changed library preparation-specific fields

The new `library_preparation.json` schema will include a field for capturing INSDC experiment accessions if the project is already archived. This field (`process.insdc_experiment.insdc_experiment_accession`) generally represents a single library preparation in the archives (ENA, SRA, DDJB). This field currently lives awkwardly in a process module for historical reasons and should be moved to the new library preparation schema.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there scenarios where this isn't an SRX id?

Copy link
Member Author

@malloryfreeberg malloryfreeberg Jul 24, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the experiment has been submitted to one of the mentioned archives, then it will necessarily have a SRX/ERX/DRX accession, AFAIK. Of course, some submitted data might not be in an archive yet, in which case it won't have this accession.

rfcs/text/0000-rfc-library-preparation.md Outdated Show resolved Hide resolved

```

will return all bundles (primary and secondary) that contain the specified library preparation.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The number of bundles that a query like this could return is pretty limited right? Seems like it would be 0, 1, or 2.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now, that is correct. Might return more if we start to have matrix bundles, if the data from a library preparation is put into a project-wide matrix and possibly other granularity of matrix bundles. The point of this statement was to clarify that you would always get all bundles, regardless of how many there are.


#### Matrix Service

- Is anything coded against cell suspension?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The matrix service (and azul/data browser I believe) does make the assumption that the "lowest" biomaterial for a given cell is a cell_suspension. So there will be a little work required to shift that down to library prep. Also, it might get a little confusing if we use "library prep" as a shorthand to refer to both the library preparation protocol and the library preparation biomaterial.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The matrix service (and azul/data browser I believe) does make the assumption that the "lowest" biomaterial for a given cell is a cell_suspension. So there will be a little work required to shift that down to library prep.

I appreciate this change might need to be made. I don't know exactly what the assumptions are that these services make, but I wonder if cell suspension could still be used for some things (like assigning cell IDs) or whether it just makes it easier to shift everything down to LP.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, it might get a little confusing if we use "library prep" as a shorthand to refer to both the library preparation protocol and the library preparation biomaterial.

I agree that it's confusing to have both "Library preparation protocol" and "Library preparation" (the biomaterial). Based on my background of both working with and making library preparations, I feel these are the most "accurate" terms to describe these concepts, but am open to doing some research to see if there are alternative, but still accurate, terms to use. This RFC could probably be approved w/o resolving the naming issue.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to actually do some research to see how submitters/researchers refer to this entity.
I also want to see how they keep and differentiate this information themselves.

Am I right that the single cell expression atlas call this thing an assay? (not saying that's the right name, just wondered if it's the same thing.)

If there are any current submitters who will benefit from having the library prep ID to correctly represent their experiment (if they did 2 libraries, e.g.), maybe we can briefly chat to them?

Lastly, what is the assumption - do they keep this ID themselves and we're benefiting from this, or are we making them make it up on the spot?

Gabs.

Copy link
Member Author

@malloryfreeberg malloryfreeberg Jul 25, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Am I right that the single cell expression atlas call this thing an assay?

Based on the fact that in 1 project I checked for the SCEA they have SRR accessions in the Assay column, they use the term "Assay" to refer to 1 "run". A library preparation (biomaterial) can have 1 or more runs (the process that follows the sequencing protocol).

If there are any current submitters who will benefit from having the library prep ID to correctly represent their experiment, maybe we can briefly chat to them?

Don't see why not! Ask in the wrangling channel, as I'm not working with any current contributors.

what is the assumption - do they keep this ID themselves and we're benefiting from this, or are we making them make it up on the spot?

For the most part, I imaging if a simple 1 cell suspension -> 1 library prep experiment is done, then the ID they keep track of might be the same for both of these. If the design is 1 cell suspension -> 1+ library preps, I imagine they would have a new ID for the LPs (maybe by adding "rep1", "rep2" or something like that to the end of the cell suspension ID).

#### Matrix Service

- Is anything coded against cell suspension?
- Will there be disruptions to how Matrix Service identifies single cells if we insert a library preparation entity between cell suspension and sequence file (thinking especially for plate-based technologies)?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The matrix service uses the cell suspension uuid as the cell id for smart-seq2 experiments. I think we could actually just keep doing that and not break anything because there a one-to-one relationship between cell suspension and library preparation for ss2.

Copy link
Member Author

@malloryfreeberg malloryfreeberg Jul 24, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good! Certainly don't want to make it harder to generate cell IDs for plate-based assays.


#### To solve during RFC process
- How will the DCP enforce the requirement of having the library preparation entity in every logical unit (i.e. bundle)?
- How will current production sequencing datasets be updated? All of the current datasets will need to be updated as all of them will need to have library preparation entities added. This is a complex update - change in "bundle" structure by adding a `library_preparation_0.json` entity - so it can not be done with simple AUDR. If the update happens after GA but before complex AUDR, the UUIDs will have to be maintained somehow.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not even thinking about AUDR constraints, is there a way to automatically migrate projects, even in theory? For example, I assume there would need to be some bundle mergers that would require a conversation with the data contributor.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we must come up with a plan which doesn't require complete re-ingestion for this to work. There has been some internal discussions within ingest and we think this might be possible but it probably won't be pretty.

Ensuring we are correctly representing the library prep strategy might need a discussion with some contributors but we should never need to discuss the bundle structure with a contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd really prefer to take the time/resources required to make it pretty. A more ad hoc pseudo-re-ingestion strategy will introduce a lot of confusion to downstream components.

Copy link
Member Author

@malloryfreeberg malloryfreeberg Jul 24, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume there would need to be some bundle mergers that would require a conversation with the data contributor.

We have addressed all the merged bundles with the most recent re-ingestion. So, all the current bundles are correct now in terms of containing the correct data files. What would need to be changed is to insert a library preparation biomaterial JSON into each bundle, and update the linking to put the LP biomaterial JSON in the correct place in the experimental graph.

@malloryfreeberg
Copy link
Member Author

@gabsie @mckinsel @lauraclarke I have tried to respond to everyone's comments, some with updates to the RFC. The last day for comments in July 31, which is also my last day with the DCP. Please, if I have adequately addressed your comment/concern, can you Resolve the comment? If not, can you let me know what questions/concerns you still have?

Thank you!

@brianraymor how is a Shepherd assigned for the RFC?

@malloryfreeberg
Copy link
Member Author

@justincc has volunteered to be Shepherd.

Copy link
Contributor

@diekhans diekhans left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new metadata entity is great.

I am concerned about the bundling.

### Overview

We will resolve the challenges described above by **creating a first-class library preparation biomaterial entity** as part of the HCA metadata standard. This approach will allow the metadata model to explicitly represent which sequencing data files came from the same library preparation even if the data files were generated across multiple sequencing runs (i.e. multiple processes). This approach would also enable data consumers to more easily identify all data files associated with the same library preparation (i.e. from one logical unit) which is required for correct data processing. Unless and until the fundamental way that DNA sequencing experiments are performed changes, all sequencing-based experiments will involve generating library preparations; therefore, the addition of this entity to the metadata model will be used by all current and future sequence-based datasets.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find it concerning that we are solving a downstream processing problem by changing the storage structure of the data. The lack of any separation between storage and processing is severely limiting. This is not sustainable if we have other or conflicting analysis needs that require being notified in a different manner.

This is the only short-term hack currently available, so I am not suggesting that it not be done, but that it shouldn't be considered a general solutuon.

It also creates complexity as it is creating bundles that look different to the user.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is creating bundles that look different to the user.

While I agree that implementation of this RFC means technically bundles might look different to the user (some bundles might have a single triplet of fastq files if an LP was sequenced once, while other bundles might have multiple triplets of fastq files if an LP was sequenced more than once), the overwhelming opinion I get from the comp bios I've talked to (e.g. @barkasn @kishorikonwar) is that "all fastq files from the same library preparation need to be processed together", not just in the DCP but as a general rule for processing single cell sequencing data anywhere.

Thus, from my perspective, providing an efficient way to get all fastq files per LP is preferred over providing that exact same structure within a bundle (e.g. 1 triplet per bundle, what we've been doing up until now). We already know the structure within a bundle is going to be different for imaging data, anyway.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went ahead and created a scratch DCP terminology page. Here I've written that a bundle is just a collection of file references so if you don't agree please edit/comment :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also have the Bundle Definition RFC! #93

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I doubt that an rfc was the right mechanism for that. Why is andrey the only reviewer?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And in the latest part of this comedy it turns out there is already a Google doc

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@justincc I believe that RFC is still in Draft mode. Re. google docs: we should be slowly migrating the important info from google docs to RFCs to avoid the google drive morass.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't suitable for an RFC since it's something that changes in increments over time, You can't start an RFC every time you want to add a definition - that just means no one every writes anything because the process barrier is too high.


### Definitions

**Logical unit** - a set of metadata and data files that are consumed together to provide a context for understanding, processing, and interpreting data. In the HCA DCP, a logical unit is instantiated by a primary or secondary “bundle”, although the meaning of bundle is not stable or clearly defined at the moment.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fact that we don't know the semantics of a fundamental data structure is alarming and I think rather than create new terminology, having a rigorous definition of the DCP's use of a bundle should be considered a blocker and this RRC and data submissions of this type stalled until it is resolved.

It isn't clear to me if FASTQs end up in both library prep and assays bundles or if they only end up in one or the other.

@mshadbolt mshadbolt self-assigned this Aug 2, 2019
@justincc justincc merged commit 00f0acd into master Aug 21, 2019
@justincc justincc deleted the mfreeberg-rfc-library-preparation branch August 21, 2019 16:36
diekhans pushed a commit to diekhans/dcp-community that referenced this pull request Oct 31, 2019
…data standard (HumanCellAtlas#87)

* First draft library prep RFC.

* Fixed image links.

* Fixed formatting of things.

* Fixed formatting of things.

* Tested resizing image.

* Tested resizing image.

* Tested resizing image.

* Tested resizing image.

* Tested resizing image.

* Fixed formatting of things.

* Fixed formatting of things.

* Fixed formatting of things.

* Debugging.

* Debugging.

* Debugging.

* Debugging.

* Debugging.

* Debugging.

* Debugging.

* Final formatting fixes.

* Fixed author formatting.

* Added graffle image files.

* Table formatting fixes.

* Added section about removing old field.

* Fixed table formatting.

* Updated graffle file.

* Updated ES query.

* Updated definition of LP biomaterial.

* Added caveat about ncbi_taxon_id.

* Added Justin as Shepherd.

* Cleaned up author/shepherd formatting.

* Updated logical unit definition.

* spelling fix in surname

* add last call for oversight

* Rename 0000-rfc-library-preparation.md to 0010-rfc-library-preparation.md

* add link
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

10 participants