Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Index lm4 in LungMAP #5769

Closed
14 tasks done
bvizzier-ucsc opened this issue Dec 7, 2023 · 9 comments
Closed
14 tasks done

Index lm4 in LungMAP #5769

bvizzier-ucsc opened this issue Dec 7, 2023 · 9 comments
Assignees
Labels
- [priority] Medium enh [type] New feature or request infra [subject] Project infrastructure like CI/CD, build and deployment scripts no demo [process] Not to be demonstrated at the end of the sprint operator [process] To be addressed by whoever is operator orange [process] Done by the Azul team

Comments

@bvizzier-ucsc
Copy link

bvizzier-ucsc commented Dec 7, 2023

There are 7 new datasets snapshots to be added to LungMAP , for a total of 13 see Daniel's edit below.

Please see the lm4 column in the LungMAP Prod tab of the "TDR Dataset and snapshots" spreadsheet.

Slack thread

[edit: @dsotirho-ucsc: Of the 7 snapshots, 6 are updates to existing projects and 1 is for a new project (4ae8c5c9). Catalog lm4 will have 7 total projects.]

  • Security design review completed; the Resolution of this issue does not
    • … affect authentication; for example:
      • OAuth 2.0 with the application (API or Swagger UI)
      • Authentication of developers with Google Cloud APIs
      • Authentication of developers with AWS APIs
      • Authentication with a GitLab instance in the system
      • Password and 2FA authentication with GitHub
      • API access token authentication with GitHub
      • Authentication with
    • … affect the permissions of internal users like access to
      • Cloud resources on AWS and GCP
      • GitLab repositories, projects and groups, administration
      • an EC2 instance via SSH
      • GitHub issues, pull requests, commits, commit statuses, wikis, repositories, organizations
    • … affect the permissions of external users like access to
      • TDR snapshots
    • … affect permissions of service or bot accounts
      • Cloud resources on AWS and GCP
    • … affect audit logging in the system, like
      • adding, removing or changing a log message that represents an auditable event
      • changing the routing of log messages through the system
    • … affect monitoring of the system
    • … introduce a new software dependency like
      • Python packages on PYPI
      • Command-line utilities
      • Docker images
      • Terraform providers
    • … add an interface that exposes sensitive or confidential data at the security boundary
    • … affect the encryption of data at rest
    • … require persistence of sensitive or confidential data that might require encryption at rest
    • … require unencrypted transmission of data within the security boundary
    • … affect the network security layer; for example by
      • modifying, adding or removing firewall rules
      • modifying, adding or removing security groups
      • changing or adding a port a service, proxy or load balancer listens on
  • Documentation on any unchecked boxes is provided in comments below
@bvizzier-ucsc bvizzier-ucsc added enh [type] New feature or request infra [subject] Project infrastructure like CI/CD, build and deployment scripts orange [process] Done by the Azul team no demo [process] Not to be demonstrated at the end of the sprint operator [process] To be addressed by whoever is operator - [priority] Medium labels Dec 7, 2023
@dsotirho-ucsc dsotirho-ucsc self-assigned this Dec 8, 2023
hannes-ucsc added a commit that referenced this issue Dec 21, 2023
This reverts commit 3acd9f3, reversing
changes made to 2f9d2cd.
@hannes-ucsc
Copy link
Member

Apparently, lm4 now uses non-null DRS URIs in file descriptors but we weren't informed and didn't implement #3631 ahead of time. We had to back this change out. It was slated to be promoted to prod today.

@hannes-ucsc
Copy link
Member

hannes-ucsc commented Dec 21, 2023

This also needs to be reverted on develop.

@hannes-ucsc
Copy link
Member

We decided in stand-up that we would do a minimal fix to ignore the DRS URIs in the file descriptors. This would allow for publishing lm4 soon but the files with non-null DRSs URIs in their descriptors (presumably those pointing at BDCat) would not be downloadable from the DB for LungMap until we resolve #3631.

@hannes-ucsc
Copy link
Member

An example descriptor is

{
    "content_type": ".gz",
    "crc32c": "f994e3b0",
    "describedBy": "https://raw.githubusercontent.com/HumanCellAtlas/metadata-schema/lungmap-phantom-files/json_schema/system/file_descriptor.json",
    "drs_uri": "drs://dg.4503:dg.4503/6282d0a2-732a-4949-a35d-e822581a705e",
    "file_id": "8035f992-ebfd-4546-8330-a18cdd33ac8e",
    "file_name": "donor2_periph_rep2_R1.fastq.gz",
    "file_version": "2023-09-01T15:15:51.659382Z",
    "schema_type": "file_descriptor",
    "schema_version": "2.1.1",
    "sha256": "725c15f885ac44cde9010b5dcfb85ce200d4e62b8084bb0d64809b9e51a36346",
    "size": 4332059752
}

The resulting Azul failure is an requirement error

[WARNING]	2023-12-21T05:01:00.991Z	6b2b43e9-95a0-57fa-a7d7-1e145d3a5ea3	azul.indexer.index_controller	Worker failed to handle message {'action': 'add', 'notification': {'transaction_id': 'a30ba14a-02e4-4960-99d5-e89453d4abb3', 'bundle_fqid': {'uuid': '31244706-60f8-3043-8db7-a39fd7081139', 'version': '2022-03-11T17:13:52.129998Z', 'source': {'id': '15324e53-b8d3-40b7-92fb-abfeb35e6515', 'spec': 'tdr:datarepo-3f332829:snapshot/lungmap_prod_f899709cae2c4bb988f0131142e6c7ec__20220310_20231207_lm4:/0'}}}, 'catalog': 'lm4-it'}.

Traceback (most recent call last):

  File "/var/task/azul/indexer/index_controller.py", line 179, in contribute

    contributions = self.transform(catalog, notification, delete)

                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/var/task/azul/indexer/index_controller.py", line 217, in transform

    bundle = service.fetch_bundle(catalog, bundle_fqid)

             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/var/task/azul/indexer/index_service.py", line 183, in fetch_bundle

    return plugin.fetch_bundle(bundle_fqid)

           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/var/task/azul/plugins/repository/tdr.py", line 211, in fetch_bundle

    bundle = self._emulate_bundle(bundle_fqid)

             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/var/task/azul/plugins/repository/tdr_hca/__init__.py", line 384, in _emulate_bundle

    bundle.add_entity(entity_key=f'{entity_type}_{i}.json',

  File "/var/task/azul/plugins/repository/tdr_hca/__init__.py", line 212, in add_entity

    drs_uri=self._parse_drs_uri(entity_row['file_id'], descriptor))

            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/var/task/azul/plugins/repository/tdr_hca/__init__.py", line 292, in _parse_drs_uri

    require(external_drs_uri is None,

  File "/var/task/azul/__init__.py", line 1526, in require

    reject(not condition, *args, exception=exception)

  File "/var/task/azul/__init__.py", line 1545, in reject

    raise exception(*args)

azul.RequirementError: ('Non-null `drs_uri` in file descriptor', 'drs://dg.4503:dg.4503/5b9d3de7-9d29-4e6a-9f2c-c3af1fdce900')

Note the drs_uri property which is not supported by the HCA metadata schema. The PR intended to add support for that property is HumanCellAtlas/metadata-schema#1437 but looks abandoned. The corresponding spec change is PR HumanCellAtlas/dcp2#55 which has been merged. The descriptor validates because it refers to the schema from the yet to be approved and merged PR branch, but using a tag (lungmap-phantom-files).

So we are now in a precarious state: the spec change was merged, the actual schema change was not and LungMap is providing JSON metadata referring to a non-standard, unapproved schema at an ad-hoc URL. The tag could be deleted, rendering the JSON referring to it invalid.

As outlined above, we will remove the requirement that's causing the error in Azul and ignore the DRS URI (#5824) so that we can go forward with lm4. Before we can move forward with #3631, the schema PR would have to be merged. Updates LungMap snapshots would also have to be cut so that the descriptors refer to the updated, official and released schema.

@hannes-ucsc
Copy link
Member

lm4 has been indexed in prod. It can be previewed here:

https://data-browser.lungmap.net/explore/projects?catalog=lm4

@hannes-ucsc
Copy link
Member

hannes-ucsc commented Jan 16, 2024

Can we remove lm3 and change the defautl to lm4?

@achave11-ucsc
Copy link
Member

achave11-ucsc commented Jan 16, 2024

Assignee to create follow-up ticket.

@achave11-ucsc
Copy link
Member

@hannes-ucsc: "Also remember to assign @dsotirho-ucsc when done."

@bvizzier-ucsc
Copy link
Author

CC has completed there work so this can be moved forward with making LM4 the default.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
- [priority] Medium enh [type] New feature or request infra [subject] Project infrastructure like CI/CD, build and deployment scripts no demo [process] Not to be demonstrated at the end of the sprint operator [process] To be addressed by whoever is operator orange [process] Done by the Azul team
Projects
None yet
Development

No branches or pull requests

4 participants