Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor Data Guide #162

Merged
merged 11 commits into from
Aug 22, 2024
Merged

Refactor Data Guide #162

merged 11 commits into from
Aug 22, 2024

Conversation

jbusecke
Copy link
Contributor

@jbusecke jbusecke commented Jun 28, 2024

This is an effort to update the data-guide with some new instructions reflecting the pgf ingestion pipeline and the now available catalog.

  • Closes Define LEAP Affiliated course and link in EDU guide. #169

  • Reorganize content more strictly following the diviso categories (might factor that out as another PR). A sketched out plan
    IMG_3380

  • Break up Data and Compute Guide

  • Create all the empty links referenced in here.

  • Break up and move the auth/transfer instructions

Copy link

github-actions bot commented Jun 28, 2024

👋 Thanks for opening this PR! The Cookbook will be automatically built with GitHub Actions. To see the status of your deployment, click below.
🔍 Git commit SHA: 55ef5d9
✅ Deployment Preview URL: https://leap-stc.github.io/_preview/162

@jbusecke
Copy link
Contributor Author

Ok I finally finished this major refactor of the docs. @SammyAgrawal @norlandrhagen if you could review the preview above (once it is done building) that would be tremendously helpful.

If you think in particular the Catalog and Ingestion parts are helpful enough, we can finally unblock the official release of the catalog!

@norlandrhagen
Copy link
Contributor

Nice work @jbusecke!

A few things:

  • This section describing the data catalog seems outdated. We can probably point to the existing catalog instead of the Radiant Earth link.

  • Under the Ingesting Datasets into Cloud Storage section, there is a markdown formatting error in the [!note] tag

  • In Ingesting Datasets into Cloud Storage #2, bullet 3, the link to Pangeo-Forge brings you to this page:

image

Maybe it can link directly to the pangeo-forge-recipes docs. Same with the Zarr link.

  • In the types of data supported: Linking an existing (public, egress-free) ARCO dataset to the [Data Catalog](https://leap-stc.github.io/_preview/162/explanation/architecture.html#explanation-architecture-catalog), not sure if it's worth mentioning, but zarr-proxy can get around CORS restrictions to outside data.

  • I'm not super clear on the data library / data catalog differences. It seems like the catalog has kind of become the library instead of a STAC catalog?

@SammyAgrawal
Copy link
Contributor

I really like the addition of the data guide, I think that is really awesome!

  • I wonder whether certain "How to" sections could be punchier, as self contained guides to accomplish a certain task?
  • Is there an idea for what we want to include in the policies and how they are separated from the guides? Since the high level description of how these fit into Pangeo are subsections of the Architecture?
  • I think the Ingesting Datasets section remains somewhat confusing

There are some small edits I want to make (language cleanup) and will circle back on some organizing stuff. Overall I think this is great though!

@jbusecke
Copy link
Contributor Author

Thanks for all the feedback @norlandrhagen and @SammyAgrawal.

An overall issue is that some of the text is hella old, as you pointed out, and we should work to replace it slowly, not sure if we want to do it all in one?

I'm not super clear on the data library / data catalog differences. It seems like the catalog has kind of become the library instead of a STAC catalog?

I personally see the libary = catalog + data storage. Is there some way to highlight this?

This section describing the data catalog seems outdated. We can probably point to the existing catalog instead of the Radiant Earth link.

This is sort of emblematic of that issue, we need to find a way to reconcile the old 'vision' language with what is actually implemented. I for now kept most of the old language, but I agree this is a bit confusing.

In the types of data supported: Linking an existing (public, egress-free) ARCO dataset to the Data Catalog, not sure if it's worth mentioning, but zarr-proxy can get around CORS restrictions to outside data.

@norlandrhagen It would be great if you could add an admonition with some detail on that here!

In Ingesting Datasets into Cloud Storage #2, bullet 3, the link to Pangeo-Forge brings you to this page:

Good question here. I was thinking that we might add a short description in the reference at some point + a link. That might be better for people to quickly look up things while staying in the same website. But on the other hand I see your point...not quite sure how to handle this TBH.

Under the Ingesting Datasets into Cloud Storage section, there is a markdown formatting error in the [!note] tag

Fixed. Thanks for spotting this.

I wonder whether certain "How to" sections could be punchier, as self contained guides to accomplish a certain task?

Certainly room for improvement, but I would love to split that into another PR. Could you lead that effort (maybe just start with an issue linked to this?).

Is there an idea for what we want to include in the policies and how they are separated from the guides? Since the high level description of how these fit into Pangeo are subsections of the Architecture?

I have some ideas, but am not 100% sure, but I am happy to work on this. I think of these more as "how do we deal with data", e.g. always ask before using someones data, always try to be as open as possible, etc.

I think the Ingesting Datasets section remains somewhat confusing

This is really relevant! Can you be a bit more specific what you find confusing? Inline comments/suggestions on the code/text would be most helpful here. Thank you.

@SammyAgrawal
Copy link
Contributor

Can we merge this?

To start ingesting a dataset follow these steps:

1. Add a new [dataset_request](https://github.com/leap-stc/data-management/issues/new?assignees=&labels=dataset&projects=&template=new_dataset.yaml&title=New+Dataset+%5BDataset+Name%5D) in the [data-management](https://github.com/leap-stc/data-management) repo so there is a central place where people can suggest datasets for ingestion and follow progress.
1. Start a feedstock for your dataset. We organize any kind of data that is part of the [](explanation.architecture.data-library) in its own repository under the `leap-stc` github organization. Please use [our Template](https://github.com/leap-stc/LEAP_template_feedstock) to get started. Based on the 3 [types of data](explanation.data-policy.types) we host in the [](explanation.architecture.data-library) there are different ways of ingesting data (with specific instructions provided in the [template feedstock](https://github.com/leap-stc/LEAP_template_feedstock)):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is an information dense paragraph with almost every sentence linking to something. Could we break it up into maybe two bullet points with more explanation?

Like: "start a feedstock for your dataset. A feedstock is ____".

I think too many links also make things scarier? Creates impression that there are lots of moving parts; maybe we could explain/ link to "data library" at the top and then not have the text linked

Copy link
Contributor

@SammyAgrawal SammyAgrawal Aug 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest sentence structure like:

  1. Start a feedstock for your dataset. Any kind of data that is part of the is organized via its own repository under the leap-stc github organization. Pangeo Forge recipes are deployed via Github Actions in these repositories, referred to as feedstocks. Please use our Template to get started.

(Separate the data specific parameters into third bullet point)
3. Based on the 3 types of data we host in the there are different ways of ingesting data.

  • LEAP Curated: data from an existing (public, egress-free) ARCO dataset should be linked to the
  • LEAP ingested: data which exists in legacy formats (e.g. netcdf) that is transformed into an ARCO copy on . The preferred method to do that is to use .
  • (Work in Progress): Creating a virtual zarr store from existing publically hosted legacy format data (e.g. netcdf)
  1. (with specific instructions provided in the template feedstock):

@jbusecke
Copy link
Contributor Author

Merging this now.

@jbusecke jbusecke merged commit 8ef2f9c into main Aug 22, 2024
2 checks passed
github-actions bot pushed a commit that referenced this pull request Aug 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Define LEAP Affiliated course and link in EDU guide.
3 participants