Refactor Data Guide #162

jbusecke · 2024-06-28T00:02:45Z

This is an effort to update the data-guide with some new instructions reflecting the pgf ingestion pipeline and the now available catalog.

Closes Define LEAP Affiliated course and link in EDU guide. #169
Reorganize content more strictly following the diviso categories (might factor that out as another PR). A sketched out plan
Break up Data and Compute Guide
Create all the empty links referenced in here.
Break up and move the auth/transfer instructions

for more information, see https://pre-commit.ci

github-actions · 2024-06-28T00:03:20Z

👋 Thanks for opening this PR! The Cookbook will be automatically built with GitHub Actions. To see the status of your deployment, click below.
🔍 Git commit SHA: 55ef5d9
✅ Deployment Preview URL: https://leap-stc.github.io/_preview/162

jbusecke · 2024-08-20T20:42:12Z

Ok I finally finished this major refactor of the docs. @SammyAgrawal @norlandrhagen if you could review the preview above (once it is done building) that would be tremendously helpful.

If you think in particular the Catalog and Ingestion parts are helpful enough, we can finally unblock the official release of the catalog!

norlandrhagen · 2024-08-20T22:54:00Z

Nice work @jbusecke!

A few things:

This section describing the data catalog seems outdated. We can probably point to the existing catalog instead of the Radiant Earth link.
Under the Ingesting Datasets into Cloud Storage section, there is a markdown formatting error in the [!note] tag
In Ingesting Datasets into Cloud Storage #2, bullet 3, the link to Pangeo-Forge brings you to this page:

Maybe it can link directly to the pangeo-forge-recipes docs. Same with the Zarr link.

In the types of data supported: Linking an existing (public, egress-free) ARCO dataset to the [Data Catalog](https://leap-stc.github.io/_preview/162/explanation/architecture.html#explanation-architecture-catalog), not sure if it's worth mentioning, but zarr-proxy can get around CORS restrictions to outside data.
I'm not super clear on the data library / data catalog differences. It seems like the catalog has kind of become the library instead of a STAC catalog?

SammyAgrawal · 2024-08-21T02:36:09Z

I really like the addition of the data guide, I think that is really awesome!

I wonder whether certain "How to" sections could be punchier, as self contained guides to accomplish a certain task?
Is there an idea for what we want to include in the policies and how they are separated from the guides? Since the high level description of how these fit into Pangeo are subsections of the Architecture?
I think the Ingesting Datasets section remains somewhat confusing

There are some small edits I want to make (language cleanup) and will circle back on some organizing stuff. Overall I think this is great though!

jbusecke · 2024-08-21T15:53:15Z

Thanks for all the feedback @norlandrhagen and @SammyAgrawal.

An overall issue is that some of the text is hella old, as you pointed out, and we should work to replace it slowly, not sure if we want to do it all in one?

I'm not super clear on the data library / data catalog differences. It seems like the catalog has kind of become the library instead of a STAC catalog?

I personally see the libary = catalog + data storage. Is there some way to highlight this?

This section describing the data catalog seems outdated. We can probably point to the existing catalog instead of the Radiant Earth link.

This is sort of emblematic of that issue, we need to find a way to reconcile the old 'vision' language with what is actually implemented. I for now kept most of the old language, but I agree this is a bit confusing.

In the types of data supported: Linking an existing (public, egress-free) ARCO dataset to the Data Catalog, not sure if it's worth mentioning, but zarr-proxy can get around CORS restrictions to outside data.

@norlandrhagen It would be great if you could add an admonition with some detail on that here!

In Ingesting Datasets into Cloud Storage #2, bullet 3, the link to Pangeo-Forge brings you to this page:

Good question here. I was thinking that we might add a short description in the reference at some point + a link. That might be better for people to quickly look up things while staying in the same website. But on the other hand I see your point...not quite sure how to handle this TBH.

Under the Ingesting Datasets into Cloud Storage section, there is a markdown formatting error in the [!note] tag

Fixed. Thanks for spotting this.

I wonder whether certain "How to" sections could be punchier, as self contained guides to accomplish a certain task?

Certainly room for improvement, but I would love to split that into another PR. Could you lead that effort (maybe just start with an issue linked to this?).

Is there an idea for what we want to include in the policies and how they are separated from the guides? Since the high level description of how these fit into Pangeo are subsections of the Architecture?

I have some ideas, but am not 100% sure, but I am happy to work on this. I think of these more as "how do we deal with data", e.g. always ask before using someones data, always try to be as open as possible, etc.

I think the Ingesting Datasets section remains somewhat confusing

This is really relevant! Can you be a bit more specific what you find confusing? Inline comments/suggestions on the code/text would be most helpful here. Thank you.

SammyAgrawal · 2024-08-22T03:29:14Z

Can we merge this?

SammyAgrawal · 2024-08-22T15:47:46Z

book/guides/data_guide.md

+To start ingesting a dataset follow these steps:
+
+1. Add a new [dataset_request](https://github.com/leap-stc/data-management/issues/new?assignees=&labels=dataset&projects=&template=new_dataset.yaml&title=New+Dataset+%5BDataset+Name%5D) in the [data-management](https://github.com/leap-stc/data-management) repo so there is a central place where people can suggest datasets for ingestion and follow progress.
+1. Start a feedstock for your dataset. We organize any kind of data that is part of the [](explanation.architecture.data-library) in its own repository under the `leap-stc` github organization. Please use [our Template](https://github.com/leap-stc/LEAP_template_feedstock) to get started. Based on the 3 [types of data](explanation.data-policy.types) we host in the [](explanation.architecture.data-library) there are different ways of ingesting data (with specific instructions provided in the [template feedstock](https://github.com/leap-stc/LEAP_template_feedstock)):


I think this is an information dense paragraph with almost every sentence linking to something. Could we break it up into maybe two bullet points with more explanation?

Like: "start a feedstock for your dataset. A feedstock is ____".

I think too many links also make things scarier? Creates impression that there are lots of moving parts; maybe we could explain/ link to "data library" at the top and then not have the text linked

I would suggest sentence structure like:

Start a feedstock for your dataset. Any kind of data that is part of the is organized via its own repository under the leap-stc github organization. Pangeo Forge recipes are deployed via Github Actions in these repositories, referred to as feedstocks. Please use our Template to get started.

(Separate the data specific parameters into third bullet point)
3. Based on the 3 types of data we host in the there are different ways of ingesting data.

LEAP Curated: data from an existing (public, egress-free) ARCO dataset should be linked to the

LEAP ingested: data which exists in legacy formats (e.g. netcdf) that is transformed into an ARCO copy on . The preferred method to do that is to use .

(Work in Progress): Creating a virtual zarr store from existing publically hosted legacy format data (e.g. netcdf)

(with specific instructions provided in the template feedstock):

for more information, see https://pre-commit.ci

jbusecke · 2024-08-22T19:33:19Z

Merging this now.

jbusecke and others added 2 commits June 27, 2024 20:02

Refactor Data Guide

7722d9a

[pre-commit.ci] auto fixes from pre-commit.com hooks

3d32fe3

for more information, see https://pre-commit.ci

jbusecke mentioned this pull request Aug 8, 2024

Team Planning - Wednesday, August 7th leap-stc/data-and-compute-team#8

Closed

jbusecke mentioned this pull request Aug 15, 2024

Define LEAP Affiliated course and link in EDU guide. #169

Closed

jbusecke added 5 commits August 20, 2024 12:46

Merge remote-tracking branch 'origin/main' into data-guide-refactor

2340fd8

Finished Refactor of docs

f9bf5a4

Fix relative ref

3ff75ce

More fixes

aae3f3f

gahh more

50148a5

Change infra ref title

25c5563

jbusecke requested review from SammyAgrawal and norlandrhagen August 20, 2024 20:43

jbusecke added this to the Catalog Release milestone Aug 20, 2024

Fix markdown callout

a0615e9

SammyAgrawal mentioned this pull request Aug 22, 2024

iterating on data guides #172

Open

SammyAgrawal reviewed Aug 22, 2024

View reviewed changes

jbusecke and others added 2 commits August 22, 2024 12:31

Update data_guide.md

ae2d629

[pre-commit.ci] auto fixes from pre-commit.com hooks

68b06cd

for more information, see https://pre-commit.ci

jbusecke mentioned this pull request Aug 22, 2024

Team Planning - Wednesday, August 21st leap-stc/data-and-compute-team#18

Closed

jbusecke merged commit 8ef2f9c into main Aug 22, 2024
2 checks passed

github-actions bot pushed a commit that referenced this pull request Aug 22, 2024

Delete preview for pull request \#162

60f29d9

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor Data Guide #162

Refactor Data Guide #162

jbusecke commented Jun 28, 2024 •

edited

Loading

github-actions bot commented Jun 28, 2024 •

edited

Loading

jbusecke commented Aug 20, 2024

norlandrhagen commented Aug 20, 2024

SammyAgrawal commented Aug 21, 2024

jbusecke commented Aug 21, 2024

SammyAgrawal commented Aug 22, 2024

SammyAgrawal Aug 22, 2024

SammyAgrawal Aug 22, 2024 •

edited

Loading

jbusecke commented Aug 22, 2024

Refactor Data Guide #162

Refactor Data Guide #162

Conversation

jbusecke commented Jun 28, 2024 • edited Loading

github-actions bot commented Jun 28, 2024 • edited Loading

jbusecke commented Aug 20, 2024

norlandrhagen commented Aug 20, 2024

SammyAgrawal commented Aug 21, 2024

jbusecke commented Aug 21, 2024

SammyAgrawal commented Aug 22, 2024

SammyAgrawal Aug 22, 2024

Choose a reason for hiding this comment

SammyAgrawal Aug 22, 2024 • edited Loading

Choose a reason for hiding this comment

jbusecke commented Aug 22, 2024

jbusecke commented Jun 28, 2024 •

edited

Loading

github-actions bot commented Jun 28, 2024 •

edited

Loading

SammyAgrawal Aug 22, 2024 •

edited

Loading