Skip to content

Commit

Permalink
deploy: f2b495b
Browse files Browse the repository at this point in the history
  • Loading branch information
dependabot[bot] committed Oct 1, 2023
1 parent f48624d commit c88e23c
Show file tree
Hide file tree
Showing 95 changed files with 35,661 additions and 0 deletions.
4 changes: 4 additions & 0 deletions _preview/88/.buildinfo
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
# Sphinx build info version 1
# This file hashes the configuration used when building these files. When it is not found, a full rebuild will be done.
config: fe86a250aa39242c7e298ca3ae89f448
tags: 645f666f9bcd5a90fca523b33c5a78b7
Binary file added _preview/88/_images/LEAP_knowledge_graph.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added _preview/88/_images/email_org_invite.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added _preview/88/_images/gh_org_invite_1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added _preview/88/_images/gh_org_invite_2.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
:root {
--tabs-color-label-active: hsla(231, 99%, 66%, 1);
--tabs-color-label-inactive: rgba(178, 206, 245, 0.62);
--tabs-color-overline: rgb(207, 236, 238);
--tabs-color-underline: rgb(207, 236, 238);
--tabs-size-label: 1rem;
}
5 changes: 5 additions & 0 deletions _preview/88/_sources/contact.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
# Contact Us

(contact.data_compute_manager)=
## Manager for Data and Compute
You can contact Julius Busecke on [Slack](https://leap-nsf-stc.slack.com/team/U03MSCLCTRA).
29 changes: 29 additions & 0 deletions _preview/88/_sources/guides/education.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
# LEAP-Pangeo for Education

## Running classes on the JupyterHub

🚧 Full Guide coming soon ... If you are a LEAP educator and want to run your class on the hub, please reach out to [](contact.data_compute_manager).

(education:sing_up)=
### How to sign up students

Students should be signed up to the appropriate user [categories](users.categories) ahead of the class. Please direct your students to this documentation and try to ensure that everyone has [access to the Hub](hub:server:login) before the class starts.

#### Troubleshooting

**Students cannot sign on**

Check if the students are part of the [appropriate github teams](users:categories).

If they **are not** follow these steps:
- [ ] Did the student [sign up for LEAP membership]()?
- [ ] Did the student receive a github invite? [Here](users.invite) is how to check for that.
- [ ] Check again if they are part of the [appropriate github teams](users:categories).
- If these steps do not work, please reach out to [](contact.data_compute_manager).

If they **are**, ask them to try the following steps:
- [ ] Refresh the browser cache
- [ ] Try a different browser
- [ ] Restart the computer
- If these steps do not work, please reach out to [](contact.data_compute_manager).

18 changes: 18 additions & 0 deletions _preview/88/_sources/how_to_cite.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
# How to cite LEAP-Pangeo

If you use any of the LEAP resources, please follow these guidlines to recognize our work.

## Add your publication to our [LEAP publication tracker]()

## Cite LEAP-Pangeo Platform
If you used the JupyterHub platform to perform analysis, please add a statement similar to this to your acknowledgment section of the paper:
```
We acknowledge the computing and storage resources provided by the
`NSF Science and Technology Center (STC) Learning the Earth with
Artificial intelligence and Physics (LEAP)` (Award # 2019625).
```
## Cite Data
Please include all datasets used for your work in your citations using the doi of each individual dataset.

## Don't forget to cite your open source packages
Please take the time to cite all packages used in your work, to ensure that the essential work of open source developers for open science is properly recognized.
15 changes: 15 additions & 0 deletions _preview/88/_sources/intro.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
# LEAP Technical Documentation

This website is the home for all technical documentation related to LEAP and LEAP-Pangeo.

## Dashboard

| Update Status | Contributors | Deployment Status |
| -- | -- | -- |
| [![GitHub last commit](https://img.shields.io/github/last-commit/leap-stc/leap-stc.github.io)](https://github.com/leap-stc/leap-stc.github.io) | ![GitHub contributors](https://img.shields.io/github/contributors/leap-stc/leap-stc.github.io) | [![publish-book](https://github.com/leap-stc/leap-stc.github.io/actions/workflows/publish-book.yaml/badge.svg?style=flat-square)](https://github.com/leap-stc/leap-stc.github.io/actions/workflows/publish-book.yaml) |


## Contents

```{tableofcontents}
```
200 changes: 200 additions & 0 deletions _preview/88/_sources/leap-pangeo/architecture.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,200 @@
# LEAP-Pangeo Architecture


LEAP-Pangeo is a cloud-based data and computing platform that will be used to support research, education, and knowledge transfer within the LEAP program.

## Motivation

The motivation and justification for developing LEAP-Pangeo are laid out in several recent peer-reviewed publications: {cite}`AbernatheyEtAl2021` and {cite}`GentemannEtAl2021`.
To summarize these arguments, a shared data and computing platform will:
- Empower LEAP participants with instant access to high-performance computing and analysis-ready data in order to support ambitious research objectives
- Facilitate seamless collaboration between project members around data-intensive science, accelerating research progress
- Enable rich data-driven classroom experiences for learners, helping them transition successfully from coursework to research
- Place actionable data in the hands of LEAP partners to support knowledge transfer

## Design Principles

In the proposal, we committed to building this in a way that enables the tools and infrastructure to be reused and remixed.
So The challenge for LEAP Pangeo is to deploy an “enterprise quality” platform built entirely out of open-source tools, and to make this platform as reusable and useful for the broader climate science community as possible.
We committed to following the following design principles:
- Open source
- Modular system: built out of smaller, standalone pieces which interoperate through clearly documented interfaces / standards
- Agile development on GitHub
- Following industry-standard best practices for continuous deployment, testing, etc.
- Resuse of existing technologies and contribution to "upstream" open source projects on which LEAP-Pangeo depends
(rather than development of new stuff just for the sake of it).
This is a key part of our sustainability plan.

## Related Tools and Platforms


It’s useful to understand the recent history and related efforts in this space.

- **[Google Colab](https://research.google.com/colaboratory/faq.html)** is a free notebook-in-the-cloud service run by Google.
It is built around the open source Jupyter project, but with advanced notebook sharing capabilities (like Google Docs).
- **[Google Earth Engine](https://earthengine.google.org/)** is a reference point for all cloud geospatial analytics platforms.
It’s actually a standalone application that is separate from Google Cloud, the single instance of a highly customized, black box (i.e. not open source) application that enables parallel computing on distributed data.
It’s very good at what it was designed for (analyzing satellite images), but isn’t easily adapted to other applications, such as machine learning.
- **[Columbia IRI Data Library](https://iridl.ldeo.columbia.edu/index.html)** is a powerful and freely accessible online data repository and analysis tool that allows a user to view, analyze, and download hundreds of terabytes of climate-related data through a standard web browser.
Due to its somewhat outdated architecture, IRI data library cannot easily be updated or adapted to new projects.
- **[Pangeo](http://pangeo.io/)** is an open science community oriented around open-source python tools for big-data geoscience.
It is a loose ecosystem of interoperable python packages including [Jupyter](https://jupyter.org/), [Xarray](http://xarray.pydata.org/), [Dask](http://dask.pydata.org/), and [Zarr](https://zarr.readthedocs.io/).
The Pangeo tools have been deployed in nearly all commercial clouds (AWS, GCP, Azure) as well as HPC environments.
[Pangeo Cloud](https://pangeo.io/cloud.html) is a publicly accessible data-proximate computing environment based on Pangeo tools.
Pangeo is used heavily within NCAR.
- **[Microsoft Planetary Computer](https://planetarycomputer.microsoft.com/)** is a collection of datasets and computational tools hosted by Microsoft in the Azure cloud.
It combines Pangeo-style computing environments with a data library based on [SpatioTemporal Asset Catalog](https://stacspec.org/)
- **[Radiant Earth ML Hub](https://www.radiant.earth/mlhub/)** is a cloud-based open library dedicated to Earth observation training data for use with machine learning algorithms.
It focuses mostly on data access and curation.
Data are cataloged using STAC.
- **[Pangeo Forge](https://pangeo-forge.org/)** is a new initiative, funded by the NSF EarthCube program, to build a platform for
"crowdsourcing" the production of analysis-ready, cloud-optimized data.
Once operational, Pangeo Forge will be a useful tool for many different projects which need data in the cloud.

Of these different tools, we opt to build on Pangeo because of its open-source, grassroots
foundations in the climate data science community, strong uptake within NCAR, and track-record of support from NSF.

## Design and Architecture

```{figure} https://i.imgur.com/PVhoQUu.png
---
name: architecture-diagram
---
LEAP-Pangeo high-level architecture diagram
```

There are four primary components to LEAP-Pangeo.

### The Data Library

The data library will provide analysis-ready, cloud-optimized data for all aspects of LEAP.
The data library is directly inspired by the [IRI Data Library](https://iridl.ldeo.columbia.edu) mentioned above; however, LEAP-Pangeo data will be hosted in the cloud, for maximum impact, accessibility, and interoperability.

The contents of the data library will evolve dynamically based on the needs of the project.
Examples of data that may become part of the library are
- NOAA [OISST](https://www.ncei.noaa.gov/products/optimum-interpolation-sst) sea-surface temperature data,
used in workshops and classes to illustrate the fundamentals of geospatial data science.
- High-resolution climate model simulations from the [NCAR "EarthWorks"](https://news.ucar.edu/132760/csu-ncar-develop-high-res-global-model-community-use)
project, used by LEAP researchers to develop machine-learning parameterizations of climate processes like cloud and ocean eddies.
- Machine-learning "challenge datasets," published by the LEAP Team and accessible to the world, to help broading participation
by ML researchers into climate science.
- Easily accessible syntheses of climate projections from [CMIP6 data](https://esgf-node.llnl.gov/projects/cmip6/), produced by the LEAP team,
for use by industry partners for business strategy and decision making.

### Data Storage Service

The underlying technology for the LEAP Data catalog will be cloud object storage (e.g. Amazon S3),
which enables high throughput concurrent access to many simultaneous users over the public internet.
Cloud Object Storage is the most performant, cost-effective, and simple way to serve such large volumes of data.

Initially, the LEAP data will be stored in Google Cloud Storage, in the same cloud region
as the JupyterHub.
Going forward, we will work with NCAR to obtain an [Open Storage Network](https://www.openstoragenetwork.org/)
pod which allows data to be accessible from both Google Cloud and NCAR's computing system.

#### Pangeo Forge

```{figure} https://raw.githubusercontent.com/pangeo-forge/flow-charts/main/renders/architecture.png
---
width: 600px
name: pangeo-forge-flow
---
Pangeo Forge high-level workflow. Diagram from https://github.com/pangeo-forge/flow-charts
```

A central tool for the population and maintenance of the LEAP-Pangeo data catalog is
[Pangeo Forge](https://pangeo-forge.readthedocs.io/en/latest/).
Pangeo Forge is an open source tool for data Extraction, Transformation, and Loading (ETL).
The goal of Pangeo Forge is to make it easy to extract data from traditional data repositories and deposit in cloud object storage in analysis-ready, cloud-optimized (ARCO) format.

Pangeo Forge works by allowing domain scientists to define "recipes" that describe data transformation pipelines.
These recipes are stored in GitHub repositories.
Continuous integration monitors GitHub and automatically executes the data pipelines when needed.
The use of distributed, cloud-based processing allows very large volumes of data to be processed quickly.

Pangeo Forge is a new project, funded by the NSF EarthCube program.
LEAP-Pangeo will provide a high-impact use case for Pangeo Forge, and Pangeo Forge
will empower and enhance LEAP research.
This synergistic relationship with be mutually beneficial to two NSF-sponsored projects.
Using Pangeo Forge effectively will require LEAP scientists and data engineers to engage
with the open-source development process around Pangeo Forge and related technologies.

#### Catalog

A [STAC](https://stacspec.org/) data catalog be used to enumerate all LEAP-Pangeo datasets and provide this information to the public.
The catalog will store all relevant metadata about LEAP datasets following established metadata standards (e.g. CF Conventions).
It will also provide direct links to raw data in cloud object storage.

The catalog will facilitate several different modes of access:
- Searching, crawling, and opening datasets from within notebooks or scripts
- "Crawling" by search indexes or other machine-to-machine interfaces
- A pretty web front-end interface for interactive public browsing

The [Radiant Earth MLHub](https://mlhub.earth/) is a great reference for how we imagine the LEAP data catalog will eventually look.

### The Hub

```{figure} https://jupyter.org/assets/homepage/labpreview.webp
---
width: 400px
name: jupyterlab-preview
---
Screenshot from JupyterLab. From <https://jupyter.org/>
```

Jupyter Notebook / Lab has emerged as the standard tool for doing interactive data science.
Jupyter supports combining rich text, code, and generated outputs (e.g. figures) into a single document, creating a way to communicate and share complete data-science research project

```{figure} https://jupyterhub.readthedocs.io/en/stable/_images/jhub-fluxogram.jpeg
---
width: 400px
name: jupyterhub-architecture
---
JupyterHub architecture. From <https://jupyterhub.readthedocs.io/>
```

JupyterHub is a multi-user Jupyter Notebook / Lab environment that runs on a server.
JupyterHub provides a gateway to highly customized software environments backed by dedicated computing with specified resources (CPU, RAM, GPU, etc.)
Running in the cloud, JupyterHub can scale up to accommodate any number of simultaneous users with no degradation in
performance.
JupyterHub environments can support basically [every existing programming language](https://github.com/jupyter/jupyter/wiki/Jupyter-kernels).
We anticipate that LEAP users will primarily use **Python**, **R**, and **Julia** programming languages.
In addition to Jupyter Notebook / Lab, JupyterHub also supports launching [R Studio](https://www.rstudio.com/).

The Pangeo project already provides [curated Docker images](https://github.com/pangeo-data/pangeo-docker-images)
with full-featured Python software environments for environmental data science.
These environments will be the starting point for LEAP environments.
They may be augmented as LEAP evolves with more specific software as needed by research projects.

Use management and access control for the Hub are described in {doc}`/policies/users_roles`.
We use GitHub for identity management, in order to make it easy to include participants
from any partner institution..

### The Knowledge Graph

LEAP "outputs" will be of four main types:

- **Datasets** (covered above)
- **Papers** - traditional scientific publications
- **Project Code** - the code behind the papers, used to actually generate the scientific results
- **Trained ML Models** - models that can be used directly for inference by others
- **Educational Modules** - used for teaching

All of these object must be tracked and cataloged in a uniform way.
The {doc}`/policies/code_policy` and {doc}`/policies/data_policy` will help set these standards.

```{figure} LEAP_knowledge_graph.png
---
width: 600px
name: knowledge-graph
---
LEAP Knowledge Graph
```

By tracking the linked relationships between datasets, papers, code, models, and educational , we will generate a “knowledge graph”.
This graph will reveal the dynamic, evolving state of the outputs of LEAP research and the relationships between different elements of the project.
By also tracking participations (i.e. humans), we will build a novel and inspiring track record of LEAP's impacts through the project lifetime.

This is the most open-ended aspect of our infrastructure.
Organizing and displaying this information effectively is a challenging problem in
information architecture and systems design.
Loading

0 comments on commit c88e23c

Please sign in to comment.