deploy: f2b495b

leap-stc · Oct 1, 2023 · c88e23c · c88e23c
1 parent f48624d
commit c88e23c
Show file tree

Hide file tree

Showing 95 changed files with 35,661 additions and 0 deletions.
diff --git a/_preview/88/.buildinfo b/_preview/88/.buildinfo
@@ -0,0 +1,4 @@
+# Sphinx build info version 1
+# This file hashes the configuration used when building these files. When it is not found, a full rebuild will be done.
+config: fe86a250aa39242c7e298ca3ae89f448
+tags: 645f666f9bcd5a90fca523b33c5a78b7
diff --git a/_preview/88/_images/LEAP_knowledge_graph.png b/_preview/88/_images/LEAP_knowledge_graph.png
diff --git a/_preview/88/_images/email_org_invite.png b/_preview/88/_images/email_org_invite.png
diff --git a/_preview/88/_images/gh_org_invite_1.png b/_preview/88/_images/gh_org_invite_1.png
diff --git a/_preview/88/_images/gh_org_invite_2.png b/_preview/88/_images/gh_org_invite_2.png
diff --git a/_preview/88/_panels_static/panels-main.c949a650a448cc0ae9fd3441c0e17fb0.css b/_preview/88/_panels_static/panels-main.c949a650a448cc0ae9fd3441c0e17fb0.css
diff --git a/_preview/88/_panels_static/panels-variables.06eb56fa6e07937060861dad626602ad.css b/_preview/88/_panels_static/panels-variables.06eb56fa6e07937060861dad626602ad.css
@@ -0,0 +1,7 @@
+:root {
+--tabs-color-label-active: hsla(231, 99%, 66%, 1);
+--tabs-color-label-inactive: rgba(178, 206, 245, 0.62);
+--tabs-color-overline: rgb(207, 236, 238);
+--tabs-color-underline: rgb(207, 236, 238);
+--tabs-size-label: 1rem;
+}
diff --git a/_preview/88/_sources/contact.md b/_preview/88/_sources/contact.md
@@ -0,0 +1,5 @@
+# Contact Us
+
+(contact.data_compute_manager)=
+## Manager for Data and Compute
+You can contact Julius Busecke on [Slack](https://leap-nsf-stc.slack.com/team/U03MSCLCTRA).
diff --git a/_preview/88/_sources/guides/education.md b/_preview/88/_sources/guides/education.md
@@ -0,0 +1,29 @@
+# LEAP-Pangeo for Education
+
+## Running classes on the JupyterHub
+
+🚧 Full Guide coming soon ... If you are a LEAP educator and want to run your class on the hub, please reach out to [](contact.data_compute_manager).
+
+(education:sing_up)=
+### How to sign up students
+
+Students should be signed up to the appropriate user [categories](users.categories) ahead of the class. Please direct your students to this documentation and try to ensure that everyone has [access to the Hub](hub:server:login) before the class starts.
+
+#### Troubleshooting
+
+**Students cannot sign on**
+
+Check if the students are part of the [appropriate github teams](users:categories). 
+
+If they **are not** follow these steps:
+- [ ] Did the student [sign up for LEAP membership]()?
+- [ ] Did the student receive a github invite? [Here](users.invite) is how to check for that.  
+- [ ] Check again if they are part of the [appropriate github teams](users:categories).
+- If these steps do not work, please reach out to [](contact.data_compute_manager).
+
+If they **are**, ask them to try the following steps:
+- [ ] Refresh the browser cache
+- [ ] Try a different browser
+- [ ] Restart the computer
+- If these steps do not work, please reach out to [](contact.data_compute_manager).
+
diff --git a/_preview/88/_sources/how_to_cite.md b/_preview/88/_sources/how_to_cite.md
@@ -0,0 +1,18 @@
+# How to cite LEAP-Pangeo
+
+If you use any of the LEAP resources, please follow these guidlines to recognize our work.
+
+## Add your publication to our [LEAP publication tracker]()
+
+## Cite LEAP-Pangeo Platform
+If you used the JupyterHub platform to perform analysis, please add a statement similar to this to your acknowledgment section of the paper:
+```
+We acknowledge the computing and storage resources provided by the
+`NSF Science and Technology Center (STC) Learning the Earth with
+Artificial intelligence and Physics (LEAP)` (Award # 2019625).
+```
+## Cite Data
+Please include all datasets used for your work in your citations using the doi of each individual dataset.
+
+## Don't forget to cite your open source packages
+Please take the time to cite all packages used in your work, to ensure that the essential work of open source developers for open science is properly recognized.
diff --git a/_preview/88/_sources/intro.md b/_preview/88/_sources/intro.md
@@ -0,0 +1,15 @@
+# LEAP Technical Documentation
+
+This website is the home for all technical documentation related to LEAP and LEAP-Pangeo.
+
+## Dashboard
+
+| Update Status | Contributors | Deployment Status |
+| -- | -- | -- |
+| [![GitHub last commit](https://img.shields.io/github/last-commit/leap-stc/leap-stc.github.io)](https://github.com/leap-stc/leap-stc.github.io) | ![GitHub contributors](https://img.shields.io/github/contributors/leap-stc/leap-stc.github.io)  | [![publish-book](https://github.com/leap-stc/leap-stc.github.io/actions/workflows/publish-book.yaml/badge.svg?style=flat-square)](https://github.com/leap-stc/leap-stc.github.io/actions/workflows/publish-book.yaml) |
+
+
+## Contents
+
+```{tableofcontents}
+```
diff --git a/_preview/88/_sources/leap-pangeo/architecture.md b/_preview/88/_sources/leap-pangeo/architecture.md
@@ -0,0 +1,200 @@
+# LEAP-Pangeo Architecture
+
+
+LEAP-Pangeo is a cloud-based data and computing platform that will be used to support research, education, and knowledge transfer within the LEAP program.
+
+## Motivation
+
+The motivation and justification for developing LEAP-Pangeo are laid out in several recent peer-reviewed publications: {cite}`AbernatheyEtAl2021` and {cite}`GentemannEtAl2021`.
+To summarize these arguments, a shared data and computing platform will:
+- Empower LEAP participants with instant access to high-performance computing and analysis-ready data in order to support ambitious research objectives
+- Facilitate seamless collaboration between project members around data-intensive science, accelerating research progress
+- Enable rich data-driven classroom experiences for learners, helping them transition successfully from coursework to research
+- Place actionable data in the hands of LEAP partners to support knowledge transfer
+
+## Design Principles
+
+In the proposal, we committed to building this in a way that enables the tools and infrastructure to be reused and remixed.
+So The challenge for LEAP Pangeo is to deploy an “enterprise quality” platform built entirely out of open-source tools, and to make this platform as reusable and useful for the broader climate science community as possible.
+We committed to following the following design principles:
+- Open source
+- Modular system: built out of smaller, standalone pieces which interoperate through clearly documented interfaces / standards
+- Agile development on GitHub
+- Following industry-standard best practices for continuous deployment, testing, etc.
+- Resuse of existing technologies and contribution to "upstream" open source projects on which LEAP-Pangeo depends
+  (rather than development of new stuff just for the sake of it).
+  This is a key part of our sustainability plan.
+
+## Related Tools and Platforms
+
+
+It’s useful to understand the recent history and related efforts in this space.
+
+- **[Google Colab](https://research.google.com/colaboratory/faq.html)** is a free notebook-in-the-cloud service run by Google.
+  It is built around the open source Jupyter project, but with advanced notebook sharing capabilities (like Google Docs).
+- **[Google Earth Engine](https://earthengine.google.org/)** is a reference point for all cloud geospatial analytics platforms.
+  It’s actually a standalone application that is separate from Google Cloud, the single instance of a highly customized, black box (i.e. not open source)  application that enables parallel computing on distributed data.
+  It’s very good at what it was designed for (analyzing satellite images), but isn’t easily adapted to other applications, such as machine learning.
+- **[Columbia IRI Data Library](https://iridl.ldeo.columbia.edu/index.html)** is a powerful and freely accessible online data repository and analysis tool that allows a user to view, analyze, and download hundreds of terabytes of climate-related data through a standard web browser.
+  Due to its somewhat outdated architecture, IRI data library cannot easily be updated or adapted to new projects.
+- **[Pangeo](http://pangeo.io/)** is an open science community oriented around open-source python tools for big-data geoscience.
+  It is a loose ecosystem of interoperable python packages including [Jupyter](https://jupyter.org/), [Xarray](http://xarray.pydata.org/), [Dask](http://dask.pydata.org/), and [Zarr](https://zarr.readthedocs.io/).
+  The Pangeo tools have been deployed in nearly all commercial clouds (AWS, GCP, Azure) as well as HPC environments.
+  [Pangeo Cloud](https://pangeo.io/cloud.html) is a publicly accessible data-proximate computing environment based on Pangeo tools.
+  Pangeo is used heavily within NCAR.
+- **[Microsoft Planetary Computer](https://planetarycomputer.microsoft.com/)** is a collection of datasets and computational tools hosted by Microsoft in the Azure cloud.
+  It combines Pangeo-style computing environments with a data library based on [SpatioTemporal Asset Catalog](https://stacspec.org/)
+- **[Radiant Earth ML Hub](https://www.radiant.earth/mlhub/)** is a cloud-based open library dedicated to Earth observation training data for use with machine learning algorithms.
+  It focuses mostly on data access and curation.
+  Data are cataloged using STAC.
+- **[Pangeo Forge](https://pangeo-forge.org/)** is a new initiative, funded by the NSF EarthCube program, to build a platform for
+  "crowdsourcing" the production of analysis-ready, cloud-optimized data.
+  Once operational, Pangeo Forge will be a useful tool for many different projects which need data in the cloud.
+
+Of these different tools, we opt to build on Pangeo because of its open-source, grassroots
+foundations in the climate data science community, strong uptake within NCAR, and track-record of support from NSF.
+
+## Design and Architecture
+
+```{figure} https://i.imgur.com/PVhoQUu.png
+---
+name: architecture-diagram
+---
+LEAP-Pangeo high-level architecture diagram
+```
+
+There are four primary components to LEAP-Pangeo.
+
+### The Data Library
+
+The data library will provide analysis-ready, cloud-optimized data for all aspects of LEAP.
+The data library is directly inspired by the [IRI Data Library](https://iridl.ldeo.columbia.edu) mentioned above; however, LEAP-Pangeo data will be hosted in the cloud, for maximum impact, accessibility, and interoperability.
+
+The contents of the data library will evolve dynamically based on the needs of the project.
+Examples of data that may become part of the library are
+- NOAA [OISST](https://www.ncei.noaa.gov/products/optimum-interpolation-sst) sea-surface temperature data,
+  used in workshops and classes to illustrate the fundamentals of geospatial data science.
+- High-resolution climate model simulations from the [NCAR "EarthWorks"](https://news.ucar.edu/132760/csu-ncar-develop-high-res-global-model-community-use)
+  project, used by LEAP researchers to develop machine-learning parameterizations of climate processes like cloud and ocean eddies.
+- Machine-learning "challenge datasets," published by the LEAP Team and accessible to the world, to help broading participation
+  by ML researchers into climate science.
+- Easily accessible syntheses of climate projections from [CMIP6 data](https://esgf-node.llnl.gov/projects/cmip6/), produced by the LEAP team,
+  for use by industry partners for business strategy and decision making.
+
+### Data Storage Service
+
+The underlying technology for the LEAP Data catalog will be cloud object storage (e.g. Amazon S3),
+which enables high throughput concurrent access to many simultaneous users over the public internet.
+Cloud Object Storage is the most performant, cost-effective, and simple way to serve such large volumes of data.
+
+Initially, the LEAP data will be stored in Google Cloud Storage, in the same cloud region
+as the JupyterHub.
+Going forward, we will work with NCAR to obtain an [Open Storage Network](https://www.openstoragenetwork.org/)
+pod which allows data to be accessible from both Google Cloud and NCAR's computing system.
+
+#### Pangeo Forge
+
+```{figure} https://raw.githubusercontent.com/pangeo-forge/flow-charts/main/renders/architecture.png
+---
+width: 600px
+name: pangeo-forge-flow
+---
+Pangeo Forge high-level workflow. Diagram from https://github.com/pangeo-forge/flow-charts
+```
+
+A central tool for the population and maintenance of the LEAP-Pangeo data catalog is
+[Pangeo Forge](https://pangeo-forge.readthedocs.io/en/latest/).
+Pangeo Forge is an open source tool for data Extraction, Transformation, and Loading (ETL).
+The goal of Pangeo Forge is to make it easy to extract data from traditional data repositories and deposit in cloud object storage in analysis-ready, cloud-optimized (ARCO) format.
+
+Pangeo Forge works by allowing domain scientists to define "recipes" that describe data transformation pipelines.
+These recipes are stored in GitHub repositories.
+Continuous integration monitors GitHub and automatically executes the data pipelines when needed.
+The use of distributed, cloud-based processing allows very large volumes of data to be processed quickly.
+
+Pangeo Forge is a new project, funded by the NSF EarthCube program.
+LEAP-Pangeo will provide a high-impact use case for Pangeo Forge, and Pangeo Forge
+will empower and enhance LEAP research.
+This synergistic relationship with be mutually beneficial to two NSF-sponsored projects.
+Using Pangeo Forge effectively will require LEAP scientists and data engineers to engage
+with the open-source development process around Pangeo Forge and related technologies.
+
+#### Catalog
+
+A [STAC](https://stacspec.org/) data catalog be used to enumerate all LEAP-Pangeo datasets and provide this information to the public.
+The catalog will store all relevant metadata about LEAP datasets following established metadata standards (e.g. CF Conventions).
+It will also provide direct links to raw data in cloud object storage.
+
+The catalog will facilitate several different modes of access:
+- Searching, crawling, and opening datasets from within notebooks or scripts
+- "Crawling" by search indexes or other machine-to-machine interfaces
+- A pretty web front-end interface for interactive public browsing
+
+The [Radiant Earth MLHub](https://mlhub.earth/) is a great reference for how we imagine the LEAP data catalog will eventually look.
+
+### The Hub
+
+```{figure} https://jupyter.org/assets/homepage/labpreview.webp
+---
+width: 400px
+name: jupyterlab-preview
+---
+Screenshot from JupyterLab. From <https://jupyter.org/>
+```
+
+Jupyter Notebook / Lab has emerged as the standard tool for doing interactive data science.
+Jupyter supports combining rich text, code, and generated outputs (e.g. figures) into a single document, creating a way to communicate and share complete data-science research project
+
+```{figure} https://jupyterhub.readthedocs.io/en/stable/_images/jhub-fluxogram.jpeg
+---
+width: 400px
+name: jupyterhub-architecture
+---
+JupyterHub architecture. From <https://jupyterhub.readthedocs.io/>
+```
+
+JupyterHub is a multi-user Jupyter Notebook / Lab environment that runs on a server.
+JupyterHub provides a gateway to highly customized software environments backed by dedicated computing with specified resources (CPU, RAM, GPU, etc.)
+Running in the cloud, JupyterHub can scale up to accommodate any number of simultaneous users with no degradation in
+performance.
+JupyterHub environments can support basically [every existing programming language](https://github.com/jupyter/jupyter/wiki/Jupyter-kernels).
+We anticipate that LEAP users will primarily use **Python**, **R**, and **Julia** programming languages.
+In addition to Jupyter Notebook / Lab, JupyterHub also supports launching [R Studio](https://www.rstudio.com/).
+
+The Pangeo project already provides [curated Docker images](https://github.com/pangeo-data/pangeo-docker-images)
+with full-featured Python software environments for environmental data science.
+These environments will be the starting point for LEAP environments.
+They may be augmented as LEAP evolves with more specific software as needed by research projects.
+
+Use management and access control for the Hub are described in {doc}`/policies/users_roles`.
+We use GitHub for identity management, in order to make it easy to include participants
+from any partner institution..
+
+### The Knowledge Graph
+
+LEAP "outputs" will be of four main types:
+
+- **Datasets** (covered above)
+- **Papers** - traditional scientific publications
+- **Project Code** - the code behind the papers, used to actually generate the scientific results
+- **Trained ML Models** - models that can be used directly for inference by others
+- **Educational Modules** - used for teaching
+
+All of these object must be tracked and cataloged in a uniform way.
+The {doc}`/policies/code_policy` and {doc}`/policies/data_policy` will help set these standards.
+
+```{figure} LEAP_knowledge_graph.png
+---
+width: 600px
+name: knowledge-graph
+---
+LEAP Knowledge Graph
+```
+
+By tracking the linked relationships between datasets, papers, code, models, and educational , we will generate a “knowledge graph”.
+This graph will reveal the dynamic, evolving state of the outputs of LEAP research and the relationships between different elements of the project.
+By also tracking participations (i.e. humans), we will build a novel and inspiring track record of LEAP's impacts through the project lifetime.
+
+This is the most open-ended aspect of our infrastructure.
+Organizing and displaying this information effectively is a challenging problem in
+information architecture and systems design.