Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Virtualizarr + Xarray-beam as a ARCO dataset creation option #27

Open
3 tasks
norlandrhagen opened this issue Sep 19, 2024 · 0 comments
Open
3 tasks

Virtualizarr + Xarray-beam as a ARCO dataset creation option #27

norlandrhagen opened this issue Sep 19, 2024 · 0 comments

Comments

@norlandrhagen
Copy link

I created an example repo to explore how you can use virtualizarr + xarray-beam to create an ARCO dataset from a collection of NetCDF files.

In this example, I used a pre-generated virtualizarr reference as the input dataset. This was then fed into an apache-beam pipeline using xarray-beam PTransforms to open the virtualizarr, rechunk and materialize it as a Zarr store.

Rough and optimized timings on dataflow.

1.3 TB dataset
13 variables (merge op)
60 years of daily data
33 minutes on dataflow using ARM instances.

ToDo:

  • Clean up repo.
  • Create stage to create virtualizarr references
  • Compare (a subset) against standard pangeo-forge recipe.

cc @jbusecke @SammyAgrawal

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant