Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Total refactor #27

Merged
merged 34 commits into from
Jan 22, 2021
Merged

Total refactor #27

merged 34 commits into from
Jan 22, 2021

Conversation

rabernat
Copy link
Contributor

@rabernat rabernat commented Dec 2, 2020

This is the beginning of a complete rewrite of pangeo forge, learning from our initial steps.

The core of the new code is a Recipe class, which is used like this:

r = recipe.NetCDFtoZarrSequentialRecipe(
    input_urls=netcdf_local_paths,
    sequence_dim="time",
    inputs_per_chunk=1,
    nitems_per_input=daily_xarray_dataset.attrs["items_per_file"],
    target=tmp_target,
    input_cache=tmp_cache,
)

# manual execution of a recipe
r.prepare()
for input_key in r.iter_inputs():
    r.cache_input(input_key)
for chunk_key in r.iter_chunks():
    r.store_chunk(chunk_key)
r.finalize()

The idea is that we will then implement a RecipeExecutor class which runs these recipes. This will work because the methods prepare, cache_input, store_chunk and finalize are not regular methods. The are actually properties that return serializable functions.

There is significant work to do here, but I feel pretty good about where this is going...

@rabernat
Copy link
Contributor Author

rabernat commented Dec 4, 2020

Black and flake8 both pass on my local env but are failing in this action. Not what I want to be debugging right now.

Copy link
Collaborator

@jhamman jhamman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still catching up here so I've mostly asked questions to improve my understanding. Generally, I'm a fan of the redesign with my only concerns being how we leave space for extension to pipelines that may not use xarray or zarr.

@@ -1,6 +1,6 @@
from pkg_resources import DistributionNotFound, get_distribution

from pangeo_forge.pipelines import AbstractPipeline
# from pangeo_forge.pipelines import AbstractPipeline
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reminder to remove this.

from ..recipe import DatasetRecipe


class PrefectExecutor:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we want to define a base Executor class? Neither class has an init method in this PR. Are these just wrappers or do we think we'll want to parameterize the executor at some point?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking about this a bit more, I think I'd like to see init and repr methods on all these classes.

from functools import partial
from typing import Callable, Iterable

# from ..types import Pipeline, Stage, Task
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reminder to remove this

from .storage import InputCache, Target
from .utils import chunked_iterable, fix_scalar_attr_encoding

# logger = logging.getLogger(__name__)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

was this not working?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does work. I'll put it back

Comment on lines 90 to 93
# Notes about dataclasses:
# - https://www.python.org/dev/peps/pep-0557/#inheritance
# - https://stackoverflow.com/questions/51575931/class-inheritance-in-python-3-7-dataclasses
# This means that, for now, I can't get default arguments to work.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So what do dataclasses give us here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dataclasses reduce the amount of boilerplate we have to write / maintain. None of these classes needs init methods.

I believe we can fix the default arguments problem by tweaking the mixin order. This requires me to understand python's method resolution order 🤯.



@dataclass
class StandardSequentialRecipe(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's think about a more informative name here. Once concern I have is that we'll get to Xarray->Zarr focused. While I'm quite happy to support that workflow as the primary and initial implementation, I want to make sure we leave room for a Rasterio->COG workflow or Pandas->Parquet workflow.



@dataclass
class Target:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this should be called a ZarrTarget?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe MapperTarget? There is nothing zarr specific about it...

Comment on lines 49 to 50
def _full_path(self, path):
return os.path.join(self.prefix, _hash_path(path))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For what its worth, I've found that hashing paths like this can make it difficult to debug failed workflows. Maybe you can explain a bit more why we need to hash all paths like this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. My thinking was: I want there to be unique mapping between inputs and paths in the cache. Hashing achieves this. The input paths may be urls with lots of forbidden characters in them. But I'll play with some alternatives that are more readable / debuggable.

Comment on lines 169 to 173
# do we really want to just delete all encoding?
# for v in ds.variables:
# ds[v].encoding = {}

# TODO: maybe do some chunking here?
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we make these options? I think that yes, we generally want to remove netcdf encoding before writing to zarr. But there are probably cases where this isn't true.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes agreed.

@rabernat
Copy link
Contributor Author

rabernat commented Dec 9, 2020

Thanks a lot Joe! I'll first reply to some of your questions and then update my PR in response to your comments.

@rabernat
Copy link
Contributor Author

In retrospect, I feel like this approach of creating dozens of mixins and multiple inheritance is premature complexification / abstractifiction. I recently read this blog post and felt like it was speaking directly to me! 😆

Now I think that what we should do is create a basic working recipe class that implements the methods executors expect. As we define new recipes based that don't fit this mold, we should slowly refactor this class to make it more generalizable (rather than trying to generalize everything from day 1.)

Working on this now.

@jhamman
Copy link
Collaborator

jhamman commented Dec 18, 2020

In retrospect, I feel like this approach of creating dozens of mixins and multiple inheritance is premature complexification / abstractifiction.

Couldn't agree more. I think I went through this same realization a few months ago. I think for now, the executors may be enough of an abstraction to allow us to move forward. If we find that recipes are frequently sharing elements, we can pull those out one by one.

@rabernat
Copy link
Contributor Author

rabernat commented Jan 4, 2021

Just a note: in pangeo-data/rechunker#77 I am working on an update to rechunker that intersects with this.

@joshmoore
Copy link

#27 (review) my only concerns being how we leave space for extension to pipelines that may not use xarray or zarr.

This was my first impression as well. I've not yet been able to make multiscales representations openable by xarray.

@davidbrochart
Copy link
Contributor

@joshmoore what other tool are you thinking about to generate multiscale representations?

@joshmoore
Copy link

@davidbrochart : really anything producing (largish) images (or generally spatial arrays?) would benefit from a multiscale representation. For other datatypes, I'd defer to other domains whether it's useful or not.

@rabernat rabernat marked this pull request as ready for review January 18, 2021 18:08
@rabernat
Copy link
Contributor Author

The tests are hanging intermittently. But I think everything is working.

@charlesbluca charlesbluca mentioned this pull request Jan 20, 2021
@rabernat
Copy link
Contributor Author

The tests have become so unreliable for this PR. I think it has to do with starting the HTTP server.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants