Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add public API for Dataset._copy_listed #3894

Open
nbren12 opened this issue Mar 26, 2020 · 15 comments
Open

Add public API for Dataset._copy_listed #3894

nbren12 opened this issue Mar 26, 2020 · 15 comments

Comments

@nbren12
Copy link
Contributor

nbren12 commented Mar 26, 2020

In my data pipelines, I have been repeatedly burned using indexing notation to grab a few variables from a dataset in the following way:

ds = xr.Dataset(...)
vars = ('a' , 'b', 'c')
ds[vars] # this errors
ds[list(vars)] # this is ok

Moreover, because Dataset__getitem__ is type unstable, it makes it hard to detect this kind of error using mypy, so it often appears 30 minutes into a long data pipeline. It would be great to have a type-stable method that can take any sequence of variable names and return the Dataset consisting of those variables and their coordinates only. In fact, this method already exists, but it currently not public API. Could we make it so? Thanks.

@max-sixty
Copy link
Collaborator

Would that be different from ensuring the input is a list?

Moreover, because Dataset__getitem__ is type unstable,

I very much empathize with the pain from methods being type unstable; indeed I think that's one of the biggest benefits of xarray over pandas. Here, it's stable over the same typed inputs. i.e. if supplied with a list, it returns with a dataset, otherwise it returns a DataArray. (or am I missing something?)

it makes it hard to detect this kind of error using mypy

Is there a way in mypy we could use something like overload to specify the above contract here, as an alternative to another method?

@dcherian
Copy link
Contributor

What's the reasoning for not returning a Dataset when __getitem__ is passed an Iterable like _copy_listed?

@shoyer
Copy link
Member

shoyer commented Aug 15, 2020

I agree, this API is too overloaded. It would be better to have an explicit method for subsetting Dataset variables. Maybe subset or sel_vars?

In early versions of xarray (back when it was called xray), we actually had a select method but I was concerned it was too confusing with indexing: http://xarray.pydata.org/en/v0.1.1/generated/xray.Dataset.select.html#xray.Dataset.select

What's the reasoning for not returning a Dataset when __getitem__ is passed an Iterable like _copy_listed?

The current check uses hashability to determine whether to try to make a DataArray. In theory, you could put a variable with the name ('a', 'b', 'c') into a Dataset.

@nbren12
Copy link
Contributor Author

nbren12 commented Aug 17, 2020

Is there a way in mypy we could use something like overload to specify the above contract here, as an alternative to another method?

That is correct. The output type is predictable from the inputs types. With #4144, mypy may have a chance at detecting this kind of error. Still, I don't know how many users use type-checking. I expect most will only discover this problem at runtime.

It would be better to have an explicit method for subsetting Dataset variables.

I agree. sel_vars is more clear IMO since subset could apply to the coordinates too e.g. a spatial subset.

@nbren12
Copy link
Contributor Author

nbren12 commented Aug 17, 2020

Or maybe "get" since it's a synonym of "select" that isn't overloaded with spatial indexing in the code base.

@nbren12
Copy link
Contributor Author

nbren12 commented Aug 17, 2020

NVM, get is already a method. Maybe we could overload it?

@shoyer
Copy link
Member

shoyer commented Aug 17, 2020

It would be better to have an explicit method for subsetting Dataset variables.

I agree. sel_vars is more clear IMO since subset could apply to the coordinates too e.g. a spatial subset.

We did a similar splitting of functionality recently with drop, into drop_vars and drop_sel.

So this would leave us with:

  • sel/drop_sel for indices
  • sel_vars/drop_vars for variables

The naming doesn't have an obvious pattern here, which seems non-ideal. I can't think of anything much better at the moment, but perhaps it would help to avoid reusing sel. Maybe get_vars or subset_vars?

@dcherian
Copy link
Contributor

I think avoiding sel is a good idea (but no strong thoughts here).

how about pick_vars or take_vars?

@max-sixty
Copy link
Collaborator

I do think having a Dataset behave similarly to a dict / Mapping is generally good, and allows new users to bring existing understandings around those data structures to the xarray data model.

I recognize that a hashable iterable (e.g. ('a', 'b', 'c') is an annoying corner case, though.

@max-sixty
Copy link
Collaborator

A level down, re the name — I thought .vars might be decent — but potentially it's too similar to .variables — which is a mapping to the actual variables (and can't take multiple selections)

@jthielen
Copy link
Contributor

jthielen commented Aug 19, 2020

Would this proposal mean that subsetting variables with __getitem__ would no longer work? If so, I'd make the humble request as a downstream user/library contributor for it to have a very generous deprecation period, since it is core functionality I rely on a lot (and include in many tutorials).

@keewis
Copy link
Collaborator

keewis commented Aug 19, 2020

IIUC, what we're discussing here is adding a new method that treats all sequences the same (__getitem__ won't be touched, except maybe the type hints).

@shoyer
Copy link
Member

shoyer commented Aug 19, 2020

At most, I would require using the new method if you want your code to type-check properly.

@stale
Copy link

stale bot commented Apr 16, 2022

In order to maintain a list of currently relevant issues, we mark issues as stale after a period of inactivity

If this issue remains relevant, please comment here or remove the stale label; otherwise it will be marked as closed automatically

@stale stale bot added the stale label Apr 16, 2022
@nbren12
Copy link
Contributor Author

nbren12 commented Apr 18, 2022

I think the issue is still valid, we just couldn't think of what to name the new API.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants