Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

datatype()s can declare default values #6374

Closed

Conversation

cosmicexplorer
Copy link
Contributor

@cosmicexplorer cosmicexplorer commented Aug 21, 2018

Problem

It's true that we can define default values and other logic for the arguments of a datatype by overriding the __new__ method, but it's False that it's the most ergonomic, declarative, or composable way to do that. What if we could reach for the stars?

Solution

# This is the recommended style of importing this class.
from pants.util.objects import DatatypeFieldDecl as F
from pants.util.objects import datatype

class MyType(datatype([F('an_int', int, default_value=3)])): pass

assert(MyType().an_int == 3)
assert(MyType(4).an_int == 4)
MyType('asdf') # raises TypeError

# The above is the same as:

from pants.util.objects import Exactly

class MyType(datatype([F(
  field_name='an_int', 
  type_constraint=Exactly(int), 
  default_value=3, 
  has_default_value=True,
)])): pass

# You can still use the tuple syntax:

# This is the same syntax allowed by Python 3.7's typing.NamedTuple!
class MyOtherType(datatype([('x', int)])): pass

MyOtherType('asdf') # raises TypeError

Result

There are a few elements that I have since removed and split off into separate changes to make this mergeable, but which would help to showcase how neat and compositional this feature is. For example, I have implemented (again, not in this PR) an optional() method producing a type constraint which will accept None, or an object matching the given type constraint (if given), as well as convert(), which will check if an object matches a type constraint, or if not, coerce it in a structured way. This allows succinctly expressing a large variety of the usages of datatype() in this repo which were previously overriding __new__. With those changes, I was able to wipe away a lot of / almost all __new__ methods entirely, which seemed like a huge win in (de)clar(at)i(vi)ty.

Also, I added a TODO to use python 3 dataclasses when we can restrict the python version on our python 3 test shard to CPython>=3.6.

Please let me know your thoughts on the syntax and implementation!

@cosmicexplorer cosmicexplorer force-pushed the datatype-default-values branch 3 times, most recently from 569881f to 6cdc180 Compare August 21, 2018 09:42
@jsirois
Copy link
Contributor

jsirois commented Aug 21, 2018

I want to make sure you were aware of https://www.python.org/dev/peps/pep-0557/. One path to the stars is to standardize our codebase with dataclasses if our flip to python 3 can require at least 3.6 (backport here https://github.com/ericvsmith/dataclasses).

@Eric-Arellano
Copy link
Contributor

Good point about dataclasses.

One concern I have about our own data type implementation is that static type checkers like MyPy won’t understand it. I’m hoping once we drop Py2 we can start adopting Py3’s type hints, and we’ll get more meaningful results if we use dataclasses.

Other than that concern, this looks like a very helpful and sensible change!

@cosmicexplorer
Copy link
Contributor Author

I want to conform to existing standards as much as posssible and will review the links provided.

@cosmicexplorer
Copy link
Contributor Author

In response to the comments, my plan for this PR now is to transition to use the dataclasses library for py2 and the py3 builtin one, and to get an idea of how and where this change affects perf. For things like checking whether args are strings or ints or something else that can be described in types, this can be partially checked statically (depending on the annotation quality and the type checker's abilities), so I can immediately see one reason why we might expect better perf from moving to the dataclass system (by ensuring those kinds of conditionals don't need even a probably-branch-predicted check at runtime).

@cosmicexplorer cosmicexplorer changed the title [WIP] Datatype default values [WIP] use dataclass 3rdparty library to implement datatype() Aug 25, 2018
@cosmicexplorer cosmicexplorer changed the title [WIP] use dataclass 3rdparty library to implement datatype() datatype()s can declare default values Sep 1, 2018
@cosmicexplorer
Copy link
Contributor Author

cosmicexplorer commented Sep 1, 2018

After realizing that the dataclass library is only available for 3.6+, and our python 3 shard doesn't require that version yet, in a fit of cowardice I reneged on the approach above and instead just went to make the original PR pass CI (and just accidentally hit the "comment and close" button). I left a TODO describing the use of dataclasses once we can appropriately bound the python 3 shard's interpreter version.

@cosmicexplorer
Copy link
Contributor Author

I'm going to revert everything that's not strictly about default values and make it into a followup PR to make this easier to review.

@cosmicexplorer
Copy link
Contributor Author

cosmicexplorer commented Sep 1, 2018

I reverted 200 lines and kept this diff to the changes necessary to use default values, and a small number of examples, instead of any deeper code. Will publish a followup PR that allows us to completely remove a large number of __new__ overrides, once this PR is merged.

My main concern right now is that the DataType._parse_args_kwargs() classmethod (which parses out named and positional arguments in order to set our default values, because the namedtuple constructor would otherwise raise a TypeError for missing arguments) might be slow (which I can profile). While shepherding this through CI, I'm going to look into whether we could support mypy type annotations in datatype classes in a separate PR, or make an issue if I can't figure it out.

Copy link
Contributor

@jsirois jsirois left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I count 6 lines of __new__ code removed as a result of this feature (there were other __new__'s removed that were not needed in the 1st place). This seems like a high cost (+439/-117) for that. Are you convinced its not worth waiting for python3.6 and dataclasses?

'time',
# TODO: an `optional(type_constraint=None)` type constraint factory which matches values
# matching the given constraint, as well as None.
('failure', None, None),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The degree of mysteriousness here is very high. The natural question from the reader is why ('failure', None, None), instead of 'failure',; ie: what does 'name', above do that this does not or vice-versa? Maybe the TODO changes 'failure' in such away that None is no longer tossed about?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One alternative is to use the DatatypeFieldDecl() constructor directly, with named arguments. One thing I did notice is that in the python 3 typing module, the NamedTuple class accepts precisely the syntax we currently (on master) have for typed fields:

from typing import NamedTuple
ClassName = NamedTuple('ClassName', [('x', int)])
# versus
from pants.util.objects import datatype
class ClassName(datatype([('x', int)])): pass
# compare to:
from pants.util.objects import DatatypeFieldDecl as F
class ClassName(datatype([
    F(field_name='failure', default_value=None, has_default_value=True),
])): pass

So while we have actually so far kept compatible field declaration syntax with NamedTuple, this PR allowing mere positional tuples with this mysterious extended syntax as you've pointed out here goes past that. I think using the DatatypeFieldDecl constructor directly with named arguments, as above, might help this situation, and ensure we make it clear when we're writing something incompatible.

@cosmicexplorer
Copy link
Contributor Author

cosmicexplorer commented Sep 3, 2018

Well, I can show what it's like with all of those TODOs implemented, which would make more convincing a case (mostly the Convert wrapper mentioned, for things like converting inputs to tuple if they're not already, and providing a default value of e.g. tuple()). That wrapper introduces some more mysteriousness (we rely on the return value of validate_satisfied_by to do this transformation -- we already do this, but we didn't rely on it before), but that's what really cuts down on the need for most __new__ implementations, and I actually think the result with that is very terse and effective without too much magic, since the Convert(tuple) is pretty explicitly declared in the datatype() call. I've had that percolating and can try putting it back, it didn't require a lot of extra code but I was trying to cut this down if possible. I explicitly reverted all the __new__ implementations that I was able to remove in the name of cutting this diff's complexity down so it is very easy for me to add them back (meaning, to remove the __new__ override using this new functionality).

@cosmicexplorer cosmicexplorer changed the title datatype()s can declare default values [WIP] datatype()s can declare default values Sep 3, 2018
Copy link
Contributor

@jsirois jsirois left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll step away and let others review this change.

@cosmicexplorer
Copy link
Contributor Author

This is back as a WIP after making several sweeping changes. Please take a look at the added tests in test_objects.py, starting here, to see what functionality was added (along with the edits to many datatypes existing in the codebase).

Note that we are not doing anything weird anymore with parsing arguments to __new__ that might be bad for performance. I think this all looks neat and I can very easily split out e.g. optional, convert, and convert_default into a separate PR if this seems like a useful effort.

@cosmicexplorer cosmicexplorer changed the title [WIP] datatype()s can declare default values datatype()s can declare default values Sep 4, 2018
@cosmicexplorer
Copy link
Contributor Author

cosmicexplorer commented Sep 4, 2018

@Eric-Arellano It will be very exciting when python 2 can be removed! I'm only complaining because I can't figure out how to run python 3 tests locally, or at least, I keep seeing:
FAILURE: Unable to detect a suitable interpreter for compatibilities: CPython>=3.4,<3.7 (Conflicting targets: tests/python/pants_test/util:objects)

I spent some time investigating this yesterday but didn't end up anywhere, I'm sure I can figure it out I just haven't yet.

@stuhood
Copy link
Sponsor Member

stuhood commented Sep 5, 2018

This has gotten really large, and probably needs some discussion. In particular, at least one reviewer has resigned due to the size, and I'm concerned that this is overkill when compared to defining __new__, which is really not much of a burden.

@cosmicexplorer
Copy link
Contributor Author

I think I can slim this down to a much smaller number of lines given the epiphany I mentioned earlier in setting default values. I've pushed up the current version to cosmicexplorer:datatype-default-values-stress-me-out and will rebase this and slim down this PR now substantially.

The intention with writing all of this code was not to have someone just accept it, but to then have a discussion about it after the iterative process of figuring out the interface I wanted for more complex typing. I did not have any idea in my head of what this could look like until I spent the time banging it out. It's also useful to note that 90% of the changes are just rearrangement of objects.py and adding tests, which I don't consider to be adding complexity whatsoever. However, I definitely should not have removed the [WIP] prefix, and it would have been extremely helpful to explain what each subsequent commit was intended to do. I know ~exactly what this can be reduced to and will do that now.

@cosmicexplorer cosmicexplorer force-pushed the datatype-default-values branch 2 times, most recently from b68b46c to c622065 Compare September 8, 2018 22:49
@cosmicexplorer
Copy link
Contributor Author

cosmicexplorer commented Sep 8, 2018

I wiped away all the extraneous stuff and avoided reordering anything in objects.py, which leads to a much smaller diff. I have also updated the OP. I would like to respond to @stuhood:

This has gotten really large, and probably needs some discussion. In particular, at least one reviewer has resigned due to the size, and I'm concerned that this is overkill when compared to defining __new__, which is really not much of a burden.

Maybe defining __new__ isn't much of a burden, but reading a __new__ override to understand what values it accepts and what coercions it may perform to the arguments can be a burden.

The intent of the PR before I cut it down just now is not just to make it easy to add default values, but to add a syntax to declaratively specify what arguments are accepted, and what those arguments will eventually resolve to (which is actually completely orthogonal to default values -- I didn't realize this until I had been able to iterate on it). For example, convert is one of the type constraint factories that is now removed from this PR. Declaring a field with ('field_name', convert(tuple)) states:

  1. The field is named field_name.
  2. The field will always contain a tuple.
  3. The default value of the field, if not provided to the constructor, is tuple().
  4. Any value for this field not matching SubclassesOf(tuple) has the function tuple() applied to it.

These type constraints are now composable. The goal is not to do any of the work that should be done in a rule, but to make it easier to look at a datatype definition and understand what values it accepts and what values it doesn't. For example, lists aren't hashable, but it's really not necessary for the argument to the constructor to be a tuple, in most cases. However, this is currently required for e.g. ExecuteProcessRequest, exposing the implementation detail that "tuple is hashable" as part of the interface. These can be overcome in __new__ overrides, but to me the much larger previous iteration seems to have several examples of making the code significantly easier to follow (because it's declarative) compared to a __new__ override, which usually tells you nothing "at a glance" without spending time to understand the precise logic in the constructor.

I can understand being concerned about complexity of implementation, but it seems like the result is fantastic -- consider the current definition of AddressMapper:

class AddressMapper(datatype([
  'parser',
  'build_patterns',
  'build_ignore_patterns',
  'exclude_target_regexps',
  'subproject_roots',
])):
  """Configuration to parse build files matching a filename pattern."""

  def __new__(cls,
              parser,
              build_patterns=None,
              build_ignore_patterns=None,
              exclude_target_regexps=None,
              subproject_roots=None):
    """Create an AddressMapper.

    Both the set of files that define a mappable BUILD files and the parser used to parse those
    files can be customized.  See the `pants.engine.parsers` module for example parsers.

    :param parser: The BUILD file parser to use.
    :type parser: An instance of :class:`pants.engine.parser.Parser`.
    :param tuple build_patterns: A tuple of fnmatch-compatible patterns for identifying BUILD files
                                 used to resolve addresses.
    :param list build_ignore_patterns: A list of path ignore patterns used when searching for BUILD files.
    :param list exclude_target_regexps: A list of regular expressions for excluding targets.
    """
    build_patterns = tuple(build_patterns or ['BUILD', 'BUILD.*'])
    build_ignore_patterns = tuple(build_ignore_patterns or [])
    exclude_target_regexps = tuple(exclude_target_regexps or [])
    subproject_roots = tuple(subproject_roots or [])
    return super(AddressMapper, cls).__new__(
        cls,
        parser,
        build_patterns,
        build_ignore_patterns,
        exclude_target_regexps,
        subproject_roots
      )
  # ...

Now consider the version that was in the previous version of this PR (with the exact same behavior, except we also check that our parser is a Parser):

class AddressMapper(datatype([
  ('parser', SubclassesOf(Parser)),
  ('build_patterns', convert_default(tuple, default_value=('BUILD', 'BUILD.*'))),
  ('build_ignore_patterns', convert_default(tuple)),
  ('exclude_target_regexps', convert_default(tuple)),
  ('subproject_roots', convert_default(tuple)),
])):
  """Configuration to parse build files matching a filename pattern.""

It's not cheating to remove the __new__ docstring -- that's the idea. convert_default was another type constraint factory that's not in this PR, which provides None as the default value to the constructor, but if None is provided as the argument value, it will instead return default_value, if provided, or tuple() (for the case of convert_default(tuple)).

A final interesting part to me is that we can do something like:

IntoTuple = convert_default(tuple)

class MyType(datatype([('arg', IntoTuple)])): pass

This makes it easy to package up a lot of related coercion-esque logic into a single object that can be imported and shared across multiple datatype() classes. Making coercion less ad-hoc and more canonical is to me a very significant and ongoing benefit.

Copy link
Contributor

@illicitonion illicitonion left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like a reasonable step in a good direction - I think there's a little more we can kill off :)

('env', tuple),
('output_files', tuple),
('output_directories', tuple),
# TODO: allow inferring a default value if a type is provided which has a 0-arg constructor.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would be happy to just remove this TODO; inferring defaults is scary...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After discussion on slack, agreed and will do.

('output_files', tuple),
('output_directories', tuple),
# TODO: allow inferring a default value if a type is provided which has a 0-arg constructor.
F('env', Exactly(tuple, dict, type(None)), default_value=None),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this Noneable? I would remove type(None) from its type constraints, and default it to () or {}...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sounds like a correct analysis and I will do that.

# NB: timeout_seconds covers the whole remote operation including queuing and setup.
('timeout_seconds', Exactly(float, int)),
('jdk_home', Exactly(text_type, type(None))),
F('timeout_seconds', Exactly(float, int), default_value=_default_timeout_seconds),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

inline _default_timeout_seconds?

timeout_seconds=_default_timeout_seconds,
jdk_home=None,
):
def __new__(cls, *args, **kwargs):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find this this implementation to be more work for me to read, rather than less... Before it massaged param, and then constructed an object as I'd expect; now it has multiple layers of validation, and the provision on the type definition for an intermediate state which would be invalid outside the constructor (having a dict as a field value)...

Maybe the answer here is to fall back to the automatically implemented __new__, and have a with_env class-function which does the env munging and passes the tuple to the constructor?

from pants.util.meta import AbstractClass


# TODO: when we can restrict the python version to >= 3.6 in our python 3 shard, we can use the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sadly that's not a sufficient condition; we also need to have dropped py2 support entirely

Copy link
Contributor Author

@cosmicexplorer cosmicexplorer Sep 12, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I was thinking there could be some parallel development (not too much) of py3-only code which would be using dataclasses. I will elaborate.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Parallel development of Py2 and Py3 branches is heavily discouraged in Python community, so might not be the best approach. It used to be the recommend approach in the early days, and it led to extreme complexity for library developers.

So, the new best practice is using Python 3 first code that still works on Python 2, which is what we've been doing with the future library and backports.


# NB: it is *not* recommended to rely on the ordering of the tuple returned by this method.
@memoized_classproperty
def _supertype_keyword_only_cached_constructor(cls):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One caller? Let's inline.

return super(DataType, cls).__new__(cls, **kw)
return ctor

def _replace(self, **kwds):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/kwds/kwargs/g?


This method upholds 2 contracts:
1. If no keyword arguments are provided, return the original object.
2. Do not call __new__() -- the parent class's __new__() is used instead (skipping e.g. type
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would much rather just not allow copying in __new__?

@@ -316,6 +421,9 @@ def satisfied_by_type(self, obj_type):
:rtype: bool
"""

# TODO: This method could be extended to allow for coercion -- we could change the interface to
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's remove the TODO; it's really specific detail for something which may or may not happen :)

If the argument `default_value` is provided (only by keyword), `has_default_value` is set to
True, and the argument `default_value` is used as this field's `default_value`. If
`default_value` is not provided (as when parsing a field declaration from a tuple), but
`type_constraint` is a TypeConstraint with the property `has_default_value` evaluating to True,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do any TypeConstraints actually have default values? If not, let's remove the abstract method, property, and this bit of pydoc

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right! Will do.

@cosmicexplorer
Copy link
Contributor Author

We can close this due to the progress made by @Eric-Arellano on #7074!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants