Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

absent-vs-null caltrops #191

Open
warpfork opened this issue Jun 8, 2021 · 10 comments
Open

absent-vs-null caltrops #191

warpfork opened this issue Jun 8, 2021 · 10 comments
Labels
P3 Low: Not priority right now

Comments

@warpfork
Copy link
Collaborator

warpfork commented Jun 8, 2021

We have some errata about how Absent vs Null are handled. This is a distinction that is sometimes useful, but also very subtle.

(Absent is somewhat similar to javascript's undefined. I hope we pull it off better than javascript did, frankly -- but javascript's history with this concept is certainly a valid cautionary tale about the degree of subtlety, too.)

At least one piece of code that touches on these corner cases resulted in pain today: a codec coerced an Absent to a Null and serialized data that way. The resulting serial data could then not be parsed back: because the schema declared "optional" rather than "nullable" for the affected fields, a null token was (correctly!) rejected in that position as data that was not matching the schema.

It would be better to not be able to end up in this position (or, at minimum, make it much harder to end up in this position accidentally).

I have three sets of ideas, in increasing levels of invasiveness, which could be relevant:


The first change is the most direct.

  • Codecs should absolutely not silently coerce an ipld.Absent node to be encoded as a null token.
    • In the past: I thought this might be "helpful", because it would mean you could use e.g. the dag-json codec as a "debug" view on the typed data (even if you have a low-readability representation strategy somewhere in the schema).
    • In the present: It is now clear that this footgun is way bigger and more painful than the hypothetical upside. So let's cut this out.
    • What it should do: is error immediately upon detecting an Absent. Erroring is much better than silent coersion of values.

We can make a norm of having codec implementations do this check, and flunk that data.

(Unfortunately, we can only make it a strongly recommended feature for codec implementations to do this check. I can't think of a way to make the type system help us enforce this rule for any third party codecs. But maybe that's fine.)

(It's also still concerningly easy for a codec to forget to check this, because absent nodes report their Kind as ipld.Kind_Null, and you have to check IsAbsent or == ipld.Absent to disambiguate. But again, maybe that's fine (for now), and we just try to make a norm of it anyway.)

This will detect this particular situation of accidental coersion of absent values into nulls. It should stop people from reaching a situation where they have serial data which they believe is correct but was actually serialized from a level they didn't intend. That's the most important goal. (However, also remember that this only kicks in if there's Absent values in the data -- it's not a general detector for accidentally serializing type-level data when one meant to serialize representation-level data... so, depending on where you're putting your goalposts, this may be seen as either a full solution to the specific problem, or a "partial" solution to a bigger problem.)

There's also one downside, which is the ability to use a codec as a "debug" view is lost if the data contains any Absent values. However, that's a relatively minor downside. (And honestly, that was a probably "buggy" "feature" anyway -- such as being ambiguous for a field which is optional and nullable.)


The second possible change area is a little more complicated.
The goal would be to increase clarity for encode functions, preventing accidental misapplication of them on type-level data (when instead the encode function should've been applied on representation-level data, which is... the vast majority of the time):

  • Have codecs check for the schema.TypedNode interface on the ipld.Node value, and just say "no" in that case.
    • However, we may not want to do this without still also providing a way to do things like this (again, debug output is valuable), so, next point:
  • Add an options parameter of some kind to encoders which lets you opt into this.
    • Or a whole alternate function, which explicitly accepts schema.TypedNode.
    • In general, just make it require intentional engagement, rather than easy to do this accidentally.
  • A similar feature might be advisable for ADLs. Accidentally attempting to serialize a large ADL into one block would also generally be a pretty nasty bug.
    • (Also similarly, though... while it's rare, there may be times when that's exactly what you meant to do. So again override options would be wanted.)

Like above, in the first change concept, I can't actually think of a good way to use the type system to enforce that all codecs do this.
It's intended that the typed view of data (and ADLs, similarly) provide at least all the methods of ipld.Node.
The consequence is that there's no interface that says "this is 'just' a node": we're stuck detecting the fancier nodes and denylisting them, rather than acceptlisting the things that are known to be sensible.
But maybe this is fine, and we should make norms out of this regardless.

I feel a bit weird about aiming in this direction, because it's kind of intentionally crippling a couple of that are composable, just because of the encounter rate with some footguns that composability enables.
However, the footguns are pretty big, so it's still worth considering.


A possible even more aggressive further pursuit, focused on the null-vs-absent situation:

  • Remove IsAbsent from ipld.Node entirely.
  • Add IsAbsent to schema.TypedNode, instead.
  • A node that is absent must now report its Kind property as ipld.Kind_Invalid?
    • This is might break some code. (But most of it is probably within go-ipld-prime (or is codegen outputs that can be regenerated).)
      • (The scariest thing about this is that the code which becomes logically broken won't be broken in a way that's compile-time obvious, so there may be a burn-in period of bug fixes after this.)
    • It might alternatively be better to do this by introducing a new kind of ipld.Kind_Absent. Clearer. (Leaves ipld.Kind_Invalid only as an extremely clear indicator that You Goofed somewhere and have uninitialized or seriously bonkers data.)
      • On the third hand: "absent" isn't a "data model kind", it's a type kind, so adding it to the kind enum maybe be problematically confusion-generating.
    • We could also do just this, and leave the IsAbsent predicate where it is, if that turns out easier for some reason.

This would decrease the odds of someone accidentally writing a codec in the future which emits null tokens when encountering absent values.
(It would fix the current problem of how concerningly easy it is to make this coersion by accident because absent nodes report their Kind as ipld.Kind_Null, and you have to check IsAbsent or == ipld.Absent to disambiguate.)

Moving the IsAbsent predicate to schema.TypedNode would also make it clearer in general that "absent" should not occur in the regular data model, which could be a bonus win.

As you can by see how the bullet points started bifrucating into a choice forest there, I'm less sure about this. There are a couple different options here. It's not clear which is best. They seem to vary in terms of how error-prone they are in golang vs how well they match the data model spec, which is pretty disconcerting.


Overall I think change #1 is the most reasonable. It fixes the biggest and most painful of the footguns, and is clear enough to execute on.

Maybe with #1 implemented, then any need for #2 will actually fade. The other consequences of accidentally encoding something from the type-level view instead of representation-level view aren't very bad, as long as they don't hit this coersion stuff. (Usually, you can rehydrate that data again: just use the type-level prototypes when loading it, and you're matched up.) (The absent->null coersion is especially nasty just because it's (potentially) lossy, and because it doesn't roundtrip at all anytime there's any optional fields with absent values in play.)

At any rate, all these are now documented for consideration.

@warpfork
Copy link
Collaborator Author

warpfork commented Jun 8, 2021

I really do not appreciate that github thinks that a #1 in prose is reasonable to turn into a link to issues without prompting.

@willscott
Copy link
Member

can we not attempt to see if there is a .Representation() method on the passed in node and encode that in the codec unless explicitly told otherwise?

@warpfork
Copy link
Collaborator Author

warpfork commented Jun 9, 2021

That's basically option 2, except s/say "no"/quietly call .Representation()/. So yes, it's definitely possible.

@warpfork
Copy link
Collaborator Author

warpfork commented Jun 9, 2021

I think lean towards preferring "say no" (with an error message that tries to provide helpful suggests that "did you mean to call Representation first, and encode that?") vs quietly calling a method, if we pursue option 2.

The quietly-call-the-method route would result in wider changes of behavior from smaller/less-direct changes in arguments if one did pass additional config later, and my current thought is that this would be more magical looking and more confusing than just nudging people to make sure not to forget the .Representation() step. Not sure if that's a strongly held belief though.

@willscott
Copy link
Member

i think this will remain a caltrop. i feel like the default thing when i encode a node is that i want the representation encoded as i expect. the debug functionality is something i should opt into.
making me remember this extra weird detail every time has.. failed so far

@mvdan
Copy link
Contributor

mvdan commented Jun 9, 2021

i feel like the default thing when i encode a node is that i want the representation encoded

I fully agree - that's what I've argued before and in #190 :) the Go APIs should do what most users expect by default.

@warpfork
Copy link
Collaborator Author

warpfork commented Jun 9, 2021

I think maybe we'll want to look at how the code would turn out before deciding on this detail.

My concern with making all codecs do.... anything, really... is mostly in terms of the odds of defect rates in third party codecs; and how odd the results will be for any defects that are likely.

We don't have one good single point to wrap all codecs. ipld.Encoder is just a function interface. We could wrap the calls when they're passing through the multicodec system, sure; but that's not a universal entrypoint, so I'm afraid it will make things weirder.

Resolving to consider it a bug (and definitely not a feature) when a codec coerces an absent to a null is fairly clear. As mentioned in the original post, I'm still concerned with the potential defect rate here, and wish our APIs made those odds less, but steps on that are hard. Importantly, though, if we do have a defect in the future where a codec incorrectly does this coersion, it's very clear where to report the bug.

If we resolved to make "check for schema.TypedNode or ipld.ADL and then transform your arguments before proceeding" be a feature of every codec, we'll have to copy and paste that (or make a mixin function; details) into the start of every codec function. And count on third party codecs to do the same. The odds of defect here seen slightly higher, because the "correct" thing would now require additional action (whereas handling tokens, in the above story, is already key to the job). And then it requires even more action to add back in the ability to override this, so that it's possible to encode the type level representation if the user knows what they're asking for. Then we get to the even bigger concern: if a third party codec doesn't do this... how odd are the results? I'm afraid very odd: the user will be able to "patch" their code to do the right thing by calling .Representation() (and they probably will; "path of least resistance"), but now it'll be inconsistent depending on what codec they're using,... oh boy.

I think we might get considerably better consistency by the chiding approach, because of a third party codec doesn't implement that, it doesn't produce odd results wherein the path of least resistence which a user will take to compensate for the issue then creates other problems.

@warpfork
Copy link
Collaborator Author

I've revisited part of my last comment...

if a third party codec doesn't do this... how odd are the results?

What I was worrying about is that if a third party codec didn't do this, an end user would add a .Representation() call to their side... and that's still true... but what I was really worrying about is that then if the user switched to another codec which did have the additional internal transformation, then they'd have to remove code again so it didn't get transformed twice. And that's not true. It's a fixed point. You can't .Representation().Representation().

So there's a lot less of a problem there than I had been imagining previously.

@warpfork
Copy link
Collaborator Author

There's a lot going on in this issue, and some of it pretty heady -- but, I believe the state of play here has at least partially, tangentially, become improved:

#232 will be introducing new helper functions for encoding and decoding which do indeed automatically shift gears into using representation nodes and nodebuilders whenever typed nodes are seen in play.

While that addresses none of the deeper theoretical soundness issues here, it does have the potential to drastically decrease the frequency of painful end-user encounters with this subtlety, and it does it immediately and universally for users of the new API, rather than requiring all codecs to adopt it individually.

The linked PR is for newly added APIs. I have not drafted a similar change to the existing methods on LinkSystem yet, but that would also seem like a reasonable step to take in the near future.

@willscott
Copy link
Member

We hit this again in the indexer wire format. The times one is supposed to call .Representation continue to be non-obvious and lead to hard-to-understand errors

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P3 Low: Not priority right now
Projects
None yet
Development

No branches or pull requests

4 participants