Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

'retval' proposal re: calls and multiple return values #280

Closed
kg opened this issue Jul 23, 2015 · 22 comments
Closed

'retval' proposal re: calls and multiple return values #280

kg opened this issue Jul 23, 2015 · 22 comments
Milestone

Comments

@kg
Copy link
Contributor

kg commented Jul 23, 2015

In #278, Dan suggested some good refinements for our thinking about calls and return values. Implied in this is the potential for multiple return values to show up in the future. I think it's relatively inevitable that multiple return values will creep their way in eventually, even if C/C++ never use them. In the event that they do, this has some nasty implications for our current model:

We need a way to represent the idea of an expression yielding multiple distinct values of different types.
We need a way to handle those value tuples: Assign them out into locals, etc.
A function needs a way to construct those value tuples so it can return them.

My suggestion for tackling this and some other issues is that we should define a distinct concept separate from local variables that I'll refer to as 'retval'. Effectively, retval is a tiny set of registers that contain the return value(s) of the most recent function call, with various restrictions to make it simple to reason about.

A rough sketch of how retval would work:

  • Every function call produces 0 - N return values. In the MVP, a function call is either void (no retvals) or has a single retval of type T.
  • After a function call, retval N is valid so long as N < the number of return values from the previous function. The type of this expression depends on the signature of the most recent function call, and it evaluates to the contents of that numbered return value register.
  • retval N can only be used once per return value per invocation; it consumes the result. To use a return value multiple times, store it into a local.
  • Crossing a branch invalidates the return values from the most recent function call.
    • This is probably a pain from a compiler perspective but seems valuable in terms of making the runtime side simpler. No need for SSA or phis or whatever.
  • Call expressions with a single return value evaluate to the equivalent of (call @a ...), (retval 0) - essentially this is a simple transform that can happen in the compiler, decoder, or native runtime. I'd probably put it in the decoder.

So, (complete strawman), existing code that looks like this:

local @a
setlocal @a (call @sin (const 0.75f))
call @print (getlocal @a)

would look something like this:

call @sin (const 0.75f)
call @print (retval 0)

(EDIT) or this:

call @print (call @sin (const 0.75f))

So, why? Various reasons:

  • Without this design change, function calls end up producing a temporary local for every return value. This bloats the number of locals in large functions tremendously, unless compilers pool temporary locals, in which case...
    • Now static analysis is more complicated because you have to SSA or GVN or some other horrifying three-letter-acronym to figure out the lifetime of return values, when stack/register space is needed, etc.
  • Transforming between nested and non-nested calls is simpler.
    • In the old model, hoisting a nested function call out into a statement would require you to introduce temporary locals for its return values. In this model, you just undo the return value transform described above (call @a (call @b ...) -> call @b ...; call @a (retval 0)).
  • The lifetime of return values is trivial to reason about.
    • It becomes trivial to identify unused return values. Just look for a retval.
    • It becomes easier to do optimizations in the vein of 'move the call to the location where its result is used, if it's safe', etc
    • It's way easier (IMO) to write a naive runtime with reasonable performance this way - you don't need to aggressively nuke temporaries quite so badly.
    • It's easier to figure out whether to shove return values into a register.
  • Forwarding multiple return values into a call immediately afterwards becomes easier.
    • call @a (call @b) is pretty easy to expand in the case that @b returns say, 3 values: call @b; call @a (retval 0) (retval 1) (retval 2).
  • Deduplication (nullary macros) becomes way more effective in this model.
    • Common sub-expressions that operate on the result(s) of a function will trivially deduplicate, where before the expressions were operating on arbitrary numbered locals (or on function call expressions directly), which needs parameterized macros to deduplicate. Yuck.
    • A compiler can generate whatever it wants, and a smart 'wasm optimizer' can attempt to convert calls between the two representations (inline vs retval) to see whether one of the two produces better deduplication results (and thus, smaller file size).

I think this is especially important to think about, because many future languages that we (probably) want to target wasm have multiple return values or something equivalent. Go extensively uses multiple return values, and C# somewhat-extensively uses something roughly equivalent (out parameters).

As far as returning multiple values from a function at once goes - I think we punt on that entirely until after the MVP, but this model probably makes sense there too (opcodes for storing values into your result registers, or something similar).

If people like this proposal I can try to draft up a PR for it later.

@jfbastien
Copy link
Member

C++ does allow you to return multiple values with structs. You can even std::tie and use std::tuple!

I think the main issue you're discussing is: with just one return value calls only generate temporaries. With multiple, these temporaries become awkward.

(set_local @a (call foo @bar)) ; Make it a local.
(call bar (call foo @bar)) ; Notice the temporary!

Versus:

(set_locals (@a @b) (call baz @bar)) ; Maker two locals.

What if I want to use the two temporaries? Tough luck!

I'm not sure I like the proposed solution, but I can't think of any alternative for now...

@titzer
Copy link

titzer commented Jul 23, 2015

For an outside datapoint, Virgil supports arbitrary tuples [
https://code.google.com/p/virgil/wiki/TutorialTuples] which are flattened
in the middle of the compiler. That is the broadest generalization of value
types that I can think of. It requires the compiler to support
multiple-output value nodes, which TurboFan does to some extent.

Restricting multiple-value nodes in some sense would be good here, I think.
I can imagine a baseline JIT or interpreter being very, very frustrated
with generalized tuples.

That said, the implicit dataflow from the last call to the next retval
usage in this proposal could cause some problems. If we can achieve a more
explicit dataflow here, it will make things more composable.

On Thu, Jul 23, 2015 at 8:26 PM, JF Bastien notifications@github.com
wrote:

C++ does allow you to return multiple values with structs. You can even
std::tie and use std::tuple!

I think the main issue you're discussing is: with just one return value
calls only generate temporaries. With multiple, these temporaries become
awkward.

(set_local @A (call foo @bar)) ; Make it a local.
(call bar (call foo @bar)) ; Notice the temporary!

Versus:

(set_locals (@A @b) (call baz @bar)) ; Maker two locals.

What if I want to use the two temporaries? Tough luck!

I'm not sure I like the proposed solution, but I can't think of any
alternative for now...


Reply to this email directly or view it on GitHub
#280 (comment).

@kg
Copy link
Contributor Author

kg commented Jul 23, 2015

One mitigation for some of your concerns about the dataflow would be for retval to literally just be syntax sugar that can boil down to temporaries managed by the runtime/decoder instead of the compiler. That lets VMs decide how best to handle it. A naive JS polyfill would just generate locals for every function call, and then map the retvals to the appropriate locals, for example. We still get the size/clarity wins in the IR.

But if it just makes things harder on the runtime author, that's no good. Maybe there's a way to retain most of the size wins without doing that.

@bar
Copy link

bar commented Jul 23, 2015

@jfbastien what? why me?

@jfbastien
Copy link
Member

I think what I want to avoid is creating first-class tuple types, and tuple types that can exist as locals. I feel like the retval idea tries to work around that restriction, but isn't really better than (tie local0 local1 (call multi_return_func args)).

I'm leaning towards liking tie to locals better...

@kg
Copy link
Contributor Author

kg commented Jul 23, 2015

Well, retval has a few advantages that led me to propose it (smaller ast, more-compressible values, ability to discard return values) but (tie x y (call ...)) is definitely a minimal-pain solution. We can do that post-MVP when we actually do multiple return values, and just punt entirely on the issue for now.

@titzer
Copy link

titzer commented Jul 23, 2015

On Thu, Jul 23, 2015 at 10:44 PM, JF Bastien notifications@github.com
wrote:

I think what I want to avoid is creating first-class tuple types, and
tuple types that can exist as locals. I feel like the retval idea tries
to work around that restriction, but isn't really better than (tie local0
local1 (call multi_return_func args)).

Maybe we could have a multi-call bytecode that has the tie built into it?

I'm leaning towards liking tie to locals better...


Reply to this email directly or view it on GitHub
#280 (comment).

@qwertie
Copy link

qwertie commented Jul 23, 2015

@jfbastien Why do you want to avoid creating first-class tuple types?

@jfbastien
Copy link
Member

First-class tuples are the same as structs. The backend needs to start being smarter, and we need to define a bunch more than what we have now.

@lukewagner
Copy link
Member

I'm concerned that retval seems to require some amount of control/flow analysis to validate which will complicate decoding (type-based validation has otherwise eliminated all flow-dependent validation criteria). To avoid the analysis, we could say that retval must immediately follow a call, but then we might as well call it a single opcode. I'm also concerned with composition: if I have a nested call f(g()) I can use the retval of g() as the arg of f(), but if I have f(g(), h()) then g()'s retval needs to be "spilled" to a local before it is clobbered by h() and now you have to think of retval as an optimization that you try to do before falling back to set_local and in general this makes me think of bad times with x86/ARM condition flags.

I do like the idea of combining the destructuring of a returned tuple with a call opcode as @titzer suggested. Two benefits:

  • aesthetically, it seems symmetric with arguments (where the creation of the args tuple also happens at the call)
  • practically, by forcing the returned tuple to be immediately consumed, we avoid having to implement annoying corner cases likes phis of tuples.

One issue is that call isn't the only opcode where we want multiple return values (think add_with_overflow or div_mod), so we'd have to explicitly do this for every multi-return opcode. While this sounds like duplication, I expect we could define it orthogonally to allow reuse in the spec and impl (kinda like how modr/m is (sorta, kinda) defined orthogonal to opcode in x86).

One idea that I didn't see brought up that I've been assuming for multi-return:
Instead of requiring every element of the returned tuple to be assigned to locals, we could allow one element to be the result of the expression. For example, when compiling this clang C++:

return __builtin_uadd_overflow(x, y, &sum) ? UINT32_MAX : sum;

then the AST could be:

return (conditional (i32.add_with_overflow @x  @y  @sum) UINT32_MAX @sum)

which is pretty nice, I think. To be clear what's happening: as a binary-return op, i32.add_with_overflow takes 1 extra immediate operand (the index of the local to store the sum).

Addressing some specific points of the OP:

Without this design change, function calls end up producing a temporary local for every return value. This bloats the number of locals in large functions tremendously, unless compilers pool temporary locals, in which case. Now static analysis is more complicated because you have to SSA or GVN or some other horrifying three-letter-acronym to figure out the lifetime of return values, when stack/register space is needed, etc.

It is definitely the plan for LLVM to attempt to maximally reuse local slots. This is good for a number of reasons:

  • forgetting this multi-return issue, if allocations aren't pooled, the SSA form of LLVM would generate huge numbers of locals (one per def/phi)
  • a natural impl on a dumb compiler is going to give one stack slot per local, which means huge stack frames
  • for a dumb compiler on a register-rich arch, locals can be mapped to registers allowing the dumb compiler to reuse the regalloc done by the beefy AOT compiler
  • for a beefy compiler, various algorithms can have cost proportional to the number of locals (such as the structured, single-pass SSA generation in IonMonkey) and reducing the number of locals is already an optimization Emscripten does to reduce this compilation cost
  • no fancy analyses (SSA, GVN) are needed to deal with reuse of locals; b/c of (1) typed locals, (2) structured control flow, the simplest single-pass algorithm can easily track the current def associated with a local and generate SSA as it goes.

Thus, I think we are already assuming that locals were going to be aggressively reused and this was actually seen as a strength of our psuedo-register approach compared to mandating SSA form.

Transforming between nested and non-nested calls is simpler.

I think most of this analysis would be done in LLVM IR, before locals/wasm enter the picture. In general, any wasm compiler will invariable need the ability to decompose an expression using temporaries. A natural impl would be to maintain a logical stack of temporaries to maximize local reuse.

The lifetime of return values is trivial to reason about.

I still think it involves some small amount of analysis which means that a dumb compiler won't benefit from it (assuming the dumb compiler is doing single pass generate-code-as-you-go). OTOH, a beefy compiler won't have any problems with locals.

Forwarding multiple return values into a call immediately afterwards becomes easier.

It's important to qualify "easier". I think there is limited value to trying to make wasm more ergonomic to write by hand (that's definitely not a tier-1 priority compared to simplicity, speed of decoding and size). OTOH, "easier" could mean "requires less bytes", but I think there are multiple ways to achieve that goal in terms of tuples (think splat) if we measure this as a significant opportunity.

Deduplication (nullary macros) becomes way more effective in this model.

Maybe, but needs measurement and again I expect we could introduce operations on tuples to a similar effect if we saw there were significant savings from multi-return.

@kg
Copy link
Contributor Author

kg commented Jul 25, 2015

Good points on all of this, Luke. Re: deduplication the advantage isn't from tuple operations, though, it's that it normalizes more expressions so that they all contain retval 0 and retval 1 instead of arbitrary locals. Once you get rid of that, every expression is using some arbitrary local that the call result(s) got spilled into, and the only way to deduplicate those expressions is with parameterized macros.

It's not clear to me how my proposal requires control flow analysis. The requirement that retval not persist across branches was designed to eliminate that. What did I overlook?

The point about nesting two function calls is a good one. There are a number of scenarios where you'd still need temporaries, and maybe that alone makes it problematic enough to not be worthwhile.

@lukewagner
Copy link
Member

On the subject of deduplication: a compiler could take deduplication into account when assigning local variable indices. In the limit, it could effectively simulate retval by reserving certain locals for return values. Certainly more experimentation to do in this space.

You're right that the "control flow analysis" is super-simple given that it's intra-basic-block. I guess my spider senses were going off because it is a categorically different kind of validation criteria than the rest of our type-based validation which is only sensitive to the signature of ops and the types of child expressions.

@kg
Copy link
Contributor Author

kg commented Jul 28, 2015

How would y'all feel about a variant of retval that is strongly typed? So instead of (retval 0), (retval 1), hypothetically:

(retval:int32 0), (retval:float32 0). So the type of a retval expression is always known and there's basically a block of virtual registers for function return values segmented by type. 'What happens with multiple calls used as arguments' is still a tricky problem but I think in that case you just fall back to storing into a local and it's no worse than things were before.

After some thought I think in practice retval is an alternate representation of temporary locals and function return values. The alternate representation is more regular so it compresses better, and it has additional constraints that can potentially make it easier for an engine to deal with (single-use, doesn't cross over branches, etc.) A naive implementation can just lazily convert function calls and retval expressions to always use locals under the hood (especially if the opcode is typed).

@jfbastien
Copy link
Member

IIUC typed retval is indeed the same as storing to stored locals, but indeed has some regularity by being automatic. Sounds potentially worthwhile, though I wonder if it trades regularity for ease of optimization? Probably not.

@sunfishcode sunfishcode modified the milestone: MVP Jul 29, 2015
@lukewagner
Copy link
Member

If we set aside the size/compression argument, then I think it's simpler and more regular not to introduce a new special kind of local with special semantics.

If the only argument is size/compression, it seems like we need some motivating data and arguments why layer 1 (viz. macros) can't achieve the same effect. In particular, I think we could simulate retval at layer 1 by (1) reserving a set of locals to serve as retvals, (2) defining macros for any hot multi-return ops that name those locals. Honestly, I'm not expecting multi-returns to be a significant % of total bytes in any large app.

@titzer
Copy link

titzer commented Jul 29, 2015

It's clear that multiple return values need some addition to the AST
semantics we have now, and the solutions discussed lie roughly on a
continuum of generality. I'd sketch the space a bit like this:

(GeneralTuples) On one end is general tuples that are flattened by the
implementation to be scalars. That is the approach I took in Virgil,
generalizing all functions to be single parameter, single return and
mapping multi-argument functions onto functions that take and return
tuples. That gives a really nice flexible source semantics, but the
implementation burden is to flatten tuples and deal with them in the
compiler IR. There's lots of fun with tuples inside of tuples, arrays of
tuples, tuples of arrays, etc. The result is that the generated code is
really efficient, as native calling conventions are used so that arguments
and return values go in registers and on the stack, regardless of how
complex the tuples at the source level are.

(ScalarsOnly) On the extreme other end is no support at all for tuples,
only scalars like we have now. A function has a fixed number of scalar
parameters and a fixed number of scalar return values. Local variables and
expressions can only have scalar types.

(OnlyLocalTuples) In the middle of the continuum, we could allow tuples
within a function body so that local variables and expressions can have
tuple types. A call to a function that returns multiple values has a tuple
type. That tuple could be stored in a local variable. Operations to access
elements of tuples are required. If only flat (non-nested) tuples are
allowed, then tuple access always produces a scalar.

(RetVal) is another solution in the middle. I see at as basically a
short-distance tuple that expires after some intervening operation. Local
variables still can only have scalar types, but the special "retval"
expressions have tuple types. There are rules about when a "retval" can be
used (e.g. some places are illegal), and the return values that a "retval"
accesses are determined not by nesting or numbering, but implicitly through
control flow.

I think we want to avoid (GeneralTuples). As I discussed above, it's a nice
source-level semantics, but is kind of hell for implementations. I think we
want to keep WebAssembly simple and machine-like, leaving overly-general
source niceties up to a higher-level compiler. For point of reference, if I
were to target Virgil to WebAssembly, I'd still flatten tuples in my
compiler before generating wasm, so I don't think leaving generalized
tuples out of wasm would preclude any nice source languages.

So without (GeneralTuples), we're left with (ScalarsOnly),
(OnlyLocalTuples), and (RetVal).

I'd argue that (OnlyLocalTuples) has most of the same problems as
(GeneralTuples), even if we restrict the local tuples to be flat.
Implementations still have to track the types of local variables and
expressions when they are tuples, check tuple accesses, and map a local
tuple to multiple machine registers or stack slots. It probably makes a
baseline JIT even more tricky.

That leaves (ScalarsOnly) and (RetVal), which is basically where this
discussion is now.

Luke and I seem to be leaning towards (ScalarsOnly) with the assumption
that we need a special "call_multiple" expression that assigns to multiple
locals. This design at least preserves the nice properties of only having
scalar types for local variables and expressions with the potential cost of
forcing wasm producers to create lots of temporaries and encode big-fat
call_multiple bytecodes.

StrawMan1:

call_multiple %0 %1...%K [func] [arg0] [arg1]...[argN]

The semantics would be to call the function [func] either directly or
indirectly, with the results of evaluating arg0...argN, and then assign the
results to temporaries %0 to %K. The overall result of the call_multiple
expression would be the value %0.

StrawMan2:

call_multiple %1...%K [func] [arg0] [arg1]...[argN]

The semantics would be to call the function [func] either directly or
indirectly, with the results of evaluating arg0...argN, and then assign the
results to temporaries %1 to %K, with the difference being that the first
return value does not require a temporary. The overall result of the
call_multiple expression would be the first return value of the call (i.e.
what would have been %0 in StrawMan1). StrawMan2 is potentially more
compact than StrawMan1, since the first return value is implicitly
returned, and we thus need 1 less temporary. For a likely case of a 2-tuple
return value, this only requires a single extra local.

On Wed, Jul 29, 2015 at 9:46 AM, Luke Wagner notifications@github.com
wrote:

If we set aside the size/compression argument, then I think it's simpler
and more regular not to introduce a new special kind of local with special
semantics.

If the only argument is size/compression, it seems like we need some
motivating data and arguments why layer 1 (viz. macros) can't achieve the
same effect. In particular, I think we could simulate retval at layer 1
by (1) reserving a set of locals to serve as retvals, (2) defining macros
for any hot multi-return ops that name those locals. Honestly, I'm not
expecting multi-returns to be a significant % of total bytes in any large
app.


Reply to this email directly or view it on GitHub
#280 (comment).

@sunfishcode
Copy link
Member

I like StrawMan2. While I see the appeal of StrawMan1 in allowing all return values to be handled consistently as named variables, this can be achieved with StrawMan2 by adding a set_local:

set_local %0, (call_multiple %1...%K [func] [arg0] [arg1]...[argN])

which is slightly less pretty but achieves the same effect. And, it avoids the awkwardness in StrawMan1 of producing a value into two output locations.

@lukewagner
Copy link
Member

Agreed

@rossberg
Copy link
Member

FWIW, the ML prototype currently implements a simpler variation of (OnlyLocalTuples) that I would call (SecondClassTuples), and which doesn't have most of the problems @titzer mentions.

In that approach, values or variables cannot have tuple type, only expressions can. In general, any expression produces a sequence of values, most of them unary or empty. Calls can produce n-ary expression type. Such types can be consumed by either a Destruct operator (an n-ary SetLocal), by Return, or if you want, by other Calls (as their argument).

It is not possible to get a first-class handle on the tuple itself. Nor can tuples be nested, because their components are values.

That's more modular in several ways: (1) it allows multiple ways of consuming tuples; (2) it also allows introducing additional expressions that produce n-ary results; (3) it allows instructions that are generic in arity. In particular, it doesn't require introducing a _multiple variant of each Call instruction (with the strawmen above, you'd seem to need to add at least both call_multiple and call_indirect_multiple?).

@titzer
Copy link

titzer commented Aug 14, 2015

On Fri, Aug 14, 2015 at 9:37 AM, rossberg-chromium <notifications@github.com

wrote:

FWIW, the ML prototype currently implements a simpler variation of
(OnlyLocalTuples) that I would call (SecondClassTuples), and which doesn't
have most of the problems @titzer https://github.com/titzer mentions.

In that approach, values or variables cannot have tuple type, only
expressions can. In general, any expression produces a sequence of values,
most of them unary or empty. Calls can produce n-ary expression type. Such
types can be consumed by either a Destruct operator (an n-ary SetLocal), by
Return, or if you want, by other Calls (as their argument).

It is not possible to get a first-class handle on the tuple itself. Nor
can tuples be nested, because their components are values.

That's more modular in several ways: (1) it allows multiple ways of
consuming tuples; (2) it also allows introducing additional expressions
that produce n-ary results; (3) it allows instructions that are generic in
arity. In particular, it doesn't require introducing a _multiple variant of
each Call instruction (with the strawmen above, you'd seem to need to add
at least both call_multiple and call_indirect_multiple?).

If expressions can produce tuples, it means that decoders have to track the
number of outputs from expressions. Concretely, the decoder stack (and
potential SSA renaming environment) tracked during descent is no longer
unary. That's a pain point.

I don't think we gain (or need) much expressive power by allowing tuples to
feed directly into other calls. StrawMan1/2 basically build Destruct
directly into CallMultiple.


Reply to this email directly or view it on GitHub
#280 (comment).

@rossberg
Copy link
Member

A couple of things you could gain in the future are e.g. primitive operators with multiple results (e.g. divmod), or potential forms of control transfer with multiple values (e.g. exceptions or a generalised Break). None of these are relevant for the MVP, but it would still be nice to have a factorisation that is somewhat future-proof (and avoids duplication of opcodes).

As for decoding complications: true. Not sure that is too terrible, though. Moreover, it could be avoided (for most part) by disallowing multiple values to flow through other expressions (such as conditionals), in which case only the top of the stack could be n-ary. That leaves the option to generalise later.

@lukewagner
Copy link
Member

Definitely interested in multiple-return primitive values, but it seems like these ops could just do the same thing as call_multiple which is to take the outparams local index as an operand (this was my i32.add_with_overflow example above).

The more interesting case as you point out is if we had generalized tuple return from control flow primitives. @sunfishcode and I were talking about this a long time ago in the context of reducing the high % of bytes spent on get_local/set_local by trying to give more opportunity to build bigger trees. It'd be nice to do some experiments once llvm-wasm is emitting full programs to see how much this would help in practice.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

9 participants