[RFC] Clarify the Summer RFC and create an executive overview

[ci skip]
cockroachdb · Apr 28, 2016 · 8879af7 · 8879af7
1 parent 1c8c73e
commit 8879af7
Showing 1 changed file with 92 additions and 9 deletions.
diff --git a/docs/RFCS/typing.md b/docs/RFCS/typing.md
@@ -346,18 +346,92 @@ Where Summer will always pick a type and be able to explain it.
 
 ## Proposed typing strategy
 
+### High-level overview
+
+To explain Summer to a newcomer it would be mostly correct to say
+"Summer first determines the types of the operands of a complex
+expression, then based on the operand types decides the type of the
+complex expression", ie. the intuitive description of a bottom-up type
+inference.
+
+The reason why Summer is more complex than this in reality (and the principle
+underlying its design) is threefold:
+
+- Expressions containing placeholders often contain insufficient
+  information to determine a proper type in a bottom-up fashion. For
+  example in the expression `floor($1 * $2)` we cannot type the
+  placeholders unless we take into account the accepted argument types
+  of `floor`.
+
+- SQL number literals are usually valid values in multiple types
+  (`int`, `float`, `decimal`). Not only do users expect a minimum
+  amount of automatic type coercion, so that expressions like `1.5 +
+  123` are not rejected.  Also there is a conflict of interest between
+  flexibility for the SQL user (which suggests picking the largest
+  type) and performance (which suggests picking the smallest type).
+  Summer does extra work to reach a balance in there. For example
+  `greatest(1, 1.2)` will pick `float` whereas `greatest(1,
+  1.2e10000)` will pick `decimal`.
+
+- SQL has overloaded functions. If there are multiple candidates and
+  the operand types do not match the candidates' expected types
+  "exactly" Summer does extra work to find an acceptable candidate.
+
+So another way to explain Summer that is somewhat less incorrect
+than the naive explanation above would be:
+
+1. the type of constant literals (numbers, strings, null) and
+   placeholders are mostly determined by their parent expression
+   depending on other rules (especially the expected type at that
+   position), not themselves. For example Summer does not "know"
+   (determines) the constant "123" to be an `int` until it looks at
+   its parent in the syntax tree. For complex expressions involving
+   number constants, this requires Summer to first perform constant
+   folding so that the immediate parent of a constant, often an
+   overloaded operator, has enough information from its other
+   operand(s) to decide a type for the constant. This constant folding
+   is performed using exact arithmetic.
+
+2. for functions that require homogenous types (e.g. `GREATEST`, `CASE
+   .. THEN` etc), the type expected by the context, if any, is used to
+   restrict the operand types (rule 6.2) otherwise the first operand
+   with a "possibly useful" type is used to restrict the type of the
+   other operands (rules 6.3 and 6.4).
+
+3. during overload resolution, the candidate list is first restricted
+   to the candidates that are *compatible* with the arguments (rules
+   7.1 to 7.3), then filtered down by compatibility between the
+   candidate return types and the context (7.4), then by minimizing
+   the amount of type conversions for literals (7.5), then by
+   preferring homogenous argument lists (7.6).  
+
 ### Language extension
 
-We introduce a new expression node "type annotation".
+In order to clarify the typing rules below and to exercise
+the proposed system, we found it was useful to "force" certain
+expressions to be of a certain type. 
+
+Unfortunately the SQL cast expression (`CAST(... AS ...)` or
+`...::...`) is not appropriate for this, because although it
+guarantees a type to the surrounding expression it does not constrain
+its argument. For example `sign(1.2)::int` does not disambiguate which
+overload of `sign` to use.
+
+Therefore we propose the following SQL extension, which is not
+required to implement the typing system but offers opportunities to
+better exercise it in tests.  The explanatory examples below also use
+this extension for explanatory purposes.
+
+The extension is a new expression node "type annotation".
 
-We also introduce the new SQL syntax for this: "E : T".
+We also propose the following SQL syntax for this: "E : T".
 
 For example: `1:int` of `1 : int`.
 
-The meaning of this at a first order approximation is "interpret the expression on the left giving
-a preference to the type on the right".
+The meaning of this at a first order approximation is "interpret the
+expression on the left using the type on the right".
 
-This is different from casts, see below.
+This is different from casts, as explain below.
 
 The need for this type of extension is also implicitly
 present/expressed in the alternate proposals Rick and Morty.
@@ -494,7 +568,7 @@ tree where each node is unable to be properly introspect about its own return
 type into a typed tree which can provide its inferred result type, and as such 
 can be evaluated later. 
 
-#### Implementation Note
+#### Implementation Notes
 
 _In an effort to make this distinction clearer in code, a `TypedExpr` interface 
 will be created, which is a superset of the `Expr` interface, but also has the 
@@ -574,18 +648,27 @@ subsequent step, we check the remaining overload set:
    Then the overload candidates are filtered based on the resulting types. If any argument of the call 
    receives type null, then it is not used for filtering.
 
+   For example: `select mod(extract(seconds from now()), $1*20)`. There
+   are 3 candidates for `mod`, on `int`, `float` and `decimal`. The
+   first argument `extract` is typed without a desired type and
+   resolves to `int`. This selects the candidate `mod(int, int)`. From then on only one candidate
+   remains so `$1*20` gets typed using desired type `int` and `$1` gets typed as `int`.
+
 3. (7.3) candidates are filtered based on the resolvable type set types of constant number literals. 
    Remember at this point all constant literals already have a resolvable type set since constant folding.
 
    The filtering is done left to right, eliminating at each argument all candidates that do not accept
    one of the types in the resolvable set at that position.
-  
+
+   Example: `select sign(1.2)`. `sign` has 3 candidates for `int`, `float` and `decimal`. Step 7.3 eliminates
+   the candidate for `int`.
+
    After this point,
    the number of candidates left will be checked now and after each following step.
 
 4. (7.4) candidates are filtered based on the desired return type, if one is provided
 
-   Example: `insert into (str_col) values (left($1, 1))
+   Example: `insert into (str_col) values (left($1, 1))`
    With only rules 7.2 and 7.3 above we still have 2 candidates: `left(string, int)` and `left(bytes, int)`.
    With rule 7.4 `left(string, int)` is selected.
 
@@ -615,7 +698,7 @@ subsequent step, we check the remaining overload set:
    that accepts this type in the yet untyped positions,
    choose that candidate.
 
-   Example: `select (1 + $1)`
+   Example: `select div(1, $1)` still has candidates for `int`, `float` and `decimal`.
 
 Another approach would be to go through each overload and attempt to type check each 
 argument expression with the parameter's type. If any of these expressions type checked to a