Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix parsing of 'with' inside streaming operator #693 #697

Closed
wants to merge 4 commits into from
Closed

fix parsing of 'with' inside streaming operator #693 #697

wants to merge 4 commits into from

Conversation

hzeller
Copy link
Collaborator

@hzeller hzeller commented Mar 11, 2021

The stream expression has an expression followed by
an optional 'with' '[' array_range_expression ']'
(LRM 11.4.14)

There are a few other cases where an expression can
be trailed with 'with',

  • with_constraint_block: there we expect an
    identifier list in parenthesis after 'with'
  • similar with array methods with predicate
    with '(' ...)

Since the parser with one look-ahead can't see beyond
the 'with', it runs into a shift/reduce conflict
as it does not know if '(' or '[' is to follow.

Disambiguate that in the lexer by creating a
separate token TK_with__followed_by_bracket which is
a TK_with, where the lexer already looked ahead and
knows that it will be followed by '['.

After seeing the '[', everything but the "with" is put
back into the stream to be processed separately.

Changed lexer and parser as well as associated unit tests including formatter.

Fixing #693

Signed-off-by: Henner Zeller h.zeller@acm.org

The stream expression has an expression followed by
an optional 'with' '[' array_range_expression ']'
(LRM 11.4.14)

There are a few other cases where an expression can
be trailed with 'with',
 * with_constraint_block: there we expect an
   identifier list in parenthesis after 'with'
 * similar with array methods with predicate
   with '(' ...)

Since the parser with one look-ahead can't see beyond
the 'with', it runs into a shift/reduce conflict
as it does not know if '(' or '[' is to follow.

Disambiguate that in the lexer by creating a
separate token TK_with__followed_by_bracket which is
a TK_with, where the lexer already looked ahead and
knows that it will be followed by '['.

After seeing the '[', everything but the "with" is put
back into the stream to be processed separately.

TODO:
  [ ] Extend what can be between 'with' and '[' to
      include arbitrary whitespace (including: comments)
  [ ] Make formatter work; currently
      {<< {unpacked_arr with [i*4 +: 4]}};
      is formatted
      {<<{unpacked_arrwith[i*4+:4]}};

Signed-off-by: Henner Zeller <h.zeller@acm.org>
@googlebot googlebot added the cla: yes All contributors in pull request have signed the CLA with Google. label Mar 11, 2021
@hzeller hzeller requested a review from fangism March 11, 2021 21:39
@hzeller
Copy link
Collaborator Author

hzeller commented Mar 11, 2021

Implemented the suggestion to yyless back to only consume with for the rest to be consumed later.

Next step is to see if it is easy to implement in flex to have everything between with and [ just be some arbitrary whitespace including comments (Currently simplistically implemented as [ \t]*).

Also the formatter needs to be able to deal with this token sequence it seems ... right now it removes all whitespace.

Signed-off-by: Henner Zeller <h.zeller@acm.org>
@hzeller
Copy link
Collaborator Author

hzeller commented Mar 12, 2021

Upcoming: Fix formatting.

Right now, for

module test;
  initial begin
    logic unpacked_arr [16] = '{0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1};
    logic [3:0] nibble_arr[4];
    for (int i = 0; i < 4; i++) begin
      nibble_arr[i] = {<< {unpacked_arr with [i*4 +: 4]}};
    end
  end
endmodule

... the token partition tree looks like:

Full token partition tree:
{ ([<auto>], policy: always-expand) @{}
  { ([<auto>], policy: always-expand) @{0}, (origin: "module test...d
endmodule")
    { ([module test ;], policy: fit-else-expand, (origin: "module test;")) }
    { (>>[<auto>], policy: fit-else-expand) @{0,1}, (origin: "initial beg...  end
  end")
      { (>>[initial begin], policy: fit-else-expand) }
      { (>>>>[<auto>], policy: tabular-alignment) @{0,1,1}, (origin: "logic unpac...}};
    end")
        { (>>>>[logic unpacked_arr [ 16 ] = '{ 0 , 1 , 0 , 1 , 0 , 1 , 0 , 1 , 0 , 1 , 0 , 1 , 0 , 1 , 0 , 1 } ;], policy: fit-else-expand, (origin: "logic unpac..., 1, 0, 1};")) }
        { (>>>>[logic [ 3 : 0 ] nibble_arr [ 4 ] ;], policy: fit-else-expand, (origin: "logic [3:0]...ble_arr[4];")) }
        { (>>>>[<auto>], policy: fit-else-expand) @{0,1,1,2}, (origin: "for (int i ...}};
    end")
          { (>>>>[<auto>], policy: fit-else-expand) @{0,1,1,2,0}, (origin: "for (int i ...i < 4; i++)")
            { (>>>>[for (], policy: fit-else-expand) }
            { (>>>>>>>>[<auto>], policy: fit-else-expand) @{0,1,1,2,0,1}, (origin: "int i = 0; i < 4; i++")
              { (>>>>>>>>[int i = 0 ;], policy: fit-else-expand, (origin: "int i = 0")) }
              { (>>>>>>>>[i < 4 ;], policy: fit-else-expand, (origin: "i < 4")) }
              { (>>>>>>>>[i ++], policy: fit-else-expand, (origin: "i++")) }
            }
            { (>>>>[) begin], policy: uninitialized) }
          }
          { (>>>>>>[nibble_arr [ i ] = { << { unpacked_arr with [ i * 4 +: 4 ] } } ;], policy: fit-else-expand, (origin: "nibble_arr[...*4 +: 4]}};")) }
          { (>>>>[end], policy: always-expand, (origin: "end")) }
        }
      }
      { (>>[end], policy: always-expand, (origin: "end")) }
    }
    { ([endmodule], policy: always-expand, (origin: "endmodule")) }
  }
}

With that, the formatting is not adding any spaces in the relevant line; resulting in the following faulty formatting:

module test;
  initial begin
    logic unpacked_arr[16] = '{0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1};
    logic [3:0] nibble_arr[4];
    for (int i = 0; i < 4; i++) begin
      nibble_arr[i] = {<<{unpacked_arrwith[i*4+:4]}};
    end
  end
endmodule

"with"[ \t]*"[" {
/* TODO: formulate the ws sequence as any whitespace+comment sequence*/
yyless(4);
"with"({TraditionalCommentOrSpace}|{EndOfLineComment})*"[" {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be easier with an <AFTER_WITH> state, where with returns a TK_width after pushing a new state.
It will be easier to re-use rules inside states than glue them together like this in monolithic regular expressions.
Inside this new state, you handle whitespace, comments, and [, and for everything else, you can pop the state and rewind, to defer to default <INITIAL> state handling (or whatever was previously on the state stack).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You will also want lexer unit-tests for this. :)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, will need to familiarize myself how states work. They did look somewhat complicated.

The beauty if this is, that I don't have to handle anything special, including '['; I just look ahead, find that this is the next token and TK_with needs to be a new token.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(lexer unit test added)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked at the AFTER_DOT example, which is somewhat involved. I have to play with it to see if it is more advantageous.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking over this again, I still stand by this first comment. Generally, the options are:

  1. track state within the lexer to alter the enumeration of the latter token
  2. track state inside the lexical context class to alter the enumeration of the earlier token (effectively achieving lookahead) -- this approach is pretty localized, you're defining a tiny state-machine that can run along side other independent state machines.

"{>>8 {foo, bar with \t [ a +: b]}} = nums;\n"
"endfunction",
"function void unpack_id_with(int nums);\n"
"{>>8 {foo, bar with /* some comment */ [ a +: b]}} = nums;\n"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More test case ideas:

  • multiple comments in between with and [.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@fangism
Copy link
Collaborator

fangism commented Mar 12, 2021

Upcoming: Fix formatting.

Right now, for

module test;
  initial begin
    logic unpacked_arr [16] = '{0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1};
    logic [3:0] nibble_arr[4];
    for (int i = 0; i < 4; i++) begin
      nibble_arr[i] = {<< {unpacked_arr with [i*4 +: 4]}};
    end
  end
endmodule

... the token partition tree looks like:

Full token partition tree:
{ ([<auto>], policy: always-expand) @{}
  { ([<auto>], policy: always-expand) @{0}, (origin: "module test...d
endmodule")
    { ([module test ;], policy: fit-else-expand, (origin: "module test;")) }
    { (>>[<auto>], policy: fit-else-expand) @{0,1}, (origin: "initial beg...  end
  end")
      { (>>[initial begin], policy: fit-else-expand) }
      { (>>>>[<auto>], policy: tabular-alignment) @{0,1,1}, (origin: "logic unpac...}};
    end")
        { (>>>>[logic unpacked_arr [ 16 ] = '{ 0 , 1 , 0 , 1 , 0 , 1 , 0 , 1 , 0 , 1 , 0 , 1 , 0 , 1 , 0 , 1 } ;], policy: fit-else-expand, (origin: "logic unpac..., 1, 0, 1};")) }
        { (>>>>[logic [ 3 : 0 ] nibble_arr [ 4 ] ;], policy: fit-else-expand, (origin: "logic [3:0]...ble_arr[4];")) }
        { (>>>>[<auto>], policy: fit-else-expand) @{0,1,1,2}, (origin: "for (int i ...}};
    end")
          { (>>>>[<auto>], policy: fit-else-expand) @{0,1,1,2,0}, (origin: "for (int i ...i < 4; i++)")
            { (>>>>[for (], policy: fit-else-expand) }
            { (>>>>>>>>[<auto>], policy: fit-else-expand) @{0,1,1,2,0,1}, (origin: "int i = 0; i < 4; i++")
              { (>>>>>>>>[int i = 0 ;], policy: fit-else-expand, (origin: "int i = 0")) }
              { (>>>>>>>>[i < 4 ;], policy: fit-else-expand, (origin: "i < 4")) }
              { (>>>>>>>>[i ++], policy: fit-else-expand, (origin: "i++")) }
            }
            { (>>>>[) begin], policy: uninitialized) }
          }
          { (>>>>>>[nibble_arr [ i ] = { << { unpacked_arr with [ i * 4 +: 4 ] } } ;], policy: fit-else-expand, (origin: "nibble_arr[...*4 +: 4]}};")) }
          { (>>>>[end], policy: always-expand, (origin: "end")) }
        }
      }
      { (>>[end], policy: always-expand, (origin: "end")) }
    }
    { ([endmodule], policy: always-expand, (origin: "endmodule")) }
  }
}

With that, the formatting is not adding any spaces in the relevant line; resulting in the following faulty formatting:

module test;
  initial begin
    logic unpacked_arr[16] = '{0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1};
    logic [3:0] nibble_arr[4];
    for (int i = 0; i < 4; i++) begin
      nibble_arr[i] = {<<{unpacked_arrwith[i*4+:4]}};
    end
  end
endmodule

When the tokens mistakenly fused, did the formatter catch it as data corruption and bail out? (If not, that itself is a bug.)

This is controlled by inter-token spacing (token_annotator). There is a general rule that keywords and identifiers required a minimum 1 space separation. You will likely need a classification entry to make the special [ look and feel like a regular [.

There was a special case for streaming operators that
reduced spaces around the various components. However, for
'with', we need to special-special case that to actually
keep a space around.

Signed-off-by: Henner Zeller <h.zeller@acm.org>
Signed-off-by: Henner Zeller <h.zeller@acm.org>
@hzeller
Copy link
Collaborator Author

hzeller commented Mar 15, 2021

Yes, the formatter was bailing out correctly, finding that it created a syntax error after reparsing.

The issue in this particular case in the annotator was, that there was a special case in the annotator that removed spaces around streaming operators components. But we do have to keep them around the 'with' keyword. So that needed to be special-special cased.

Anyway, I think this is ready to review. It still uses the lexer without extra state machine though, will have to investigate it that would feel better.
The bug however would be fixed with the current state already.

@hzeller hzeller changed the title Work in progress: fix #693 fix parsing of with inside streaming operator #693 Mar 15, 2021
@hzeller hzeller linked an issue Mar 15, 2021 that may be closed by this pull request
@hzeller hzeller marked this pull request as ready for review March 15, 2021 23:33
@hzeller hzeller requested a review from fangism March 15, 2021 23:33
@hzeller hzeller changed the title fix parsing of with inside streaming operator #693 fix parsing of 'with' inside streaming operator #693 Mar 16, 2021
@hzeller
Copy link
Collaborator Author

hzeller commented Apr 6, 2021

I have created #749 to follow-up on the implementation improvement suggestion by @fangism . I'd like to submit this rather now to be able to deal with the immediate parse situations, but then look for the nicer implementation later when dust in other corners of the project settled.

@hzeller hzeller requested a review from tgorochowik April 6, 2021 15:30
@hzeller
Copy link
Collaborator Author

hzeller commented Apr 8, 2021

@tgorochowik I handed the review baton on to you as David focuses on other things currently.

Copy link
Member

@tgorochowik tgorochowik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be honest I agree with David - this solution looks pretty hacky and I think it would be best to avoid doing it like this.

However it does solve the issue, so I guess it's okay to merge it and revisit at a later stage. So let me approve this in hope it won't start the "broken window theory" thing.

@fangism
Copy link
Collaborator

fangism commented Apr 9, 2021

Sorry, it's been a while since looking at this. I should have prefaced my initial comment with: if it can be encoded using standard techniques for expressing precedence and associativity in LR grammars, do that first. Because with appears in different contexts, you might have to create an auxiliary rule like expression_followed_by_with as one way to avoid/delay the conflict, but this is not always easy. Based on early conversations, I believe Henner attempted some of these approaches?

@hzeller
Copy link
Collaborator Author

hzeller commented Apr 9, 2021

Ok, will not submit now and revisit when there is a bit more time.

@hzeller
Copy link
Collaborator Author

hzeller commented Oct 12, 2022

This branch got lost, so continuing in #1504.

@hzeller hzeller closed this Oct 12, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla: yes All contributors in pull request have signed the CLA with Google.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Syntax errors due to streaming with operator
4 participants