fix parsing of 'with' inside streaming operator #693 #697

hzeller · 2021-03-11T21:38:28Z

The stream expression has an expression followed by
an optional 'with' '[' array_range_expression ']'
(LRM 11.4.14)

There are a few other cases where an expression can
be trailed with 'with',

with_constraint_block: there we expect an
identifier list in parenthesis after 'with'
similar with array methods with predicate
with '(' ...)

Since the parser with one look-ahead can't see beyond
the 'with', it runs into a shift/reduce conflict
as it does not know if '(' or '[' is to follow.

Disambiguate that in the lexer by creating a
separate token TK_with__followed_by_bracket which is
a TK_with, where the lexer already looked ahead and
knows that it will be followed by '['.

After seeing the '[', everything but the "with" is put
back into the stream to be processed separately.

Changed lexer and parser as well as associated unit tests including formatter.

Fixing #693

Signed-off-by: Henner Zeller h.zeller@acm.org

The stream expression has an expression followed by an optional 'with' '[' array_range_expression ']' (LRM 11.4.14) There are a few other cases where an expression can be trailed with 'with', * with_constraint_block: there we expect an identifier list in parenthesis after 'with' * similar with array methods with predicate with '(' ...) Since the parser with one look-ahead can't see beyond the 'with', it runs into a shift/reduce conflict as it does not know if '(' or '[' is to follow. Disambiguate that in the lexer by creating a separate token TK_with__followed_by_bracket which is a TK_with, where the lexer already looked ahead and knows that it will be followed by '['. After seeing the '[', everything but the "with" is put back into the stream to be processed separately. TODO: [ ] Extend what can be between 'with' and '[' to include arbitrary whitespace (including: comments) [ ] Make formatter work; currently {<< {unpacked_arr with [i*4 +: 4]}}; is formatted {<<{unpacked_arrwith[i*4+:4]}}; Signed-off-by: Henner Zeller <h.zeller@acm.org>

hzeller · 2021-03-11T21:41:56Z

Implemented the suggestion to yyless back to only consume with for the rest to be consumed later.

Next step is to see if it is easy to implement in flex to have everything between with and [ just be some arbitrary whitespace including comments (Currently simplistically implemented as [ \t]*).

Also the formatter needs to be able to deal with this token sequence it seems ... right now it removes all whitespace.

Signed-off-by: Henner Zeller <h.zeller@acm.org>

hzeller · 2021-03-12T02:51:43Z

Upcoming: Fix formatting.

Right now, for

module test;
  initial begin
    logic unpacked_arr [16] = '{0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1};
    logic [3:0] nibble_arr[4];
    for (int i = 0; i < 4; i++) begin
      nibble_arr[i] = {<< {unpacked_arr with [i*4 +: 4]}};
    end
  end
endmodule

... the token partition tree looks like:

Full token partition tree:
{ ([<auto>], policy: always-expand) @{}
  { ([<auto>], policy: always-expand) @{0}, (origin: "module test...d
endmodule")
    { ([module test ;], policy: fit-else-expand, (origin: "module test;")) }
    { (>>[<auto>], policy: fit-else-expand) @{0,1}, (origin: "initial beg...  end
  end")
      { (>>[initial begin], policy: fit-else-expand) }
      { (>>>>[<auto>], policy: tabular-alignment) @{0,1,1}, (origin: "logic unpac...}};
    end")
        { (>>>>[logic unpacked_arr [ 16 ] = '{ 0 , 1 , 0 , 1 , 0 , 1 , 0 , 1 , 0 , 1 , 0 , 1 , 0 , 1 , 0 , 1 } ;], policy: fit-else-expand, (origin: "logic unpac..., 1, 0, 1};")) }
        { (>>>>[logic [ 3 : 0 ] nibble_arr [ 4 ] ;], policy: fit-else-expand, (origin: "logic [3:0]...ble_arr[4];")) }
        { (>>>>[<auto>], policy: fit-else-expand) @{0,1,1,2}, (origin: "for (int i ...}};
    end")
          { (>>>>[<auto>], policy: fit-else-expand) @{0,1,1,2,0}, (origin: "for (int i ...i < 4; i++)")
            { (>>>>[for (], policy: fit-else-expand) }
            { (>>>>>>>>[<auto>], policy: fit-else-expand) @{0,1,1,2,0,1}, (origin: "int i = 0; i < 4; i++")
              { (>>>>>>>>[int i = 0 ;], policy: fit-else-expand, (origin: "int i = 0")) }
              { (>>>>>>>>[i < 4 ;], policy: fit-else-expand, (origin: "i < 4")) }
              { (>>>>>>>>[i ++], policy: fit-else-expand, (origin: "i++")) }
            }
            { (>>>>[) begin], policy: uninitialized) }
          }
          { (>>>>>>[nibble_arr [ i ] = { << { unpacked_arr with [ i * 4 +: 4 ] } } ;], policy: fit-else-expand, (origin: "nibble_arr[...*4 +: 4]}};")) }
          { (>>>>[end], policy: always-expand, (origin: "end")) }
        }
      }
      { (>>[end], policy: always-expand, (origin: "end")) }
    }
    { ([endmodule], policy: always-expand, (origin: "endmodule")) }
  }
}

With that, the formatting is not adding any spaces in the relevant line; resulting in the following faulty formatting:

module test;
  initial begin
    logic unpacked_arr[16] = '{0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1};
    logic [3:0] nibble_arr[4];
    for (int i = 0; i < 4; i++) begin
      nibble_arr[i] = {<<{unpacked_arrwith[i*4+:4]}};
    end
  end
endmodule

fangism · 2021-03-12T04:48:30Z

verilog/parser/verilog.lex

-"with"[ \t]*"[" {
-  /* TODO: formulate the ws sequence as any whitespace+comment sequence*/
-  yyless(4);
+"with"({TraditionalCommentOrSpace}|{EndOfLineComment})*"[" {


This might be easier with an <AFTER_WITH> state, where with returns a TK_width after pushing a new state.
It will be easier to re-use rules inside states than glue them together like this in monolithic regular expressions.
Inside this new state, you handle whitespace, comments, and [, and for everything else, you can pop the state and rewind, to defer to default <INITIAL> state handling (or whatever was previously on the state stack).

You will also want lexer unit-tests for this. :)

Ok, will need to familiarize myself how states work. They did look somewhat complicated.

The beauty if this is, that I don't have to handle anything special, including '['; I just look ahead, find that this is the next token and TK_with needs to be a new token.

(lexer unit test added)

I looked at the AFTER_DOT example, which is somewhat involved. I have to play with it to see if it is more advantageous.

Looking over this again, I still stand by this first comment. Generally, the options are:

track state within the lexer to alter the enumeration of the latter token

track state inside the lexical context class to alter the enumeration of the earlier token (effectively achieving lookahead) -- this approach is pretty localized, you're defining a tiny state-machine that can run along side other independent state machines.

fangism · 2021-03-12T04:50:14Z

verilog/parser/verilog_parser_unittest.cc

+    "{>>8 {foo, bar with \t [ a +: b]}} = nums;\n"
+    "endfunction",
+    "function void unpack_id_with(int nums);\n"
+    "{>>8 {foo, bar with /* some comment */ [ a +: b]}} = nums;\n"


More test case ideas:

multiple comments in between with and [.

fangism · 2021-03-12T04:53:27Z

Upcoming: Fix formatting.

Right now, for

module test;
  initial begin
    logic unpacked_arr [16] = '{0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1};
    logic [3:0] nibble_arr[4];
    for (int i = 0; i < 4; i++) begin
      nibble_arr[i] = {<< {unpacked_arr with [i*4 +: 4]}};
    end
  end
endmodule

... the token partition tree looks like:

Full token partition tree:
{ ([<auto>], policy: always-expand) @{}
  { ([<auto>], policy: always-expand) @{0}, (origin: "module test...d
endmodule")
    { ([module test ;], policy: fit-else-expand, (origin: "module test;")) }
    { (>>[<auto>], policy: fit-else-expand) @{0,1}, (origin: "initial beg...  end
  end")
      { (>>[initial begin], policy: fit-else-expand) }
      { (>>>>[<auto>], policy: tabular-alignment) @{0,1,1}, (origin: "logic unpac...}};
    end")
        { (>>>>[logic unpacked_arr [ 16 ] = '{ 0 , 1 , 0 , 1 , 0 , 1 , 0 , 1 , 0 , 1 , 0 , 1 , 0 , 1 , 0 , 1 } ;], policy: fit-else-expand, (origin: "logic unpac..., 1, 0, 1};")) }
        { (>>>>[logic [ 3 : 0 ] nibble_arr [ 4 ] ;], policy: fit-else-expand, (origin: "logic [3:0]...ble_arr[4];")) }
        { (>>>>[<auto>], policy: fit-else-expand) @{0,1,1,2}, (origin: "for (int i ...}};
    end")
          { (>>>>[<auto>], policy: fit-else-expand) @{0,1,1,2,0}, (origin: "for (int i ...i < 4; i++)")
            { (>>>>[for (], policy: fit-else-expand) }
            { (>>>>>>>>[<auto>], policy: fit-else-expand) @{0,1,1,2,0,1}, (origin: "int i = 0; i < 4; i++")
              { (>>>>>>>>[int i = 0 ;], policy: fit-else-expand, (origin: "int i = 0")) }
              { (>>>>>>>>[i < 4 ;], policy: fit-else-expand, (origin: "i < 4")) }
              { (>>>>>>>>[i ++], policy: fit-else-expand, (origin: "i++")) }
            }
            { (>>>>[) begin], policy: uninitialized) }
          }
          { (>>>>>>[nibble_arr [ i ] = { << { unpacked_arr with [ i * 4 +: 4 ] } } ;], policy: fit-else-expand, (origin: "nibble_arr[...*4 +: 4]}};")) }
          { (>>>>[end], policy: always-expand, (origin: "end")) }
        }
      }
      { (>>[end], policy: always-expand, (origin: "end")) }
    }
    { ([endmodule], policy: always-expand, (origin: "endmodule")) }
  }
}

With that, the formatting is not adding any spaces in the relevant line; resulting in the following faulty formatting:

module test;
  initial begin
    logic unpacked_arr[16] = '{0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1};
    logic [3:0] nibble_arr[4];
    for (int i = 0; i < 4; i++) begin
      nibble_arr[i] = {<<{unpacked_arrwith[i*4+:4]}};
    end
  end
endmodule

When the tokens mistakenly fused, did the formatter catch it as data corruption and bail out? (If not, that itself is a bug.)

This is controlled by inter-token spacing (token_annotator). There is a general rule that keywords and identifiers required a minimum 1 space separation. You will likely need a classification entry to make the special [ look and feel like a regular [.

There was a special case for streaming operators that reduced spaces around the various components. However, for 'with', we need to special-special case that to actually keep a space around. Signed-off-by: Henner Zeller <h.zeller@acm.org>

Signed-off-by: Henner Zeller <h.zeller@acm.org>

hzeller · 2021-03-15T23:20:59Z

Yes, the formatter was bailing out correctly, finding that it created a syntax error after reparsing.

The issue in this particular case in the annotator was, that there was a special case in the annotator that removed spaces around streaming operators components. But we do have to keep them around the 'with' keyword. So that needed to be special-special cased.

Anyway, I think this is ready to review. It still uses the lexer without extra state machine though, will have to investigate it that would feel better.
The bug however would be fixed with the current state already.

hzeller · 2021-04-06T15:27:46Z

I have created #749 to follow-up on the implementation improvement suggestion by @fangism . I'd like to submit this rather now to be able to deal with the immediate parse situations, but then look for the nicer implementation later when dust in other corners of the project settled.

hzeller · 2021-04-08T15:35:05Z

@tgorochowik I handed the review baton on to you as David focuses on other things currently.

tgorochowik

To be honest I agree with David - this solution looks pretty hacky and I think it would be best to avoid doing it like this.

However it does solve the issue, so I guess it's okay to merge it and revisit at a later stage. So let me approve this in hope it won't start the "broken window theory" thing.

fangism · 2021-04-09T03:31:36Z

Sorry, it's been a while since looking at this. I should have prefaced my initial comment with: if it can be encoded using standard techniques for expressing precedence and associativity in LR grammars, do that first. Because with appears in different contexts, you might have to create an auxiliary rule like expression_followed_by_with as one way to avoid/delay the conflict, but this is not always easy. Based on early conversations, I believe Henner attempted some of these approaches?

hzeller · 2021-04-09T15:07:12Z

Ok, will not submit now and revisit when there is a bit more time.

hzeller · 2022-10-12T14:21:13Z

This branch got lost, so continuing in #1504.

googlebot added the cla: yes All contributors in pull request have signed the CLA with Google. label Mar 11, 2021

hzeller requested a review from fangism March 11, 2021 21:39

Extend notion of whitespace between 'with' and '[' to comments.

14776bf

Signed-off-by: Henner Zeller <h.zeller@acm.org>

fangism reviewed Mar 12, 2021

View reviewed changes

hzeller added 2 commits March 15, 2021 15:46

Add lexer unit test. Extend parser unit test.

3f2c80d

Signed-off-by: Henner Zeller <h.zeller@acm.org>

hzeller changed the title ~~Work in progress: fix #693~~ fix parsing of with inside streaming operator #693 Mar 15, 2021

hzeller linked an issue Mar 15, 2021 that may be closed by this pull request

Syntax errors due to streaming with operator #693

Open

hzeller marked this pull request as ready for review March 15, 2021 23:33

hzeller requested a review from fangism March 15, 2021 23:33

hzeller changed the title ~~fix parsing of with inside streaming operator #693~~ fix parsing of 'with' inside streaming operator #693 Mar 16, 2021

This was referenced Apr 5, 2021

Format OpenTitan codebase with Verible SystemVerilog code formatter #748

Open

Consider using lexer states for #697 #749

Open

hzeller requested a review from tgorochowik April 6, 2021 15:30

tgorochowik approved these changes Apr 8, 2021

View reviewed changes

nikhiljha pushed a commit to nikhiljha/verible that referenced this pull request Sep 27, 2022

feat(health): check GitHub rate limit (chipsalliance#697)

86def32

hzeller mentioned this pull request Oct 12, 2022

Fix parsing of 'with' inside streaming operator. #1504

Draft

hzeller closed this Oct 12, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix parsing of 'with' inside streaming operator #693 #697

fix parsing of 'with' inside streaming operator #693 #697

hzeller commented Mar 11, 2021 •

edited

Loading

hzeller commented Mar 11, 2021

hzeller commented Mar 12, 2021

fangism Mar 12, 2021

fangism Mar 12, 2021

hzeller Mar 15, 2021

hzeller Mar 15, 2021

hzeller Mar 15, 2021

fangism Apr 9, 2021

fangism Mar 12, 2021

hzeller Mar 15, 2021

fangism commented Mar 12, 2021

hzeller commented Mar 15, 2021

hzeller commented Apr 6, 2021

hzeller commented Apr 8, 2021

tgorochowik left a comment

fangism commented Apr 9, 2021

hzeller commented Apr 9, 2021

hzeller commented Oct 12, 2022

fix parsing of 'with' inside streaming operator #693 #697

fix parsing of 'with' inside streaming operator #693 #697

Conversation

hzeller commented Mar 11, 2021 • edited Loading

hzeller commented Mar 11, 2021

hzeller commented Mar 12, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fangism commented Mar 12, 2021

hzeller commented Mar 15, 2021

hzeller commented Apr 6, 2021

hzeller commented Apr 8, 2021

tgorochowik left a comment

Choose a reason for hiding this comment

fangism commented Apr 9, 2021

hzeller commented Apr 9, 2021

hzeller commented Oct 12, 2022

hzeller commented Mar 11, 2021 •

edited

Loading