Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lossless syntax tree prototype and discussion #189

Closed
wants to merge 13 commits into from

Conversation

nickolay
Copy link
Contributor

@nickolay nickolay commented Jun 7, 2020

I'm opening this draft PR to share the prototype I have at https://github.com/nickolay/sqlparser-rs/tree/wip-cst

I hope that people, who contribute, already use, or want to use sqlparser-rs, react here (if only via a GitHub reaction) so that we can judge if this is something worth pursuing as part of this library or if I'm completely off-track. There's a detailed write-up in the branch's README, but the short version is that current AST is lossy (w.r.t whitespace, comments, unimportant syntax, and source code locations) and to fix that I propose an additional intermediate layer, which is lossless. Some changes to the AST will be necessary too, but to what extent is up for discussion.

Apart from gauging the general mood and collecting feedback, I hope that documenting what I have so far will enable someone to beat me at coming up with the specific design for the CST/AST, as I have not been able to dedicate much time to this.

I'll include a quote from the motivational section of the README:

Preserving full source code information (#161) would enable SQL rewriting/refactoring tools based on sqlparser-rs. For example:

  1. Error reporting, both in the parser and in later stages of query processing, would benefit from knowing the source code location of SQL constructs (Add line/column information consistently to every parser message. #179)
  2. SQL pretty-printing requires comments to be preserved in AST (see Comment handling #175, mentioning forma)
  3. Refactoring via AST transformations would also benefit from having full control over serialization, a possible solution for dialect-specific "writers" (Implement SQL writer #18)
  4. Analyzing partially invalid code may be useful in the context of an IDE or other tooling.

I think that adopting rust-analyzer's design, that includes a lossless syntax tree, is the right direction for sqlparser-rs. In addition to solving the use-cases described above, it helps in other ways:

  1. We can omit syntax that does not affect semantics of the query (e.g. ROW vs ROWS) from the typed AST by default, reducing the implementation effort.
  2. Having a homogenous syntax tree also alleviates the need for a "visitor" (Add an AST visitor #114), also reducing the burden on implementors of new syntax

`Parser::parse_sql()` is intended to be only a helper, the user should
be able to use it or not, without having to re-implement parts of the
parsing logic.
Adjusting parser's `self.index` always felt like a hack, and it becomes
more important as I need to store more state in my "lossless syntax
tree" prototype.
Part of the `cst` branch, see its README for details.
These are re-enabled in the "Support parser backtracking in the
GreenNodeBuilder" commit.
This adds the common code to allow building a lossless syntax tree based
on the `rowan` crate, in parallel with our current AST. The motivation
is discussed in the README. The CST is a flat list of tokens by default,
structure can be added incrementally in future commits.

Two changes are split into their own commits, that should be squashed
with this, once the GreenNodeBuilder change is upstreamed:

- Fix `TBD: rowan's builder does not allow reverting to a checkpoint`
- Hide rowan behind a non-default `cst` feature

Other tasks to be done later, marked with `TBD` in the code:

- Fix `Token`/`SyntaxKind` duplication, changing the former to
    store a slice of the original source code, e.g.
    `(SyntaxKind, SmolStr)`

  This should also fix the currently disabled test-cases where `Token`'s
  `to_string()` does not return the original string:

    * `parse_escaped_single_quote_string_predicate`
    * `parse_literal_string`  (the case of `HexLiteralString`)

- Fix the hack in `parse_keyword()` to remap the token type (RA has
  `bump_remap() for this)

- Fix the `Parser::pending` hack (needs rethinking parser's API)

    Probably related is the issue of handling whitespace and comments:
    the way this prototype handles it looks wrong.
Copy GreenNodeBuilder from rowan with the minimal changes to make it
work as part of our crate.
This adds a `reset()` method to the GreenNodeBuilder previously
imported from rowan and uses it to fix the CST in the testcases where
parser has to backtrack in `parse_table_factor`.
The naming scheme for `SyntaxKind`s and the CST structure is only an
example, see the README for the discussion.

Note also the 'TBD' markers in the patch.
@Dandandan
Copy link
Contributor

I am using sqlparser-rs for a prototype.

Locations of ast nodes (for parser error messages, but also generated type errors, linting messages, etc.) will be incredibly valuable for this!

I will try reviewing this prototype.

@nickolay nickolay mentioned this pull request Jun 7, 2020
@maxcountryman
Copy link
Contributor

This is great. I'm very interested in this sort of thing since it opens the door to all sorts of things that simply aren't possible with the current implementation. I'm personally interested in using this for a project to format SQL in a way that preserves comments.

@panarch
Copy link
Contributor

panarch commented Jun 9, 2020

This is awesome, it is obviously worth fully enough, I'd really appreciate.

I'm currently making a new sql database engine based on key-value storage, planning to release the first version in Aug, this year under MPL 2.0 license.

Attractive points to see may be like these below,

  • Purely written in Rust
  • All core sql executing code is fully written in functional, no even single mut is used.
  • Easily swappable storage engine - At first I'll use sled as a default storage, but anyone else can attach their own storage engine easily by implementing simple traits it provides.
  • WebAssembly support, so it can be used in web frontend development. I'm expecting this db to be used in state management.

And the parser I've used was nom-sql but it didn't have many necessary features I need and looks not updated recently, so I was considering to find some alternatives.
Then by Materialize db announcement, I got to know about this project.
So, what I'm currently working is replacing the parser from nom-sql to this sqlparser-rs.

I'll really look forward to see this new change. thanks.

@hwchen
Copy link

hwchen commented Jun 29, 2020

This looks pretty interesting. I'm wondering if this will also increase performance by being zero-copy? (If I'm misunderstanding the design, let me know). For my use-case, speed is a priority over features, assuming basic parsing is equal.

@Dandandan
Copy link
Contributor

@hwchen I think the main reason of the change is not performance, but rather functionality: easily allow tools to generate the same output, keeping comments, capitalization, spacing, etc. This allows for better errors, tools like formatters, etc.
I think you are right that zero copy could improve performance by a bit, although it probably won't be a lot in this case.

If you are interested in contributing to performance (I think sqlparser is already quite fast in the general case), I would suggest to see if you can add something to the benchmarks https://github.com/ballista-compute/sqlparser-rs/tree/main/sqlparser_bench , see if you can improve the performance through some profiling or maybe try to solve a performance issue #216

@SKalt
Copy link

SKalt commented Mar 21, 2021

HI, what steps are still needed to get this PR ready for review? Specifically, how might I help?

@maxcountryman
Copy link
Contributor

HI, what steps are still needed to get this PR ready for review? Specifically, how might I help?

I'd also love to help.

@SKalt
Copy link

SKalt commented Mar 23, 2021

Looks like the 1st item of business would be updating the CST-experiment branch with the features and fixes now in the mainline. The 0th item of business would be identifying what those features/fixes are.

@SKalt
Copy link

SKalt commented Mar 24, 2021

There's a fair amount of work on sqlparser-rs that's happened since June (that's a good thing!). I'm cautiously optimistic that a careful merge and will work rather than redoing the existing rowan/CST implementation.

git log --oneline wip-cst..real/main output if you're interested
43fef23 (cargo-release) start next development iteration 0.8.1-alpha.0
34cd794 (cargo-release) version 0.8.0
a868ff6 Add release notes
add8991 feat: support sqlite insert or statement (#281)
07342d5 Support parsing multiple show variables. (#290)
f40955e Parse floats without leading number (#294)
6f0b2dc Implement SUBSTRING(col [FROM <expr>] [FOR <expr>]) syntax (#293)
8a214f9 Implement Hive QL Parsing (#235)
17f8eb9 Fix clippy lints (#287)
200ed5e (cargo-release) start next development iteration 0.7.1-alpha.0
e11b80e (cargo-release) version 0.7.0
97cd1c0 Release 0.7.0 instead
9930bdf (cargo-release) start next development iteration 0.6.3-alpha.0
26c281e (cargo-release) version 0.6.2
d66294f Add date
e18e8dc Prepare 0.6.2
94ff468 Support ANALYZE TABLE syntax (#285)
17f2930 Introduce support for EXPLAIN [ANALYZE] [VERBOSE] <STATEMENT> syntax
cbd3c6b Merge pull request #263 from ballista-compute/documentation/cargo-release-config
929fc67 Merge pull request #260 from eyalleshem/single_tables_in_parens
ad72cda [snowflake] Support specifying an alias after `FROM (table_factor)`
d9e044a Extend one_statement_parses_to to also test parsing canonical SQL
4128dfe Introduce tests/test_utils/mod.rs and use it consistently
9f772f0 Add support for Recursive CTEs (#278)
99fb633 Move existing SF tests to sqlparser_snowflake.rs
54be391 Update CHANGELOG
7dc5d4c Follow-up to #275: Bump simple_logger version in Cargo.toml
1ac2083 Support IF NOT EXISTS for CREATE SCHEMA (#276)
926b03a Add parsing for PostgreSQL math operators (#267)
0ac343a Don't publish on the push of any tag
cc4f51f Update releasing.md docs
2f71324 Fix deprecated way of initializing SimpleLogger (#275)
cf7263c Fix typo in the README
01a2a6b Update CHANGELOG (#261)
1c6077c [snowflake] Support single line comments starting with '#' or '//' (#264)
e9aa87f Update bigdecimal requirement from 0.1 to 0.2 (#268)
a5b7524 Fix clippy linting error, use enumerate (#266)
038ef98 Add Ballista to the README as a user of this crate (#262)
fcf1eb1 add a note about cargo release config
118a345 Merge pull request #238 from maxcountryman/feature/gh-action-releases
f500a42 Add snowflake dialect (#259)
2c6c295 Refactor <column definition> parsing (#251)
66505eb Don't fail parsing a column definition with unexpected tokens
23f5c7e Move parse_column_def below parse_columns
1b46e82 Enable dialect specific behaviours in the parser (#254)
3871bbc Enable dependabot for this repository (#256)
5f3a40e Dependency updates (#255)
61431b0 Support TABLE functions in FROM (#253)
a246d5d Undo accidental commit
caeb046 Enable dependabot for this repository
1cc3bf4 Support named arguments in function invocations (#250)
9351efb update release instructions
580e4b1 Merge pull request #252 from ballista-compute/fixup/use-nightly-fmt
6b37c16 fix typo
1a70c6e document initial release process
cac3a8e provide missing license header
76a911b ensure we use nightly with fmt
9c1a5a7 Don't fail parsing ALTER TABLE ADD COLUMN ending with a semicolon (#246)
f8feff4 Add SQLite dialect (#248)
4452f9b Support specifying ASC/DESC in index columns (#249)
9e7e302 Support identifiers quoted with backticks in the MySQL dialect (#247)
1337820 Merge pull request #245 from nickolay/cleanups
a6e30b3 Fix typo in JSONFILE serialization
9a2d86d Change CREATE INDEX serialization to not end with a semicolon
9371652 Fix "unused stmt" warning in tests, with default features
3e880b5 Use consistent style for Display impls
d0db8a2 Run cargo fmt
09ca14f Support dialect-specific auto-increment column options for MySQL and SQLite (#234)
8020b2e Add Postgres-specific PREPARE, EXECUTE and DEALLOCATE (#243)
d2e4340 Support create or replace view/table (#239)
bc9bfae automate crate publishing
f053383 Release 0.6.1
3a42b69 Release 0.6.0 (#232)
583f22b Remove PostgreSQL version of assert (#229)
c24b0e0 Implement ASSERT statement (#226)
5cab189 Add TPCH reggression tests (#221)
f3b9edc update travis badge to point to actions status (#219)
8cc7702 update branch references to `main` (#215)
2a6d5f2 update cargo manifest (#214)
a53f1d2 Support SQLite `CREATE VIRTUAL TABLE` (#209)
0c83e5d Support SQLite's WITHOUT ROWID in CREATE TABLE (#208)
0c82be5 Follow-up to the recent release (CHANGELOG and the bench crate)
1946791 (cargo-release) start next development iteration 0.5.2-alpha.0
05f8992 (cargo-release) version 0.5.1
15d5f71 Add CREATE TABLE AS support (#206)
26361fd Implement ALTER TABLE DROP COLUMN (#148)
faeb7d4 Implement ALTER TABLE ADD COLUMN and RENAME (#203)
fab6e28 Output DataType capitalized (#202)
b24dbe5 Replace FromStr with normal parser function for FileFormat (#201)
68afa2a Make FileFormat case insensitive (#200)
a0f076a Make the cli example print JSON (#197)
f4fbd9b Take slice as input for parse_keywords (#199)
6cdd4a1 Support general "typed string" literals (#187)
34548e8 Change Word::keyword to a enum (#193)
0fe3a8e Use Token::EOF instead of Option<Token> (#195)
2f10153 Add serde support to AST structs and enums (#196)
d9a7491 Various follow-ups to recent pushes
846c52f Allow omitting units after INTERVAL (#184)
d842f49 Add line and column number to TokenizerError (#194)
10b0b7f Update CHANGELOG (#192)
a42121d Use binary search to speed up matching keywords (#191)
af54eb0 Rework github actions, add code coverage (#186)
6e6fae7 Add benchmarks using cargo bench / criterion (#190)

@SKalt
Copy link

SKalt commented Apr 3, 2021

requesting debugging: I merged the main branch into the wip-cst branch, leaving the product on branch cst-refresh of skalt/sqlparser-rs, PR'd against wip-cst. I'm new to both rust and this codebase1, so I could use help debugging a failing unit test, parser::tests::test_prev_index. It looks like a call to prev_token() is chopping the "version" off of "SELECT version" despite the best efforts of parser.syntax() to consume any remaining tokens.

1: This codebase has amazing comments!

@nickolay
Copy link
Contributor Author

nickolay commented Apr 5, 2021

@SKalt Thanks for taking an interest in this and sorry to keep you waiting!

Regarding the failing test, one of the problems is that you didn't sync up SyntaxNode with Token.

More generally, I suggest rebasing instead of trying to blindly merge this:

  • Otherwise it's really hard to spot the merge mistakes in a huge merge commit.
  • Redoing the patches is also a good way to get familiar with the code,
  • You'll be able to see what change introduces the problem
  • The first few commits from this branch are not really related to the CST work and could probably be landed separately - rebasing will make it possible to land them independently of finishing up the rest of the work.

As for the next steps, I tried to detail the current status and the outstanding tasks in the branch's README, the most important and the hardest one being coming up with a new typed AST design. Try working with raw SyntaxNodes to see what I mean.

@alamb
Copy link
Contributor

alamb commented Aug 20, 2021

Hi @nickolay -- sorry for the delay in review. I am going to help out now with this repo and we are working to clear the backlog. Is this PR still something you would like to work on to help contribute?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants