Packrat Parsing

About packrat parsing

Parseq is implemented as a backtracking recursive descent parser. In the worst case, this kind of parser requires exponential execution time (see for example here). Packrat parsing is a method which guarantees linear execution time and a backtracking recursive decent parser can be adapted to use this method. Packrat parsing was introduced together with parsing expression grammars (PEG) by Bryan Ford in 2002.

To enable packrat parsing in a backtracking recursive descent parser, memoization has to be performed. This means that when a parsing expression (PE) is evaluated at a given position in the sequence, the position and the result are stored (memoized). If (later on) the same PE is again about to be evaluated at the same position, the previous result is recalled instead of actually evaluating the PE. The obvious drawback is that the parser has to store data during parsing. This amount of data is proporitional to the length of the input sequence.

Apart from increased memory consumption, there are other challenges:

The parsing result that is stored may be a large tree or other structure.
The parsing result should be copied for storage (unless the code is fully functional).
Side effects of parsing expressions will break because the parsing code will only be executed once.

Parseq as a packrat parser

Parseq is not ideally suited for packrat parsing for the following reasons:

Parseq uses more than a pointer into the sequence, because it allows parsing of trees. Therefore the storage key for the memoization includes a tree pointer (which is a tuple).
Parsing expressions in parseq can have arguments. These affect the behaviour and/or the result of the expression. Therefore, the arguments need to be part of the memoization storage key as well.
Variables declared with (:external ...) may also affect the result of the expression, therefore these variables must be part of the memoization key too.
Processing options allow parsing to be non-functional. Memoization would break the operation of the parser in this case.

The first three items mean that the storage key may be a compound object that potentially contains a lot of data. Key lookup (which must be done each time a call to a parsing expression is made) requires the comparison of keys. With the key being a compound object, the comparison between keys may be just as expensive as evaluating the parsing expression instead.

Implementation

Packrat is implemented in parseq as an option. Furthermore, it is activated for individual parsing expressions instead of all at once. It has been shown that memoizing all PE will make the performance worse (at least for the java grammar).

With this kind of implementation, the disadvantages can be avoided mostly. Because the PE are compiled, there is not even an overhead when packrat parsing is disabled.

Should you use packrat parsing?

It is my humble opinion that you should not. Exponential time may seem frightening but I believe that most real world grammars do not have this problem. If a situation with exponential time occurs, I am quite confident that the grammar can be tweaked in such a way that the problem disappears.

Feel free to decide against this advice.

Cases that lead to exponential processing time

I made an example which appears to require exponential time. Packrat parsing reduces the number of processing steps significantly. However, there is another (equivalent) grammar that is even more efficient (without packrat parsing).

The same example can be found in the exponential.lisp example.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly