Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor(tokenizer): Introduce events #404

Merged
merged 19 commits into from
Feb 27, 2022

Conversation

fb55
Copy link
Collaborator

@fb55 fb55 commented Feb 11, 2022

Update the tokenizer to produce events. See #403 for details.

Copy link
Collaborator

@wooorm wooorm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a lot of changes. Including what seems like refactoring. That make it hard to review.
If possible, can you move part of this work to separate PRs?
Can you point me towards subsets that you want me to review?

protected override _handleToken(token: Token): boolean {
if (!super._handleToken(token)) {
this.emitRaw(this._getRawHtml(token.location!));
protected override emitIfListenerExists(eventName: string, token: SaxToken): boolean {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The SAX parser now handles events, so _handleToken is gone. We can override emitIfListenerExists instead for the same effect.

protected parserFeedbackSimulator: ParserFeedbackSimulator;
private pendingText: CharacterToken | null = null;
private pendingText: Text | null = null;
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With _handleToken gone, we keep a Text object here for simplicity.


constructor(private tokenizer: Tokenizer) {
constructor(options: TokenizerOptions, private handler: TokenHandler) {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of passing tokens on after a getNextToken call, the ParserFeedbackSimulator now mirrors tokens received from the tokenizer to a passed handler.

'ParserFeedbackSimulator',
'ParserFeedbackSimulator',
feedbackPath.pathname,
(handler) => new ParserFeedbackSimulator({}, handler).tokenizer
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

generateTokenizationTests now only needs a tokenizer to be returned.

@@ -1,4 +1,4 @@
import { Tokenizer, TokenizerMode } from '../tokenizer/index.js';
import { QueuedTokenizer, TokenizerMode } from '../tokenizer/index.js';
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The parser is mostly unchanged — only the wrapped tokenizer is imported.

generateTokenizationTests('tokenizer', 'Tokenizer', dataPath.pathname, ({ errors }) => {
const tokenizer = new Tokenizer({
sourceCodeLocationInfo: true,
onParseError(err): void {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The onParseError method for the tokenizer moved to the handler. This is not a breaking change for anyone not explicitly using the tokenizer (and there are a lot of breaking changes here for those that are).

@@ -40,7 +40,41 @@ function appendToken(dest: Token[], token: Token): void {
dest.push(token);
}

function collectParserTokens(html: string): ReturnType<typeof convertTokenToHtml5Lib>[] {
function convertTokenToHtml5Lib(token: Token): HtmlLibToken {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

convertTokenToHtml5Lib was no longer needed in test/utils/generate-tokenization-tests.ts, and therefore moved here.

initialState: getTokenizerSuitableStateName(initialStateName),
initialStateName,
lastStartTag: descr.lastStartTag,
expectedErrors: descr.errors || [],
expectedErrors: TestsWithBrokenErrors[descr.description]
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Broken tests are now referenced here, instead of when comparing once all errors are gathered.

@@ -141,7 +164,7 @@ interface TestDescription {
description: string;
input: string;
lastStartTag: string;
errors?: string[];
Copy link
Collaborator Author

@fb55 fb55 Feb 11, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

errors was wrongly typed; it should have been a TokenError to begin with.

@fb55
Copy link
Collaborator Author

fb55 commented Feb 11, 2022

This is a lot of changes. Including what seems like refactoring. That make it hard to review.
If possible, can you move part of this work to separate PRs?

I have pushed some commits that might help make this more digestible.

I was debating whether or not to include updates to other modules & tests. I felt like they help to illustrate how code looks like after adopting this change, so I kept them for now. But happy to move them to a separate PR if needed.

Can you point me towards subsets that you want me to review?

packages/parse5/lib/tokenizer/index.ts and packages/parse5/lib/tokenizer/queued.ts are the two files that are the most crucial.

The tokenizer unfortunately has a lot of changes due to all emitToken calls being replaced with the one for the specific token type. These changes follow a very consistent pattern and are hopefully easy to follow.

@fb55 fb55 merged commit 722f429 into inikulin:master Feb 27, 2022
@fb55 fb55 deleted the refactor/tokenizer-events branch February 27, 2022 07:35
@wooorm
Copy link
Collaborator

wooorm commented Feb 27, 2022

sorry, thanks for putting in that work, sorry that I didn’t review this! :(

@fb55
Copy link
Collaborator Author

fb55 commented Feb 27, 2022

sorry, thanks for putting in that work, sorry that I didn’t review this! :(

No worries — hope it's okay that I moved forward with this after two weeks.

I have opened #419 to wrap up this refactoring and would appreciate a review there (if you have the time).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants