Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add UTF-16 input parsing #3538

Merged
merged 6 commits into from
Jan 5, 2024
Merged

Add UTF-16 input parsing #3538

merged 6 commits into from
Jan 5, 2024

Conversation

raskad
Copy link
Member

@raskad raskad commented Dec 26, 2023

We currently have UTF-16 handling at runtime trough our JsString type. But one thing that is missing is support for UTF-16 input handling itself. One case where this is relevant for boa itself it in the handling of eval. Because we only allow UTF-8 input, we have to convert the input of eval to UTF-8 before parsing it.

To solve this, I added a ReadChar trait to the parser crate that allows the parser to handle different input encodings. The parsing itself is done on unicode code points. To make this work I removed all of the parsing that was done on bytes, since that was assuming that we only handle UTF-8 inputs. Most of that work is in 55752f2.

I added a UTF-16 input type. To get some first positive results I also adjusted the regex parsing to work for non UTF-8 inputs. That should give us some eval and regex related tests that pass now.

@raskad raskad added enhancement New feature or request lexer Issues surrounding the lexer labels Dec 26, 2023
@raskad raskad added this to the v0.18.0 milestone Dec 26, 2023
Copy link

codecov bot commented Dec 26, 2023

Codecov Report

Attention: 73 lines in your changes are missing coverage. Please review.

Comparison is base (35df2de) 47.42% compared to head (e81c12b) 47.37%.
Report is 3 commits behind head on main.

Files Patch % Lines
core/parser/src/lexer/string.rs 71.66% 17 Missing ⚠️
core/parser/src/lexer/mod.rs 67.85% 9 Missing ⚠️
core/parser/src/lexer/number.rs 80.00% 9 Missing ⚠️
core/parser/src/lexer/regex.rs 67.85% 9 Missing ⚠️
core/parser/src/source/utf16.rs 43.75% 9 Missing ⚠️
core/parser/src/lexer/template.rs 45.45% 6 Missing ⚠️
core/parser/src/lexer/cursor.rs 87.50% 5 Missing ⚠️
core/parser/src/source/utf8.rs 78.26% 5 Missing ⚠️
core/engine/src/module/mod.rs 0.00% 1 Missing ⚠️
core/parser/src/lexer/private_identifier.rs 0.00% 1 Missing ⚠️
... and 2 more
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3538      +/-   ##
==========================================
- Coverage   47.42%   47.37%   -0.05%     
==========================================
  Files         470      472       +2     
  Lines       45690    45643      -47     
==========================================
- Hits        21667    21625      -42     
+ Misses      24023    24018       -5     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link

Test262 conformance changes

Test result main count PR count difference
Total 95,960 95,960 0
Passed 76,534 76,542 +8
Ignored 18,477 18,477 0
Failed 949 941 -8
Panics 0 0 0
Conformance 79.76% 79.76% +0.01%
Fixed tests (8):
test/language/literals/regexp/S7.8.5_A1.4_T2.js [strict mode] (previously Failed)
test/language/literals/regexp/S7.8.5_A1.4_T2.js (previously Failed)
test/language/literals/regexp/S7.8.5_A2.1_T2.js [strict mode] (previously Failed)
test/language/literals/regexp/S7.8.5_A2.1_T2.js (previously Failed)
test/language/literals/regexp/S7.8.5_A2.4_T2.js [strict mode] (previously Failed)
test/language/literals/regexp/S7.8.5_A2.4_T2.js (previously Failed)
test/language/literals/regexp/S7.8.5_A1.1_T2.js [strict mode] (previously Failed)
test/language/literals/regexp/S7.8.5_A1.1_T2.js (previously Failed)

@raskad raskad requested a review from a team December 26, 2023 19:21
@raskad raskad added the waiting-on-review Waiting on reviews from the maintainers label Dec 26, 2023
Copy link
Member

@jedel1043 jedel1043 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really great to finally have proper UTF-16 parsing! I just have some nitpicks that don't block merging.

core/parser/src/source/utf8.rs Outdated Show resolved Hide resolved
core/parser/src/source/utf8.rs Outdated Show resolved Hide resolved
@jedel1043 jedel1043 requested a review from a team January 5, 2024 00:21
Copy link
Member

@HalidOdat HalidOdat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work @raskad!

Just a nitpick :)

core/parser/src/lexer/string.rs Outdated Show resolved Hide resolved
@raskad raskad added this pull request to the merge queue Jan 5, 2024
Merged via the queue into main with commit 84a5e45 Jan 5, 2024
14 checks passed
@raskad raskad deleted the lexer-utf-16 branch January 5, 2024 12:56
@jedel1043 jedel1043 removed the waiting-on-review Waiting on reviews from the maintainers label Sep 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request lexer Issues surrounding the lexer
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants