Incorrect handling of multibyte UTF-16 encodings #2080

Vtec234 · 2020-08-14T21:35:40Z

Describe the bug
It seems the handling of multibyte UTF-16 encodings is incorrect in lsp-mode.

To Reproduce
Create a file with just these contents: "🍋" - a single lemon emoji. Place the cursor after the lemon and type a character ("l", say, for lemon). Most emojis including this one are represented by two UTF-16 bytes, so since LSP specifies offsets as in a UTF-16 string representation, this is at column 2.

But lsp-mode sends column 1:

{"jsonrpc":"2.0","method":"textDocument/didChange","params":{"textDocument":{"uri":"file:///home/w/utf16.lean","version":1},"contentChanges":[{"range":{"start":{"line":0,"character":1},"end":{"line":0,"character":1}},"rangeLength":0,"text":"l"}]}}

Expected behavior
Compare with e.g. the VSCode sample client, which sends 2 as it should:

{"jsonrpc":"2.0","method":"textDocument/didChange","params":{"textDocument":{"uri":"file:///home/w/utf16.lean","version":56},"contentChanges":[{"range":{"start":{"line":0,"character":2},"end":{"line":0,"character":2}},"rangeLength":0,"text":"l"}]}}

Which Language Server did you use
Custom one added via the tutorial. lsp-mode version 7.0.1.

OS
Linux

The text was updated successfully, but these errors were encountered:

matklad · 2021-01-19T12:21:32Z

We are hitting this in rust-analyzer: rust-lang/rust-analyzer#4263 (comment).

Out of curiosity, how hard would it be for Emacs to report offsets in ut8 coordinate space (I believe Emacs uses that internally)? I'd be willing to implement a protocol extension to use utf-8 offsets throughout.

I could, but I would be reluctant to implement unicode codepoints coordinates: utt8 skips coordinates transcoding altogether, while unicode would just replace "map utf8 to utf16" with "map utf8 to unicode", which imo isn't meaningfully different.

yyoncho · 2021-01-19T12:47:42Z

I don't think it will be hard to implement. We havennt fixed that issue due to the fact that it has a relatively low impact and it will affect performance.

bkchr · 2021-01-19T16:07:49Z

Currently the problem for me is, that this makes rust-analyzer panic :D

yyoncho · 2021-01-19T16:30:44Z

So, the priority of the issue now is bigger.

bkchr · 2021-01-26T13:09:37Z

@yyoncho could you maybe give me some pointer into the code? So, I could try to fix this issue?

yyoncho · 2021-01-26T13:52:35Z

@bkchr you have to fix both functions lsp--position-to-point and lsp--point-to-position. Also, AFAIK eglot has that function implemented.

nbfalcon · 2021-01-26T13:53:38Z

As a side note, clangd supports UTF-8 directly, if a special option is specified.

matklad · 2021-01-26T14:16:23Z

@nbfalcon thanks, I didn't realize that the extension already exists. I actually would prefer to fix this on the server-side than, to create social pressure to actually officially adopt that into the protocol.

EDIT: rust-lang/rust-analyzer#7453

nbfalcon · 2021-01-26T14:20:56Z

Here is the relevant source. @matklad I agree that this would be great, since many servers/editors/libraries assume UTF-8 (well, everything aside from VSCode and JavaScript). There is an upstream bug about this already: microsoft/language-server-protocol#376.

yyoncho · 2021-01-30T08:42:24Z

@bkchr

@yyoncho could you maybe give me some pointer into the code? So, I could try to fix this issue?

Did you start working on that?

bkchr · 2021-01-30T08:43:03Z

No

yyoncho · 2021-01-30T08:45:03Z

Ok, I will take a look.

yyoncho · 2021-01-31T08:32:01Z

@bkchr - pushed a proposed fix at - yyoncho@aac9b6a

bkchr · 2021-01-31T08:33:11Z

Nice ty

petr-tik · 2021-02-06T00:27:28Z

As a side note, clangd supports UTF-8 directly, if a special option is specified.

Now that we have integration tests for clangd-9 on linux, this could be added as a regression test that adds offsetEncoding: ["utf-8"] in clientCapabilities during initialization

itamarst · 2021-10-08T16:08:39Z

Hi, what's the status of this? Just had rust-analyzer crash due to adding emoji in my code (on lsp-mode from a couple weeks ago).

matklad · 2021-10-08T16:20:19Z

On rust-analyzer's side, we fully implemented clangd's extension for UTF-8 offsets, so, on Emacs side you an either:

do utf-16 translation, which needs some code and CPU time
expose underlying byte offsets at zero cost (I believe Emacs internally uses something sufficiently close to utf-8 for this to work)

yyoncho · 2021-10-09T07:47:19Z

@itamarst the fix proposed at yyoncho@aac9b6a should work(after rebasing).

acowley · 2021-10-13T20:07:40Z

The patch @yyoncho links there seems to work for me with both clangd and rust-analyzer.

yyoncho · 2021-10-13T20:39:06Z

I have to benchmark it and if it is slow at least allow using utf8 when server supports it

wagk · 2022-07-06T03:06:22Z

Hi, I'm running into this issue right now, and I'm wondering what is currently blocking the patch from being merged into master?

scohen · 2022-11-16T04:21:46Z

@yyoncho Sorry for the ping, I've been working on the elixir language server, and just found this issue.

If the reason it hasn't been merged is solely due to performance, then I have a suggestion.
You don't need to encode every character that you visit, you merely need to encode non-ascii characters (those that are >= 128) as you find them. Given the fact that the vast, vast majority of source code is ascii text, this shouldn't be a problem perf-wise.

If you haven't merged the fix for some other reason, please ping me, this is holding me up and causes the elixir-ls to produce incorrect results under lsp-mode. Utf8 on the server isn't really a fix, since older clients, windows and the lsp spec require utf-16.

Also, thanks for lsp-mode! It's great.

matklad · 2022-11-16T08:49:22Z

the lsp spec require utf-16.

These days, LSP spec allows utf8 support.

It also formally requires utf16 support, but that seems like a bad design decision in the original protocol, so I personally would be quite happy with server/clients supporting only utf8 and pushing everyone else to do so =P

scohen · 2022-11-16T15:03:09Z

Allowing utf8 and requiring utf16 aren't mutually exclusive 😉
Agreed about utf16 being a bad decision, but that's what the spec says compliant clients need to support, and you should.
I don't see a reason to hold off other than performance, and I'll be glad to work with anyone to make the perf impact tiny. Indeed, if you don't reallocate the entire buffer, and only reallocate single multibyte characters when you encounter them, there will likely be no performance impact at all for most documents.
the fix is almost there, we can get it over the line

scohen · 2022-11-16T15:12:30Z

personally would be quite happy with server/clients supporting only utf8 and pushing everyone else to do so

To be quite clear, such servers wouldn't be compliant with the lsp spec. Maybe a future version would remove utf16 support, but for now, servers are stuck handling both.

"This is the default and must always be supported by servers". It's wildly annoying, but we (server contributors) have to support it. Can you please make our job a little easier?

scohen · 2022-11-18T23:01:48Z

@yyoncho can we have a resolution for this? This bug has been open for two years, and you have a fix for it that needs a tiny optimization. I’m willing to help get you there, but I don’t know elisp very well. I’d be willing to help via zoom chat, pop in to irc, submit pseudocode, anything.

Failing that, you should mark this as wontfix, and throw an exception when multibyte characters are encountered in a file when the server is in utf-16 mode, as well as clearly indicating that lsp-mode doesn’t support utf-16 ranges.

The current implementation is broken and causes emacs to produce invalid code. This is not a tenable situation.

teor2345 · 2023-02-03T04:40:50Z

I just ran into this issue today, it still crashes rust-analyzer.

Can we get at least one of these fixes applied?

correct offsets: yyoncho@aac9b6a
declaring UTF-8 lsp mode: Feature Request: UTF-8 support #3344 (comment)

flying-sheep · 2023-02-21T10:35:12Z

A workaround exists now with #3958

For spec compliance, utf-16 offset support still has to be implemented (this issue is not fixed)
If emacs internally uses a utf-8 buffer, utf-8 offset support would have the best performance with language servers that also use utf-8 buffers

ericdallo added the bug label Aug 14, 2020

mukovnin mentioned this issue Aug 30, 2020

Completion doesn't work if C source files contain some Unicode symbols. #2126

Closed

matklad mentioned this issue Jan 19, 2021

Incremental text sync panics rust-lang/rust-analyzer#4263

Closed

yyoncho self-assigned this Jan 30, 2021

bkchr mentioned this issue Feb 5, 2021

Implement utf-8 offsets extension from clangd rust-lang/rust-analyzer#7453

Closed

jturner314 mentioned this issue Feb 11, 2021

Unicode-related panic when editing comment rust-lang/rust-analyzer#7635

Closed

lnicola mentioned this issue Feb 23, 2021

rust-analyzer crashes when used with a file containing the vomit emoji (🤮) rust-lang/rust-analyzer#7761

Closed

yyoncho mentioned this issue May 18, 2021

CP1252 encoded documents cause lsp-mode to send invalid characters, crashing clangd #2870

Open

lnicola mentioned this issue Jun 3, 2021

Panic on textDocument/didChange containing a multi-byte unicode character rust-lang/rust-analyzer#9121

Closed

teor2345 mentioned this issue Jul 6, 2021

Fix Orchard implementation, refactor, and add more test vectors ZcashFoundation/zebra#2445

Merged

2 tasks

michaelmesser mentioned this issue Nov 1, 2021

Server sends incorrect column number idris-community/idris2-lsp#110

Open

bjorn3 mentioned this issue Nov 11, 2021

rust-analyzer crashes when inputting a particular character in a comment. rust-lang/rust-analyzer#10746

Closed

yyoncho mentioned this issue Feb 5, 2022

Inserting emoji in using Elixir LS causes lsp-mode to crash #3343

Closed

3 tasks

axelson mentioned this issue Feb 5, 2022

Feature Request: UTF-8 support #3344

Open

lnicola mentioned this issue Apr 9, 2022

cargo check diagnostics are mapped incorrectly with non-BMP codepoints rust-lang/rust-analyzer#11945

Closed

lnicola mentioned this issue Dec 2, 2022

Typing let c = '🐰 causes immediate rust-analyzer crash (before I can type the close quote) rust-lang/rust-analyzer#13709

Closed

david-christiansen mentioned this issue Oct 10, 2023

Support LSP encoding negotiation leanprover/lean4#2646

Draft

vikigenius mentioned this issue Sep 18, 2024

LSP ruff shows outdated syntax errors #4547

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incorrect handling of multibyte UTF-16 encodings #2080

Incorrect handling of multibyte UTF-16 encodings #2080

Vtec234 commented Aug 14, 2020

matklad commented Jan 19, 2021

yyoncho commented Jan 19, 2021

bkchr commented Jan 19, 2021

yyoncho commented Jan 19, 2021

bkchr commented Jan 26, 2021

yyoncho commented Jan 26, 2021

nbfalcon commented Jan 26, 2021

matklad commented Jan 26, 2021 •

edited

Loading

nbfalcon commented Jan 26, 2021 •

edited

Loading

yyoncho commented Jan 30, 2021

bkchr commented Jan 30, 2021

yyoncho commented Jan 30, 2021

yyoncho commented Jan 31, 2021

bkchr commented Jan 31, 2021

petr-tik commented Feb 6, 2021

itamarst commented Oct 8, 2021

matklad commented Oct 8, 2021

yyoncho commented Oct 9, 2021

acowley commented Oct 13, 2021

yyoncho commented Oct 13, 2021

wagk commented Jul 6, 2022

scohen commented Nov 16, 2022

matklad commented Nov 16, 2022

scohen commented Nov 16, 2022 •

edited

Loading

scohen commented Nov 16, 2022 •

edited

Loading

scohen commented Nov 18, 2022

teor2345 commented Feb 3, 2023

flying-sheep commented Feb 21, 2023 •

edited

Loading

Incorrect handling of multibyte UTF-16 encodings #2080

Incorrect handling of multibyte UTF-16 encodings #2080

Comments

Vtec234 commented Aug 14, 2020

matklad commented Jan 19, 2021

yyoncho commented Jan 19, 2021

bkchr commented Jan 19, 2021

yyoncho commented Jan 19, 2021

bkchr commented Jan 26, 2021

yyoncho commented Jan 26, 2021

nbfalcon commented Jan 26, 2021

matklad commented Jan 26, 2021 • edited Loading

nbfalcon commented Jan 26, 2021 • edited Loading

yyoncho commented Jan 30, 2021

bkchr commented Jan 30, 2021

yyoncho commented Jan 30, 2021

yyoncho commented Jan 31, 2021

bkchr commented Jan 31, 2021

petr-tik commented Feb 6, 2021

itamarst commented Oct 8, 2021

matklad commented Oct 8, 2021

yyoncho commented Oct 9, 2021

acowley commented Oct 13, 2021

yyoncho commented Oct 13, 2021

wagk commented Jul 6, 2022

scohen commented Nov 16, 2022

matklad commented Nov 16, 2022

scohen commented Nov 16, 2022 • edited Loading

scohen commented Nov 16, 2022 • edited Loading

scohen commented Nov 18, 2022

teor2345 commented Feb 3, 2023

flying-sheep commented Feb 21, 2023 • edited Loading

matklad commented Jan 26, 2021 •

edited

Loading

nbfalcon commented Jan 26, 2021 •

edited

Loading

scohen commented Nov 16, 2022 •

edited

Loading

scohen commented Nov 16, 2022 •

edited

Loading

flying-sheep commented Feb 21, 2023 •

edited

Loading