Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect handling of multibyte UTF-16 encodings #2080

Open
Vtec234 opened this issue Aug 14, 2020 · 28 comments
Open

Incorrect handling of multibyte UTF-16 encodings #2080

Vtec234 opened this issue Aug 14, 2020 · 28 comments
Assignees
Labels

Comments

@Vtec234
Copy link
Contributor

Vtec234 commented Aug 14, 2020

Describe the bug
It seems the handling of multibyte UTF-16 encodings is incorrect in lsp-mode.

To Reproduce
Create a file with just these contents: "🍋" - a single lemon emoji. Place the cursor after the lemon and type a character ("l", say, for lemon). Most emojis including this one are represented by two UTF-16 bytes, so since LSP specifies offsets as in a UTF-16 string representation, this is at column 2.

But lsp-mode sends column 1:

{"jsonrpc":"2.0","method":"textDocument/didChange","params":{"textDocument":{"uri":"file:///home/w/utf16.lean","version":1},"contentChanges":[{"range":{"start":{"line":0,"character":1},"end":{"line":0,"character":1}},"rangeLength":0,"text":"l"}]}}

Expected behavior
Compare with e.g. the VSCode sample client, which sends 2 as it should:

{"jsonrpc":"2.0","method":"textDocument/didChange","params":{"textDocument":{"uri":"file:///home/w/utf16.lean","version":56},"contentChanges":[{"range":{"start":{"line":0,"character":2},"end":{"line":0,"character":2}},"rangeLength":0,"text":"l"}]}}

Which Language Server did you use
Custom one added via the tutorial. lsp-mode version 7.0.1.

OS
Linux

@matklad
Copy link
Contributor

matklad commented Jan 19, 2021

We are hitting this in rust-analyzer: rust-lang/rust-analyzer#4263 (comment).

Out of curiosity, how hard would it be for Emacs to report offsets in ut8 coordinate space (I believe Emacs uses that internally)? I'd be willing to implement a protocol extension to use utf-8 offsets throughout.

I could, but I would be reluctant to implement unicode codepoints coordinates: utt8 skips coordinates transcoding altogether, while unicode would just replace "map utf8 to utf16" with "map utf8 to unicode", which imo isn't meaningfully different.

@yyoncho
Copy link
Member

yyoncho commented Jan 19, 2021

I don't think it will be hard to implement. We havennt fixed that issue due to the fact that it has a relatively low impact and it will affect performance.

@bkchr
Copy link
Contributor

bkchr commented Jan 19, 2021

Currently the problem for me is, that this makes rust-analyzer panic :D

@yyoncho
Copy link
Member

yyoncho commented Jan 19, 2021

So, the priority of the issue now is bigger.

@bkchr
Copy link
Contributor

bkchr commented Jan 26, 2021

@yyoncho could you maybe give me some pointer into the code? So, I could try to fix this issue?

@yyoncho
Copy link
Member

yyoncho commented Jan 26, 2021

@bkchr you have to fix both functions lsp--position-to-point and lsp--point-to-position. Also, AFAIK eglot has that function implemented.

@nbfalcon
Copy link
Member

As a side note, clangd supports UTF-8 directly, if a special option is specified.

@matklad
Copy link
Contributor

matklad commented Jan 26, 2021

@nbfalcon thanks, I didn't realize that the extension already exists. I actually would prefer to fix this on the server-side than, to create social pressure to actually officially adopt that into the protocol.

EDIT: rust-lang/rust-analyzer#7453

@nbfalcon
Copy link
Member

nbfalcon commented Jan 26, 2021

Here is the relevant source. @matklad I agree that this would be great, since many servers/editors/libraries assume UTF-8 (well, everything aside from VSCode and JavaScript). There is an upstream bug about this already: microsoft/language-server-protocol#376.

@yyoncho
Copy link
Member

yyoncho commented Jan 30, 2021

@bkchr

@yyoncho could you maybe give me some pointer into the code? So, I could try to fix this issue?

Did you start working on that?

@bkchr
Copy link
Contributor

bkchr commented Jan 30, 2021

No

@yyoncho yyoncho self-assigned this Jan 30, 2021
@yyoncho
Copy link
Member

yyoncho commented Jan 30, 2021

Ok, I will take a look.

@yyoncho
Copy link
Member

yyoncho commented Jan 31, 2021

@bkchr - pushed a proposed fix at - yyoncho@aac9b6a

@bkchr
Copy link
Contributor

bkchr commented Jan 31, 2021

Nice ty

@petr-tik
Copy link
Contributor

petr-tik commented Feb 6, 2021

As a side note, clangd supports UTF-8 directly, if a special option is specified.

Now that we have integration tests for clangd-9 on linux, this could be added as a regression test that adds offsetEncoding: ["utf-8"] in clientCapabilities during initialization

@itamarst
Copy link
Contributor

itamarst commented Oct 8, 2021

Hi, what's the status of this? Just had rust-analyzer crash due to adding emoji in my code (on lsp-mode from a couple weeks ago).

@matklad
Copy link
Contributor

matklad commented Oct 8, 2021

On rust-analyzer's side, we fully implemented clangd's extension for UTF-8 offsets, so, on Emacs side you an either:

  • do utf-16 translation, which needs some code and CPU time
  • expose underlying byte offsets at zero cost (I believe Emacs internally uses something sufficiently close to utf-8 for this to work)

@yyoncho
Copy link
Member

yyoncho commented Oct 9, 2021

@itamarst the fix proposed at yyoncho@aac9b6a should work(after rebasing).

@acowley
Copy link
Contributor

acowley commented Oct 13, 2021

The patch @yyoncho links there seems to work for me with both clangd and rust-analyzer.

@yyoncho
Copy link
Member

yyoncho commented Oct 13, 2021

I have to benchmark it and if it is slow at least allow using utf8 when server supports it

@wagk
Copy link

wagk commented Jul 6, 2022

Hi, I'm running into this issue right now, and I'm wondering what is currently blocking the patch from being merged into master?

@scohen
Copy link

scohen commented Nov 16, 2022

@yyoncho Sorry for the ping, I've been working on the elixir language server, and just found this issue.

If the reason it hasn't been merged is solely due to performance, then I have a suggestion.
You don't need to encode every character that you visit, you merely need to encode non-ascii characters (those that are >= 128) as you find them. Given the fact that the vast, vast majority of source code is ascii text, this shouldn't be a problem perf-wise.

If you haven't merged the fix for some other reason, please ping me, this is holding me up and causes the elixir-ls to produce incorrect results under lsp-mode. Utf8 on the server isn't really a fix, since older clients, windows and the lsp spec require utf-16.

Also, thanks for lsp-mode! It's great.

@matklad
Copy link
Contributor

matklad commented Nov 16, 2022

the lsp spec require utf-16.

These days, LSP spec allows utf8 support.

It also formally requires utf16 support, but that seems like a bad design decision in the original protocol, so I personally would be quite happy with server/clients supporting only utf8 and pushing everyone else to do so =P

@scohen
Copy link

scohen commented Nov 16, 2022

Allowing utf8 and requiring utf16 aren't mutually exclusive 😉
Agreed about utf16 being a bad decision, but that's what the spec says compliant clients need to support, and you should.
I don't see a reason to hold off other than performance, and I'll be glad to work with anyone to make the perf impact tiny. Indeed, if you don't reallocate the entire buffer, and only reallocate single multibyte characters when you encounter them, there will likely be no performance impact at all for most documents.
the fix is almost there, we can get it over the line

@scohen
Copy link

scohen commented Nov 16, 2022

personally would be quite happy with server/clients supporting only utf8 and pushing everyone else to do so

To be quite clear, such servers wouldn't be compliant with the lsp spec. Maybe a future version would remove utf16 support, but for now, servers are stuck handling both.

"This is the default and must always be supported by servers". It's wildly annoying, but we (server contributors) have to support it. Can you please make our job a little easier?

@scohen
Copy link

scohen commented Nov 18, 2022

@yyoncho can we have a resolution for this? This bug has been open for two years, and you have a fix for it that needs a tiny optimization. I’m willing to help get you there, but I don’t know elisp very well. I’d be willing to help via zoom chat, pop in to irc, submit pseudocode, anything.

Failing that, you should mark this as wontfix, and throw an exception when multibyte characters are encountered in a file when the server is in utf-16 mode, as well as clearly indicating that lsp-mode doesn’t support utf-16 ranges.

The current implementation is broken and causes emacs to produce invalid code. This is not a tenable situation.

@teor2345
Copy link

teor2345 commented Feb 3, 2023

I just ran into this issue today, it still crashes rust-analyzer.

Can we get at least one of these fixes applied?

@flying-sheep
Copy link

flying-sheep commented Feb 21, 2023

A workaround exists now with #3958

  • For spec compliance, utf-16 offset support still has to be implemented (this issue is not fixed)
  • If emacs internally uses a utf-8 buffer, utf-8 offset support would have the best performance with language servers that also use utf-8 buffers

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests