-
-
Notifications
You must be signed in to change notification settings - Fork 884
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Incorrect handling of multibyte UTF-16 encodings #2080
Comments
We are hitting this in rust-analyzer: rust-lang/rust-analyzer#4263 (comment). Out of curiosity, how hard would it be for Emacs to report offsets in ut8 coordinate space (I believe Emacs uses that internally)? I'd be willing to implement a protocol extension to use utf-8 offsets throughout. I could, but I would be reluctant to implement unicode codepoints coordinates: utt8 skips coordinates transcoding altogether, while unicode would just replace "map utf8 to utf16" with "map utf8 to unicode", which imo isn't meaningfully different. |
I don't think it will be hard to implement. We havennt fixed that issue due to the fact that it has a relatively low impact and it will affect performance. |
Currently the problem for me is, that this makes rust-analyzer panic :D |
So, the priority of the issue now is bigger. |
@yyoncho could you maybe give me some pointer into the code? So, I could try to fix this issue? |
@bkchr you have to fix both functions lsp--position-to-point and lsp--point-to-position. Also, AFAIK eglot has that function implemented. |
As a side note, |
@nbfalcon thanks, I didn't realize that the extension already exists. I actually would prefer to fix this on the server-side than, to create social pressure to actually officially adopt that into the protocol. |
Here is the relevant source. @matklad I agree that this would be great, since many servers/editors/libraries assume UTF-8 (well, everything aside from VSCode and JavaScript). There is an upstream bug about this already: microsoft/language-server-protocol#376. |
No |
Ok, I will take a look. |
@bkchr - pushed a proposed fix at - yyoncho@aac9b6a |
Nice ty |
Now that we have integration tests for clangd-9 on linux, this could be added as a regression test that adds |
Hi, what's the status of this? Just had rust-analyzer crash due to adding emoji in my code (on lsp-mode from a couple weeks ago). |
On rust-analyzer's side, we fully implemented clangd's extension for UTF-8 offsets, so, on Emacs side you an either:
|
@itamarst the fix proposed at yyoncho@aac9b6a should work(after rebasing). |
The patch @yyoncho links there seems to work for me with both |
I have to benchmark it and if it is slow at least allow using utf8 when server supports it |
Hi, I'm running into this issue right now, and I'm wondering what is currently blocking the patch from being merged into master? |
@yyoncho Sorry for the ping, I've been working on the elixir language server, and just found this issue. If the reason it hasn't been merged is solely due to performance, then I have a suggestion. If you haven't merged the fix for some other reason, please ping me, this is holding me up and causes the elixir-ls to produce incorrect results under lsp-mode. Utf8 on the server isn't really a fix, since older clients, windows and the lsp spec require utf-16. Also, thanks for lsp-mode! It's great. |
These days, LSP spec allows utf8 support. It also formally requires utf16 support, but that seems like a bad design decision in the original protocol, so I personally would be quite happy with server/clients supporting only utf8 and pushing everyone else to do so =P |
Allowing utf8 and requiring utf16 aren't mutually exclusive 😉 |
To be quite clear, such servers wouldn't be compliant with the lsp spec. Maybe a future version would remove utf16 support, but for now, servers are stuck handling both. "This is the default and must always be supported by servers". It's wildly annoying, but we (server contributors) have to support it. Can you please make our job a little easier? |
@yyoncho can we have a resolution for this? This bug has been open for two years, and you have a fix for it that needs a tiny optimization. I’m willing to help get you there, but I don’t know elisp very well. I’d be willing to help via zoom chat, pop in to irc, submit pseudocode, anything. Failing that, you should mark this as wontfix, and throw an exception when multibyte characters are encountered in a file when the server is in utf-16 mode, as well as clearly indicating that lsp-mode doesn’t support utf-16 ranges. The current implementation is broken and causes emacs to produce invalid code. This is not a tenable situation. |
I just ran into this issue today, it still crashes rust-analyzer. Can we get at least one of these fixes applied?
|
A workaround exists now with #3958
|
Describe the bug
It seems the handling of multibyte UTF-16 encodings is incorrect in
lsp-mode
.To Reproduce
Create a file with just these contents: "🍋" - a single lemon emoji. Place the cursor after the lemon and type a character ("l", say, for lemon). Most emojis including this one are represented by two UTF-16 bytes, so since LSP specifies offsets as in a UTF-16 string representation, this is at column 2.
But
lsp-mode
sends column 1:Expected behavior
Compare with e.g. the VSCode sample client, which sends 2 as it should:
Which Language Server did you use
Custom one added via the tutorial.
lsp-mode
version 7.0.1.OS
Linux
The text was updated successfully, but these errors were encountered: