-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Failure when converting to UTF-8 (WITH SAMPLE REPO) #7081
Comments
Our UTF-8 implementation is strict to keep compatibility between platforms. Java has this same issue but instead of raising an exception, fails silently by substituting any incomplete surrogate characters by the value Example of Java behavior with several incomplete surrogate characters: Incomplete surrogates might come from incomplete backend data or incorrect string manipulation. The work-around you propose is not really a work-around but how you are supposed to do if you want to be safe. |
Is there a reason why realm wouldn't handle the fall back of using the java UTF-8 conversion if this error is encountered? The use case I see this crash in is from a value returned by the Android contacts provider. I think it makes sense to allow for that conversion to happen behind the scenes rather than needing to sanitize every string to ensure it is actually UTF-8 compliant on possibly every string since realm is already doing a conversion to UTF-8.. |
@bfranks Our UTF implementation is strict since we must guarantee the compatibility between platforms. I understand your frustration. These errors show that you don't have valid UTF strings, this means that you are losing information, because of incomplete surrogates, and introducing unintended characters. We have risen your concern to the core team. By now, we have improved the error messaging for UTF encoding errors to be more descriptive. One more question, what is the source of such strings? Do you manipulate them yourself? |
We do not manipulate the problematic strings ourselves. The flow is as follows:
We have found a few of the contact names seem to have these invalid surrogate pairs which is annoying since the OS is returning these invalid characters. As such we now sanitize all contact names by using toByteArray(Charsets.UTF_8) which seems to be the only reasonable solution. What I'm suggesting is when detecting the character is not strictly valid UTF-8 instead of returning an error the java portion of the library does a retry by converting the problematic field to UTF-8 using the java platform method since that is my interpretation of what Java expects to happen. This will allow for the core UTF-8 to still be strict and not introducing compatibility but still adhere to the standard Java behaviour. It would also make sense that this could be an optional config field on the realm instance to enable this behaviour (i.e. saveAllowingStringDataLoss or something equally scary sounding that by default can be off). |
@bfranks I understand this can be an annoying issue. We would rather not want to introduce an automatic conversion unless it has been opted into since it can have consequences on other platforms that might read the strings, but we also talked internally about having a configuration option on the I have created #7101 that tracks this feature so we can discuss exactly how to solve it there. In the meantime, we also modified the error messages in #7093 so it now should be a lot more clear exactly what is going on. I'm going to close this issue as fixed through #7093 and a potentially better solution being tracked through #7101 |
We received a crash report with the following truncated stack trace:
Steps & Code to Reproduce
We were able to reproduce this by attempting to insert the following string into realm:
val problemString = "\uD83D"
This is the workaround we are using at present:
String(badText.toByteArray(Charsets.UTF_8))
Version of Realm and tooling
Realm version(s): 7.0.2
Repo demonstrating crash
https://github.com/rm8x/realm-utf8-issue
The text was updated successfully, but these errors were encountered: