Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failure when converting to UTF-8 (WITH SAMPLE REPO) #7081

Closed
rm8x opened this issue Sep 3, 2020 · 5 comments
Closed

Failure when converting to UTF-8 (WITH SAMPLE REPO) #7081

rm8x opened this issue Sep 3, 2020 · 5 comments
Assignees

Comments

@rm8x
Copy link

rm8x commented Sep 3, 2020

We received a crash report with the following truncated stack trace:

Fatal Exception: java.lang.IllegalArgumentException: Illegal Argument: Failure when converting to UTF-8; error_code = 6;  0x0050 0x0061 0x0070 0x0061 0x0020 0x0047 0x0020 0xd83c
Exception backtrace:
<backtrace not supported on this platform> in /Users/cm/Realm/realm-java/realm/realm-library/src/main/cpp/io_realm_internal_Table.cpp line 798
       at io.realm.internal.Table.nativeFindFirstString(Table.java)
       at io.realm.internal.Table.findFirstString(Table.java:583)
       at io.realm.com_coolapp_CoolObjectRealmProxy.copyOrUpdate(com_coolapp_CoolObjectRealmProxy.java:507)
       at io.realm.LibraryModuleMediator.copyOrUpdate(LibraryModuleMediator.java:105)
       at io.realm.Realm.copyOrUpdate(Realm.java:1700)
       at io.realm.Realm.copyToRealmOrUpdate(Realm.java:1296)

Steps & Code to Reproduce

We were able to reproduce this by attempting to insert the following string into realm:
val problemString = "\uD83D"

This is the workaround we are using at present:
String(badText.toByteArray(Charsets.UTF_8))

Version of Realm and tooling

Realm version(s): 7.0.2

Repo demonstrating crash

https://github.com/rm8x/realm-utf8-issue

@clementetb
Copy link
Collaborator

Error code 6 is raised when Realm finds an incomplete surrogate pair while transforming UTF-16 text to UTF-8. \uD83D, a higher surrogate character, is not valid by itself and it needs a lower surrogate character to be complete. This is part of the UTF-16 specification.

Our UTF-8 implementation is strict to keep compatibility between platforms. Java has this same issue but instead of raising an exception, fails silently by substituting any incomplete surrogate characters by the value 63 (the ASCII value representation for char ?).

Example of Java behavior with several incomplete surrogate characters:
"\uD83D\uD831\uD85D\uD93D".toByteArray(Charsets.UTF_8) => [63, 63, 63, ,63] => "????"

Incomplete surrogates might come from incomplete backend data or incorrect string manipulation. The work-around you propose is not really a work-around but how you are supposed to do if you want to be safe.

@bfranks
Copy link

bfranks commented Sep 9, 2020

Is there a reason why realm wouldn't handle the fall back of using the java UTF-8 conversion if this error is encountered? The use case I see this crash in is from a value returned by the Android contacts provider. I think it makes sense to allow for that conversion to happen behind the scenes rather than needing to sanitize every string to ensure it is actually UTF-8 compliant on possibly every string since realm is already doing a conversion to UTF-8..

@clementetb
Copy link
Collaborator

@bfranks Our UTF implementation is strict since we must guarantee the compatibility between platforms.

I understand your frustration. These errors show that you don't have valid UTF strings, this means that you are losing information, because of incomplete surrogates, and introducing unintended characters.

We have risen your concern to the core team. By now, we have improved the error messaging for UTF encoding errors to be more descriptive.

One more question, what is the source of such strings? Do you manipulate them yourself?

@bfranks
Copy link

bfranks commented Sep 11, 2020

We do not manipulate the problematic strings ourselves. The flow is as follows:

  • Get the device contacts from the Android system via a provider (Does a query on the device's contacts)
  • Do some formatting of phone numbers, and joining of contacts to allow up a fast lookup of contact names by number

We have found a few of the contact names seem to have these invalid surrogate pairs which is annoying since the OS is returning these invalid characters. As such we now sanitize all contact names by using toByteArray(Charsets.UTF_8) which seems to be the only reasonable solution.

What I'm suggesting is when detecting the character is not strictly valid UTF-8 instead of returning an error the java portion of the library does a retry by converting the problematic field to UTF-8 using the java platform method since that is my interpretation of what Java expects to happen. This will allow for the core UTF-8 to still be strict and not introducing compatibility but still adhere to the standard Java behaviour. It would also make sense that this could be an optional config field on the realm instance to enable this behaviour (i.e. saveAllowingStringDataLoss or something equally scary sounding that by default can be off).

@cmelchior
Copy link
Contributor

@bfranks I understand this can be an annoying issue. We would rather not want to introduce an automatic conversion unless it has been opted into since it can have consequences on other platforms that might read the strings, but we also talked internally about having a configuration option on the RealmConfiguration with something like automaticConvertIllegalUTF16() or useJavaUTF16CompatibilityRules()

I have created #7101 that tracks this feature so we can discuss exactly how to solve it there. In the meantime, we also modified the error messages in #7093 so it now should be a lot more clear exactly what is going on.

I'm going to close this issue as fixed through #7093 and a potentially better solution being tracked through #7101

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

4 participants