Failure when converting to UTF-8 (WITH SAMPLE REPO) #7081

rm8x · 2020-09-03T19:19:25Z

We received a crash report with the following truncated stack trace:

Fatal Exception: java.lang.IllegalArgumentException: Illegal Argument: Failure when converting to UTF-8; error_code = 6;  0x0050 0x0061 0x0070 0x0061 0x0020 0x0047 0x0020 0xd83c
Exception backtrace:
<backtrace not supported on this platform> in /Users/cm/Realm/realm-java/realm/realm-library/src/main/cpp/io_realm_internal_Table.cpp line 798
       at io.realm.internal.Table.nativeFindFirstString(Table.java)
       at io.realm.internal.Table.findFirstString(Table.java:583)
       at io.realm.com_coolapp_CoolObjectRealmProxy.copyOrUpdate(com_coolapp_CoolObjectRealmProxy.java:507)
       at io.realm.LibraryModuleMediator.copyOrUpdate(LibraryModuleMediator.java:105)
       at io.realm.Realm.copyOrUpdate(Realm.java:1700)
       at io.realm.Realm.copyToRealmOrUpdate(Realm.java:1296)

Steps & Code to Reproduce

We were able to reproduce this by attempting to insert the following string into realm:
val problemString = "\uD83D"

This is the workaround we are using at present:
String(badText.toByteArray(Charsets.UTF_8))

Version of Realm and tooling

Realm version(s): 7.0.2

Repo demonstrating crash

https://github.com/rm8x/realm-utf8-issue

The text was updated successfully, but these errors were encountered:

clementetb · 2020-09-09T11:29:49Z

Error code 6 is raised when Realm finds an incomplete surrogate pair while transforming UTF-16 text to UTF-8. \uD83D, a higher surrogate character, is not valid by itself and it needs a lower surrogate character to be complete. This is part of the UTF-16 specification.

Our UTF-8 implementation is strict to keep compatibility between platforms. Java has this same issue but instead of raising an exception, fails silently by substituting any incomplete surrogate characters by the value 63 (the ASCII value representation for char ?).

Example of Java behavior with several incomplete surrogate characters:
"\uD83D\uD831\uD85D\uD93D".toByteArray(Charsets.UTF_8) => [63, 63, 63, ,63] => "????"

Incomplete surrogates might come from incomplete backend data or incorrect string manipulation. The work-around you propose is not really a work-around but how you are supposed to do if you want to be safe.

bfranks · 2020-09-09T21:43:11Z

Is there a reason why realm wouldn't handle the fall back of using the java UTF-8 conversion if this error is encountered? The use case I see this crash in is from a value returned by the Android contacts provider. I think it makes sense to allow for that conversion to happen behind the scenes rather than needing to sanitize every string to ensure it is actually UTF-8 compliant on possibly every string since realm is already doing a conversion to UTF-8..

clementetb · 2020-09-10T11:18:24Z

@bfranks Our UTF implementation is strict since we must guarantee the compatibility between platforms.

I understand your frustration. These errors show that you don't have valid UTF strings, this means that you are losing information, because of incomplete surrogates, and introducing unintended characters.

We have risen your concern to the core team. By now, we have improved the error messaging for UTF encoding errors to be more descriptive.

One more question, what is the source of such strings? Do you manipulate them yourself?

bfranks · 2020-09-11T17:06:12Z

We do not manipulate the problematic strings ourselves. The flow is as follows:

Get the device contacts from the Android system via a provider (Does a query on the device's contacts)
Do some formatting of phone numbers, and joining of contacts to allow up a fast lookup of contact names by number

We have found a few of the contact names seem to have these invalid surrogate pairs which is annoying since the OS is returning these invalid characters. As such we now sanitize all contact names by using toByteArray(Charsets.UTF_8) which seems to be the only reasonable solution.

What I'm suggesting is when detecting the character is not strictly valid UTF-8 instead of returning an error the java portion of the library does a retry by converting the problematic field to UTF-8 using the java platform method since that is my interpretation of what Java expects to happen. This will allow for the core UTF-8 to still be strict and not introducing compatibility but still adhere to the standard Java behaviour. It would also make sense that this could be an optional config field on the realm instance to enable this behaviour (i.e. saveAllowingStringDataLoss or something equally scary sounding that by default can be off).

cmelchior · 2020-09-15T11:23:12Z

@bfranks I understand this can be an annoying issue. We would rather not want to introduce an automatic conversion unless it has been opted into since it can have consequences on other platforms that might read the strings, but we also talked internally about having a configuration option on the RealmConfiguration with something like automaticConvertIllegalUTF16() or useJavaUTF16CompatibilityRules()

I have created #7101 that tracks this feature so we can discuss exactly how to solve it there. In the meantime, we also modified the error messages in #7093 so it now should be a lot more clear exactly what is going on.

I'm going to close this issue as fixed through #7093 and a potentially better solution being tracked through #7101

realm-probot bot added the O-Community label Sep 3, 2020

clementetb mentioned this issue Sep 9, 2020

Show error messages for UTF encoding exceptions #7091

Closed

clementetb mentioned this issue Sep 10, 2020

Show error messages for UTF decoding issues #7093

Merged

cmelchior mentioned this issue Sep 15, 2020

Add RealmConfiguration.enableJavaUTF16CompatibiltyMode #7101

Open

4 tasks

cmelchior closed this as completed Sep 15, 2020

cmelchior mentioned this issue Sep 23, 2020

Caused by io.realm.exceptions.RealmError Unrecoverable error. Failure when converting long string to UTF-16 error_code = 2; retcode = 0; #6879

Closed

sync-by-unito bot assigned clementetb Nov 14, 2022

github-actions bot locked as resolved and limited conversation to collaborators Mar 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failure when converting to UTF-8 (WITH SAMPLE REPO) #7081

Failure when converting to UTF-8 (WITH SAMPLE REPO) #7081

rm8x commented Sep 3, 2020 •

edited by RealmBot

Loading

clementetb commented Sep 9, 2020

bfranks commented Sep 9, 2020

clementetb commented Sep 10, 2020

bfranks commented Sep 11, 2020

cmelchior commented Sep 15, 2020

Failure when converting to UTF-8 (WITH SAMPLE REPO) #7081

Failure when converting to UTF-8 (WITH SAMPLE REPO) #7081

Comments

rm8x commented Sep 3, 2020 • edited by RealmBot Loading

We received a crash report with the following truncated stack trace:

Steps & Code to Reproduce

Version of Realm and tooling

Repo demonstrating crash

clementetb commented Sep 9, 2020

bfranks commented Sep 9, 2020

clementetb commented Sep 10, 2020

bfranks commented Sep 11, 2020

cmelchior commented Sep 15, 2020

rm8x commented Sep 3, 2020 •

edited by RealmBot

Loading