io::IoError should carry info on the invalid byte sequence on non-utf8 InvalidInput #12113

pnkfelix · 2014-02-08T19:00:42Z

If you feed in a byte stream that is almost utf-8 but has errors, a looped series of calls to fn read_char will eventually return an IoError with kind == InvalidResult.

Unfortunately, the returned IoError does not include any information about what the bytes were that were invalid (nor does it include information like how many bytes were read from the input before the error was encountered).

It seems like it would not be that bad to change IoError so that its detail field could be an Option<Either<~str, ~[u8]>>, or something along those lines, so that in this scenario, the InvalidResult would imply that one could look at the detail field to determine what the byte sequence was that caused the problem (and then the client code would have the option of substituting in a different character sequence specific to the byte sequence that failed).

(Alternatively, we could change IoErrorKind so that the InvalidResult variant carried an Option<~[u8]>, but then the IoErrorKind would no longer be a C-like enum.)

I believe that this is strictly more expressive than just mapping every replacement to a single replacement character, as is done by from_utf8_lossy (#12062).

The text was updated successfully, but these errors were encountered:

pnkfelix · 2014-02-08T19:04:29Z

(I would also be satisfied with an variant of read_char that allowed one to pass in a closure callback for invalid byte sequences; essentially recreating a condition-like API, but with more explicit condition handlers threaded through.)

((after further review, the above proposal seems similar to the very old #1675 ))

alexcrichton · 2014-02-08T23:49:22Z

One possibility would be to add a new IoErrorKind with a uint payload saying where things went wrong, but it's still not transmitting the discarded bytes. We have a few other methods (read_until for example) which discard partially read bytes if an error is encountered.

pnkfelix · 2014-02-09T00:53:34Z

I was making a little program to post-process my irc-logs, which unfortunately for some reason have non-utf8 mixed in, so it was important to me to have a reasonable way to recover from these scenarios and resume the parsing.

I hacked up something that worked for me, but I doubt its clean enough to be put into the standard lib. (I was happy that Rust's stdlib does at least expose enough functionality for me to get the job done, e.g. by making helpers like char::len_utf8_bytes and str::utf8_char_width pub instead of priv. Yay!)

The experience showed me that this is not as trivial a problem as I was making it out to be (e.g. I think a fully general interface needs to allow one to feed in a prefix sequence of characters that were left over from a previous failed call to read_char; and likewise, failures need to pass back a slice of the intermediate 4-byte buffer so that a suffix of it can be used as that prefix).

Unfortunately my main experiences in the past with such problems (e.g. in Flash) were only just further instances where the provided API's were not flexible enough.

Anyway, hopefully I'll iterate more on this and come up with something palatable.

steveklabnik · 2015-01-23T02:56:55Z

With IO reform, IoError is being renamed, and some of it is being made private. So we'll see how this shakes out.

steveklabnik · 2016-02-02T19:15:51Z

Triage: too much time has passed and too much has changed, so I don't actually remember what the right thing is here.

I believe this boils down to #27802, ie, we still haven't decided what happens when you get an invalid char while reading.

@rust-lang/libs, opinions?

sfackler · 2016-02-02T19:19:55Z

We certainly now have the ability to do this via the custom error payload you can pass into io::Error::new. It seems pretty reasonable to include the offending byte in that.

pnkfelix · 2016-02-09T07:39:50Z

I agree that this essentially could become a sub issue of #27802 ; but the discussion of that issue seems to focus on the semantics of a char iterator's interaction with an underlying byte stream (which is very important), while this issue is more of a "it would be nice if the error object that gets bubbled up actually carried enough info for a user to do recovery"

We haven't always done such a great job in this respect, IMO, so I'm trying to be explicit about it here.

Mark-Simulacrum · 2018-07-28T21:59:10Z

Read::chars has been deprecated and the functionality this asks for is primarily provided through str::from_utf8; I'm going to close this issue.

small typo in log message

pnkfelix mentioned this issue Feb 11, 2014

Expose an UTF-8 checking function that returns the index of the error #12168

Closed

steveklabnik added the A-libs label Jan 23, 2015

pnkfelix added the T-libs-api Relevant to the library API team, which will review and decide on the PR/issue. label Feb 9, 2016

troplin mentioned this issue May 5, 2016

Tracking issue for Read::chars #27802

Closed

steveklabnik removed the A-libs label Mar 24, 2017

Mark-Simulacrum added the C-enhancement Category: An issue proposing an enhancement or a PR with one. label Jul 20, 2017

Mark-Simulacrum closed this as completed Jul 28, 2018

bors added a commit to rust-lang-ci/rust that referenced this issue Jul 25, 2022

Auto merge of rust-lang#12113 - jtracey:patch-1, r=lnicola

5f1ed3c

small typo in log message

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

io::IoError should carry info on the invalid byte sequence on non-utf8 InvalidInput #12113

io::IoError should carry info on the invalid byte sequence on non-utf8 InvalidInput #12113

pnkfelix commented Feb 8, 2014

pnkfelix commented Feb 8, 2014

alexcrichton commented Feb 8, 2014

pnkfelix commented Feb 9, 2014

steveklabnik commented Jan 23, 2015

steveklabnik commented Feb 2, 2016

sfackler commented Feb 2, 2016

pnkfelix commented Feb 9, 2016

Mark-Simulacrum commented Jul 28, 2018

io::IoError should carry info on the invalid byte sequence on non-utf8 InvalidInput #12113

io::IoError should carry info on the invalid byte sequence on non-utf8 InvalidInput #12113

Comments

pnkfelix commented Feb 8, 2014

pnkfelix commented Feb 8, 2014

alexcrichton commented Feb 8, 2014

pnkfelix commented Feb 9, 2014

steveklabnik commented Jan 23, 2015

steveklabnik commented Feb 2, 2016

sfackler commented Feb 2, 2016

pnkfelix commented Feb 9, 2016

Mark-Simulacrum commented Jul 28, 2018