Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

io::IoError should carry info on the invalid byte sequence on non-utf8 InvalidInput #12113

Closed
pnkfelix opened this issue Feb 8, 2014 · 8 comments
Labels
C-enhancement Category: An issue proposing an enhancement or a PR with one. T-libs-api Relevant to the library API team, which will review and decide on the PR/issue.

Comments

@pnkfelix
Copy link
Member

pnkfelix commented Feb 8, 2014

If you feed in a byte stream that is almost utf-8 but has errors, a looped series of calls to fn read_char will eventually return an IoError with kind == InvalidResult.

Unfortunately, the returned IoError does not include any information about what the bytes were that were invalid (nor does it include information like how many bytes were read from the input before the error was encountered).

It seems like it would not be that bad to change IoError so that its detail field could be an Option<Either<~str, ~[u8]>>, or something along those lines, so that in this scenario, the InvalidResult would imply that one could look at the detail field to determine what the byte sequence was that caused the problem (and then the client code would have the option of substituting in a different character sequence specific to the byte sequence that failed).

(Alternatively, we could change IoErrorKind so that the InvalidResult variant carried an Option<~[u8]>, but then the IoErrorKind would no longer be a C-like enum.)

I believe that this is strictly more expressive than just mapping every replacement to a single replacement character, as is done by from_utf8_lossy (#12062).

@pnkfelix
Copy link
Member Author

pnkfelix commented Feb 8, 2014

(I would also be satisfied with an variant of read_char that allowed one to pass in a closure callback for invalid byte sequences; essentially recreating a condition-like API, but with more explicit condition handlers threaded through.)

((after further review, the above proposal seems similar to the very old #1675 ))

@alexcrichton
Copy link
Member

One possibility would be to add a new IoErrorKind with a uint payload saying where things went wrong, but it's still not transmitting the discarded bytes. We have a few other methods (read_until for example) which discard partially read bytes if an error is encountered.

@pnkfelix
Copy link
Member Author

pnkfelix commented Feb 9, 2014

I was making a little program to post-process my irc-logs, which unfortunately for some reason have non-utf8 mixed in, so it was important to me to have a reasonable way to recover from these scenarios and resume the parsing.

I hacked up something that worked for me, but I doubt its clean enough to be put into the standard lib. (I was happy that Rust's stdlib does at least expose enough functionality for me to get the job done, e.g. by making helpers like char::len_utf8_bytes and str::utf8_char_width pub instead of priv. Yay!)

The experience showed me that this is not as trivial a problem as I was making it out to be (e.g. I think a fully general interface needs to allow one to feed in a prefix sequence of characters that were left over from a previous failed call to read_char; and likewise, failures need to pass back a slice of the intermediate 4-byte buffer so that a suffix of it can be used as that prefix).

Unfortunately my main experiences in the past with such problems (e.g. in Flash) were only just further instances where the provided API's were not flexible enough.

Anyway, hopefully I'll iterate more on this and come up with something palatable.

@steveklabnik
Copy link
Member

With IO reform, IoError is being renamed, and some of it is being made private. So we'll see how this shakes out.

@steveklabnik
Copy link
Member

Triage: too much time has passed and too much has changed, so I don't actually remember what the right thing is here.

I believe this boils down to #27802, ie, we still haven't decided what happens when you get an invalid char while reading.

@rust-lang/libs, opinions?

@sfackler
Copy link
Member

sfackler commented Feb 2, 2016

We certainly now have the ability to do this via the custom error payload you can pass into io::Error::new. It seems pretty reasonable to include the offending byte in that.

@pnkfelix
Copy link
Member Author

pnkfelix commented Feb 9, 2016

I agree that this essentially could become a sub issue of #27802 ; but the discussion of that issue seems to focus on the semantics of a char iterator's interaction with an underlying byte stream (which is very important), while this issue is more of a "it would be nice if the error object that gets bubbled up actually carried enough info for a user to do recovery"

We haven't always done such a great job in this respect, IMO, so I'm trying to be explicit about it here.

@pnkfelix pnkfelix added the T-libs-api Relevant to the library API team, which will review and decide on the PR/issue. label Feb 9, 2016
@Mark-Simulacrum Mark-Simulacrum added the C-enhancement Category: An issue proposing an enhancement or a PR with one. label Jul 20, 2017
@Mark-Simulacrum
Copy link
Member

Read::chars has been deprecated and the functionality this asks for is primarily provided through str::from_utf8; I'm going to close this issue.

bors added a commit to rust-lang-ci/rust that referenced this issue Jul 25, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-enhancement Category: An issue proposing an enhancement or a PR with one. T-libs-api Relevant to the library API team, which will review and decide on the PR/issue.
Projects
None yet
Development

No branches or pull requests

5 participants