Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Confusing Debug of chars #62947

Closed
max-sixty opened this issue Jul 24, 2019 · 4 comments · Fixed by #63000
Closed

Confusing Debug of chars #62947

max-sixty opened this issue Jul 24, 2019 · 4 comments · Fixed by #63000
Labels
A-iterators Area: Iterators C-enhancement Category: An issue proposing an enhancement or a PR with one. T-libs-api Relevant to the library API team, which will review and decide on the PR/issue.

Comments

@max-sixty
Copy link
Contributor

max-sixty commented Jul 24, 2019

Currently, the Debug of Chars prints the underlying bytes, rather than the chars:

Playground link

#![allow(unused)]
fn main() {
    let s = String::from(" é 😀 ");
    let c = s.chars();
    dbg!("Debug of Chars: ", &c);
    dbg!("Debug of each char: ");
    for x in c {
        dbg!(x);
    }
}

Returns:

[src/main.rs:5] "Debug of Chars: " = "Debug of Chars: "
[src/main.rs:5] &c = Chars {
    iter: Iter(
        [
            32,
            195,
            169,
            32,
            240,
            159,
            152,
            128,
            32,
        ],
    ),
}
[src/main.rs:6] "Debug of each char: " = "Debug of each char: "
[src/main.rs:8] x = ' '
[src/main.rs:8] x = 'é'
[src/main.rs:8] x = ' '
[src/main.rs:8] x = '😀'
[src/main.rs:8] x = ' '

As I was trying to work out what chars was (whether it was unicode points or bytes or something else), the first output was v confusing - is there a reason we don't print something like the second case?

Would you take a PR to change this?

I couldn't find any previous discussion on this - #49283 was the closest I could find.

@jonas-schievink jonas-schievink added A-iterators Area: Iterators C-enhancement Category: An issue proposing an enhancement or a PR with one. T-libs-api Relevant to the library API team, which will review and decide on the PR/issue. labels Jul 24, 2019
@ExpHP
Copy link
Contributor

ExpHP commented Jul 25, 2019

The first one is UTF-8 bytes. You see this because the Debug impl for Chars is auto-generated:

#[derive(Clone, Debug)]
pub struct Chars<'a> {
    iter: slice::Iter<'a, u8>
}

Strictly speaking, since the bytes inside of Chars should always be valid UTF-8, this could have a custom Debug impl that makes it pretend to contain a string by formatting the member as str::from_utf8(self.iter.as_slice()).unwrap().


The reason it doesn't display as individual chars is because it doesn't have individual chars; determining their boundaries is the entire point of the Chars iterator. I suppose this same argument could be used against the call to str::from_utf8, which needs to scan the whole string to validate it.

(but then the solution seems to be to use str::from_utf8_unchecked, which seems awfully heavy-handed for a Debug impl. And perhaps it doesn't even matter, because the cost of most io::Write impls probably outweighs the cost of this validation)

I guess that, questionable concerns of efficiency aside, my main concern is simply that showing a list of individual chars would be... dishonest, I guess.

@max-sixty
Copy link
Contributor Author

I guess that, questionable concerns of efficiency aside, my main concern is simply that showing a list of individual chars would be... dishonest, I guess.

Do str & String contain more data than Chars? I had thought they both contained the underlying bytes and then these were decoded as needed - including as part of the Display & Debug implementations

@ExpHP
Copy link
Contributor

ExpHP commented Jul 25, 2019

No, they contain the same data. They're all just UTF-8 bytes. And the Display implementation of str simply writes the bytes contained in the str directly to the io::Write instance.


Suppose we are writing to STDOUT. On UNIX, the io::Write for Stdout impl writes these bytes directly to the underlying file descriptor with no processing:

impl io::Write for Stdout {
fn write(&mut self, buf: &[u8]) -> io::Result<usize> {
ManuallyDrop::new(FileDesc::new(libc::STDOUT_FILENO)).write(buf)
}

pub fn write(&self, buf: &[u8]) -> io::Result<usize> {
let ret = cvt(unsafe {
libc::write(self.fd,
buf.as_ptr() as *const c_void,
cmp::min(buf.len(), max_len()))
})?;
Ok(ret as usize)
}

I would imagine this is because the console on any UNIX platform almost certainly uses UTF-8.1 It is your terminal application that is then responsible for decoding these bytes and producing glyphs. Considering that UTF-8 dominates much of the web space as well, it's quite possible that even on the playground, these bytes are ultimately sent over the wire to your PC with minimal processing, where your browser is responsible for decoding and displaying them.

On Windows, io::Write for Stdout transcodes the UTF-8 into the UTF-16 format expected by the windows APIs:

let mut utf16 = [0u16; MAX_BUFFER_SIZE / 2];
let mut len_utf16 = 0;
for (chr, dest) in utf8.encode_utf16().zip(utf16.iter_mut()) {
*dest = chr;
len_utf16 += 1;
}
let utf16 = &utf16[..len_utf16];
let mut written = write_u16s(handle, &utf16)?;

and then Windows does whatever it does with those UTF-16 code units. (Quite likely, it hands them directly to the console, which is then responsible for decoding and displaying them)


Footnotes

  1. (I think in actuality UNIX accepts arbitrary strings of bytes, and then the portions of these strings which are valid UTF-8 are rendered appropriately by the console. I don't know; doesn't really matter)

@max-sixty
Copy link
Contributor Author

OK, so given that - is there still an objection to displaying the unicode characters for Chars but not String?

Centril added a commit to Centril/rust that referenced this issue Jul 29, 2019
…hton

Impl Debug for Chars

Closes rust-lang#62947, making `Debug` more consistent with the struct's output and purpose

Let me know any feedback!
Centril added a commit to Centril/rust that referenced this issue Jul 29, 2019
…hton

Impl Debug for Chars

Closes rust-lang#62947, making `Debug` more consistent with the struct's output and purpose

Let me know any feedback!
Centril added a commit to Centril/rust that referenced this issue Jul 30, 2019
…hton

Impl Debug for Chars

Closes rust-lang#62947, making `Debug` more consistent with the struct's output and purpose

Let me know any feedback!
Centril added a commit to Centril/rust that referenced this issue Jul 30, 2019
…hton

Impl Debug for Chars

Closes rust-lang#62947, making `Debug` more consistent with the struct's output and purpose

Let me know any feedback!
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-iterators Area: Iterators C-enhancement Category: An issue proposing an enhancement or a PR with one. T-libs-api Relevant to the library API team, which will review and decide on the PR/issue.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants