Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test C locale breaks test suite #6564

Open
TysonStanley opened this issue Oct 9, 2024 · 1 comment
Open

Test C locale breaks test suite #6564

TysonStanley opened this issue Oct 9, 2024 · 1 comment

Comments

@TysonStanley
Copy link
Member

Something for after the patch release found in the release process (but don't believe it should stop the current patch release):

# Test C locale doesn't break test suite (#2771)
echo LC_ALL=C > ~/.Renviron
R
Sys.getlocale()=="C"
q("no")
R CMD check data.table_1.16.2.tar.gz
rm ~/.Renviron

This makes test.data.table() fail on MacOS Apple Silicon on test 2194.7

**** Full long double accuracy is not available. Tests using this will be skipped.

Running test id 2194.7          Test 2194.7 produced 0 errors but expected 1
Expected: Internal error.*types or lengths incorrect
@aitap
Copy link
Contributor

aitap commented Oct 10, 2024

Can also be reproduced on amd64 Linux (although multiple other tests also break due to <U+????> substitutions in conversions from UTF-8 to native encoding):

.libPaths(c('data.table.Rcheck', .libPaths()))
library(data.table)
trace(data.table:::endsWithAny, quote(if(identical(y, 'B')) browser())) # test 2194.7 compares with 'B'
test.data.table()
# same as data.table/inst/tests/issue_563_fread.txt'
Browse[1]> readLines(parent.frame(8)$env$testDir('issue_563_fread.txt'))
[1] "A,B"
Browse[1]> c
# later, at top level again
> readLines('inst/tests/issue_563_fread.txt')
[1] "A,B"               "\304\205,\305\276" "\305\253,\304\257"
[4] "\305\263,\304\227" "\305\241,\304\231"

Rconn_fgetc returns EOF after the first line because it's set to decode from UTF-8 into the native encoding, and iconv() fails to decode non-ASCII characters. This comes from file(encoding = getOption("encoding")), which is indeed set to UTF-8 by test.data.table:

oldOptions = options(
datatable.verbose = verbose,
encoding = "UTF-8", # just for tests 708-712 on Windows

When giving a file path to readLines, there's no way around it calling file() with the default encoding=, so tests.Rraw will have to either manually open the file with a different encoding (in which the contents will be invalid!) or construct a different string to endsWithAny. In particular, ?file recomments creating an unopened connection marked as UTF-8 (file(open = '', encoding = 'UTF-8')) and giving it to readLines in order to read UTF-8 in an R session incapable of representing UTF-8 natively:

# context: options(encoding = 'UTF-8'), LC_ALL=C
con <- file('inst/tests/issue_563_fread.txt', open = '')
readLines(con)
# [1] "A,B"           "<U+0105>,<U+017E>" "<U+016B>,<U+012F>" "<U+0173>,<U+0117>"
# [5] "<U+0161>,<U+0119>"
close(con)

Unfortunately, readLines won't do it by itself: it uses file(open='r') which initialises UTF-8 → ASCII conversion and breaks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants