Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

typeorder for LGLSXP and RAWSXP in src/init.c is in conflict with R's coercion rules #4172

Closed
sritchie73 opened this issue Jan 13, 2020 · 4 comments

Comments

@sritchie73
Copy link
Contributor

Copying from a discussion chain with @MichaelChirico in #4144 :

In src/init.c the type order is laid out as follows:

static void setSizes() {
  for (int i=0; i<100; ++i) { sizes[i]=0; typeorder[i]=0; }
  // only these types are currently allowed as column types :
  sizes[LGLSXP] =  sizeof(int);       typeorder[LGLSXP] =  0;
  sizes[RAWSXP] =  sizeof(Rbyte);     typeorder[RAWSXP] =  1;
  sizes[INTSXP] =  sizeof(int);       typeorder[INTSXP] =  2;   // integer and factor
  sizes[REALSXP] = sizeof(double);    typeorder[REALSXP] = 3;   // numeric and integer64
  sizes[CPLXSXP] = sizeof(Rcomplex);  typeorder[CPLXSXP] = 4;
  sizes[STRSXP] =  sizeof(SEXP *);    typeorder[STRSXP] =  5;
  sizes[VECSXP] =  sizeof(SEXP *);    typeorder[VECSXP] =  6;   // list column
  if (sizeof(char *)>8) error(_("Pointers are %d bytes, greater than 8. We have not tested on any architecture greater than 64bit yet."), sizeof(char *));
  // One place we need the largest sizeof is the working memory malloc in reorder.c
}

The order of LGLSXP and RAWSXP are in conflict with R's coercion rules:

c(as.raw(0), TRUE)
# [1] FALSE  TRUE

As you can see, the raw is coerced to logical, not the other way round as listed in src/init.c

I believe the coercion rules work this way because a logical vector may contain NA, whereas raw vectors cannot:

as.raw(NA)
# [1] 00
# Warning message:
# out-of-range values treated as 0 in coercion to raw

Changing the order of LGLSXP and RAWSXP in typeorder has minimal impact, breaking only the following tests: 2006.1, 2006.2, and 2129.

These tests all related to the rbind/rbindlist functionality. Tests 2006.1 and 2006.2 are a simple fix: they simply expect the wrong output: when the typeorder is changed, rbindlist(..., fill = TRUE) with missing raw colums are now filled with NA in line with the functionality for all other column types, rather than 00 as is in the expected test output. Its not clear to me why 2129 fails, as the input data.tables do not contain raw columns, but the test is for rbind(), so this would take a little debugging.

@sritchie73 sritchie73 changed the title typeorder for LGLSXP and RAWSXP wrong in src/init.c typeorder for LGLSXP and RAWSXP in src/init.c is in conflict with R's coercion rules Jan 13, 2020
@MichaelChirico
Copy link
Member

For 2129, that's related to #546 / #3811. As pointed out there, the issue wasn't closed by the PR, so the test failure may well be related to that issue being unsolved.

@mattdowle
Copy link
Member

mattdowle commented Feb 19, 2020

That example is only with 0 in raw.
Consider in base R c(as.raw(12), as.raw(42), TRUE) converts 12 and 42 to TRUE and loses the distinction between 12 and 42, which I don't think makes sense. When folk use raw, they're likely to be using the full range 0:255.
In very many (most?) real-world cases, the logical column doesn't contain any NA anyway, and coercing 0:1 to the 0:255 range makes more sense.
I'm aware of base R's hierarchy and I decided to be different.
Where choices like this have been made I think we should resist changing it for the sole reason of being consistent with base R. If there's an example from a user, or a problem demonstrated with real-world data, then that's different. We certainly shouldn't change it after only considering c(as.raw(0), TRUE). And in terms of priorities, raw is a less used type that tends not to be combined with other types anyway, which is why we don't see this coercion aspect coming up in practice.

@sritchie73
Copy link
Contributor Author

Ah, I hadn't realised that base R's coercion rules differed depending on the value of the raw, nor that a design decision had been made here. I'm happy for the issue and correspondign PR to be closed, I agree that it doesn't make sense to downcast the non-zero raw values to TRUE.

@MichaelChirico
Copy link
Member

Thanks for the explanation Matt, let's add something to that effect as a comment there -- I looked around for an explanation of the ordering there before recommending to try switching but didn't find one

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants