Fast data.table to matrix conversion in C [Depends: #4196] #4144

sritchie73 · 2019-12-28T14:20:04Z

Following on from PR #4134 I've started working on C code for faster conversion of data.tables to matrices.

The current code so far works for numeric matrices only, assuming all columns in the data.table are already numeric.

TODO:

Modify C code so that it will work on all atomic types, not just numeric data.
Handle type conversion for data.tables with multiple types
Parallelise C code across input data.table columns
Revisit R code to reduce unnecessary column checks
bit64 support
fallback unlist method for types not implemented in C
Add tests
Check compatibilty with changes to matrix in upcoming R 4.0.0

Basic code implemented for numeric matrices for data.tables whose columns are all numeric

Helper functions (e.g. REAL) are now only called once for each column/object and objects are accessed by pointer in the for loops.

sritchie73 · 2019-12-28T14:22:30Z

At this point it would be useful to have someone's eyes on the code I've written to sanity check I haven't done something drastically wrong. Its my first time writing C extensions for R (I've worked quite a bit with Rcpp and have learnt C in the past) and also working with C code in an R package setting.

codecov · 2019-12-28T14:28:44Z

Codecov Report

Merging #4144 (bef9fa1) into master (4bda6da) will increase coverage by 0.01%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master    #4144      +/-   ##
==========================================
+ Coverage   99.60%   99.61%   +0.01%     
==========================================
  Files          72       73       +1     
  Lines       13918    14043     +125     
==========================================
+ Hits        13863    13989     +126     
+ Misses         55       54       -1

Impacted Files	Coverage Δ
src/init.c	`100.00% <ø> (ø)`
R/data.table.R	`100.00% <100.00%> (ø)`
src/matrix.c	`100.00% <100.00%> (ø)`
src/utils.c	`98.09% <0.00%> (-0.04%)`	⬇️
src/fread.c	`99.52% <0.00%> (-0.01%)`	⬇️
R/xts.R	`100.00% <0.00%> (ø)`
R/fcast.R	`100.00% <0.00%> (ø)`
R/frank.R	`100.00% <0.00%> (ø)`
R/utils.R	`100.00% <0.00%> (ø)`
src/frank.c	`100.00% <0.00%> (ø)`
... and 11 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4bda6da...53846b8. Read the comment docs.

src/asmatrix.c

C implementation of as.matrix now works on simple atomic types: logical, integer, numeric, and complex, provided all columns in the data.table are of the same type.

Bug was caused by dispatch function attempting to access non-existant second column, due to forgetting that array indexing in C starts at 0 not 1.

sritchie73 · 2019-12-29T12:23:53Z

C implementation of as.matrix now works on multiple atomic types, provided the input data.table's columns are all the same type (data.tables with mixed column types are next on the TODO list). The types I've tested, and covered by test.data.table() so far are: logical, integer, numeric, complex, character, raw, and list.

To achieve this, I've made the C asmatrix function a dispatch function that simply detects the atomic type from the first column of the data.table, then calls the appropriate asmatrix_ function. This unfortunately means theres quite a bit of code duplication. I don't know if there is a better way of doing this, especially given that character vectors and lists have to be handled quite differently in C to the other atomic types.

For data.tables with multiple column types, as.matrix.data.table performs type conversion across columns before handing the data.table to the C implementation of as.matrix

Covers test 2074.07: data.tables may be constructed such that an individual column may multi-column, e.g. a data.table or matrix. Since the test notes that these can only occur when data.tables are constructed incorrectly, I've just reverted to the old non-C code to handle this special case.

sritchie73 · 2019-12-29T14:25:20Z

For data.tables with mixed column types, I've explicitly handle the type conversion in R prior to handing off to C. I'm not sure if this is an improvement over the old method of simply calling unlist(), since unlist() handles type conversion in its internal C code.

R/data.table.R

MichaelChirico · 2019-12-29T14:54:26Z

I like what you've done to split out the some-columns-has-dimensions edge case. Given that AFAIK we don't support columns with dimensions (?) maybe worth to just error for that case & not bother with the added complexity.

Once you've coerced your columns-as-list-elements to the "highest" type, how does do.call(cbind) perform?

Explicit integer use + correct handling of multi-type columns

R/data.table.R

sritchie73 · 2019-12-29T15:03:35Z

I like what you've done to split out the some-columns-has-dimensions edge case. Given that AFAIK we don't support columns with dimensions (?) maybe worth to just error for that case & not bother with the added complexity.

Once you've coerced your columns-as-list-elements to the "highest" type, how does do.call(cbind) perform?

I've only implemented this to cover test 2074.07:

## as.matrix.data.table when a column has columns (only possible when constructed incorrectly)
DT = structure(list(a=1:5, d=data.table(b=6:10, c=11:15), m=matrix(16:25, ncol=2L)), class = c('data.table', 'data.frame'))
test(2074.07, as.matrix(DT), matrix(1:25, ncol=5L, dimnames=list(NULL, c('a', 'd.b', 'd.c', 'm.1', 'm.2'))))

I'd otherwise prefer to throw an error, but perhaps better here to maintain backwards compatibility. I'm not particularly concerned with increasing efficiency for this edge case which can apparently only arise erroneously.

MichaelChirico · 2019-12-29T15:14:07Z

I know for sure 2074.07 was only added for codecov purposes (I wrote it 😂), I wouldn't have any issue removing that test. setDT warns for such columns and as.data.table "unnests" them, the only way to get such an object now is through the convoluted approach I made for the test. Don't think we should be catering too much for that.

sritchie73 · 2019-12-30T11:18:35Z

In that case, I'm now wondering about lines 1909-1917, which look like they are also specific to this erroneous edge case:

    if (length(dj <- dim(xj)) == 2L && dj[2L] > 1L) {
      if (inherits(xj, "data.table"))
        xj = X[[j]] = as.matrix(X[[j]])
      dnj = dimnames(xj)[[2L]]
      collabs[[j]] = paste(collabs[[j]], if (length(dnj) >
        0L)
        dnj
      else seq_len(dj[2L]), sep = ".")
    }

I think we can remove the body of this if statement, and instead replace it with a call to setDT on X after the for loop run across each column, if any columns with dimensions are detected. Alternatively, we could eliminate this check entirely, and just run setDT regardless.

Elimanted costly check of each column for dimensions, replacing it with a single (very fast) call to setDT to check for columns with multiple columns (e.g. a column that contains a matrix). In the rare edge case where this is detected (see test 2074.07) we now use as.data.table to unpack these columns. This elimanted the need for the later check for matrix and data.table columns and subsequent use of the old unlist rather than the new C method. Test 2074.07 required minor modification in the expected column names of the output matrix.

sritchie73 · 2019-12-30T12:43:51Z

I think I've come up with a much nicer solution to that problem which eliminates the need for checking each column's dimensions.

Previously, when the user supplied as.matrix with a column to use as the rownames, there was an expensive copy call `x = X[,.SD,.SDcols=cn[-rownames]]` which would create a copy of the input data.table excluding the rownames column. Now instead the as.matrix function simply keeps track of the rownames column to exclude, and skips over that column throughout the rest of the column checks and while creating the matrix.

sritchie73 · 2020-02-19T06:43:31Z

RE: discussion of raw types in #4172 , I changed the as.matrix implementation here to mirror the behaviour of as.matrix.data.frame: the resulting matrix is raw type only if all columns are raw type, otherwise the matrix is coerced to character

Previous version had lots of interacting parts making it (1) buggy, (2) difficult to test, and (3) impossible to maintain. This refactor makes the logic clearer and more modular so that the logic is easier to follow and requires fewer tests to cover.

sritchie73 · 2020-03-19T12:08:31Z

One thing that would be nice to do in this PR as well is to group all the as.matrix tests in the test file. Currently these are spread out across tests.Rraw. Is there a way of doing this without having to renumber all the tests?

MichaelChirico · 2020-03-19T13:21:46Z

it should be fine to leave the old test numbers.

otoh, it's just a for loop through the new file to renumber 😁

ethanbsmith · 2020-05-20T02:14:44Z

wondering if you would consider optimizing xts / zoo <-> data.table as part of this. Since xts and zoo are really just a matrix with an index attribute, apart from the index column, the rest would really just be a single type matrix conversion.

data.table in place updates, grouping, indexing and handling of mixed types are so powerful, that many of my analytics function convert my xts structures to a data.table internally for processing, but then need to reconvert to xts on the way out for compatibility. so, this is a very common use case (for me)

jangorecki · 2020-05-20T02:17:16Z

good idea, but better to keep it as a separate PR probably

MichaelChirico · 2024-02-19T04:29:12Z

Setting as draft instead of label:WIP.

sritchie73 added 2 commits December 28, 2019 13:46

Began implementing a fast as.matrix in C

a0c09dc

Basic code implemented for numeric matrices for data.tables whose columns are all numeric

Made loop code more efficient

a614743

Helper functions (e.g. REAL) are now only called once for each column/object and objects are accessed by pointer in the for loops.

sritchie73 added the WIP label Dec 28, 2019

sritchie73 self-assigned this Dec 28, 2019

jangorecki reviewed Dec 28, 2019

View reviewed changes

src/asmatrix.c Outdated Show resolved Hide resolved

sritchie73 added 5 commits December 29, 2019 11:30

Casmatrix now works on simple types

a7a22a0

C implementation of as.matrix now works on simple atomic types: logical, integer, numeric, and complex, provided all columns in the data.table are of the same type.

Implemented Casmatrix for character vectors

27fa71c

Implemented Casmatrix for list type columns

0d6742a

Bugfix: no longer crashes on 1-column matrices

89e4239

Bug was caused by dispatch function attempting to access non-existant second column, due to forgetting that array indexing in C starts at 0 not 1.

removed spurious line of code

e101b0e

sritchie73 added 3 commits December 29, 2019 12:26

Style fix: changed <- to =

dff21b5

as.matrix now handles type conversion

a7118b7

For data.tables with multiple column types, as.matrix.data.table performs type conversion across columns before handing the data.table to the C implementation of as.matrix

MichaelChirico reviewed Dec 29, 2019

View reviewed changes

R/data.table.R Outdated Show resolved Hide resolved

MichaelChirico reviewed Dec 29, 2019

View reviewed changes

R/data.table.R Outdated Show resolved Hide resolved

MichaelChirico reviewed Dec 29, 2019

View reviewed changes

R/data.table.R Outdated Show resolved Hide resolved

Bugfix

7b4891a

Explicit integer use + correct handling of multi-type columns

MichaelChirico reviewed Dec 29, 2019

View reviewed changes

R/data.table.R Outdated Show resolved Hide resolved

sritchie73 added 2 commits December 30, 2019 12:47

Style fix: Numeric -> Explicit integer

cc08c5d

sritchie73 changed the title ~~Fast data.table to matrix conversion in C [Depends: #4195, #4196, #4203]~~ Fast data.table to matrix conversion in C [Depends: #4196, #4203] Feb 19, 2020

sritchie73 changed the title ~~Fast data.table to matrix conversion in C [Depends: #4196, #4203]~~ Fast data.table to matrix conversion in C [Depends: #4196] Feb 19, 2020

sritchie73 added 3 commits February 19, 2020 17:28

Fixed detection of coercion required

93563a3

base type is logical not raw

7236060

Fixed raw type coercion rules

f5c5b46

sritchie73 added 8 commits February 19, 2020 17:45

raw now works with list columns

dd81334

Fixed ncol incrementer

2d30f8a

bugfix raw type detection and coercion

61fcc0a

Refactored preprocess in asmatrix

6a5cb04

Previous version had lots of interacting parts making it (1) buggy, (2) difficult to test, and (3) impossible to maintain. This refactor makes the logic clearer and more modular so that the logic is easier to follow and requires fewer tests to cover.

Updated integer64 tests for raw rules

7668701

Fixed initialisation rules

8519f31

*wd is a pointer to an INTEGER SEXP array not int64_t

fdea714

missed an int64_t case

53846b8

mattdowle modified the milestones: 1.13.1, 1.13.3 Oct 17, 2020

jangorecki modified the milestones: 1.14.3, 1.14.5 Jul 19, 2022

jangorecki modified the milestones: 1.14.11, 1.15.1 Oct 29, 2023

MichaelChirico removed the WIP label Feb 19, 2024

MichaelChirico marked this pull request as draft February 19, 2024 04:29

MichaelChirico modified the milestones: 1.16.0, 1.17.0 Jul 10, 2024

ben-schwen mentioned this pull request Oct 12, 2024

FR: setDT for matrices #6565

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fast data.table to matrix conversion in C [Depends: #4196] #4144

Fast data.table to matrix conversion in C [Depends: #4196] #4144

sritchie73 commented Dec 28, 2019 •

edited by MichaelChirico

Loading

sritchie73 commented Dec 28, 2019

codecov bot commented Dec 28, 2019 •

edited

Loading

sritchie73 commented Dec 29, 2019

sritchie73 commented Dec 29, 2019

MichaelChirico commented Dec 29, 2019

sritchie73 commented Dec 29, 2019

MichaelChirico commented Dec 29, 2019

sritchie73 commented Dec 30, 2019 •

edited

Loading

sritchie73 commented Dec 30, 2019

sritchie73 commented Feb 19, 2020

sritchie73 commented Mar 19, 2020

MichaelChirico commented Mar 19, 2020

ethanbsmith commented May 20, 2020

jangorecki commented May 20, 2020

MichaelChirico commented Feb 19, 2024

Fast data.table to matrix conversion in C [Depends: #4196] #4144

Are you sure you want to change the base?

Fast data.table to matrix conversion in C [Depends: #4196] #4144

Conversation

sritchie73 commented Dec 28, 2019 • edited by MichaelChirico Loading

sritchie73 commented Dec 28, 2019

codecov bot commented Dec 28, 2019 • edited Loading

Codecov Report

sritchie73 commented Dec 29, 2019

sritchie73 commented Dec 29, 2019

MichaelChirico commented Dec 29, 2019

sritchie73 commented Dec 29, 2019

MichaelChirico commented Dec 29, 2019

sritchie73 commented Dec 30, 2019 • edited Loading

sritchie73 commented Dec 30, 2019

sritchie73 commented Feb 19, 2020

sritchie73 commented Mar 19, 2020

MichaelChirico commented Mar 19, 2020

ethanbsmith commented May 20, 2020

jangorecki commented May 20, 2020

MichaelChirico commented Feb 19, 2024

sritchie73 commented Dec 28, 2019 •

edited by MichaelChirico

Loading

codecov bot commented Dec 28, 2019 •

edited

Loading

sritchie73 commented Dec 30, 2019 •

edited

Loading