[FEA] Add Host Memory Retry Columnar To Row Conversion #8886
Labels
reliability
Features to improve reliability or bugs that severly impact the reliability of the plugin
task
Work required that improves the product but is not user facing
Is your feature request related to a problem? Please describe.
GpuColumnarToRowExec is a little complicated because we have a number of different optimization around it. Ultimately we need a good way to limit the amount of host memory that it can use.
#9862 should give us hard limits, but we also want to be able to retry the allocation if it fails like with the GPU retry framework. This is to add in that retry where needed.
The accelerated transpose case (AcceleratedColumnarToRowIterator) converts the data to one or more HostColumnVector instances that are lists of bytes, which hold a row format similar to UnsafeRow in Spark.
The non-accelerated case (ColumnarToRowIterator) will just copy the data to the host and then walk that data one row at a time.
The primary goal here is to limit the host memory without deadlocking. The easiest way to do this would be to take the HostColumnVectors and make them spillable. Then we would get the wrapped column vector each time we needed to read some data and release it when we were done. This would work, but it is far from ideal, especially if we have a lot of columns and spilling is happening regularly.
It would be better if we could chunk the data on demand into smaller chunks.
For the accelerated case I think we could adjust the limits it currently has in place so we could have a target size that we pass down to the kernels. They would then return chunks of rows that are about that size. When we copy them back to the host, there would be more overhead on the heap to keep track of more objects, but it would also reduce the maximum amount of non-spillable memory that we have at any point in time, and it would also reduce the amount of memory that might need to be read back in each time.
For the non-accelerated case I think we would need a contig split like API or something similar. I am not super happy with the idea that we would need to do more computation, but perhaps we could do this dynamically, like we do for spill with bounce buffers and the chunked pack API there. This might need to be a follow on piece of work, as it is likely a lot more complicated.
Tasks
The text was updated successfully, but these errors were encountered: