[BUG] Can't name subset of columns with read_csv #8973

efajardo-nv opened this issue Aug 5, 2021 · 6 comments · Fixed by #12018

bug Something isn't working cuIO cuIO issue libcudf Affects libcudf (C++/CUDA) code.


Describe the bug
Our framework uses a wrapper around cuDF read_csv to read CSV file input into our data pipelines. Ideally, we should be able read in any CSV just using a single call to read_csv with the appropriate arguments. We have run into an issue where read_csv returns an error when trying to name columns selected via the usecols argument. The pandas equivalent works.

Steps/Code to reproduce bug

import cudf
import pandas as pd

filename = 'foo.csv'
lines = [

with open(filename, 'w') as fp:


>>> cudf.read_csv(filename, skiprows=1, header=None, usecols=[2], names=['renamed_text_col'])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/opt/conda/envs/rapids/lib/python3.8/", line 75, in inner
    return func(*args, **kwds)
  File "/opt/conda/envs/rapids/lib/python3.8/site-packages/cudf/io/", line 70, in read_csv
    return libcudf.csv.read_csv(
  File "cudf/_lib/csv.pyx", line 393, in cudf._lib.csv.read_csv
RuntimeError: basic_string::_M_construct null not valid


>>> pd.read_csv(filename, skiprows=1, header=None, usecols=[2], names=['renamed_text_col'])
0              abc
1              def
2              ghi

Expected behavior
Same result as pandas

Environment overview (please complete the following information)

  • Environment location: Docker
  • Method of cuDF install: Docker
    • docker pull rapidsai/rapidsai-nightly:21.08-cuda11.0-runtime-ubuntu18.04-py3.8

Environment details

 ***OS Information***
 VERSION="18.04.5 LTS (Bionic Beaver)"
 PRETTY_NAME="Ubuntu 18.04.5 LTS"
 Linux EFAJARDO-DT 5.4.0-77-generic #86~18.04.1-Ubuntu SMP Fri Jun 18 01:23:22 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
 ***GPU Information***
 Thu Aug  5 17:48:05 2021
 | NVIDIA-SMI 465.19.01    Driver Version: 465.19.01    CUDA Version: 11.3     |
 | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
 | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
 |                               |                      |               MIG M. |
 |   0  NVIDIA Quadro R...  On   | 00000000:15:00.0 Off |                  Off |
 | 33%   38C    P8    32W / 260W |   8114MiB / 48601MiB |      0%      Default |
 |                               |                      |                  N/A |
 | Processes:                                                                  |
 |  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
 |        ID   ID                                                   Usage      |
 Architecture:        x86_64
 CPU op-mode(s):      32-bit, 64-bit
 Byte Order:          Little Endian
 CPU(s):              12
 On-line CPU(s) list: 0-11
 Thread(s) per core:  2
 Core(s) per socket:  6
 Socket(s):           1
 NUMA node(s):        1
 Vendor ID:           GenuineIntel
 CPU family:          6
 Model:               85
 Model name:          Intel(R) Xeon(R) Gold 6128 CPU @ 3.40GHz
 Stepping:            4
 CPU MHz:             1786.574
 CPU max MHz:         3700.0000
 CPU min MHz:         1200.0000
 BogoMIPS:            6800.00
 Virtualization:      VT-x
 L1d cache:           32K
 L1i cache:           32K
 L2 cache:            1024K
 L3 cache:            19712K
 NUMA node0 CPU(s):   0-11
 Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 invpcid_single pti intel_ppin ssbd mba ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb intel_pt avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts hwp hwp_act_window hwp_epp hwp_pkg_req pku ospke md_clear flush_l1d
 Python 3.8.10
 ***Environment Variables***
 PATH                            : /opt/conda/envs/rapids/bin:/opt/conda/condabin:/opt/conda/bin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
 LD_LIBRARY_PATH                 : /usr/local/nvidia/lib:/usr/local/nvidia/lib64
 NUMBAPRO_NVVM                   :
 NUMBAPRO_LIBDEVICE              :
 CONDA_PREFIX                    : /opt/conda/envs/rapids
 PYTHON_PATH                     :
charlesbluca commented Aug 11, 2021

It looks like specifying names somehow causes illegal memory access:

In [7]: import cudf
   ...: import pandas as pd
   ...: filename = 'foo.csv'
   ...: lines = [
   ...:   "num,text",
   ...:   "123,abc",
   ...:   "456,def",
   ...:   "789,ghi"
   ...: ]
   ...: with open(filename, 'w') as fp:
   ...:     fp.write('\n'.join(lines)+'\n')

In [8]: cudf.read_csv(filename, usecols=[1])
0  abc
1  def
2  ghi

In [9]: cudf.read_csv(filename, usecols=[1], names=[0])
RuntimeError                              Traceback (most recent call last)
<ipython-input-9-0a0dfd4ac3ce> in <module>
----> 1 cudf.read_csv(filename, usecols=[1], names=[0])

~/compose/etc/conda/cuda_11.2/envs/rapids/lib/python3.8/ in inner(*args, **kwds)
     73         def inner(*args, **kwds):
     74             with self._recreate_cm():
---> 75                 return func(*args, **kwds)
     76         return inner

~/cudf/python/cudf/cudf/io/ in read_csv(filepath_or_buffer, lineterminator, quotechar, quoting, doublequote, header, mangle_dupe_cols, usecols, sep, delimiter, delim_whitespace, skipinitialspace, names, dtype, skipfooter, skiprows, dayfirst, compression, thousands, decimal, true_values, false_values, nrows, byte_range, skip_blank_lines, parse_dates, comment, na_values, keep_default_na, na_filter, prefix, index_col, **kwargs)
     68         na_values = [na_values]
---> 70     return libcudf.csv.read_csv(
     71         filepath_or_buffer,
     72         lineterminator=lineterminator,

~/cudf/python/cudf/cudf/_lib/csv.pyx in cudf._lib.csv.read_csv()
    392     cdef table_with_metadata c_result
    393     with nogil:
--> 394         c_result = move(cpp_read_csv(read_csv_options_c))
    396     meta_names = [name.decode() for name in c_result.metadata.column_names]

RuntimeError: reduce failed to synchronize: cudaErrorIllegalAddress: an illegal memory access was encountered

While looking at this, I also noticed that it's relatively easy to get an illegal memory access by passing an out of range column index in usecols:

In [4]: cudf.read_csv(filename, usecols=[2])
RuntimeError                              Traceback (most recent call last)
<ipython-input-4-97a4a25f179e> in <module>
----> 1 cudf.read_csv(filename, usecols=[2])

~/compose/etc/conda/cuda_11.2/envs/rapids/lib/python3.8/ in inner(*args, **kwds)
     73         def inner(*args, **kwds):
     74             with self._recreate_cm():
---> 75                 return func(*args, **kwds)
     76         return inner

~/cudf/python/cudf/cudf/io/ in read_csv(filepath_or_buffer, lineterminator, quotechar, quoting, doublequote, header, mangle_dupe_cols, usecols, sep, delimiter, delim_whitespace, skipinitialspace, names, dtype, skipfooter, skiprows, dayfirst, compression, thousands, decimal, true_values, false_values, nrows, byte_range, skip_blank_lines, parse_dates, comment, na_values, keep_default_na, na_filter, prefix, index_col, **kwargs)
     68         na_values = [na_values]
---> 70     return libcudf.csv.read_csv(
     71         filepath_or_buffer,
     72         lineterminator=lineterminator,

~/cudf/python/cudf/cudf/_lib/csv.pyx in cudf._lib.csv.read_csv()
    392     cdef table_with_metadata c_result
    393     with nogil:
--> 394         c_result = move(cpp_read_csv(read_csv_options_c))
    396     meta_names = [name.decode() for name in c_result.metadata.column_names]

RuntimeError: reduce failed to synchronize: cudaErrorIllegalAddress: an illegal memory access was encountered

I can open a separate issue for that.

This issue has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

mattf commented Oct 21, 2022

this is a segfault in 22.10 (cc @beckernick)

$ python3.9 -m IPython
Python 3.9.14 (main, Sep  7 2022, 23:43:48) 
Type 'copyright', 'credits' or 'license' for more information
IPython 8.5.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: import io, cudf, pandas as pd

In [2]: cudf.__version__
Out[2]: '22.10.00a+392.g1558403753'

In [3]: f = lambda: io.StringIO("""
   ...: num1,datetime,text
   ...: 123,2018-11-13T12:00:00,abc
   ...: 456,2018-11-14T12:35:01,def
   ...: 789,2018-11-15T18:02:59,ghi
   ...: """)

In [4]: pd.read_csv(f(), skiprows=1, header=None, usecols=[2], names=['renamed_text_col'])
0             text
1              abc
2              def
3              ghi

In [5]: cudf.read_csv(f(), skiprows=1, header=None, usecols=[2], names=['renamed_text_col'])
Segmentation fault (core dumped)

$ nvidia-smi 
Fri Oct 21 13:43:38 2022       
| NVIDIA-SMI 510.85.02    Driver Version: 510.85.02    CUDA Version: 11.6     |
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   42C    P8    14W /  70W |      0MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|  No running processes found                                                 |

Copy link

shwina commented Oct 21, 2022

@galipremsagar any chance you could look into this?

Copy link

@galipremsagar any chance you could look into this?


@galipremsagar galipremsagar self-assigned this Oct 21, 2022
@vuule vuule self-assigned this Oct 26, 2022
rapids-bot bot pushed a commit that referenced this issue Nov 17, 2022
…csv` (#12018)

closes #8973
CSV reader has a few gaps in the logic for column selection and user specified column names:
1. Users cannot only specify the names of selected columns;
2. Reader fails in unpredictable ways when only a subset of column names is passed (w/o column selection);

This PR fixes the issues above. Users can now specify column names (can be lower than the actual number of columns) or names of columns selected via their indices (must match the number of indices). If selection via indices is used, the number of column names has to match either the actual number of columns, or the number of selected columns.

Also fixed test an error that went unnoticed due to issues above.

  - Vukasin Milovanovic (

  - Karthikeyan (
  - Vyas Ramasubramani (
  - Nghia Truong (

URL: #12018
bug Something isn't working cuIO cuIO issue libcudf Affects libcudf (C++/CUDA) code.
Archived in project

