-
Notifications
You must be signed in to change notification settings - Fork 891
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Can't name subset of columns with read_csv #8973
Comments
It looks like specifying In [7]: import cudf
...: import pandas as pd
...:
...: filename = 'foo.csv'
...: lines = [
...: "num,text",
...: "123,abc",
...: "456,def",
...: "789,ghi"
...: ]
...:
...: with open(filename, 'w') as fp:
...: fp.write('\n'.join(lines)+'\n')
...:
In [8]: cudf.read_csv(filename, usecols=[1])
Out[8]:
text
0 abc
1 def
2 ghi
In [9]: cudf.read_csv(filename, usecols=[1], names=[0])
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
<ipython-input-9-0a0dfd4ac3ce> in <module>
----> 1 cudf.read_csv(filename, usecols=[1], names=[0])
~/compose/etc/conda/cuda_11.2/envs/rapids/lib/python3.8/contextlib.py in inner(*args, **kwds)
73 def inner(*args, **kwds):
74 with self._recreate_cm():
---> 75 return func(*args, **kwds)
76 return inner
77
~/cudf/python/cudf/cudf/io/csv.py in read_csv(filepath_or_buffer, lineterminator, quotechar, quoting, doublequote, header, mangle_dupe_cols, usecols, sep, delimiter, delim_whitespace, skipinitialspace, names, dtype, skipfooter, skiprows, dayfirst, compression, thousands, decimal, true_values, false_values, nrows, byte_range, skip_blank_lines, parse_dates, comment, na_values, keep_default_na, na_filter, prefix, index_col, **kwargs)
68 na_values = [na_values]
69
---> 70 return libcudf.csv.read_csv(
71 filepath_or_buffer,
72 lineterminator=lineterminator,
~/cudf/python/cudf/cudf/_lib/csv.pyx in cudf._lib.csv.read_csv()
392 cdef table_with_metadata c_result
393 with nogil:
--> 394 c_result = move(cpp_read_csv(read_csv_options_c))
395
396 meta_names = [name.decode() for name in c_result.metadata.column_names]
RuntimeError: reduce failed to synchronize: cudaErrorIllegalAddress: an illegal memory access was encountered While looking at this, I also noticed that it's relatively easy to get an illegal memory access by passing an out of range column index in In [4]: cudf.read_csv(filename, usecols=[2])
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
<ipython-input-4-97a4a25f179e> in <module>
----> 1 cudf.read_csv(filename, usecols=[2])
~/compose/etc/conda/cuda_11.2/envs/rapids/lib/python3.8/contextlib.py in inner(*args, **kwds)
73 def inner(*args, **kwds):
74 with self._recreate_cm():
---> 75 return func(*args, **kwds)
76 return inner
77
~/cudf/python/cudf/cudf/io/csv.py in read_csv(filepath_or_buffer, lineterminator, quotechar, quoting, doublequote, header, mangle_dupe_cols, usecols, sep, delimiter, delim_whitespace, skipinitialspace, names, dtype, skipfooter, skiprows, dayfirst, compression, thousands, decimal, true_values, false_values, nrows, byte_range, skip_blank_lines, parse_dates, comment, na_values, keep_default_na, na_filter, prefix, index_col, **kwargs)
68 na_values = [na_values]
69
---> 70 return libcudf.csv.read_csv(
71 filepath_or_buffer,
72 lineterminator=lineterminator,
~/cudf/python/cudf/cudf/_lib/csv.pyx in cudf._lib.csv.read_csv()
392 cdef table_with_metadata c_result
393 with nogil:
--> 394 c_result = move(cpp_read_csv(read_csv_options_c))
395
396 meta_names = [name.decode() for name in c_result.metadata.column_names]
RuntimeError: reduce failed to synchronize: cudaErrorIllegalAddress: an illegal memory access was encountered I can open a separate issue for that. |
This issue has been labeled |
This issue has been labeled |
this is a segfault in 22.10 (cc @beckernick)
|
@galipremsagar any chance you could look into this? |
Yup |
…csv` (#12018) closes #8973 CSV reader has a few gaps in the logic for column selection and user specified column names: 1. Users cannot only specify the names of selected columns; 2. Reader fails in unpredictable ways when only a subset of column names is passed (w/o column selection); This PR fixes the issues above. Users can now specify column names (can be lower than the actual number of columns) or names of columns selected via their indices (must match the number of indices). If selection via indices is used, the number of column names has to match either the actual number of columns, or the number of selected columns. Also fixed test an error that went unnoticed due to issues above. Authors: - Vukasin Milovanovic (https://github.com/vuule) Approvers: - GALI PREM SAGAR (https://github.com/galipremsagar) - Karthikeyan (https://github.com/karthikeyann) - Vyas Ramasubramani (https://github.com/vyasr) - Nghia Truong (https://github.com/ttnghia) - https://github.com/nvdbaranec URL: #12018
Describe the bug
Our framework uses a wrapper around cuDF
read_csv
to read CSV file input into our data pipelines. Ideally, we should be able read in any CSV just using a single call toread_csv
with the appropriate arguments. We have run into an issue whereread_csv
returns an error when trying to name columns selected via theusecols
argument. Thepandas
equivalent works.Steps/Code to reproduce bug
cuDF:
pandas:
Expected behavior
Same result as pandas
Environment overview (please complete the following information)
Environment details
Click here to see environment details
The text was updated successfully, but these errors were encountered: