Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] perform shift operator with string column #9150

Closed
rnyak opened this issue Aug 31, 2021 · 2 comments
Closed

[FEA] perform shift operator with string column #9150

rnyak opened this issue Aug 31, 2021 · 2 comments
Labels
feature request New feature or request libcudf Affects libcudf (C++/CUDA) code.

Comments

@rnyak
Copy link
Contributor

rnyak commented Aug 31, 2021

Is your feature request related to a problem? Please describe.

I am getting the following error from my code (see below):

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<timed exec> in <module>

/opt/conda/lib/python3.8/site-packages/cudf/core/frame.py in shift(self, periods, freq, axis, fill_value)
   1640         """
   1641         assert axis in (None, 0) and freq is None
-> 1642         return self._shift(periods)
   1643 
   1644     def _shift(self, offset, fill_value=None):

/opt/conda/lib/python3.8/site-packages/cudf/core/frame.py in _shift(self, offset, fill_value)
   1645         data_columns = (col.shift(offset, fill_value) for col in self._columns)
   1646         data = zip(self._column_names, data_columns)
-> 1647         return self.__class__._from_table(Frame(data, self._index))
   1648 
   1649     def __array__(self, dtype=None):

cudf/_lib/table.pyx in cudf._lib.table.Table.__init__()

/opt/conda/lib/python3.8/site-packages/cudf/core/column_accessor.py in __init__(self, data, multiindex, level_names)
    119             self._data = {}
    120             if data:
--> 121                 data = dict(data)
    122                 # Faster than next(iter(data.values()))
    123                 column_length = len(data[next(iter(data))])

/opt/conda/lib/python3.8/site-packages/cudf/core/frame.py in <genexpr>(.0)
   1643 
   1644     def _shift(self, offset, fill_value=None):
-> 1645         data_columns = (col.shift(offset, fill_value) for col in self._columns)
   1646         data = zip(self._column_names, data_columns)
   1647         return self.__class__._from_table(Frame(data, self._index))

/opt/conda/lib/python3.8/site-packages/cudf/core/column/column.py in shift(self, offset, fill_value)
    386 
    387     def shift(self, offset: int, fill_value: ScalarLike) -> ColumnBase:
--> 388         return libcudf.copying.shift(self, offset, fill_value)
    389 
    390     @property

cudf/_lib/copying.pyx in cudf._lib.copying.shift()

RuntimeError: cuDF failure at: /workspace/build-env/cpp/src/copying/shift.cu:51: shift does not support non-fixed-width types.

Describe the solution you'd like

You can reproduce the error with the following code:

df = cudf.read_parquet('df_toy.parquet')

# Keeps repeated interactions on the same items, removing only consecutive interactions, 
df = df.sort_values(['user_session', 'event_time_ts']).reset_index(drop=True)

print("Count with in-session repeated interactions: {}".format(len(df)))
# Sorts the dataframe by session and timestamp, to remove consecutive repetitions
df['product_id_past'] = df['product_id'].shift(1).fillna(0)
df['session_id_past'] = df['user_session'].shift(1).fillna(0)
#Keeping only no consecutive repeated in session interactions
df = df[~((df['user_session'] == df['session_id_past']) & \
             (df['product_id'] == df['product_id_past']))]
print("Count after removed in-session repeated interactions: {}".format(len(df)))
del(df['product_id_past'])
del(df['session_id_past'])

Dataset can be downloaded from here.

Describe alternatives you've considered

As a workaround, I have to label encode the user_session column first to be able to perform shift operation.
In addition, I can run this pipeline with pandas with long strings.

@rnyak rnyak added Needs Triage Need team to review and classify feature request New feature or request labels Aug 31, 2021
@galipremsagar galipremsagar changed the title [FEA] perform shift operator with columns including long strings [FEA] perform shift operator with string column Aug 31, 2021
@galipremsagar galipremsagar added the libcudf Affects libcudf (C++/CUDA) code. label Aug 31, 2021
@davidwendt
Copy link
Contributor

Strings column support was added to cudf::shift() in release 21.08 in PR #8648
The runtime error included in the description

RuntimeError: cuDF failure at: /workspace/build-env/cpp/src/copying/shift.cu:51: shift does not support non-fixed-width types.

indicates this was tested on a build prior to 21.08.

@beckernick
Copy link
Member

If you run into any issues with shift after upgrading versions, please feel free to file an issue. For now, I'm going to close this issue as answered.

@galipremsagar galipremsagar removed the Needs Triage Need team to review and classify label Aug 31, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request libcudf Affects libcudf (C++/CUDA) code.
Projects
None yet
Development

No branches or pull requests

4 participants