From dd68db3b644c4448f9c87a43dcb303e9fb055ad4 Mon Sep 17 00:00:00 2001 From: Ashwin Srinath <3190405+shwina@users.noreply.github.com> Date: Wed, 4 May 2022 17:39:26 -0400 Subject: [PATCH] Reorganize cuDF Python docs (#10691) This PR is composed of two high-level changes: * Replaces the use of ReStructuredText with [MyST Markdown](https://myst-parser.readthedocs.io/en/latest/). I used [rst2myst](https://github.com/executablebooks/rst2myst) for this and it worked pretty well. The rationale for this change is simple: we use `myst-nb` to render notebooks into documentation, and for consistency, it's nice to use `myst-parser` to parse the rest of our docs too. As a matter of opinion, I think Markdown is simpler and more familiar to most developers. * Reorganizes the docs (see below): Prior to this PR, the cuDF documentation was divided into 3 sections: * A user guide * A "Basics" section * API reference The distinction between the first two sections was never clear. I've gone ahead and merged those into a single section named "User Guide". This is also more consistent with Pandas. This PR also makes a couple of other changes: - Renamed the "Basics" page under the previous "Basics" section to "Data Types", as that reflects its contents more accurately. I also modified the content here a bit. - Renamed the "10 minutes to CuPy and CuDF" notebook to "Interoperability between CuPy and CuDF" as that more accurately describes what that page is about. ---- Compare the TOC from this PR (below) with our [currently published docs](https://docs.rapids.ai/api/cudf/stable/). Screen Shot 2022-04-20 at 1 13 04 PM Authors: - Ashwin Srinath (https://github.com/shwina) - Mike McCarty (https://github.com/mmccarty) - GALI PREM SAGAR (https://github.com/galipremsagar) Approvers: - GALI PREM SAGAR (https://github.com/galipremsagar) - Mike McCarty (https://github.com/mmccarty) - Bradley Dice (https://github.com/bdice) URL: https://github.com/rapidsai/cudf/pull/10691 --- docs/cudf/source/_static/params.css | 8 +- docs/cudf/source/basics/PandasCompat.rst | 4 - docs/cudf/source/basics/basics.rst | 62 -- docs/cudf/source/basics/dask-cudf.rst | 107 ---- docs/cudf/source/basics/groupby.rst | 274 -------- docs/cudf/source/basics/index.rst | 15 - docs/cudf/source/basics/internals.rst | 216 ------- .../cudf/source/basics/io-gds-integration.rst | 42 -- .../source/basics/io-nvcomp-integration.rst | 27 - docs/cudf/source/basics/io.rst | 13 - docs/cudf/source/index.rst | 1 - docs/cudf/source/user_guide/10min.ipynb | 371 +++++++---- docs/cudf/source/user_guide/PandasCompat.md | 5 + ...min-cudf-cupy.ipynb => cupy-interop.ipynb} | 246 ++++--- docs/cudf/source/user_guide/dask-cudf.md | 104 +++ docs/cudf/source/user_guide/data-types.md | 153 +++++ docs/cudf/source/user_guide/groupby.md | 273 ++++++++ .../source/user_guide/guide-to-udfs.ipynb | 149 ++++- docs/cudf/source/user_guide/index.md | 16 + docs/cudf/source/user_guide/index.rst | 12 - docs/cudf/source/user_guide/internals.md | 212 +++++++ .../io.md} | 113 +++- ...-missing-data.ipynb => missing-data.ipynb} | 598 ++++++++++-------- 23 files changed, 1738 insertions(+), 1283 deletions(-) delete mode 100644 docs/cudf/source/basics/PandasCompat.rst delete mode 100644 docs/cudf/source/basics/basics.rst delete mode 100644 docs/cudf/source/basics/dask-cudf.rst delete mode 100644 docs/cudf/source/basics/groupby.rst delete mode 100644 docs/cudf/source/basics/index.rst delete mode 100644 docs/cudf/source/basics/internals.rst delete mode 100644 docs/cudf/source/basics/io-gds-integration.rst delete mode 100644 docs/cudf/source/basics/io-nvcomp-integration.rst delete mode 100644 docs/cudf/source/basics/io.rst create mode 100644 docs/cudf/source/user_guide/PandasCompat.md rename docs/cudf/source/user_guide/{10min-cudf-cupy.ipynb => cupy-interop.ipynb} (87%) create mode 100644 docs/cudf/source/user_guide/dask-cudf.md create mode 100644 docs/cudf/source/user_guide/data-types.md create mode 100644 docs/cudf/source/user_guide/groupby.md create mode 100644 docs/cudf/source/user_guide/index.md delete mode 100644 docs/cudf/source/user_guide/index.rst create mode 100644 docs/cudf/source/user_guide/internals.md rename docs/cudf/source/{basics/io-supported-types.rst => user_guide/io.md} (69%) rename docs/cudf/source/user_guide/{Working-with-missing-data.ipynb => missing-data.ipynb} (87%) diff --git a/docs/cudf/source/_static/params.css b/docs/cudf/source/_static/params.css index 9e6be7ca75f..17c9d5accbd 100644 --- a/docs/cudf/source/_static/params.css +++ b/docs/cudf/source/_static/params.css @@ -50,11 +50,17 @@ table.io-supported-types-table thead{ } +/* Used to make special-table scrollable when it overflows */ +.special-table-wrapper { + width: 100%; + overflow: auto !important; +} + .special-table td, .special-table th { border: 1px solid #dee2e6; } -/* Needed to resolve https://github.com/executablebooks/jupyter-book/issues/1611 */ +/* Needed to resolve https://github.com/executablebooks/jupyter-book/issues/1611 */ .output.text_html { overflow: auto; } diff --git a/docs/cudf/source/basics/PandasCompat.rst b/docs/cudf/source/basics/PandasCompat.rst deleted file mode 100644 index fe9161e49c3..00000000000 --- a/docs/cudf/source/basics/PandasCompat.rst +++ /dev/null @@ -1,4 +0,0 @@ -Pandas Compatibility Notes -========================== - -.. pandas-compat-list:: diff --git a/docs/cudf/source/basics/basics.rst b/docs/cudf/source/basics/basics.rst deleted file mode 100644 index 9b8983fba49..00000000000 --- a/docs/cudf/source/basics/basics.rst +++ /dev/null @@ -1,62 +0,0 @@ -Basics -====== - - -Supported Dtypes ----------------- - -cuDF uses dtypes for Series or individual columns of a DataFrame. cuDF uses NumPy dtypes, NumPy provides support for ``float``, ``int``, ``bool``, -``'timedelta64[s]'``, ``'timedelta64[ms]'``, ``'timedelta64[us]'``, ``'timedelta64[ns]'``, ``'datetime64[s]'``, ``'datetime64[ms]'``, -``'datetime64[us]'``, ``'datetime64[ns]'`` (note that NumPy does not support timezone-aware datetimes). - - -The following table lists all of cudf types. For methods requiring dtype arguments, strings can be specified as indicated. See the respective documentation sections for more on each type. - -.. rst-class:: special-table -.. table:: - - +-----------------+------------------+--------------------------------------------------------------+----------------------------------------------+ - | Kind of Data | Data Type | Scalar | String Aliases | - +=================+==================+==============================================================+==============================================+ - | Integer | | np.int8_, np.int16_, np.int32_, np.int64_, np.uint8_, | ``'int8'``, ``'int16'``, ``'int32'``, | - | | | np.uint16_, np.uint32_, np.uint64_ | ``'int64'``, ``'uint8'``, ``'uint16'``, | - | | | | ``'uint32'``, ``'uint64'`` | - +-----------------+------------------+--------------------------------------------------------------+----------------------------------------------+ - | Float | | np.float32_, np.float64_ | ``'float32'``, ``'float64'`` | - +-----------------+------------------+--------------------------------------------------------------+----------------------------------------------+ - | Strings | | `str `_ | ``'string'``, ``'object'`` | - +-----------------+------------------+--------------------------------------------------------------+----------------------------------------------+ - | Datetime | | np.datetime64_ | ``'datetime64[s]'``, ``'datetime64[ms]'``, | - | | | | ``'datetime64[us]'``, ``'datetime64[ns]'`` | - +-----------------+------------------+--------------------------------------------------------------+----------------------------------------------+ - | Timedelta | | np.timedelta64_ | ``'timedelta64[s]'``, ``'timedelta64[ms]'``, | - | (duration type) | | | ``'timedelta64[us]'``, ``'timedelta64[ns]'`` | - +-----------------+------------------+--------------------------------------------------------------+----------------------------------------------+ - | Categorical | CategoricalDtype | (none) | ``'category'`` | - +-----------------+------------------+--------------------------------------------------------------+----------------------------------------------+ - | Boolean | | np.bool_ | ``'bool'`` | - +-----------------+------------------+--------------------------------------------------------------+----------------------------------------------+ - | Decimal | Decimal32Dtype, | (none) | (none) | - | | Decimal64Dtype, | | | - | | Decimal128Dtype | | | - +-----------------+------------------+--------------------------------------------------------------+----------------------------------------------+ - | Lists | ListDtype | list | ``'list'`` | - +-----------------+------------------+--------------------------------------------------------------+----------------------------------------------+ - | Structs | StructDtype | dict | ``'struct'`` | - +-----------------+------------------+--------------------------------------------------------------+----------------------------------------------+ - -**Note: All dtypes above are Nullable** - -.. _np.int8: -.. _np.int16: -.. _np.int32: -.. _np.int64: -.. _np.uint8: -.. _np.uint16: -.. _np.uint32: -.. _np.uint64: -.. _np.float32: -.. _np.float64: -.. _np.bool: https://numpy.org/doc/stable/user/basics.types.html -.. _np.datetime64: https://numpy.org/doc/stable/reference/arrays.datetime.html#basic-datetimes -.. _np.timedelta64: https://numpy.org/doc/stable/reference/arrays.datetime.html#datetime-and-timedelta-arithmetic diff --git a/docs/cudf/source/basics/dask-cudf.rst b/docs/cudf/source/basics/dask-cudf.rst deleted file mode 100644 index a9c65dfbfae..00000000000 --- a/docs/cudf/source/basics/dask-cudf.rst +++ /dev/null @@ -1,107 +0,0 @@ -Multi-GPU with Dask-cuDF -======================== - -cuDF is a single-GPU library. For Multi-GPU cuDF solutions we use -`Dask `__ and the `dask-cudf -package `__, -which is able to scale cuDF across multiple GPUs on a single machine, or -multiple GPUs across many machines in a cluster. - -`Dask DataFrame `__ was -originally designed to scale Pandas, orchestrating many Pandas -DataFrames spread across many CPUs into a cohesive parallel DataFrame. -Because cuDF currently implements only a subset of Pandas’s API, not all -Dask DataFrame operations work with cuDF. - -The following is tested and expected to work: - -What works ----------- - -- Data ingestion - - - ``dask_cudf.read_csv`` - - Use standard Dask ingestion with Pandas, then convert to cuDF (For - Parquet and other formats this is often decently fast) - -- Linear operations - - - Element-wise operations: ``df.x + df.y``, ``df ** 2`` - - Assignment: ``df['z'] = df.x + df.y`` - - Row-wise selections: ``df[df.x > 0]`` - - Loc: ``df.loc['2001-01-01': '2005-02-02']`` - - Date time/string accessors: ``df.timestamp.dt.dayofweek`` - - ... and most similar operations in this category that are already - implemented in cuDF - -- Reductions - - - Like ``sum``, ``mean``, ``max``, ``count``, and so on on - ``Series`` objects - - Support for reductions on full dataframes - - \ ``std``\ - - Custom reductions with - `dask.dataframe.reduction `__ - -- Groupby aggregations - - - On single columns: ``df.groupby('x').y.max()`` - - With custom aggregations: - - groupby standard deviation - - grouping on multiple columns - - groupby agg for multiple outputs - -- Joins: - - - On full unsorted columns: ``left.merge(right, on='id')`` - (expensive) - - On sorted indexes: - ``left.merge(right, left_index=True, right_index=True)`` (fast) - - On large and small dataframes: ``left.merge(cudf_df, on='id')`` - (fast) - -- Rolling operations -- Converting to and from other forms - - - Dask + Pandas to Dask + cuDF - ``df.map_partitions(cudf.from_pandas)`` - - Dask + cuDF to Dask + Pandas - ``df.map_partitions(lambda df: df.to_pandas())`` - - cuDF to Dask + cuDF: - ``dask.dataframe.from_pandas(df, npartitions=20)`` - - Dask + cuDF to cuDF: ``df.compute()`` - -Additionally all generic Dask operations, like ``compute``, ``persist``, -``visualize`` and so on work regardless. - -Developing the API ------------------- - -Above we mention the following: - - and most similar operations in this category that are already - implemented in cuDF - -This is because it is difficult to create a comprehensive list of -operations in the cuDF and Pandas libraries. The API is large enough to -be difficult to track effectively. For any operation that operates -row-wise like ``fillna`` or ``query`` things will likely, but not -certainly work. If operations don't work it is often due to a slight -inconsistency between Pandas and cuDF that is generally easy to fix. We -encourage users to look at the `cuDF issue -tracker `__ to see if their -issue has already been reported and, if not, `raise a new -issue `__. - -Navigating the API ------------------- - -This project reuses the `Dask -DataFrame `__ project, -which was originally designed for Pandas, with the newer library cuDF. -Because we use the same Dask classes for both projects there are often -methods that are implemented for Pandas, but not yet for cuDF. As a -result users looking at the full Dask DataFrame API can be misleading, -and often lead to frustration when operations that are advertised in the -Dask API do not work as expected with cuDF. We apologize for this in -advance. diff --git a/docs/cudf/source/basics/groupby.rst b/docs/cudf/source/basics/groupby.rst deleted file mode 100644 index f74853769f6..00000000000 --- a/docs/cudf/source/basics/groupby.rst +++ /dev/null @@ -1,274 +0,0 @@ -.. _basics.groupby: - -GroupBy -======= - -cuDF supports a small (but important) subset of Pandas' `groupby -API `__. - -Summary of supported operations -------------------------------- - -1. Grouping by one or more columns -2. Basic aggregations such as "sum", "mean", etc. -3. Quantile aggregation -4. A "collect" or ``list`` aggregation for collecting values in a group - into lists -5. Automatic exclusion of columns with unsupported dtypes ("nuisance" - columns) when aggregating -6. Iterating over the groups of a GroupBy object -7. ``GroupBy.groups`` API that returns a mapping of group keys to row - labels -8. ``GroupBy.apply`` API for performing arbitrary operations on each - group. Note that this has very limited functionality compared to the - equivalent Pandas function. See the section on - `apply <#groupby-apply>`__ for more details. -9. ``GroupBy.pipe`` similar to - `Pandas `__. - -Grouping --------- - -A GroupBy object is created by grouping the values of a ``Series`` or -``DataFrame`` by one or more columns: - -.. code:: python - - import cudf - - >>> df = cudf.DataFrame({'a': [1, 1, 1, 2, 2], 'b': [1, 1, 2, 2, 3], 'c': [1, 2, 3, 4, 5]}) - >>> df - >>> gb1 = df.groupby('a') # grouping by a single column - >>> gb2 = df.groupby(['a', 'b']) # grouping by multiple columns - >>> gb3 = df.groupby(cudf.Series(['a', 'a', 'b', 'b', 'b'])) # grouping by an external column - -.. warning:: - - cuDF uses `sort=False` by default to achieve better performance, which provides no gaurentee to the group order in outputs. This deviates from Pandas default behavior. - - For example: - - .. code-block:: python - - >>> df = cudf.DataFrame({'a' : [2, 2, 1], 'b' : [42, 21, 11]}) - >>> df.groupby('a').sum() - b - a - 2 63 - 1 11 - >>> df.to_pandas().groupby('a').sum() - b - a - 1 11 - 2 63 - - Setting `sort=True` will produce Pandas-like output, but with some performance penalty: - - .. code-block:: python - - >>> df.groupby('a', sort=True).sum() - b - a - 1 11 - 2 63 - -Grouping by index levels -~~~~~~~~~~~~~~~~~~~~~~~~ - -You can also group by one or more levels of a MultiIndex: - -.. code:: python - - >>> df = cudf.DataFrame( - ... {'a': [1, 1, 1, 2, 2], 'b': [1, 1, 2, 2, 3], 'c': [1, 2, 3, 4, 5]} - ... ).set_index(['a', 'b']) - ... - >>> df.groupby(level='a') - -The ``Grouper`` object -~~~~~~~~~~~~~~~~~~~~~~ - -A ``Grouper`` can be used to disambiguate between columns and levels -when they have the same name: - -.. code:: python - - >>> df - b c - b - 1 1 1 - 1 1 2 - 1 2 3 - 2 2 4 - 2 3 5 - >>> df.groupby('b', level='b') # ValueError: Cannot specify both by and level - >>> df.groupby([cudf.Grouper(key='b'), cudf.Grouper(level='b')]) # OK - -Aggregation ------------ - -Aggregations on groups is supported via the ``agg`` method: - -.. code:: python - - >>> df - a b c - 0 1 1 1 - 1 1 1 2 - 2 1 2 3 - 3 2 2 4 - 4 2 3 5 - >>> df.groupby('a').agg('sum') - b c - a - 1 4 6 - 2 5 9 - >>> df.groupby('a').agg({'b': ['sum', 'min'], 'c': 'mean'}) - b c - sum min mean - a - 1 4 1 2.0 - 2 5 2 4.5 - >>> df.groupby("a").corr(method="pearson") - b c - a - 1 b 1.000000 0.866025 - c 0.866025 1.000000 - 2 b 1.000000 1.000000 - c 1.000000 1.000000 - -The following table summarizes the available aggregations and the types -that support them: - -.. rst-class:: special-table -.. table:: - - +------------------------------------+-----------+------------+----------+---------------+--------+----------+------------+-----------+ - | Aggregations / dtypes | Numeric | Datetime | String | Categorical | List | Struct | Interval | Decimal | - +====================================+===========+============+==========+===============+========+==========+============+===========+ - | count | ✅ | ✅ | ✅ | ✅ | | | | ✅ | - +------------------------------------+-----------+------------+----------+---------------+--------+----------+------------+-----------+ - | size | ✅ | ✅ | ✅ | ✅ | | | | ✅ | - +------------------------------------+-----------+------------+----------+---------------+--------+----------+------------+-----------+ - | sum | ✅ | ✅ | | | | | | ✅ | - +------------------------------------+-----------+------------+----------+---------------+--------+----------+------------+-----------+ - | idxmin | ✅ | ✅ | | | | | | ✅ | - +------------------------------------+-----------+------------+----------+---------------+--------+----------+------------+-----------+ - | idxmax | ✅ | ✅ | | | | | | ✅ | - +------------------------------------+-----------+------------+----------+---------------+--------+----------+------------+-----------+ - | min | ✅ | ✅ | ✅ | | | | | ✅ | - +------------------------------------+-----------+------------+----------+---------------+--------+----------+------------+-----------+ - | max | ✅ | ✅ | ✅ | | | | | ✅ | - +------------------------------------+-----------+------------+----------+---------------+--------+----------+------------+-----------+ - | mean | ✅ | ✅ | | | | | | | - +------------------------------------+-----------+------------+----------+---------------+--------+----------+------------+-----------+ - | var | ✅ | ✅ | | | | | | | - +------------------------------------+-----------+------------+----------+---------------+--------+----------+------------+-----------+ - | std | ✅ | ✅ | | | | | | | - +------------------------------------+-----------+------------+----------+---------------+--------+----------+------------+-----------+ - | quantile | ✅ | ✅ | | | | | | | - +------------------------------------+-----------+------------+----------+---------------+--------+----------+------------+-----------+ - | median | ✅ | ✅ | | | | | | | - +------------------------------------+-----------+------------+----------+---------------+--------+----------+------------+-----------+ - | nunique | ✅ | ✅ | ✅ | ✅ | | | | ✅ | - +------------------------------------+-----------+------------+----------+---------------+--------+----------+------------+-----------+ - | nth | ✅ | ✅ | ✅ | | | | | ✅ | - +------------------------------------+-----------+------------+----------+---------------+--------+----------+------------+-----------+ - | collect | ✅ | ✅ | ✅ | | ✅ | | | ✅ | - +------------------------------------+-----------+------------+----------+---------------+--------+----------+------------+-----------+ - | unique | ✅ | ✅ | ✅ | ✅ | | | | | - +------------------------------------+-----------+------------+----------+---------------+--------+----------+------------+-----------+ - | corr | ✅ | | | | | | | ✅ | - +------------------------------------+-----------+------------+----------+---------------+--------+----------+------------+-----------+ - | cov | ✅ | | | | | | | ✅ | - +------------------------------------+-----------+------------+----------+---------------+--------+----------+------------+-----------+ - -GroupBy apply -------------- - -To apply function on each group, use the ``GroupBy.apply()`` method: - -.. code:: python - - >>> df - a b c - 0 1 1 1 - 1 1 1 2 - 2 1 2 3 - 3 2 2 4 - 4 2 3 5 - >>> df.groupby('a').apply(lambda x: x.max() - x.min()) - a b c - a - 0 0 1 2 - 1 0 1 1 - -Limitations -~~~~~~~~~~~ - -- ``apply`` works by applying the provided function to each group - sequentially, and concatenating the results together. **This can be - very slow**, especially for a large number of small groups. For a - small number of large groups, it can give acceptable performance - -- The results may not always match Pandas exactly. For example, cuDF - may return a ``DataFrame`` containing a single column where Pandas - returns a ``Series``. Some post-processing may be required to match - Pandas behavior. - -- cuDF does not support some of the exceptional cases that Pandas - supports with ``apply``, such as calling |describe|_ inside the - callable. - - .. |describe| replace:: ``describe`` - .. _describe: https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html#flexible-apply - - -Transform ---------- - -The ``.transform()`` method aggregates per group, and broadcasts the -result to the group size, resulting in a Series/DataFrame that is of -the same size as the input Series/DataFrame. - -.. code:: python - - >>> import cudf - >>> df = cudf.DataFrame({'a': [2, 1, 1, 2, 2], 'b': [1, 2, 3, 4, 5]}) - >>> df.groupby('a').transform('max') - b - 0 5 - 1 3 - 2 3 - 3 5 - 4 5 - - -Rolling window calculations ---------------------------- - -Use the ``GroupBy.rolling()`` method to perform rolling window -calculations on each group: - -.. code:: python - - >>> df - a b c - 0 1 1 1 - 1 1 1 2 - 2 1 2 3 - 3 2 2 4 - 4 2 3 5 - -Rolling window sum on each group with a window size of 2: - -.. code:: python - - >>> df.groupby('a').rolling(2).sum() - a b c - a - 1 0 - 1 2 2 3 - 2 2 3 5 - 2 3 - 4 4 5 9 diff --git a/docs/cudf/source/basics/index.rst b/docs/cudf/source/basics/index.rst deleted file mode 100644 index a29866d7e32..00000000000 --- a/docs/cudf/source/basics/index.rst +++ /dev/null @@ -1,15 +0,0 @@ -====== -Basics -====== - - -.. toctree:: - :maxdepth: 2 - - basics - io.rst - groupby.rst - PandasCompat.rst - dask-cudf.rst - internals.rst - \ No newline at end of file diff --git a/docs/cudf/source/basics/internals.rst b/docs/cudf/source/basics/internals.rst deleted file mode 100644 index 96ef40d51e6..00000000000 --- a/docs/cudf/source/basics/internals.rst +++ /dev/null @@ -1,216 +0,0 @@ -cuDF internals -============== - -The cuDF API closely matches that of the -`Pandas `__ library. Thus, we have the types -``cudf.Series``, ``cudf.DataFrame`` and ``cudf.Index`` which look and -feel very much like their Pandas counterparts. - -Under the hood, however, cuDF uses data structures very different from -Pandas. In this document, we describe these internal data structures. - -Column ------- - -Columns are cuDF's core data structure and they are modeled after the -`Apache Arrow Columnar -Format `__. - -A column represents a sequence of values, any number of which may be -"null". Columns are specialized based on the type of data they contain. -Thus we have ``NumericalColumn``, ``StringColumn``, ``DatetimeColumn``, -etc., - -A column is composed of the following: - -- A **data type**, specifying the type of each element. -- A **data buffer** that may store the data for the column elements. - Some column types do not have a data buffer, instead storing data in - the children columns. -- A **mask buffer** whose bits represent the validity (null or not - null) of each element. Columns whose elements are all "valid" may not - have a mask buffer. Mask buffers are padded to 64 bytes. -- A tuple of **children** columns, which enable the representation - complex types such as columns with non-fixed width elements such as - strings or lists. -- A **size** indicating the number of elements in the column. -- An integer **offset**: a column may represent a "slice" of another - column, in which case this offset represents the first element of the - slice. The size of the column then gives the extent of the slice. A - column that is not a slice has an offset of 0. - -For example, the ``NumericalColumn`` backing a Series with 1000 elements -of type 'int32' and containing nulls is composed of: - -1. A data buffer of size 4000 bytes (sizeof(int32) \* 1000) -2. A mask buffer of size 128 bytes (1000/8 padded to a multiple of 64 - bytes) -3. No children columns - -As another example, the ``StringColumn`` backing the Series -``['do', 'you', 'have', 'any', 'cheese?']`` is composed of: - -1. No data buffer -2. No mask buffer as there are no nulls in the Series -3. Two children columns: - - - A column of UTF-8 characters - ``['d', 'o', 'y', 'o', 'u', h' ... '?']`` - - A column of "offsets" to the characters column (in this case, - ``[0, 2, 5, 9, 12, 19]``) - -Buffer ------- - -The data and mask buffers of a column represent data in GPU memory -(a.k.a *device memory*), and are object of type -``cudf.core.buffer.Buffer``. - -Buffers can be constructed from array-like objects that live either on -the host (e.g., numpy arrays) or the device (e.g., cupy arrays). Arrays -must be of ``uint8`` dtype or viewed as such. - -When constructing a Buffer from a host object such as a numpy array, new -device memory is allocated: - -.. code:: python - - >>> from cudf.core.buffer import Buffer - >>> buf = Buffer(np.array([1, 2, 3], dtype='int64').view("uint8")) - >>> print(buf.ptr) # address of new device memory allocation - 140050901762560 - >>> print(buf.size) - 24 - >>> print(buf._owner) - - -cuDF uses the `RMM `__ library for -allocating device memory. You can read more about device memory -allocation with RMM -`here `__. - -When constructing a Buffer from a device object such as a CuPy array, no -new device memory is allocated. Instead, the Buffer points to the -existing allocation, keeping a reference to the device array: - -.. code:: python - - >>> import cupy as cp - >>> c_ary = cp.asarray([1, 2, 3], dtype='int64') - >>> buf = Buffer(c_ary.view("uint8")) - >>> print(c_ary.data.mem.ptr) - 140050901762560 - >>> print(buf.ptr) - 140050901762560 - >>> print(buf.size) - 24 - >>> print(buf._owner is c_ary) - True - -An uninitialized block of device memory can be allocated with -``Buffer.empty``: - -.. code:: python - - >>> buf = Buffer.empty(10) - >>> print(buf.size) - 10 - >>> print(buf._owner) - - -ColumnAccessor --------------- - -cuDF ``Series``, ``DataFrame`` and ``Index`` are all subclasses of an -internal ``Frame`` class. The underlying data structure of ``Frame`` is -an ordered, dictionary-like object known as ``ColumnAccessor``, which -can be accessed via the ``._data`` attribute: - -.. code:: python - - >>> a = cudf.DataFrame({'x': [1, 2, 3], 'y': ['a', 'b', 'c']}) - >>> a._data - ColumnAccessor(OrderedColumnDict([('x', ), ('y', )]), multiindex=False, level_names=(None,)) - -ColumnAccessor is an ordered mapping of column labels to columns. In -addition to behaving like an OrderedDict, it supports things like -selecting multiple columns (both by index and label), as well as -hierarchical indexing. - -.. code:: python - - >>> from cudf.core.column_accessor import ColumnAccessor - -The values of a ColumnAccessor are coerced to Columns during -construction: - -.. code:: python - - >>> ca = ColumnAccessor({'x': [1, 2, 3], 'y': ['a', 'b', 'c']}) - >>> ca['x'] - - >>> ca['y'] - - >>> ca.pop('x') - - >>> ca - ColumnAccessor(OrderedColumnDict([('y', )]), multiindex=False, level_names=(None,)) - -Columns can be inserted at a specified location: - -.. code:: python - - >>> ca.insert('z', [3, 4, 5], loc=1) - >>> ca - ColumnAccessor(OrderedColumnDict([('x', ), ('z', ), ('y', )]), multiindex=False, level_names=(None,)) - -Selecting columns by index: - -.. code:: python - - >>> ca = ColumnAccessor({'x': [1, 2, 3], 'y': ['a', 'b', 'c'], 'z': [4, 5, 6]}) - >>> ca.select_by_index(1) - ColumnAccessor(OrderedColumnDict([('y', )]), multiindex=False, level_names=(None,)) - >>> ca.select_by_index([0, 1]) - ColumnAccessor(OrderedColumnDict([('x', ), ('y', )]), multiindex=False, level_names=(None,)) - >>> ca.select_by_index(slice(1, 3)) - ColumnAccessor(OrderedColumnDict([('y', ), ('z', )]), multiindex=False, level_names=(None,)) - -Selecting columns by label: - -.. code:: python - - >>> ca.select_by_label(['y', 'z']) - ColumnAccessor(OrderedColumnDict([('y', ), ('z', )]), multiindex=False, level_names=(None,)) - >>> ca.select_by_label(slice('x', 'y')) - ColumnAccessor(OrderedColumnDict([('x', ), ('y', )]), multiindex=False, level_names=(None,)) - -A ColumnAccessor with tuple keys (and constructed with -``multiindex=True``) can be hierarchically indexed: - -.. code:: python - - >>> ca = ColumnAccessor({('a', 'b'): [1, 2, 3], ('a', 'c'): [2, 3, 4], 'b': [4, 5, 6]}, multiindex=True) - >>> ca.select_by_label('a') - ColumnAccessor(OrderedColumnDict([('b', ), ('c', )]), multiindex=False, level_names=(None,)) - >>> ca.select_by_label(('a', 'b')) - ColumnAccessor(OrderedColumnDict([(('a', 'b'), )]), multiindex=False, level_names=(None,)) - -"Wildcard" indexing is also allowed: - -.. code:: python - - >>> ca = ColumnAccessor({('a', 'b'): [1, 2, 3], ('a', 'c'): [2, 3, 4], ('d', 'b'): [4, 5, 6]}, multiindex=True) - >>> ca.select_by_label((slice(None), 'b')) - ColumnAccessor(OrderedColumnDict([(('a', 'b'), ), (('d', 'b'), )]), multiindex=True, level_names=(None, None)) - -Finally, ColumnAccessors can convert to Pandas ``Index`` or -``MultiIndex`` objects: - -.. code:: python - - >>> ca.to_pandas_index() - MultiIndex([('a', 'b'), - ('a', 'c'), - ('d', 'b')], - ) diff --git a/docs/cudf/source/basics/io-gds-integration.rst b/docs/cudf/source/basics/io-gds-integration.rst deleted file mode 100644 index ce774453386..00000000000 --- a/docs/cudf/source/basics/io-gds-integration.rst +++ /dev/null @@ -1,42 +0,0 @@ -GPUDirect Storage Integration -============================= - -Many IO APIs can use GPUDirect Storage (GDS) library to optimize IO operations. -GDS enables a direct data path for direct memory access (DMA) transfers between GPU memory and storage, which avoids a bounce buffer through the CPU. -GDS also has a compatibility mode that allows the library to fall back to copying through a CPU bounce buffer. -The SDK is available for download `here `_. -GDS is also included in CUDA Toolkit 11.4 and higher. - -Use of GPUDirect Storage in cuDF is enabled by default, but can be disabled through the environment variable ``LIBCUDF_CUFILE_POLICY``. -This variable also controls the GDS compatibility mode. - -There are four valid values for the environment variable: - -- "GDS": Enable GDS use; GDS compatibility mode is *off*. -- "ALWAYS": Enable GDS use; GDS compatibility mode is *on*. -- "KVIKIO": Enable GDS through `KvikIO `_. -- "OFF": Completely disable GDS use. - -If no value is set, behavior will be the same as the "GDS" option. - -This environment variable also affects how cuDF treats GDS errors. -When ``LIBCUDF_CUFILE_POLICY`` is set to "GDS" and a GDS API call fails for any reason, cuDF falls back to the internal implementation with bounce buffers. -When ``LIBCUDF_CUFILE_POLICY`` is set to "ALWAYS" and a GDS API call fails for any reason (unlikely, given that the compatibility mode is on), -cuDF throws an exception to propagate the error to the user. -When ``LIBCUDF_CUFILE_POLICY`` is set to "KVIKIO" and a KvikIO API call fails for any reason (unlikely, given that KvikIO implements its own compatibility mode) cuDF throws an exception to propagate the error to the user. -For more information about error handling, compatibility mode, and tuning parameters in KvikIO see: https://github.com/rapidsai/kvikio - -Operations that support the use of GPUDirect Storage: - -- :py:func:`cudf.read_avro` -- :py:func:`cudf.read_parquet` -- :py:func:`cudf.read_orc` -- :py:meth:`cudf.DataFrame.to_csv` -- :py:meth:`cudf.DataFrame.to_parquet` -- :py:meth:`cudf.DataFrame.to_orc` - -Several parameters that can be used to tune the performance of GDS-enabled I/O are exposed through environment variables: - -- ``LIBCUDF_CUFILE_THREAD_COUNT``: Integral value, maximum number of parallel reads/writes per file (default 16); -- ``LIBCUDF_CUFILE_SLICE_SIZE``: Integral value, maximum size of each GDS read/write, in bytes (default 4MB). - Larger I/O operations are split into multiple calls. diff --git a/docs/cudf/source/basics/io-nvcomp-integration.rst b/docs/cudf/source/basics/io-nvcomp-integration.rst deleted file mode 100644 index fc24e0c15f4..00000000000 --- a/docs/cudf/source/basics/io-nvcomp-integration.rst +++ /dev/null @@ -1,27 +0,0 @@ -nvCOMP Integration -============================= - -Some types of compression/decompression can be performed using either the `nvCOMP library `_ or the internal implementation. - -Which implementation is used by default depends on the data format and the compression type. -Behavior can be influenced through environment variable ``LIBCUDF_NVCOMP_POLICY``. - -There are three valid values for the environment variable: - -- "STABLE": Only enable the nvCOMP in places where it has been deemed stable for production use. -- "ALWAYS": Enable all available uses of nvCOMP, including new, experimental combinations. -- "OFF": Disable nvCOMP use whenever possible and use the internal implementations instead. - -If no value is set, behavior will be the same as the "STABLE" option. - - -.. table:: Current policy for nvCOMP use for different types - :widths: 20 15 15 15 15 15 15 15 15 15 - - +-----------------------+--------+--------+--------+--------+---------+--------+--------+--------+--------+ - | | CSV | Parquet | JSON | ORC | AVRO | - +-----------------------+--------+--------+--------+--------+---------+--------+--------+--------+--------+ - | Compression Type | Writer | Reader | Writer | Reader | Writer¹ | Reader | Writer | Reader | Reader | - +=======================+========+========+========+========+=========+========+========+========+========+ - | snappy | ❌ | ❌ | Stable | Stable | ❌ | ❌ | Stable | Stable | ❌ | - +-----------------------+--------+--------+--------+--------+---------+--------+--------+--------+--------+ diff --git a/docs/cudf/source/basics/io.rst b/docs/cudf/source/basics/io.rst deleted file mode 100644 index ee3d997d664..00000000000 --- a/docs/cudf/source/basics/io.rst +++ /dev/null @@ -1,13 +0,0 @@ -~~~~~~~~~~~~~~ -Input / Output -~~~~~~~~~~~~~~ - -This page contains Input / Output related APIs in cuDF. - -.. toctree:: - :maxdepth: 2 - :caption: Contents: - - io-supported-types.rst - io-gds-integration.rst - io-nvcomp-integration.rst \ No newline at end of file diff --git a/docs/cudf/source/index.rst b/docs/cudf/source/index.rst index 90b287bd1b6..2c1df4a0c12 100644 --- a/docs/cudf/source/index.rst +++ b/docs/cudf/source/index.rst @@ -14,7 +14,6 @@ the details of CUDA programming. :caption: Contents: user_guide/index - basics/index api_docs/index diff --git a/docs/cudf/source/user_guide/10min.ipynb b/docs/cudf/source/user_guide/10min.ipynb index 9bb95406e8a..080fce3c55c 100644 --- a/docs/cudf/source/user_guide/10min.ipynb +++ b/docs/cudf/source/user_guide/10min.ipynb @@ -2,6 +2,7 @@ "cells": [ { "cell_type": "markdown", + "id": "e9357872", "metadata": {}, "source": [ "10 Minutes to cuDF and Dask-cuDF\n", @@ -26,6 +27,7 @@ { "cell_type": "code", "execution_count": 1, + "id": "92eed4cb", "metadata": {}, "outputs": [], "source": [ @@ -45,6 +47,7 @@ }, { "cell_type": "markdown", + "id": "ed6c6047", "metadata": {}, "source": [ "Object Creation\n", @@ -53,6 +56,7 @@ }, { "cell_type": "markdown", + "id": "aeedd961", "metadata": {}, "source": [ "Creating a `cudf.Series` and `dask_cudf.Series`." @@ -61,6 +65,7 @@ { "cell_type": "code", "execution_count": 2, + "id": "cf8b08e5", "metadata": {}, "outputs": [ { @@ -87,6 +92,7 @@ { "cell_type": "code", "execution_count": 3, + "id": "083a5898", "metadata": {}, "outputs": [ { @@ -112,6 +118,7 @@ }, { "cell_type": "markdown", + "id": "6346e1b1", "metadata": {}, "source": [ "Creating a `cudf.DataFrame` and a `dask_cudf.DataFrame` by specifying values for each column." @@ -120,6 +127,7 @@ { "cell_type": "code", "execution_count": 4, + "id": "83d1e7f5", "metadata": {}, "outputs": [ { @@ -313,6 +321,7 @@ { "cell_type": "code", "execution_count": 5, + "id": "71b61d62", "metadata": {}, "outputs": [ { @@ -502,6 +511,7 @@ }, { "cell_type": "markdown", + "id": "c7cb5abc", "metadata": {}, "source": [ "Creating a `cudf.DataFrame` from a pandas `Dataframe` and a `dask_cudf.Dataframe` from a `cudf.Dataframe`.\n", @@ -512,6 +522,7 @@ { "cell_type": "code", "execution_count": 6, + "id": "07a62244", "metadata": {}, "outputs": [ { @@ -586,6 +597,7 @@ { "cell_type": "code", "execution_count": 7, + "id": "f5cb0c65", "metadata": {}, "outputs": [ { @@ -658,6 +670,7 @@ }, { "cell_type": "markdown", + "id": "025eac40", "metadata": {}, "source": [ "Viewing Data\n", @@ -666,6 +679,7 @@ }, { "cell_type": "markdown", + "id": "47a567e8", "metadata": {}, "source": [ "Viewing the top rows of a GPU dataframe." @@ -674,6 +688,7 @@ { "cell_type": "code", "execution_count": 8, + "id": "ab8cbdb8", "metadata": {}, "outputs": [ { @@ -737,6 +752,7 @@ { "cell_type": "code", "execution_count": 9, + "id": "2e923d8a", "metadata": {}, "outputs": [ { @@ -799,6 +815,7 @@ }, { "cell_type": "markdown", + "id": "61257b4b", "metadata": {}, "source": [ "Sorting by values." @@ -807,6 +824,7 @@ { "cell_type": "code", "execution_count": 10, + "id": "512770f9", "metadata": {}, "outputs": [ { @@ -996,6 +1014,7 @@ { "cell_type": "code", "execution_count": 11, + "id": "1a13993f", "metadata": {}, "outputs": [ { @@ -1184,6 +1203,7 @@ }, { "cell_type": "markdown", + "id": "19bce4c4", "metadata": {}, "source": [ "Selection\n", @@ -1194,6 +1214,7 @@ }, { "cell_type": "markdown", + "id": "ba55980e", "metadata": {}, "source": [ "Selecting a single column, which initially yields a `cudf.Series` or `dask_cudf.Series`. Calling `compute` results in a `cudf.Series` (equivalent to `df.a`)." @@ -1202,6 +1223,7 @@ { "cell_type": "code", "execution_count": 12, + "id": "885989a6", "metadata": {}, "outputs": [ { @@ -1242,6 +1264,7 @@ { "cell_type": "code", "execution_count": 13, + "id": "14a74255", "metadata": {}, "outputs": [ { @@ -1281,6 +1304,7 @@ }, { "cell_type": "markdown", + "id": "498d79f2", "metadata": {}, "source": [ "## Selection by Label" @@ -1288,6 +1312,7 @@ }, { "cell_type": "markdown", + "id": "4b8b8e13", "metadata": {}, "source": [ "Selecting rows from index 2 to index 5 from columns 'a' and 'b'." @@ -1296,6 +1321,7 @@ { "cell_type": "code", "execution_count": 14, + "id": "d40bc19c", "metadata": {}, "outputs": [ { @@ -1368,6 +1394,7 @@ { "cell_type": "code", "execution_count": 15, + "id": "7688535b", "metadata": {}, "outputs": [ { @@ -1439,6 +1466,7 @@ }, { "cell_type": "markdown", + "id": "8a64ce7a", "metadata": {}, "source": [ "## Selection by Position" @@ -1446,6 +1474,7 @@ }, { "cell_type": "markdown", + "id": "dfba2bb2", "metadata": {}, "source": [ "Selecting via integers and integer slices, like numpy/pandas. Note that this functionality is not available for Dask-cuDF DataFrames." @@ -1454,6 +1483,7 @@ { "cell_type": "code", "execution_count": 16, + "id": "fb8d6d43", "metadata": {}, "outputs": [ { @@ -1477,6 +1507,7 @@ { "cell_type": "code", "execution_count": 17, + "id": "263231da", "metadata": {}, "outputs": [ { @@ -1542,6 +1573,7 @@ }, { "cell_type": "markdown", + "id": "2223b089", "metadata": {}, "source": [ "You can also select elements of a `DataFrame` or `Series` with direct index access." @@ -1550,6 +1582,7 @@ { "cell_type": "code", "execution_count": 18, + "id": "13f6158b", "metadata": {}, "outputs": [ { @@ -1613,6 +1646,7 @@ { "cell_type": "code", "execution_count": 19, + "id": "3cf4aa26", "metadata": {}, "outputs": [ { @@ -1634,6 +1668,7 @@ }, { "cell_type": "markdown", + "id": "ff633b2d", "metadata": {}, "source": [ "## Boolean Indexing" @@ -1641,6 +1676,7 @@ }, { "cell_type": "markdown", + "id": "bbdef48f", "metadata": {}, "source": [ "Selecting rows in a `DataFrame` or `Series` by direct Boolean indexing." @@ -1649,6 +1685,7 @@ { "cell_type": "code", "execution_count": 20, + "id": "becb916f", "metadata": {}, "outputs": [ { @@ -1726,6 +1763,7 @@ { "cell_type": "code", "execution_count": 21, + "id": "b9475c43", "metadata": {}, "outputs": [ { @@ -1802,6 +1840,7 @@ }, { "cell_type": "markdown", + "id": "ecf982f5", "metadata": {}, "source": [ "Selecting values from a `DataFrame` where a Boolean condition is met, via the `query` API." @@ -1810,6 +1849,7 @@ { "cell_type": "code", "execution_count": 22, + "id": "fc2fc9f9", "metadata": {}, "outputs": [ { @@ -1866,6 +1906,7 @@ { "cell_type": "code", "execution_count": 23, + "id": "1a05a07f", "metadata": {}, "outputs": [ { @@ -1921,6 +1962,7 @@ }, { "cell_type": "markdown", + "id": "7f8955a0", "metadata": {}, "source": [ "You can also pass local variables to Dask-cuDF queries, via the `local_dict` keyword. With standard cuDF, you may either use the `local_dict` keyword or directly pass the variable via the `@` keyword. Supported logical operators include `>`, `<`, `>=`, `<=`, `==`, and `!=`." @@ -1929,6 +1971,7 @@ { "cell_type": "code", "execution_count": 24, + "id": "49485a4b", "metadata": {}, "outputs": [ { @@ -1986,6 +2029,7 @@ { "cell_type": "code", "execution_count": 25, + "id": "0f3a9116", "metadata": {}, "outputs": [ { @@ -2042,6 +2086,7 @@ }, { "cell_type": "markdown", + "id": "c355af07", "metadata": {}, "source": [ "Using the `isin` method for filtering." @@ -2050,6 +2095,7 @@ { "cell_type": "code", "execution_count": 26, + "id": "f44a5a57", "metadata": {}, "outputs": [ { @@ -2112,6 +2158,7 @@ }, { "cell_type": "markdown", + "id": "79a50beb", "metadata": {}, "source": [ "## MultiIndex" @@ -2119,6 +2166,7 @@ }, { "cell_type": "markdown", + "id": "14e70234", "metadata": {}, "source": [ "cuDF supports hierarchical indexing of DataFrames using MultiIndex. Grouping hierarchically (see `Grouping` below) automatically produces a DataFrame with a MultiIndex." @@ -2127,6 +2175,7 @@ { "cell_type": "code", "execution_count": 27, + "id": "882973ed", "metadata": {}, "outputs": [ { @@ -2153,6 +2202,7 @@ }, { "cell_type": "markdown", + "id": "c10971cc", "metadata": {}, "source": [ "This index can back either axis of a DataFrame." @@ -2161,6 +2211,7 @@ { "cell_type": "code", "execution_count": 28, + "id": "5417aeb9", "metadata": {}, "outputs": [ { @@ -2238,6 +2289,7 @@ { "cell_type": "code", "execution_count": 29, + "id": "4d6fb4ff", "metadata": {}, "outputs": [ { @@ -2311,6 +2363,7 @@ }, { "cell_type": "markdown", + "id": "63dc11d8", "metadata": {}, "source": [ "Accessing values of a DataFrame with a MultiIndex. Note that slicing is not yet supported." @@ -2319,6 +2372,7 @@ { "cell_type": "code", "execution_count": 30, + "id": "3644920c", "metadata": {}, "outputs": [ { @@ -2340,6 +2394,7 @@ }, { "cell_type": "markdown", + "id": "697a9a36", "metadata": {}, "source": [ "Missing Data\n", @@ -2348,6 +2403,7 @@ }, { "cell_type": "markdown", + "id": "86655274", "metadata": {}, "source": [ "Missing data can be replaced by using the `fillna` method." @@ -2356,6 +2412,7 @@ { "cell_type": "code", "execution_count": 31, + "id": "28b06c52", "metadata": {}, "outputs": [ { @@ -2381,6 +2438,7 @@ { "cell_type": "code", "execution_count": 32, + "id": "7fb6a126", "metadata": {}, "outputs": [ { @@ -2405,6 +2463,7 @@ }, { "cell_type": "markdown", + "id": "7a0b732f", "metadata": {}, "source": [ "Operations\n", @@ -2413,6 +2472,7 @@ }, { "cell_type": "markdown", + "id": "1e8b0464", "metadata": {}, "source": [ "## Stats" @@ -2420,6 +2480,7 @@ }, { "cell_type": "markdown", + "id": "7523512b", "metadata": {}, "source": [ "Calculating descriptive statistics for a `Series`." @@ -2428,6 +2489,7 @@ { "cell_type": "code", "execution_count": 33, + "id": "f7cb604e", "metadata": {}, "outputs": [ { @@ -2448,6 +2510,7 @@ { "cell_type": "code", "execution_count": 34, + "id": "b8957a5f", "metadata": {}, "outputs": [ { @@ -2467,6 +2530,7 @@ }, { "cell_type": "markdown", + "id": "71fa928a", "metadata": {}, "source": [ "## Applymap" @@ -2474,6 +2538,7 @@ }, { "cell_type": "markdown", + "id": "d98d6f7b", "metadata": {}, "source": [ "Applying functions to a `Series`. Note that applying user defined functions directly with Dask-cuDF is not yet implemented. For now, you can use [map_partitions](http://docs.dask.org/en/stable/generated/dask.dataframe.DataFrame.map_partitions.html) to apply a function to each partition of the distributed dataframe." @@ -2482,6 +2547,7 @@ { "cell_type": "code", "execution_count": 35, + "id": "5e627811", "metadata": {}, "outputs": [ { @@ -2533,6 +2599,7 @@ { "cell_type": "code", "execution_count": 36, + "id": "96cf628e", "metadata": {}, "outputs": [ { @@ -2572,6 +2639,7 @@ }, { "cell_type": "markdown", + "id": "cd69c00a", "metadata": {}, "source": [ "## Histogramming" @@ -2579,6 +2647,7 @@ }, { "cell_type": "markdown", + "id": "39982866", "metadata": {}, "source": [ "Counting the number of occurrences of each unique value of variable." @@ -2587,6 +2656,7 @@ { "cell_type": "code", "execution_count": 37, + "id": "62808675", "metadata": {}, "outputs": [ { @@ -2627,6 +2697,7 @@ { "cell_type": "code", "execution_count": 38, + "id": "5b2a42ce", "metadata": {}, "outputs": [ { @@ -2666,6 +2737,7 @@ }, { "cell_type": "markdown", + "id": "2d7e62e4", "metadata": {}, "source": [ "## String Methods" @@ -2673,6 +2745,7 @@ }, { "cell_type": "markdown", + "id": "4e704eca", "metadata": {}, "source": [ "Like pandas, cuDF provides string processing methods in the `str` attribute of `Series`. Full documentation of string methods is a work in progress. Please see the cuDF API documentation for more information." @@ -2681,6 +2754,7 @@ { "cell_type": "code", "execution_count": 39, + "id": "c73e70bb", "metadata": {}, "outputs": [ { @@ -2711,6 +2785,7 @@ { "cell_type": "code", "execution_count": 40, + "id": "697c1c94", "metadata": {}, "outputs": [ { @@ -2740,6 +2815,7 @@ }, { "cell_type": "markdown", + "id": "dfc1371e", "metadata": {}, "source": [ "## Concat" @@ -2747,6 +2823,7 @@ }, { "cell_type": "markdown", + "id": "f6fb9b53", "metadata": {}, "source": [ "Concatenating `Series` and `DataFrames` row-wise." @@ -2755,6 +2832,7 @@ { "cell_type": "code", "execution_count": 41, + "id": "60538bbd", "metadata": {}, "outputs": [ { @@ -2786,6 +2864,7 @@ { "cell_type": "code", "execution_count": 42, + "id": "17953847", "metadata": {}, "outputs": [ { @@ -2816,6 +2895,7 @@ }, { "cell_type": "markdown", + "id": "27f0d621", "metadata": {}, "source": [ "## Join" @@ -2823,6 +2903,7 @@ }, { "cell_type": "markdown", + "id": "fd35f1a7", "metadata": {}, "source": [ "Performing SQL style merges. Note that the dataframe order is not maintained, but may be restored post-merge by sorting by the index." @@ -2831,6 +2912,7 @@ { "cell_type": "code", "execution_count": 43, + "id": "52ada00a", "metadata": {}, "outputs": [ { @@ -2924,6 +3006,7 @@ { "cell_type": "code", "execution_count": 44, + "id": "409fcf92", "metadata": {}, "outputs": [ { @@ -3011,6 +3094,7 @@ }, { "cell_type": "markdown", + "id": "d9dcb86b", "metadata": {}, "source": [ "## Append" @@ -3018,6 +3102,7 @@ }, { "cell_type": "markdown", + "id": "1f896819", "metadata": {}, "source": [ "Appending values from another `Series` or array-like object." @@ -3026,6 +3111,7 @@ { "cell_type": "code", "execution_count": 45, + "id": "9976c1ce", "metadata": {}, "outputs": [ { @@ -3064,6 +3150,7 @@ { "cell_type": "code", "execution_count": 46, + "id": "fe5c54ab", "metadata": {}, "outputs": [ { @@ -3093,6 +3180,7 @@ }, { "cell_type": "markdown", + "id": "9fa10ef3", "metadata": {}, "source": [ "## Grouping" @@ -3100,6 +3188,7 @@ }, { "cell_type": "markdown", + "id": "8a6e41f5", "metadata": {}, "source": [ "Like pandas, cuDF and Dask-cuDF support the Split-Apply-Combine groupby paradigm." @@ -3108,6 +3197,7 @@ { "cell_type": "code", "execution_count": 47, + "id": "2a8cafa7", "metadata": {}, "outputs": [], "source": [ @@ -3119,6 +3209,7 @@ }, { "cell_type": "markdown", + "id": "0179d60c", "metadata": {}, "source": [ "Grouping and then applying the `sum` function to the grouped data." @@ -3127,6 +3218,7 @@ { "cell_type": "code", "execution_count": 48, + "id": "7c56d186", "metadata": {}, "outputs": [ { @@ -3201,6 +3293,7 @@ { "cell_type": "code", "execution_count": 49, + "id": "f8823b30", "metadata": {}, "outputs": [ { @@ -3274,6 +3367,7 @@ }, { "cell_type": "markdown", + "id": "a84cb883", "metadata": {}, "source": [ "Grouping hierarchically then applying the `sum` function to grouped data." @@ -3282,6 +3376,7 @@ { "cell_type": "code", "execution_count": 50, + "id": "2184e3ad", "metadata": {}, "outputs": [ { @@ -3372,6 +3467,7 @@ { "cell_type": "code", "execution_count": 51, + "id": "4ec311c1", "metadata": {}, "outputs": [ { @@ -3461,6 +3557,7 @@ }, { "cell_type": "markdown", + "id": "dedfeb1b", "metadata": {}, "source": [ "Grouping and applying statistical functions to specific columns, using `agg`." @@ -3469,6 +3566,7 @@ { "cell_type": "code", "execution_count": 52, + "id": "2563d8b2", "metadata": {}, "outputs": [ { @@ -3539,6 +3637,7 @@ { "cell_type": "code", "execution_count": 53, + "id": "22c77e75", "metadata": {}, "outputs": [ { @@ -3608,6 +3707,7 @@ }, { "cell_type": "markdown", + "id": "6d074822", "metadata": {}, "source": [ "## Transpose" @@ -3615,6 +3715,7 @@ }, { "cell_type": "markdown", + "id": "16c0f0a8", "metadata": {}, "source": [ "Transposing a dataframe, using either the `transpose` method or `T` property. Currently, all columns must have the same type. Transposing is not currently implemented in Dask-cuDF." @@ -3623,6 +3724,7 @@ { "cell_type": "code", "execution_count": 54, + "id": "e265861e", "metadata": {}, "outputs": [ { @@ -3690,6 +3792,7 @@ { "cell_type": "code", "execution_count": 55, + "id": "1fe9b972", "metadata": {}, "outputs": [ { @@ -3752,14 +3855,16 @@ }, { "cell_type": "markdown", + "id": "9ce02827", "metadata": {}, "source": [ "Time Series\n", - "------------\n" + "------------" ] }, { "cell_type": "markdown", + "id": "fec907ff", "metadata": {}, "source": [ "`DataFrames` supports `datetime` typed columns, which allow users to interact with and filter data based on specific timestamps." @@ -3768,6 +3873,7 @@ { "cell_type": "code", "execution_count": 56, + "id": "7a425d3f", "metadata": {}, "outputs": [ { @@ -3847,6 +3953,7 @@ { "cell_type": "code", "execution_count": 57, + "id": "87f0e56e", "metadata": {}, "outputs": [ { @@ -3919,6 +4026,7 @@ }, { "cell_type": "markdown", + "id": "0d0e541c", "metadata": {}, "source": [ "Categoricals\n", @@ -3927,6 +4035,7 @@ }, { "cell_type": "markdown", + "id": "a36f9543", "metadata": {}, "source": [ "`DataFrames` support categorical columns." @@ -3935,6 +4044,7 @@ { "cell_type": "code", "execution_count": 58, + "id": "05bd8be8", "metadata": {}, "outputs": [ { @@ -4021,6 +4131,7 @@ { "cell_type": "code", "execution_count": 59, + "id": "676b4963", "metadata": {}, "outputs": [ { @@ -4105,6 +4216,7 @@ }, { "cell_type": "markdown", + "id": "e24f2e7b", "metadata": {}, "source": [ "Accessing the categories of a column. Note that this is currently not supported in Dask-cuDF." @@ -4113,6 +4225,7 @@ { "cell_type": "code", "execution_count": 60, + "id": "06310c36", "metadata": {}, "outputs": [ { @@ -4132,6 +4245,7 @@ }, { "cell_type": "markdown", + "id": "4eb6f858", "metadata": {}, "source": [ "Accessing the underlying code values of each categorical observation." @@ -4140,6 +4254,7 @@ { "cell_type": "code", "execution_count": 61, + "id": "0f6db260", "metadata": {}, "outputs": [ { @@ -4166,6 +4281,7 @@ { "cell_type": "code", "execution_count": 62, + "id": "b87c4375", "metadata": {}, "outputs": [ { @@ -4191,6 +4307,7 @@ }, { "cell_type": "markdown", + "id": "3f816916", "metadata": {}, "source": [ "Converting Data Representation\n", @@ -4199,6 +4316,7 @@ }, { "cell_type": "markdown", + "id": "64a17f6d", "metadata": {}, "source": [ "## Pandas" @@ -4206,6 +4324,7 @@ }, { "cell_type": "markdown", + "id": "3acdcacc", "metadata": {}, "source": [ "Converting a cuDF and Dask-cuDF `DataFrame` to a pandas `DataFrame`." @@ -4214,6 +4333,7 @@ { "cell_type": "code", "execution_count": 63, + "id": "d1fed919", "metadata": {}, "outputs": [ { @@ -4310,6 +4430,7 @@ { "cell_type": "code", "execution_count": 64, + "id": "567c7363", "metadata": {}, "outputs": [ { @@ -4405,6 +4526,7 @@ }, { "cell_type": "markdown", + "id": "c2121453", "metadata": {}, "source": [ "## Numpy" @@ -4412,6 +4534,7 @@ }, { "cell_type": "markdown", + "id": "a9faa2c5", "metadata": {}, "source": [ "Converting a cuDF or Dask-cuDF `DataFrame` to a numpy `ndarray`." @@ -4420,6 +4543,7 @@ { "cell_type": "code", "execution_count": 65, + "id": "5490d226", "metadata": {}, "outputs": [ { @@ -4459,6 +4583,7 @@ { "cell_type": "code", "execution_count": 66, + "id": "b77ac8ae", "metadata": {}, "outputs": [ { @@ -4497,6 +4622,7 @@ }, { "cell_type": "markdown", + "id": "1d24d30f", "metadata": {}, "source": [ "Converting a cuDF or Dask-cuDF `Series` to a numpy `ndarray`." @@ -4505,6 +4631,7 @@ { "cell_type": "code", "execution_count": 67, + "id": "f71a0ba3", "metadata": {}, "outputs": [ { @@ -4526,6 +4653,7 @@ { "cell_type": "code", "execution_count": 68, + "id": "a45a74b5", "metadata": {}, "outputs": [ { @@ -4546,6 +4674,7 @@ }, { "cell_type": "markdown", + "id": "0d78a4d2", "metadata": {}, "source": [ "## Arrow" @@ -4553,6 +4682,7 @@ }, { "cell_type": "markdown", + "id": "7e35b829", "metadata": {}, "source": [ "Converting a cuDF or Dask-cuDF `DataFrame` to a PyArrow `Table`." @@ -4561,6 +4691,7 @@ { "cell_type": "code", "execution_count": 69, + "id": "bb9e9a2a", "metadata": {}, "outputs": [ { @@ -4592,6 +4723,7 @@ { "cell_type": "code", "execution_count": 70, + "id": "4d020de7", "metadata": {}, "outputs": [ { @@ -4622,14 +4754,16 @@ }, { "cell_type": "markdown", + "id": "ace7b4f9", "metadata": {}, "source": [ "Getting Data In/Out\n", - "------------------------\n" + "------------------------" ] }, { "cell_type": "markdown", + "id": "161abb12", "metadata": {}, "source": [ "## CSV" @@ -4637,6 +4771,7 @@ }, { "cell_type": "markdown", + "id": "7e5dc381", "metadata": {}, "source": [ "Writing to a CSV file." @@ -4645,6 +4780,7 @@ { "cell_type": "code", "execution_count": 71, + "id": "3a59715f", "metadata": {}, "outputs": [], "source": [ @@ -4657,6 +4793,7 @@ { "cell_type": "code", "execution_count": 72, + "id": "4ebe98ed", "metadata": {}, "outputs": [], "source": [ @@ -4665,6 +4802,7 @@ }, { "cell_type": "markdown", + "id": "0479fc4f", "metadata": {}, "source": [ "Reading from a csv file." @@ -4673,6 +4811,7 @@ { "cell_type": "code", "execution_count": 73, + "id": "1a70e831", "metadata": {}, "outputs": [ { @@ -4905,6 +5044,7 @@ { "cell_type": "code", "execution_count": 74, + "id": "4c3d9ca3", "metadata": {}, "outputs": [ { @@ -5136,6 +5276,7 @@ }, { "cell_type": "markdown", + "id": "3d739c6e", "metadata": {}, "source": [ "Reading all CSV files in a directory into a single `dask_cudf.DataFrame`, using the star wildcard." @@ -5144,6 +5285,7 @@ { "cell_type": "code", "execution_count": 75, + "id": "cb7187d2", "metadata": {}, "outputs": [ { @@ -5555,6 +5697,7 @@ }, { "cell_type": "markdown", + "id": "c0939a1e", "metadata": {}, "source": [ "## Parquet" @@ -5562,6 +5705,7 @@ }, { "cell_type": "markdown", + "id": "14e6a634", "metadata": {}, "source": [ "Writing to parquet files, using the CPU via PyArrow." @@ -5570,6 +5714,7 @@ { "cell_type": "code", "execution_count": 76, + "id": "1812346f", "metadata": {}, "outputs": [], "source": [ @@ -5578,6 +5723,7 @@ }, { "cell_type": "markdown", + "id": "093cd0fe", "metadata": {}, "source": [ "Reading parquet files with a GPU-accelerated parquet reader." @@ -5586,6 +5732,7 @@ { "cell_type": "code", "execution_count": 77, + "id": "2354b20b", "metadata": {}, "outputs": [ { @@ -5817,6 +5964,7 @@ }, { "cell_type": "markdown", + "id": "132c3ff2", "metadata": {}, "source": [ "Writing to parquet files from a `dask_cudf.DataFrame` using PyArrow under the hood." @@ -5825,6 +5973,7 @@ { "cell_type": "code", "execution_count": 78, + "id": "c5d7686c", "metadata": {}, "outputs": [ { @@ -5844,6 +5993,7 @@ }, { "cell_type": "markdown", + "id": "0d73d1dd", "metadata": {}, "source": [ "## ORC" @@ -5851,6 +6001,7 @@ }, { "cell_type": "markdown", + "id": "61b5f466", "metadata": {}, "source": [ "Reading ORC files." @@ -5858,16 +6009,17 @@ }, { "cell_type": "code", - "execution_count": 80, + "execution_count": 79, + "id": "93364ff3", "metadata": {}, "outputs": [ { "data": { "text/plain": [ - "'/home/mmccarty/sandbox/rapids/cudf/python/cudf/cudf/tests/data/orc/TestOrcFile.test1.orc'" + "'/home/ashwin/workspace/rapids/cudf/python/cudf/cudf/tests/data/orc/TestOrcFile.test1.orc'" ] }, - "execution_count": 80, + "execution_count": 79, "metadata": {}, "output_type": "execute_result" } @@ -5883,7 +6035,8 @@ }, { "cell_type": "code", - "execution_count": 81, + "execution_count": 80, + "id": "2b6785c7", "metadata": {}, "outputs": [ { @@ -5974,7 +6127,7 @@ "1 [{'key': 'chani', 'value': {'int1': 5, 'string... " ] }, - "execution_count": 81, + "execution_count": 80, "metadata": {}, "output_type": "execute_result" } @@ -5986,6 +6139,7 @@ }, { "cell_type": "markdown", + "id": "238ce6a4", "metadata": {}, "source": [ "Dask Performance Tips\n", @@ -6000,6 +6154,7 @@ }, { "cell_type": "markdown", + "id": "3de9aeca", "metadata": {}, "source": [ "First, we set up a GPU cluster. With our `client` set up, Dask-cuDF computation will be distributed across the GPUs in the cluster." @@ -6007,17 +6162,16 @@ }, { "cell_type": "code", - "execution_count": 82, + "execution_count": 81, + "id": "e4852d48", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ - "2022-04-21 10:11:07,360 - distributed.diskutils - INFO - Found stale lock file and directory '/home/mmccarty/sandbox/rapids/cudf/docs/cudf/source/user_guide/dask-worker-space/worker-ghcx5g0e', purging\n", - "2022-04-21 10:11:07,360 - distributed.diskutils - INFO - Found stale lock file and directory '/home/mmccarty/sandbox/rapids/cudf/docs/cudf/source/user_guide/dask-worker-space/worker-wh16f0h3', purging\n", - "2022-04-21 10:11:07,360 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize\n", - "2022-04-21 10:11:07,388 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize\n" + "2022-04-21 13:26:06,860 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize\n", + "2022-04-21 13:26:06,904 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize\n" ] }, { @@ -6027,7 +6181,7 @@ "
\n", "
\n", "

Client

\n", - "

Client-e3492c89-c17c-11ec-813e-fc3497a62adc

\n", + "

Client-20d00fd5-c198-11ec-906c-c8d9d2247354

\n", " \n", "\n", " \n", @@ -6056,7 +6210,7 @@ " \n", "
\n", "

LocalCUDACluster

\n", - "

db2501e1

\n", + "

47648c26

\n", "
\n", " \n", " \n", " \n", " \n", - " \n", - " \n", - " \n", + " \n", + " \n", + " \n", " \n", " \n", " \n", - " \n", - " \n", - " \n", + " \n", + " \n", + " \n", " \n", " \n", " \n", - " \n", - " \n", - " \n", + " \n", + " \n", + " \n", " \n", " \n", " \n", " \n", - " \n", - " \n", + " \n", + " \n", " \n", " \n", " \n", " \n", - " \n", - " \n", + " \n", + " \n", " \n", " \n", " \n", - " \n", - " \n", - " \n", + " \n", + " \n", + " \n", " \n", " \n", " \n", - " \n", - " \n", - " \n", + " \n", + " \n", + " \n", " \n", " \n", " \n", - " \n", - " \n", - " \n", + " \n", + " \n", + " \n", " \n", " \n", " \n", - " \n", - " \n", - " \n", + " \n", + " \n", + " \n", " \n", " \n", " \n", - " \n", - " \n", - " \n", + " \n", + " \n", + " \n", " \n", " \n", "
\n", @@ -6093,11 +6247,11 @@ "
\n", "
\n", "

Scheduler

\n", - "

Scheduler-6f476508-e52f-49e9-8f1f-6a8641e177bd

\n", + "

Scheduler-f28bff16-cb70-452c-b8af-b9299a8d7b20

\n", " \n", " \n", " \n", " \n", " \n", " \n", - " \n", - " \n", - " \n", + " \n", + " \n", + " \n", " \n", " \n", " \n", - " \n", - " \n", - " \n", + " \n", + " \n", + " \n", " \n", " \n", " \n", - " \n", - " \n", - " \n", + " \n", + " \n", + " \n", " \n", " \n", " \n", - " \n", - " \n", - " \n", + " \n", + " \n", + " \n", " \n", " \n", " \n", - " \n", - " \n", - " \n", + " \n", + " \n", + " \n", " \n", " \n", " \n", - " \n", - " \n", - " \n", + " \n", + " \n", + " \n", " \n", " \n", " \n", - " \n", - " \n", - " \n", + " \n", + " \n", + " \n", " \n", " \n", " \n", - " \n", - " \n", - " \n", + " \n", + " \n", + " \n", " \n", " \n", " \n", - " \n", - " \n", - " \n", + " \n", + " \n", + " \n", " \n", " \n", " \n", - " \n", - " \n", - " \n", + " \n", + " \n", + " \n", " \n", " \n", "
\n", - " Comm: tcp://127.0.0.1:39755\n", + " Comm: tcp://127.0.0.1:33995\n", " \n", " Workers: 2\n", @@ -6139,7 +6293,7 @@ " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", @@ -6193,7 +6347,7 @@ "
\n", - " Comm: tcp://127.0.0.1:33491\n", + " Comm: tcp://127.0.0.1:40479\n", " \n", " Total threads: 1\n", @@ -6147,7 +6301,7 @@ "
\n", - " Dashboard: http://127.0.0.1:34333/status\n", + " Dashboard: http://127.0.0.1:38985/status\n", " \n", " Memory: 62.82 GiB\n", @@ -6155,13 +6309,13 @@ "
\n", - " Nanny: tcp://127.0.0.1:43093\n", + " Nanny: tcp://127.0.0.1:33447\n", "
\n", - " Local directory: /home/mmccarty/sandbox/rapids/cudf/docs/cudf/source/user_guide/dask-worker-space/worker-jsuvfju4\n", + " Local directory: /home/ashwin/workspace/rapids/cudf/docs/cudf/source/user_guide/dask-worker-space/worker-be7zg92w\n", "
\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", @@ -6251,10 +6405,10 @@ "" ], "text/plain": [ - "" + "" ] }, - "execution_count": 82, + "execution_count": 81, "metadata": {}, "output_type": "execute_result" } @@ -6272,6 +6426,7 @@ }, { "cell_type": "markdown", + "id": "181e4d10", "metadata": {}, "source": [ "### Persisting Data\n", @@ -6280,7 +6435,8 @@ }, { "cell_type": "code", - "execution_count": 83, + "execution_count": 82, + "id": "d47a1142", "metadata": {}, "outputs": [ { @@ -6356,7 +6512,7 @@ "" ] }, - "execution_count": 83, + "execution_count": 82, "metadata": {}, "output_type": "execute_result" } @@ -6372,45 +6528,37 @@ }, { "cell_type": "code", - "execution_count": 84, + "execution_count": 83, + "id": "c3cb612a", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ - "Thu Apr 21 10:11:07 2022 \n", - "+-----------------------------------------------------------------------------+\n", - "| NVIDIA-SMI 495.29.05 Driver Version: 495.29.05 CUDA Version: 11.5 |\n", - "|-------------------------------+----------------------+----------------------+\n", - "| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |\n", - "| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |\n", - "| | | MIG M. |\n", - "|===============================+======================+======================|\n", - "| 0 NVIDIA RTX A6000 On | 00000000:01:00.0 On | Off |\n", - "| 30% 48C P2 83W / 300W | 2970MiB / 48651MiB | 7% Default |\n", - "| | | N/A |\n", - "+-------------------------------+----------------------+----------------------+\n", - "| 1 NVIDIA RTX A6000 On | 00000000:02:00.0 Off | Off |\n", - "| 30% 36C P2 25W / 300W | 265MiB / 48685MiB | 5% Default |\n", - "| | | N/A |\n", - "+-------------------------------+----------------------+----------------------+\n", - " \n", - "+-----------------------------------------------------------------------------+\n", - "| Processes: |\n", - "| GPU GI CI PID Type Process name GPU Memory |\n", - "| ID ID Usage |\n", - "|=============================================================================|\n", - "| 0 N/A N/A 2292 G /usr/lib/xorg/Xorg 871MiB |\n", - "| 0 N/A N/A 2441 G /usr/bin/gnome-shell 316MiB |\n", - "| 0 N/A N/A 1240494 G ...AAAAAAAAA= --shared-files 68MiB |\n", - "| 0 N/A N/A 1240525 G ...RendererForSitePerProcess 41MiB |\n", - "| 0 N/A N/A 1243689 C .../envs/cudf_dev/bin/python 593MiB |\n", - "| 0 N/A N/A 1245502 C .../envs/cudf_dev/bin/python 753MiB |\n", - "| 0 N/A N/A 1245751 C .../envs/cudf_dev/bin/python 257MiB |\n", - "| 1 N/A N/A 2292 G /usr/lib/xorg/Xorg 4MiB |\n", - "| 1 N/A N/A 1245748 C .../envs/cudf_dev/bin/python 257MiB |\n", - "+-----------------------------------------------------------------------------+\n" + "Thu Apr 21 13:26:07 2022 \r\n", + "+-----------------------------------------------------------------------------+\r\n", + "| NVIDIA-SMI 495.29.05 Driver Version: 495.29.05 CUDA Version: 11.5 |\r\n", + "|-------------------------------+----------------------+----------------------+\r\n", + "| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |\r\n", + "| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |\r\n", + "| | | MIG M. |\r\n", + "|===============================+======================+======================|\r\n", + "| 0 Quadro GV100 Off | 00000000:15:00.0 Off | Off |\r\n", + "| 39% 52C P2 51W / 250W | 1115MiB / 32508MiB | 0% Default |\r\n", + "| | | N/A |\r\n", + "+-------------------------------+----------------------+----------------------+\r\n", + "| 1 Quadro GV100 Off | 00000000:2D:00.0 Off | Off |\r\n", + "| 43% 57C P2 52W / 250W | 306MiB / 32498MiB | 0% Default |\r\n", + "| | | N/A |\r\n", + "+-------------------------------+----------------------+----------------------+\r\n", + " \r\n", + "+-----------------------------------------------------------------------------+\r\n", + "| Processes: |\r\n", + "| GPU GI CI PID Type Process name GPU Memory |\r\n", + "| ID ID Usage |\r\n", + "|=============================================================================|\r\n", + "+-----------------------------------------------------------------------------+\r\n" ] } ], @@ -6420,6 +6568,7 @@ }, { "cell_type": "markdown", + "id": "b98810c4", "metadata": {}, "source": [ "Because Dask is lazy, the computation has not yet occurred. We can see that there are twenty tasks in the task graph and we've used about 800 MB of memory. We can force computation by using `persist`. By forcing execution, the result is now explicitly in memory and our task graph only contains one task per partition (the baseline)." @@ -6427,7 +6576,8 @@ }, { "cell_type": "code", - "execution_count": 85, + "execution_count": 84, + "id": "a929577c", "metadata": {}, "outputs": [ { @@ -6503,7 +6653,7 @@ "" ] }, - "execution_count": 85, + "execution_count": 84, "metadata": {}, "output_type": "execute_result" } @@ -6515,45 +6665,37 @@ }, { "cell_type": "code", - "execution_count": 86, + "execution_count": 85, + "id": "8aa7c079", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ - "Thu Apr 21 10:11:08 2022 \n", - "+-----------------------------------------------------------------------------+\n", - "| NVIDIA-SMI 495.29.05 Driver Version: 495.29.05 CUDA Version: 11.5 |\n", - "|-------------------------------+----------------------+----------------------+\n", - "| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |\n", - "| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |\n", - "| | | MIG M. |\n", - "|===============================+======================+======================|\n", - "| 0 NVIDIA RTX A6000 On | 00000000:01:00.0 On | Off |\n", - "| 30% 48C P2 84W / 300W | 2970MiB / 48651MiB | 3% Default |\n", - "| | | N/A |\n", - "+-------------------------------+----------------------+----------------------+\n", - "| 1 NVIDIA RTX A6000 On | 00000000:02:00.0 Off | Off |\n", - "| 30% 36C P2 37W / 300W | 265MiB / 48685MiB | 0% Default |\n", - "| | | N/A |\n", - "+-------------------------------+----------------------+----------------------+\n", - " \n", - "+-----------------------------------------------------------------------------+\n", - "| Processes: |\n", - "| GPU GI CI PID Type Process name GPU Memory |\n", - "| ID ID Usage |\n", - "|=============================================================================|\n", - "| 0 N/A N/A 2292 G /usr/lib/xorg/Xorg 871MiB |\n", - "| 0 N/A N/A 2441 G /usr/bin/gnome-shell 316MiB |\n", - "| 0 N/A N/A 1240494 G ...AAAAAAAAA= --shared-files 68MiB |\n", - "| 0 N/A N/A 1240525 G ...RendererForSitePerProcess 41MiB |\n", - "| 0 N/A N/A 1243689 C .../envs/cudf_dev/bin/python 593MiB |\n", - "| 0 N/A N/A 1245502 C .../envs/cudf_dev/bin/python 753MiB |\n", - "| 0 N/A N/A 1245751 C .../envs/cudf_dev/bin/python 257MiB |\n", - "| 1 N/A N/A 2292 G /usr/lib/xorg/Xorg 4MiB |\n", - "| 1 N/A N/A 1245748 C .../envs/cudf_dev/bin/python 257MiB |\n", - "+-----------------------------------------------------------------------------+\n" + "Thu Apr 21 13:26:08 2022 \r\n", + "+-----------------------------------------------------------------------------+\r\n", + "| NVIDIA-SMI 495.29.05 Driver Version: 495.29.05 CUDA Version: 11.5 |\r\n", + "|-------------------------------+----------------------+----------------------+\r\n", + "| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |\r\n", + "| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |\r\n", + "| | | MIG M. |\r\n", + "|===============================+======================+======================|\r\n", + "| 0 Quadro GV100 Off | 00000000:15:00.0 Off | Off |\r\n", + "| 39% 52C P2 52W / 250W | 1115MiB / 32508MiB | 3% Default |\r\n", + "| | | N/A |\r\n", + "+-------------------------------+----------------------+----------------------+\r\n", + "| 1 Quadro GV100 Off | 00000000:2D:00.0 Off | Off |\r\n", + "| 43% 57C P2 51W / 250W | 306MiB / 32498MiB | 0% Default |\r\n", + "| | | N/A |\r\n", + "+-------------------------------+----------------------+----------------------+\r\n", + " \r\n", + "+-----------------------------------------------------------------------------+\r\n", + "| Processes: |\r\n", + "| GPU GI CI PID Type Process name GPU Memory |\r\n", + "| ID ID Usage |\r\n", + "|=============================================================================|\r\n", + "+-----------------------------------------------------------------------------+\r\n" ] } ], @@ -6563,6 +6705,7 @@ }, { "cell_type": "markdown", + "id": "ff9e14b6", "metadata": {}, "source": [ "Because we forced computation, we now have a larger object in distributed GPU memory." @@ -6570,6 +6713,7 @@ }, { "cell_type": "markdown", + "id": "bb3b3dee", "metadata": {}, "source": [ "### Wait\n", @@ -6580,7 +6724,8 @@ }, { "cell_type": "code", - "execution_count": 87, + "execution_count": 86, + "id": "ef71bf00", "metadata": {}, "outputs": [], "source": [ @@ -6598,6 +6743,7 @@ }, { "cell_type": "markdown", + "id": "e1099ec0", "metadata": {}, "source": [ "This function will do a basic transformation of every column in the dataframe, but the time spent in the function will vary due to the `time.sleep` statement randomly adding 1-60 seconds of time. We'll run this on every partition of our dataframe using `map_partitions`, which adds the task to our task-graph, and store the result. We can then call `persist` to force execution." @@ -6605,7 +6751,8 @@ }, { "cell_type": "code", - "execution_count": 88, + "execution_count": 87, + "id": "700dd799", "metadata": {}, "outputs": [], "source": [ @@ -6615,6 +6762,7 @@ }, { "cell_type": "markdown", + "id": "5eb83a7e", "metadata": {}, "source": [ "However, some partitions will be done **much** sooner than others. If we had downstream processes that should wait for all partitions to be completed, we can enforce that behavior using `wait`." @@ -6622,16 +6770,17 @@ }, { "cell_type": "code", - "execution_count": 89, + "execution_count": 88, + "id": "73bccf94", "metadata": {}, "outputs": [ { "data": { "text/plain": [ - "DoneAndNotDoneFutures(done={, , , , }, not_done=set())" + "DoneAndNotDoneFutures(done={, , , , }, not_done=set())" ] }, - "execution_count": 89, + "execution_count": 88, "metadata": {}, "output_type": "execute_result" } @@ -6642,21 +6791,22 @@ }, { "cell_type": "markdown", + "id": "447301f5", "metadata": {}, "source": [ - "## With `wait`, we can safely proceed on in our workflow." + "With `wait`, we can safely proceed on in our workflow." ] }, { "cell_type": "code", "execution_count": null, + "id": "7e06fcf4", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { - "anaconda-cloud": {}, "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", @@ -6673,21 +6823,8 @@ "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.13" - }, - "toc": { - "base_numbering": 1, - "nav_menu": {}, - "number_sections": true, - "sideBar": true, - "skip_h1_title": false, - "title_cell": "Table of Contents", - "title_sidebar": "Contents", - "toc_cell": false, - "toc_position": {}, - "toc_section_display": true, - "toc_window_display": false } }, "nbformat": 4, - "nbformat_minor": 4 + "nbformat_minor": 5 } diff --git a/docs/cudf/source/user_guide/PandasCompat.md b/docs/cudf/source/user_guide/PandasCompat.md new file mode 100644 index 00000000000..a33a354e2f8 --- /dev/null +++ b/docs/cudf/source/user_guide/PandasCompat.md @@ -0,0 +1,5 @@ +# Pandas Compatibility Notes + +```{eval-rst} +.. pandas-compat-list:: +``` diff --git a/docs/cudf/source/user_guide/10min-cudf-cupy.ipynb b/docs/cudf/source/user_guide/cupy-interop.ipynb similarity index 87% rename from docs/cudf/source/user_guide/10min-cudf-cupy.ipynb rename to docs/cudf/source/user_guide/cupy-interop.ipynb index 35ca21f380e..9fbac3b2578 100644 --- a/docs/cudf/source/user_guide/10min-cudf-cupy.ipynb +++ b/docs/cudf/source/user_guide/cupy-interop.ipynb @@ -2,9 +2,10 @@ "cells": [ { "cell_type": "markdown", + "id": "8e5e6878", "metadata": {}, "source": [ - "# 10 Minutes to cuDF and CuPy\n", + "# Interoperability between cuDF and CuPy\n", "\n", "This notebook provides introductory examples of how you can use cuDF and CuPy together to take advantage of CuPy array functionality (such as advanced linear algebra operations)." ] @@ -12,6 +13,7 @@ { "cell_type": "code", "execution_count": 1, + "id": "8b2d45c3", "metadata": {}, "outputs": [], "source": [ @@ -29,6 +31,7 @@ }, { "cell_type": "markdown", + "id": "e7e64b1a", "metadata": {}, "source": [ "### Converting a cuDF DataFrame to a CuPy Array\n", @@ -45,15 +48,16 @@ { "cell_type": "code", "execution_count": 2, + "id": "45c482ab", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ - "183 µs ± 1.15 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)\n", - "553 µs ± 6.25 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)\n", - "546 µs ± 2.25 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)\n" + "118 µs ± 77.2 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)\n", + "360 µs ± 6.04 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)\n", + "355 µs ± 722 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)\n" ] } ], @@ -72,6 +76,7 @@ { "cell_type": "code", "execution_count": 3, + "id": "a565effc", "metadata": {}, "outputs": [ { @@ -98,6 +103,7 @@ }, { "cell_type": "markdown", + "id": "0759ab29", "metadata": {}, "source": [ "### Converting a cuDF Series to a CuPy Array" @@ -105,27 +111,29 @@ }, { "cell_type": "markdown", + "id": "4f35ffbd", "metadata": {}, "source": [ "There are also multiple ways to convert a cuDF Series to a CuPy array:\n", "\n", "1. We can pass the Series to `cupy.asarray` as cuDF Series exposes [`__cuda_array_interface__`](https://docs-cupy.chainer.org/en/stable/reference/interoperability.html).\n", "2. We can leverage the dlpack interface `to_dlpack()`. \n", - "3. We can also use `Series.values` \n" + "3. We can also use `Series.values`" ] }, { "cell_type": "code", "execution_count": 4, + "id": "8f97f304", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ - "76.8 µs ± 636 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)\n", - "198 µs ± 2.72 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)\n", - "181 µs ± 1.1 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)\n" + "54.4 µs ± 66 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)\n", + "125 µs ± 1.21 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)\n", + "119 µs ± 805 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)\n" ] } ], @@ -140,6 +148,7 @@ { "cell_type": "code", "execution_count": 5, + "id": "f96d5676", "metadata": {}, "outputs": [ { @@ -160,6 +169,7 @@ }, { "cell_type": "markdown", + "id": "c36e5b88", "metadata": {}, "source": [ "From here, we can proceed with normal CuPy workflows, such as reshaping the array, getting the diagonal, or calculating the norm." @@ -168,6 +178,7 @@ { "cell_type": "code", "execution_count": 6, + "id": "2a7ae43f", "metadata": {}, "outputs": [ { @@ -195,6 +206,7 @@ { "cell_type": "code", "execution_count": 7, + "id": "b442a30c", "metadata": {}, "outputs": [ { @@ -219,6 +231,7 @@ { "cell_type": "code", "execution_count": 8, + "id": "be7f4d32", "metadata": {}, "outputs": [ { @@ -238,6 +251,7 @@ }, { "cell_type": "markdown", + "id": "b353bded", "metadata": {}, "source": [ "### Converting a CuPy Array to a cuDF DataFrame\n", @@ -256,13 +270,14 @@ { "cell_type": "code", "execution_count": 9, + "id": "8887b253", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ - "23.9 ms ± 119 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)\n" + "14.3 ms ± 33.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)\n" ] } ], @@ -273,6 +288,7 @@ { "cell_type": "code", "execution_count": 10, + "id": "08ec4ffa", "metadata": {}, "outputs": [ { @@ -475,6 +491,7 @@ }, { "cell_type": "markdown", + "id": "6804d291", "metadata": {}, "source": [ "We can check whether our array is Fortran contiguous by using cupy.isfortran or looking at the [flags](https://docs-cupy.chainer.org/en/stable/reference/generated/cupy.ndarray.html#cupy.ndarray.flags) of the array." @@ -483,6 +500,7 @@ { "cell_type": "code", "execution_count": 11, + "id": "65b8bd0d", "metadata": {}, "outputs": [ { @@ -502,6 +520,7 @@ }, { "cell_type": "markdown", + "id": "151982ad", "metadata": {}, "source": [ "In this case, we'll need to convert it before going to a cuDF DataFrame. In the next two cells, we create the DataFrame by leveraging dlpack and the CUDA array interface, respectively." @@ -510,13 +529,14 @@ { "cell_type": "code", "execution_count": 12, + "id": "27b2f563", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ - "9.15 ms ± 131 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)\n" + "6.57 ms ± 9.08 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)\n" ] } ], @@ -530,13 +550,14 @@ { "cell_type": "code", "execution_count": 13, + "id": "0a0cc290", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ - "5.74 ms ± 29.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)\n" + "4.48 ms ± 7.89 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)\n" ] } ], @@ -550,6 +571,7 @@ { "cell_type": "code", "execution_count": 14, + "id": "0d2c5beb", "metadata": {}, "outputs": [ { @@ -753,6 +775,7 @@ }, { "cell_type": "markdown", + "id": "395e2bba", "metadata": {}, "source": [ "### Converting a CuPy Array to a cuDF Series\n", @@ -763,6 +786,7 @@ { "cell_type": "code", "execution_count": 15, + "id": "d8518208", "metadata": {}, "outputs": [ { @@ -787,6 +811,7 @@ }, { "cell_type": "markdown", + "id": "7e159619", "metadata": {}, "source": [ "### Interweaving CuDF and CuPy for Smooth PyData Workflows\n", @@ -799,6 +824,7 @@ { "cell_type": "code", "execution_count": 16, + "id": "2bb8ed81", "metadata": {}, "outputs": [ { @@ -1000,6 +1026,7 @@ }, { "cell_type": "markdown", + "id": "2f3d4e78", "metadata": {}, "source": [ "We can just transform it into a CuPy array and use the `axis` argument of `sum`." @@ -1008,6 +1035,7 @@ { "cell_type": "code", "execution_count": 17, + "id": "2dde030d", "metadata": {}, "outputs": [ { @@ -1035,6 +1063,7 @@ }, { "cell_type": "markdown", + "id": "4450dcc3", "metadata": {}, "source": [ "With just that single line, we're able to seamlessly move between data structures in this ecosystem, giving us enormous flexibility without sacrificing speed." @@ -1042,6 +1071,7 @@ }, { "cell_type": "markdown", + "id": "61bfb868", "metadata": {}, "source": [ "### Converting a cuDF DataFrame to a CuPy Sparse Matrix\n", @@ -1054,6 +1084,7 @@ { "cell_type": "code", "execution_count": 18, + "id": "e531fd15", "metadata": {}, "outputs": [], "source": [ @@ -1072,6 +1103,7 @@ }, { "cell_type": "markdown", + "id": "3f5e6ade", "metadata": {}, "source": [ "We can define a sparsely populated DataFrame to illustrate this conversion to either sparse matrix format." @@ -1080,6 +1112,7 @@ { "cell_type": "code", "execution_count": 19, + "id": "58c7e074", "metadata": {}, "outputs": [], "source": [ @@ -1095,6 +1128,7 @@ { "cell_type": "code", "execution_count": 20, + "id": "9265228d", "metadata": {}, "outputs": [ { @@ -1143,115 +1177,115 @@ " \n", " \n", " \n", - " \n", " \n", " \n", - " \n", " \n", - " \n", - " \n", + " \n", " \n", " \n", " \n", - " \n", " \n", " \n", " \n", " \n", " \n", + " \n", " \n", + " \n", " \n", " \n", - " \n", + " \n", + " \n", + " \n", " \n", " \n", " \n", - " \n", " \n", " \n", - " \n", " \n", - " \n", - " \n", " \n", " \n", - " \n", + " \n", " \n", " \n", " \n", - " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", " \n", " \n", - " \n", " \n", " \n", " \n", " \n", " \n", " \n", - " \n", " \n", " \n", - " \n", " \n", - " \n", - " \n", " \n", " \n", - " \n", + " \n", " \n", " \n", " \n", " \n", " \n", " \n", + " \n", " \n", + " \n", + " \n", " \n", " \n", - " \n", + " \n", + " \n", " \n", " \n", " \n", - " \n", " \n", " \n", - " \n", " \n", - " \n", - " \n", " \n", " \n", - " \n", + " \n", " \n", " \n", " \n", " \n", " \n", " \n", + " \n", " \n", + " \n", + " \n", + " \n", " \n", " \n", " \n", " \n", " \n", " \n", - " \n", " \n", " \n", - " \n", " \n", - " \n", - " \n", " \n", " \n", - " \n", - " \n", + " \n", + " \n", " \n", " \n", " \n", " \n", " \n", + " \n", " \n", + " \n", + " \n", + " \n", " \n", " \n", " \n", @@ -1261,19 +1295,19 @@ "" ], "text/plain": [ - " a0 a1 a2 a3 a4 a5 a6 a7 a8 a9 \\\n", - "0 0.000000 0.0 0.0 0.000000 0.0 9.37476 0.000000 0.0 0.0 0.000000 \n", - "1 0.000000 0.0 0.0 0.000000 0.0 0.00000 0.000000 0.0 0.0 0.000000 \n", - "2 3.232751 0.0 0.0 0.000000 0.0 0.00000 8.341915 0.0 0.0 0.000000 \n", - "3 0.000000 0.0 0.0 0.000000 0.0 0.00000 0.000000 0.0 0.0 0.000000 \n", - "4 0.000000 0.0 0.0 7.743024 0.0 0.00000 0.000000 0.0 0.0 5.987098 \n", + " a0 a1 a2 a3 a4 a5 a6 a7 a8 a9 a10 a11 a12 \\\n", + "0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.000000 0.0 0.0 0.0 \n", + "1 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 -5.241297 0.0 0.0 0.0 \n", + "2 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.000000 0.0 0.0 0.0 \n", + "3 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.000000 0.0 0.0 0.0 \n", + "4 0.0 0.0 0.0 0.0 0.0 0.0 2.526274 0.0 0.0 0.000000 0.0 0.0 0.0 \n", "\n", - " a10 a11 a12 a13 a14 a15 a16 a17 a18 a19 \n", - "0 6.237859 0.0 0.0 0.000000 0.0 0.0 0.00000 0.0 0.0 0.000000 \n", - "1 0.000000 0.0 0.0 0.065878 0.0 0.0 12.35705 0.0 0.0 0.000000 \n", - "2 0.000000 0.0 0.0 0.000000 0.0 0.0 0.00000 0.0 0.0 3.110362 \n", - "3 0.000000 0.0 0.0 0.000000 0.0 0.0 0.00000 0.0 0.0 0.000000 \n", - "4 0.000000 0.0 0.0 0.000000 0.0 0.0 0.00000 0.0 0.0 0.000000 " + " a13 a14 a15 a16 a17 a18 a19 \n", + "0 0.00000 0.000000 0.0 0.0 0.0 0.0 11.308953 \n", + "1 17.58476 0.000000 0.0 0.0 0.0 0.0 0.000000 \n", + "2 0.00000 0.000000 0.0 0.0 0.0 0.0 0.000000 \n", + "3 0.00000 10.869279 0.0 0.0 0.0 0.0 0.000000 \n", + "4 0.00000 0.000000 0.0 0.0 0.0 0.0 0.000000 " ] }, "execution_count": 20, @@ -1288,63 +1322,64 @@ { "cell_type": "code", "execution_count": 21, + "id": "5ba1a551", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ - " (2, 0)\t3.2327506467190874\n", - " (259, 0)\t10.723428115951062\n", - " (643, 0)\t0.47763624588488707\n", - " (899, 0)\t8.857065309921685\n", - " (516, 0)\t8.792407143276648\n", - " (262, 0)\t2.1900894573805396\n", - " (390, 0)\t5.007630701229646\n", - " (646, 0)\t6.630703075588639\n", - " (392, 0)\t5.573713453854357\n", - " (776, 0)\t10.501281989515688\n", - " (904, 0)\t8.261890175181366\n", - " (1033, 0)\t-0.41106824704220446\n", - " (522, 0)\t12.619952511457068\n", - " (139, 0)\t12.753348070606792\n", - " (141, 0)\t4.936902335394504\n", - " (270, 0)\t-1.7695949916946174\n", - " (782, 0)\t4.378746787324408\n", - " (15, 0)\t8.554141682891935\n", - " (527, 0)\t5.1994882136423\n", - " (912, 0)\t2.6101212854793125\n", - " (401, 0)\t5.614628764689268\n", - " (403, 0)\t9.999468341523317\n", - " (787, 0)\t7.6170790481600985\n", - " (404, 0)\t5.105328903336744\n", - " (916, 0)\t1.395526391114967\n", + " (770, 0)\t-1.373354548007899\n", + " (771, 0)\t11.641890592020793\n", + " (644, 0)\t-1.4820515981598015\n", + " (773, 0)\t4.374245789758399\n", + " (646, 0)\t4.58071340724814\n", + " (776, 0)\t5.115792716318899\n", + " (649, 0)\t8.676941295251092\n", + " (522, 0)\t-0.11573951593420229\n", + " (396, 0)\t8.124303607236273\n", + " (652, 0)\t9.359339954077681\n", + " (141, 0)\t8.50710863345112\n", + " (272, 0)\t7.440244879175392\n", + " (1042, 0)\t4.286859524587998\n", + " (275, 0)\t-0.6091666840632348\n", + " (787, 0)\t10.124449357828695\n", + " (915, 0)\t11.391560911074649\n", + " (1043, 0)\t11.478396096078907\n", + " (408, 0)\t11.204049991287349\n", + " (536, 0)\t13.239689100708974\n", + " (26, 0)\t4.951917355877771\n", + " (794, 0)\t2.736556006961319\n", + " (539, 0)\t12.553519350929216\n", + " (412, 0)\t2.8682583361020786\n", + " (540, 0)\t-1.2121388231076713\n", + " (796, 0)\t6.986443354019786\n", " :\t:\n", - " (9328, 19)\t5.938629381103238\n", - " (9457, 19)\t4.463547879031807\n", - " (9458, 19)\t-0.8034946631917106\n", - " (8051, 19)\t-1.904327616912268\n", - " (8819, 19)\t8.314944347687199\n", - " (7543, 19)\t1.4303204025224376\n", - " (8824, 19)\t5.1559713157589\n", - " (7673, 19)\t7.478681299798863\n", - " (7802, 19)\t0.502526238006068\n", - " (8186, 19)\t-3.824944685072472\n", - " (8570, 19)\t8.442324394481236\n", - " (8571, 19)\t6.204199957873215\n", - " (7420, 19)\t0.297737356585836\n", - " (9212, 19)\t3.934797966994188\n", - " (7421, 19)\t14.26161925450462\n", - " (8574, 19)\t5.826108027573207\n", - " (9214, 19)\t7.209975861932724\n", - " (9825, 19)\t11.155342644729613\n", - " (9702, 19)\t3.55144040779287\n", - " (9578, 19)\t12.638681362546228\n", - " (9712, 19)\t2.3542852760656348\n", - " (9969, 19)\t-2.645175092587592\n", - " (9973, 19)\t-2.2666402312025213\n", - " (9851, 19)\t-4.293381721466055\n", - " (9596, 19)\t6.6580506888430415\n" + " (9087, 19)\t-2.9543770156500395\n", + " (9440, 19)\t3.903613949374532\n", + " (9186, 19)\t0.3141028170017329\n", + " (9571, 19)\t1.7347840594688502\n", + " (9188, 19)\t14.68745562157488\n", + " (9316, 19)\t13.808308442016436\n", + " (9957, 19)\t9.705810918221086\n", + " (9318, 19)\t9.984168186940485\n", + " (9446, 19)\t5.173000114288142\n", + " (9830, 19)\t3.2442816093793607\n", + " (9835, 19)\t5.713078257113576\n", + " (9580, 19)\t5.373437384911853\n", + " (9326, 19)\t10.736403419943093\n", + " (9711, 19)\t-4.003216472911014\n", + " (9200, 19)\t5.560182026578174\n", + " (9844, 19)\t6.17251145210342\n", + " (9333, 19)\t7.085353006324948\n", + " (9208, 19)\t6.789030498520347\n", + " (9464, 19)\t4.314887636528589\n", + " (9720, 19)\t12.446300974563027\n", + " (9594, 19)\t4.317523130615451\n", + " (9722, 19)\t-2.3257161477576336\n", + " (9723, 19)\t1.9288133227037407\n", + " (9469, 19)\t0.268312217498608\n", + " (9599, 19)\t4.100996763787237\n" ] } ], @@ -1355,6 +1390,7 @@ }, { "cell_type": "markdown", + "id": "e8e58cd5", "metadata": {}, "source": [ "From here, we could continue our workflow with a CuPy sparse matrix.\n", @@ -1379,9 +1415,9 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.9.7" + "version": "3.8.13" } }, "nbformat": 4, - "nbformat_minor": 4 + "nbformat_minor": 5 } diff --git a/docs/cudf/source/user_guide/dask-cudf.md b/docs/cudf/source/user_guide/dask-cudf.md new file mode 100644 index 00000000000..0c0b37f641c --- /dev/null +++ b/docs/cudf/source/user_guide/dask-cudf.md @@ -0,0 +1,104 @@ +# Multi-GPU with Dask-cuDF + +cuDF is a single-GPU library. For Multi-GPU cuDF solutions we use +[Dask](https://dask.org/) and the [dask-cudf +package](https://github.com/rapidsai/cudf/tree/main/python/dask_cudf), +which is able to scale cuDF across multiple GPUs on a single machine, +or multiple GPUs across many machines in a cluster. + +[Dask DataFrame](http://docs.dask.org/en/latest/dataframe.html) was +originally designed to scale Pandas, orchestrating many Pandas +DataFrames spread across many CPUs into a cohesive parallel DataFrame. +Because cuDF currently implements only a subset of the Pandas API, not +all Dask DataFrame operations work with cuDF. + +The following is tested and expected to work: + +## What works + +- Data ingestion + + - `dask_cudf.read_csv` + - Use standard Dask ingestion with Pandas, then convert to cuDF (For + Parquet and other formats this is often decently fast) + +- Linear operations + + - Element-wise operations: `df.x + df.y`, `df ** 2` + - Assignment: `df['z'] = df.x + df.y` + - Row-wise selections: `df[df.x > 0]` + - Loc: `df.loc['2001-01-01': '2005-02-02']` + - Date time/string accessors: `df.timestamp.dt.dayofweek` + - ... and most similar operations in this category that are already + implemented in cuDF + +- Reductions + + - Like `sum`, `mean`, `max`, `count`, and so on on + `Series` objects + - Support for reductions on full dataframes + - `std` + - Custom reductions with + [dask.dataframe.reduction](https://docs.dask.org/en/latest/generated/dask.dataframe.Series.reduction.html) + +- Groupby aggregations + + - On single columns: `df.groupby('x').y.max()` + - With custom aggregations: + - groupby standard deviation + - grouping on multiple columns + - groupby agg for multiple outputs + +- Joins: + + - On full unsorted columns: `left.merge(right, on='id')` + (expensive) + - On sorted indexes: + `left.merge(right, left_index=True, right_index=True)` (fast) + - On large and small dataframes: `left.merge(cudf_df, on='id')` + (fast) + +- Rolling operations + +- Converting to and from other forms + + - Dask + Pandas to Dask + cuDF + `df.map_partitions(cudf.from_pandas)` + - Dask + cuDF to Dask + Pandas + `df.map_partitions(lambda df: df.to_pandas())` + - cuDF to Dask + cuDF: + `dask.dataframe.from_pandas(df, npartitions=20)` + - Dask + cuDF to cuDF: `df.compute()` + +Additionally all generic Dask operations, like `compute`, `persist`, +`visualize` and so on work regardless. + +## Developing the API + +Above we mention the following: + +> and most similar operations in this category that are already +> implemented in cuDF + +This is because it is difficult to create a comprehensive list of +operations in the cuDF and Pandas libraries. The API is large enough to +be difficult to track effectively. For any operation that operates +row-wise like `fillna` or `query` things will likely, but not +certainly work. If operations don't work it is often due to a slight +inconsistency between Pandas and cuDF that is generally easy to fix. We +encourage users to look at the [cuDF issue +tracker](https://github.com/rapidsai/cudf/issues) to see if their +issue has already been reported and, if not, [raise a new +issue](https://github.com/rapidsai/cudf/issues/new). + +## Navigating the API + +This project reuses the [Dask +DataFrame](https://docs.dask.org/en/latest/dataframe.html) project, +which was originally designed for Pandas, with the newer library cuDF. +Because we use the same Dask classes for both projects there are often +methods that are implemented for Pandas, but not yet for cuDF. As a +result users looking at the full Dask DataFrame API can be misleading, +and often lead to frustration when operations that are advertised in the +Dask API do not work as expected with cuDF. We apologize for this in +advance. diff --git a/docs/cudf/source/user_guide/data-types.md b/docs/cudf/source/user_guide/data-types.md new file mode 100644 index 00000000000..8963f87d52e --- /dev/null +++ b/docs/cudf/source/user_guide/data-types.md @@ -0,0 +1,153 @@ +# Supported Data Types + +cuDF supports many data types supported by NumPy and Pandas, including +numeric, datetime, timedelta, categorical and string data types. We +also provide special data types for working with decimals, list-like, +and dictionary-like data. + +All data types in cuDF are [nullable](missing-data). + +
+ +| Kind of data | Data type(s) | +|----------------------|---------------------------------------------------------------------------------| +| Signed integer | `'int8'`, `'int16'`, `'int32'`, `'int64'` | +| Unsigned integer | `'uint32'`, `'uint64'` | +| Floating-point | `'float32'`, `'float64'` | +| Datetime | `'datetime64[s]'`, `'datetime64[ms]'`, `'datetime64['us']`, `'datetime64[ns]'` | +| Timedelta (duration) | `'timedelta[s]'`, `'timedelta[ms]'`, `'timedelta['us']`, `'timedelta[ns]'` | +| Category | `cudf.CategoricalDtype` | +| String | `'object'` or `'string'` | +| Decimal | `cudf.Decimal32Dtype`, `cudf.Decimal64Dtype`, `cudf.Decimal64Dtype` | +| List | `cudf.ListDtype` | +| Struct | `cudf.StructDtype` | + +
+ +## NumPy data types + +We use NumPy data types for integer, floating, datetime, timedelta, +and string data types. Thus, just like in NumPy, +`np.dtype("float32")`, `np.float32`, and `"float32"` are all acceptable +ways to specify the `float32` data type: + +```python +>>> import cudf +>>> s = cudf.Series([1, 2, 3], dtype="float32") +>>> s +0 1.0 +1 2.0 +2 3.0 +dtype: float32 +``` + +## A note on `object` + +The data type associated with string data in cuDF is `"np.object"`. + +```python +>>> import cudf +>>> s = cudf.Series(["abc", "def", "ghi"]) +>>> s.dtype +dtype("object") +``` + +This is for compatibility with Pandas, but it can be misleading. In +both NumPy and Pandas, `"object"` is the data type associated data +composed of arbitrary Python objects (not just strings). However, +cuDF does not support storing arbitrary Python objects. + +## Decimal data types + +We provide special data types for working with decimal data, namely +`Decimal32Dtype`, `Decimal64Dtype`, and `Decimal128Dtype`. Use these +data types when you need to store values with greater precision than +allowed by floating-point representation. + +Decimal data types in cuDF are based on fixed-point representation. A +decimal data type is composed of a _precision_ and a _scale_. The +precision represents the total number of digits in each value of this +dtype. For example, the precision associated with the decimal value +`1.023` is `4`. The scale is the total number of digits to the right +of the decimal point. The scale associated with the value `1.023` is +3. + +Each decimal data type is associated with a maximum precision: + +```python +>>> cudf.Decimal32Dtype.MAX_PRECISION +9.0 +>>> cudf.Decimal64Dtype.MAX_PRECISION +18.0 +>>> cudf.Decimal128Dtype.MAX_PRECISION +38 +``` + +One way to create a decimal Series is from values of type [decimal.Decimal][python-decimal]. + +```python +>>> from decimal import Decimal +>>> s = cudf.Series([Decimal("1.01"), Decimal("4.23"), Decimal("0.5")]) +>>> s +0 1.01 +1 4.23 +2 0.50 +dtype: decimal128 +>>> s.dtype +Decimal128Dtype(precision=3, scale=2) +``` + +Notice the data type of the result: `1.01`, `4.23`, `0.50` can all be +represented with a precision of at least 3 and a scale of at least 2. + +However, the value `1.234` needs a precision of at least 4, and a +scale of at least 3, and cannot be fully represented using this data +type: + +```python +>>> s[1] = Decimal("1.234") # raises an error +``` + +## Nested data types (`List` and `Struct`) + +`ListDtype` and `StructDtype` are special data types in cuDF for +working with list-like and dictionary-like data. These are referred to +as "nested" data types, because they enable you to store a list of +lists, or a struct of lists, or a struct of list of lists, etc., + +You can create lists and struct Series from existing Pandas Series of +lists and dictionaries respectively: + +```python +>>> psr = pd.Series([{'a': 1, 'b': 2}, {'a': 3, 'b': 4}]) +>>> psr +0 {'a': 1, 'b': 2} +1 {'a': 3, 'b': 4} +dtype: object +>>> gsr = cudf.from_pandas(psr) +>>> gsr +0 {'a': 1, 'b': 2} +1 {'a': 3, 'b': 4} +dtype: struct +>>> gsr.dtype +StructDtype({'a': dtype('int64'), 'b': dtype('int64')}) +``` + +Or by reading them from disk, using a [file format that supports +nested data](io). + +```python +>>> pdf = pd.DataFrame({"a": [[1, 2], [3, 4, 5], [6, 7, 8]]}) +>>> pdf.to_parquet("lists.pq") +>>> gdf = cudf.read_parquet("lists.pq") +>>> gdf + a +0 [1, 2] +1 [3, 4, 5] +2 [6, 7, 8] +>>> gdf["a"].dtype +ListDtype(int64) +``` + +[numpy-dtype]: https://numpy.org/doc/stable/reference/arrays.dtypes.html#arrays-dtypes +[python-decimal]: https://docs.python.org/3/library/decimal.html#decimal.Decimal diff --git a/docs/cudf/source/user_guide/groupby.md b/docs/cudf/source/user_guide/groupby.md new file mode 100644 index 00000000000..66b548727e1 --- /dev/null +++ b/docs/cudf/source/user_guide/groupby.md @@ -0,0 +1,273 @@ +--- +substitutions: + describe: '`describe`' +--- + +(basics-groupby)= + +# GroupBy + +cuDF supports a small (but important) subset of Pandas' [groupby +API](https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html). + +## Summary of supported operations + +1. Grouping by one or more columns +2. Basic aggregations such as "sum", "mean", etc. +3. Quantile aggregation +4. A "collect" or `list` aggregation for collecting values in a group + into lists +5. Automatic exclusion of columns with unsupported dtypes ("nuisance" + columns) when aggregating +6. Iterating over the groups of a GroupBy object +7. `GroupBy.groups` API that returns a mapping of group keys to row + labels +8. `GroupBy.apply` API for performing arbitrary operations on each + group. Note that this has very limited functionality compared to the + equivalent Pandas function. See the section on + [apply](#groupby-apply) for more details. +9. `GroupBy.pipe` similar to + [Pandas](https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html#piping-function-calls). + +## Grouping + +A GroupBy object is created by grouping the values of a `Series` or +`DataFrame` by one or more columns: + +```python +>>> import cudf +>>> df = cudf.DataFrame({'a': [1, 1, 1, 2, 2], 'b': [1, 1, 2, 2, 3], 'c': [1, 2, 3, 4, 5]}) +>>> df + a b c +0 1 1 1 +1 1 1 2 +2 1 2 3 +3 2 2 4 +4 2 3 5 +>>> gb1 = df.groupby('a') # grouping by a single column +>>> gb2 = df.groupby(['a', 'b']) # grouping by multiple columns +>>> gb3 = df.groupby(cudf.Series(['a', 'a', 'b', 'b', 'b'])) # grouping by an external column +``` + +````{warning} +Unlike Pandas, cuDF uses `sort=False` by default to achieve better +performance, which does not guarantee any particular group order in +the result. + +For example: + +```python +>>> df = cudf.DataFrame({'a' : [2, 2, 1], 'b' : [42, 21, 11]}) +>>> df.groupby('a').sum() + b +a +2 63 +1 11 +>>> df.to_pandas().groupby('a').sum() + b +a +1 11 +2 63 +``` + +Setting `sort=True` will produce Pandas-like output, but with some performance penalty: + +```python +>>> df.groupby('a', sort=True).sum() + b +a +1 11 +2 63 +``` +```` + +### Grouping by index levels + +You can also group by one or more levels of a MultiIndex: + +```python +>>> df = cudf.DataFrame( +... {'a': [1, 1, 1, 2, 2], 'b': [1, 1, 2, 2, 3], 'c': [1, 2, 3, 4, 5]} +... ).set_index(['a', 'b']) +... +>>> df.groupby(level='a') +``` + +### The `Grouper` object + +A `Grouper` can be used to disambiguate between columns and levels +when they have the same name: + +```python +>>> df + b c +b +1 1 1 +1 1 2 +1 2 3 +2 2 4 +2 3 5 +>>> df.groupby('b', level='b') # ValueError: Cannot specify both by and level +>>> df.groupby([cudf.Grouper(key='b'), cudf.Grouper(level='b')]) # OK +``` + +## Aggregation + +Aggregations on groups are supported via the `agg` method: + +```python +>>> df + a b c +0 1 1 1 +1 1 1 2 +2 1 2 3 +3 2 2 4 +4 2 3 5 +>>> df.groupby('a').agg('sum') + b c +a +1 4 6 +2 5 9 +>>> df.groupby('a').agg({'b': ['sum', 'min'], 'c': 'mean'}) + b c + sum min mean +a +1 4 1 2.0 +2 5 2 4.5 +>>> df.groupby("a").corr(method="pearson") + b c +a +1 b 1.000000 0.866025 + c 0.866025 1.000000 +2 b 1.000000 1.000000 + c 1.000000 1.000000 +``` + +The following table summarizes the available aggregations and the types +that support them: + +```{eval-rst} +.. table:: + :class: special-table + + +------------------------------------+-----------+------------+----------+---------------+--------+----------+------------+-----------+ + | Aggregations / dtypes | Numeric | Datetime | String | Categorical | List | Struct | Interval | Decimal | + +====================================+===========+============+==========+===============+========+==========+============+===========+ + | count | ✅ | ✅ | ✅ | ✅ | | | | ✅ | + +------------------------------------+-----------+------------+----------+---------------+--------+----------+------------+-----------+ + | size | ✅ | ✅ | ✅ | ✅ | | | | ✅ | + +------------------------------------+-----------+------------+----------+---------------+--------+----------+------------+-----------+ + | sum | ✅ | ✅ | | | | | | ✅ | + +------------------------------------+-----------+------------+----------+---------------+--------+----------+------------+-----------+ + | idxmin | ✅ | ✅ | | | | | | ✅ | + +------------------------------------+-----------+------------+----------+---------------+--------+----------+------------+-----------+ + | idxmax | ✅ | ✅ | | | | | | ✅ | + +------------------------------------+-----------+------------+----------+---------------+--------+----------+------------+-----------+ + | min | ✅ | ✅ | ✅ | | | | | ✅ | + +------------------------------------+-----------+------------+----------+---------------+--------+----------+------------+-----------+ + | max | ✅ | ✅ | ✅ | | | | | ✅ | + +------------------------------------+-----------+------------+----------+---------------+--------+----------+------------+-----------+ + | mean | ✅ | ✅ | | | | | | | + +------------------------------------+-----------+------------+----------+---------------+--------+----------+------------+-----------+ + | var | ✅ | ✅ | | | | | | | + +------------------------------------+-----------+------------+----------+---------------+--------+----------+------------+-----------+ + | std | ✅ | ✅ | | | | | | | + +------------------------------------+-----------+------------+----------+---------------+--------+----------+------------+-----------+ + | quantile | ✅ | ✅ | | | | | | | + +------------------------------------+-----------+------------+----------+---------------+--------+----------+------------+-----------+ + | median | ✅ | ✅ | | | | | | | + +------------------------------------+-----------+------------+----------+---------------+--------+----------+------------+-----------+ + | nunique | ✅ | ✅ | ✅ | ✅ | | | | ✅ | + +------------------------------------+-----------+------------+----------+---------------+--------+----------+------------+-----------+ + | nth | ✅ | ✅ | ✅ | | | | | ✅ | + +------------------------------------+-----------+------------+----------+---------------+--------+----------+------------+-----------+ + | collect | ✅ | ✅ | ✅ | | ✅ | | | ✅ | + +------------------------------------+-----------+------------+----------+---------------+--------+----------+------------+-----------+ + | unique | ✅ | ✅ | ✅ | ✅ | | | | | + +------------------------------------+-----------+------------+----------+---------------+--------+----------+------------+-----------+ + | corr | ✅ | | | | | | | ✅ | + +------------------------------------+-----------+------------+----------+---------------+--------+----------+------------+-----------+ + | cov | ✅ | | | | | | | ✅ | + +------------------------------------+-----------+------------+----------+---------------+--------+----------+------------+-----------+ +``` + +## GroupBy apply + +To apply function on each group, use the `GroupBy.apply()` method: + +```python +>>> df + a b c +0 1 1 1 +1 1 1 2 +2 1 2 3 +3 2 2 4 +4 2 3 5 +>>> df.groupby('a').apply(lambda x: x.max() - x.min()) + a b c +a +0 0 1 2 +1 0 1 1 +``` + +### Limitations + +- `apply` works by applying the provided function to each group + sequentially, and concatenating the results together. **This can be + very slow**, especially for a large number of small groups. For a + small number of large groups, it can give acceptable performance. +- The results may not always match Pandas exactly. For example, cuDF + may return a `DataFrame` containing a single column where Pandas + returns a `Series`. Some post-processing may be required to match + Pandas behavior. +- cuDF does not support some of the exceptional cases that Pandas + supports with `apply`, such as calling [describe] inside the + callable. + +## Transform + +The `.transform()` method aggregates per group, and broadcasts the +result to the group size, resulting in a Series/DataFrame that is of +the same size as the input Series/DataFrame. + +```python +>>> import cudf +>>> df = cudf.DataFrame({'a': [2, 1, 1, 2, 2], 'b': [1, 2, 3, 4, 5]}) +>>> df.groupby('a').transform('max') + b +0 5 +1 3 +2 3 +3 5 +4 5 +``` + +## Rolling window calculations + +Use the `GroupBy.rolling()` method to perform rolling window +calculations on each group: + +```python +>>> df + a b c +0 1 1 1 +1 1 1 2 +2 1 2 3 +3 2 2 4 +4 2 3 5 +``` + +Rolling window sum on each group with a window size of 2: + +```python +>>> df.groupby('a').rolling(2).sum() + a b c +a +1 0 + 1 2 2 3 + 2 2 3 5 +2 3 + 4 4 5 9 +``` + +[describe]: https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html#flexible-apply diff --git a/docs/cudf/source/user_guide/guide-to-udfs.ipynb b/docs/cudf/source/user_guide/guide-to-udfs.ipynb index 8026c378156..ef7500a2be9 100644 --- a/docs/cudf/source/user_guide/guide-to-udfs.ipynb +++ b/docs/cudf/source/user_guide/guide-to-udfs.ipynb @@ -2,15 +2,16 @@ "cells": [ { "cell_type": "markdown", + "id": "77149e57", "metadata": {}, "source": [ - "Overview of User Defined Functions with cuDF\n", - "====================================" + "# Overview of User Defined Functions with cuDF" ] }, { "cell_type": "code", "execution_count": 1, + "id": "0c6b65ce", "metadata": {}, "outputs": [], "source": [ @@ -21,6 +22,7 @@ }, { "cell_type": "markdown", + "id": "8826af13", "metadata": {}, "source": [ "Like many tabular data processing APIs, cuDF provides a range of composable, DataFrame style operators. While out of the box functions are flexible and useful, it is sometimes necessary to write custom code, or user-defined functions (UDFs), that can be applied to rows, columns, and other groupings of the cells making up the DataFrame.\n", @@ -39,10 +41,10 @@ }, { "cell_type": "markdown", + "id": "32a8f4fb", "metadata": {}, "source": [ - "Series UDFs\n", - "--------------\n", + "## Series UDFs\n", "\n", "You can execute UDFs on Series in two ways:\n", "\n", @@ -54,14 +56,15 @@ }, { "cell_type": "markdown", + "id": "49399a84", "metadata": {}, "source": [ - "`cudf.Series.apply`\n", - "---------------------" + "### `cudf.Series.apply`" ] }, { "cell_type": "markdown", + "id": "0a209ea2", "metadata": {}, "source": [ "cuDF provides a similar API to `pandas.Series.apply` for applying scalar UDFs to series objects. Here is a very basic example." @@ -70,6 +73,7 @@ { "cell_type": "code", "execution_count": 2, + "id": "e28d5b82", "metadata": {}, "outputs": [], "source": [ @@ -79,6 +83,7 @@ }, { "cell_type": "markdown", + "id": "48a9fa5e", "metadata": {}, "source": [ "UDFs destined for `cudf.Series.apply` might look something like this:" @@ -87,6 +92,7 @@ { "cell_type": "code", "execution_count": 3, + "id": "96aeb19f", "metadata": {}, "outputs": [], "source": [ @@ -97,6 +103,7 @@ }, { "cell_type": "markdown", + "id": "e61d0169", "metadata": {}, "source": [ "`cudf.Series.apply` is called like `pd.Series.apply` and returns a new `Series` object:" @@ -105,6 +112,7 @@ { "cell_type": "code", "execution_count": 4, + "id": "8ca08834", "metadata": {}, "outputs": [ { @@ -127,14 +135,15 @@ }, { "cell_type": "markdown", + "id": "c98dab03", "metadata": {}, "source": [ - "Functions with Additional Scalar Arguments\n", - "---------------------------------------------------" + "### Functions with Additional Scalar Arguments" ] }, { "cell_type": "markdown", + "id": "2aa3df6f", "metadata": {}, "source": [ "In addition, `cudf.Series.apply` supports `args=` just like pandas, allowing you to write UDFs that accept an arbitrary number of scalar arguments. Here is an example of such a function and it's API call in both pandas and cuDF:" @@ -143,6 +152,7 @@ { "cell_type": "code", "execution_count": 5, + "id": "8d156d01", "metadata": {}, "outputs": [], "source": [ @@ -153,6 +163,7 @@ { "cell_type": "code", "execution_count": 6, + "id": "1dee82d7", "metadata": {}, "outputs": [ { @@ -176,6 +187,7 @@ }, { "cell_type": "markdown", + "id": "22739e28", "metadata": {}, "source": [ "As a final note, `**kwargs` is not yet supported." @@ -183,14 +195,15 @@ }, { "cell_type": "markdown", + "id": "afbf33dc", "metadata": {}, "source": [ - "Nullable Data\n", - "----------------" + "### Nullable Data" ] }, { "cell_type": "markdown", + "id": "5dc06e8c", "metadata": {}, "source": [ "The null value `NA` an propagates through unary and binary operations. Thus, `NA + 1`, `abs(NA)`, and `NA == NA` all return `NA`. To make this concrete, let's look at the same example from above, this time using nullable data:" @@ -199,6 +212,7 @@ { "cell_type": "code", "execution_count": 7, + "id": "bda261dd", "metadata": {}, "outputs": [ { @@ -224,6 +238,7 @@ { "cell_type": "code", "execution_count": 8, + "id": "0123ae07", "metadata": {}, "outputs": [], "source": [ @@ -235,6 +250,7 @@ { "cell_type": "code", "execution_count": 9, + "id": "e95868dd", "metadata": {}, "outputs": [ { @@ -258,6 +274,7 @@ }, { "cell_type": "markdown", + "id": "97372e15", "metadata": {}, "source": [ "Often however you want explicit null handling behavior inside the function. cuDF exposes this capability the same way as pandas, by interacting directly with the `NA` singleton object. Here's an example of a function with explicit null handling:" @@ -266,6 +283,7 @@ { "cell_type": "code", "execution_count": 10, + "id": "6c65241b", "metadata": {}, "outputs": [], "source": [ @@ -280,6 +298,7 @@ { "cell_type": "code", "execution_count": 11, + "id": "ab0f4dbf", "metadata": {}, "outputs": [ { @@ -303,6 +322,7 @@ }, { "cell_type": "markdown", + "id": "bdddc4e8", "metadata": {}, "source": [ "In addition, `cudf.NA` can be returned from a function directly or conditionally. This capability should allow you to implement custom null handling in a wide variety of cases." @@ -310,14 +330,15 @@ }, { "cell_type": "markdown", + "id": "54cafbc0", "metadata": {}, "source": [ - "Lower level control with custom `numba` kernels\n", - "---------------------------------------------------------" + "### Lower level control with custom `numba` kernels" ] }, { "cell_type": "markdown", + "id": "00914f2a", "metadata": {}, "source": [ "In addition to the Series.apply() method for performing custom operations, you can also pass Series objects directly into [CUDA kernels written with Numba](https://numba.pydata.org/numba-doc/latest/cuda/kernels.html).\n", @@ -329,6 +350,7 @@ { "cell_type": "code", "execution_count": 12, + "id": "732434f6", "metadata": {}, "outputs": [], "source": [ @@ -338,6 +360,7 @@ { "cell_type": "code", "execution_count": 13, + "id": "4f5997e5", "metadata": {}, "outputs": [], "source": [ @@ -352,6 +375,7 @@ }, { "cell_type": "markdown", + "id": "d9667a55", "metadata": {}, "source": [ "This kernel will take an input array, multiply it by a configurable value (supplied at runtime), and store the result in an output array. Notice that we wrapped our logic in an `if` statement. Because we can launch more threads than the size of our array, we need to make sure that we don't use threads with an index that would be out of bounds. Leaving this out can result in undefined behavior.\n", @@ -362,6 +386,7 @@ { "cell_type": "code", "execution_count": 14, + "id": "ea6008a6", "metadata": {}, "outputs": [], "source": [ @@ -372,6 +397,7 @@ }, { "cell_type": "markdown", + "id": "3fb69909", "metadata": {}, "source": [ "After calling our kernel, our DataFrame is now populated with the result." @@ -380,6 +406,7 @@ { "cell_type": "code", "execution_count": 15, + "id": "183a82ed", "metadata": {}, "outputs": [ { @@ -469,6 +496,7 @@ }, { "cell_type": "markdown", + "id": "ab9c305e", "metadata": {}, "source": [ "This API allows a you to theoretically write arbitrary kernel logic, potentially accessing and using elements of the series at arbitrary indices and use them on cuDF data structures. Advanced developers with some CUDA experience can often use this capability to implement iterative transformations, or spot treat problem areas of a data pipeline with a custom kernel that does the same job faster." @@ -476,28 +504,29 @@ }, { "cell_type": "markdown", + "id": "0acc6ef2", "metadata": {}, "source": [ - "DataFrame UDFs\n", - "--------------------\n", + "## DataFrame UDFs\n", "\n", "Like `cudf.Series`, there are multiple ways of using UDFs on dataframes, which essentially amount to UDFs that expect multiple columns as input:\n", "\n", "- `cudf.DataFrame.apply`, which functions like `pd.DataFrame.apply` and expects a row udf\n", "- `cudf.DataFrame.apply_rows`, which is a thin wrapper around numba and expects a numba kernel\n", - "- `cudf.DataFrame.apply_chunks`, which is similar to `cudf.DataFrame.apply_rows` but offers lower level control.\n" + "- `cudf.DataFrame.apply_chunks`, which is similar to `cudf.DataFrame.apply_rows` but offers lower level control." ] }, { "cell_type": "markdown", + "id": "2102c3ed", "metadata": {}, "source": [ - "`cudf.DataFrame.apply`\n", - "---------------------------" + "### `cudf.DataFrame.apply`" ] }, { "cell_type": "markdown", + "id": "238bec41", "metadata": {}, "source": [ "`cudf.DataFrame.apply` is the main entrypoint for UDFs that expect multiple columns as input and produce a single output column. Functions intended to be consumed by this API are written in terms of a \"row\" argument. The \"row\" is considered to be like a dictionary and contains all of the column values at a certain `iloc` in a `DataFrame`. The function can access these values by key within the function, the keys being the column names corresponding to the desired value. Below is an example function that would be used to add column `A` and column `B` together inside a UDF." @@ -506,6 +535,7 @@ { "cell_type": "code", "execution_count": 16, + "id": "73653918", "metadata": {}, "outputs": [], "source": [ @@ -515,6 +545,7 @@ }, { "cell_type": "markdown", + "id": "b5eb32dd", "metadata": {}, "source": [ "Let's create some very basic toy data containing at least one null." @@ -523,6 +554,7 @@ { "cell_type": "code", "execution_count": 17, + "id": "077feb75", "metadata": {}, "outputs": [ { @@ -592,14 +624,16 @@ }, { "cell_type": "markdown", + "id": "609a3da5", "metadata": {}, "source": [ - "Finally call the function as you would in pandas - by using a lambda function to map the UDF onto \"rows\" of the DataFrame: " + "Finally call the function as you would in pandas - by using a lambda function to map the UDF onto \"rows\" of the DataFrame:" ] }, { "cell_type": "code", "execution_count": 18, + "id": "091e39e1", "metadata": {}, "outputs": [ { @@ -622,6 +656,7 @@ }, { "cell_type": "markdown", + "id": "44e54c31", "metadata": {}, "source": [ "The same function should produce the same result as pandas:" @@ -630,6 +665,7 @@ { "cell_type": "code", "execution_count": 19, + "id": "bd345fab", "metadata": {}, "outputs": [ { @@ -652,6 +688,7 @@ }, { "cell_type": "markdown", + "id": "004fbbba", "metadata": {}, "source": [ "Notice that Pandas returns `object` dtype - see notes on this in the caveats section." @@ -659,6 +696,7 @@ }, { "cell_type": "markdown", + "id": "0b11c172", "metadata": {}, "source": [ "Like `cudf.Series.apply`, these functions support generalized null handling. Here's a function that conditionally returns a different value if a certain input is null:" @@ -667,6 +705,7 @@ { "cell_type": "code", "execution_count": 20, + "id": "b70f4b3b", "metadata": {}, "outputs": [ { @@ -737,6 +776,7 @@ { "cell_type": "code", "execution_count": 21, + "id": "0313c8df", "metadata": {}, "outputs": [ { @@ -759,6 +799,7 @@ }, { "cell_type": "markdown", + "id": "313c77f3", "metadata": {}, "source": [ "`cudf.NA` can also be directly returned from a function resulting in data that has the the correct nulls in the end, just as if it were run in Pandas. For the following data, the last row fulfills the condition that `1 + 3 > 3` and returns `NA` for that row:" @@ -767,6 +808,7 @@ { "cell_type": "code", "execution_count": 22, + "id": "96a7952a", "metadata": {}, "outputs": [ { @@ -845,6 +887,7 @@ { "cell_type": "code", "execution_count": 23, + "id": "e0815f60", "metadata": {}, "outputs": [ { @@ -867,6 +910,7 @@ }, { "cell_type": "markdown", + "id": "b9c674f4", "metadata": {}, "source": [ "Mixed types are allowed, but will return the common type, rather than object as in Pandas. Here's a null aware op between an int and a float column:" @@ -875,6 +919,7 @@ { "cell_type": "code", "execution_count": 24, + "id": "495efd14", "metadata": {}, "outputs": [ { @@ -948,6 +993,7 @@ { "cell_type": "code", "execution_count": 25, + "id": "678b0b5a", "metadata": {}, "outputs": [ { @@ -970,6 +1016,7 @@ }, { "cell_type": "markdown", + "id": "ce0897c0", "metadata": {}, "source": [ "Functions may also return scalar values, however the result will be promoted to a safe type regardless of the data. This means even if you have a function like:\n", @@ -991,6 +1038,7 @@ { "cell_type": "code", "execution_count": 26, + "id": "acf48d56", "metadata": {}, "outputs": [ { @@ -1063,6 +1111,7 @@ { "cell_type": "code", "execution_count": 27, + "id": "78a98172", "metadata": {}, "outputs": [ { @@ -1085,6 +1134,7 @@ }, { "cell_type": "markdown", + "id": "2ceaece4", "metadata": {}, "source": [ "Any number of columns and many arithmetic operators are supported, allowing for complex UDFs:" @@ -1093,6 +1143,7 @@ { "cell_type": "code", "execution_count": 28, + "id": "142c30a9", "metadata": {}, "outputs": [ { @@ -1181,6 +1232,7 @@ { "cell_type": "code", "execution_count": 29, + "id": "fee9198a", "metadata": {}, "outputs": [ { @@ -1203,17 +1255,17 @@ }, { "cell_type": "markdown", + "id": "9c587bd2", "metadata": {}, "source": [ - "Numba kernels for DataFrames\n", - "------------------------------------" + "### Numba kernels for DataFrames" ] }, { "cell_type": "markdown", + "id": "adc6a459", "metadata": {}, "source": [ - "\n", "We could apply a UDF on a DataFrame like we did above with `forall`. We'd need to write a kernel that expects multiple inputs, and pass multiple Series as arguments when we execute our kernel. Because this is fairly common and can be difficult to manage, cuDF provides two APIs to streamline this: `apply_rows` and `apply_chunks`. Below, we walk through an example of using `apply_rows`. `apply_chunks` works in a similar way, but also offers more control over low-level kernel behavior.\n", "\n", "Now that we have two numeric columns in our DataFrame, let's write a kernel that uses both of them." @@ -1222,6 +1274,7 @@ { "cell_type": "code", "execution_count": 30, + "id": "90cbcd85", "metadata": {}, "outputs": [], "source": [ @@ -1235,6 +1288,7 @@ }, { "cell_type": "markdown", + "id": "bce045f2", "metadata": {}, "source": [ "Notice that we need to `enumerate` through our `zipped` function arguments (which either match or are mapped to our input column names). We can pass this kernel to `apply_rows`. We'll need to specify a few arguments:\n", @@ -1251,6 +1305,7 @@ { "cell_type": "code", "execution_count": 31, + "id": "e782daff", "metadata": {}, "outputs": [ { @@ -1337,6 +1392,7 @@ }, { "cell_type": "markdown", + "id": "6b838b89", "metadata": {}, "source": [ "As expected, we see our conditional addition worked. At this point, we've successfully executed UDFs on the core data structures of cuDF." @@ -1344,9 +1400,10 @@ }, { "cell_type": "markdown", + "id": "fca97003", "metadata": {}, "source": [ - "## Null Handling in `apply_rows` and `apply_chunks`\n", + "### Null Handling in `apply_rows` and `apply_chunks`\n", "\n", "By default, DataFrame methods for applying UDFs like `apply_rows` will handle nulls pessimistically (all rows with a null value will be removed from the output if they are used in the kernel). Exploring how not handling not pessimistically can lead to undefined behavior is outside the scope of this guide. Suffice it to say, pessimistic null handling is the safe and consistent approach. You can see an example below." ] @@ -1354,6 +1411,7 @@ { "cell_type": "code", "execution_count": 32, + "id": "befd8333", "metadata": {}, "outputs": [ { @@ -1445,6 +1503,7 @@ }, { "cell_type": "markdown", + "id": "c710ce86", "metadata": {}, "source": [ "In the dataframe above, there are three null values. Each column has a null in a different row. When we use our UDF with `apply_rows`, our output should have two nulls due to pessimistic null handling (because we're not using column `c`, the null value there does not matter to us)." @@ -1453,6 +1512,7 @@ { "cell_type": "code", "execution_count": 33, + "id": "d1f3dcaf", "metadata": {}, "outputs": [ { @@ -1546,6 +1606,7 @@ }, { "cell_type": "markdown", + "id": "53b9a2f8", "metadata": {}, "source": [ "As expected, we end up with two nulls in our output. The null values from the columns we used propogated to our output, but the null from the column we ignored did not." @@ -1553,10 +1614,10 @@ }, { "cell_type": "markdown", + "id": "4bbefa67", "metadata": {}, "source": [ - "Rolling Window UDFs\n", - "-------------------------\n", + "## Rolling Window UDFs\n", "\n", "For time-series data, we may need to operate on a small \\\"window\\\" of our column at a time, processing each portion independently. We could slide (\\\"roll\\\") this window over the entire column to answer questions like \\\"What is the 3-day moving average of a stock price over the past year?\"\n", "\n", @@ -1566,6 +1627,7 @@ { "cell_type": "code", "execution_count": 34, + "id": "6bc6aea3", "metadata": {}, "outputs": [ { @@ -1593,6 +1655,7 @@ { "cell_type": "code", "execution_count": 35, + "id": "a4c31df1", "metadata": {}, "outputs": [ { @@ -1613,6 +1676,7 @@ }, { "cell_type": "markdown", + "id": "ff40d863", "metadata": {}, "source": [ "Next, we'll define a function to use on our rolling windows. We created this one to highlight how you can include things like loops, mathematical functions, and conditionals. Rolling window UDFs do not yet support null values." @@ -1621,6 +1685,7 @@ { "cell_type": "code", "execution_count": 36, + "id": "eb5a081b", "metadata": {}, "outputs": [], "source": [ @@ -1637,6 +1702,7 @@ }, { "cell_type": "markdown", + "id": "df8ba31d", "metadata": {}, "source": [ "We can execute the function by passing it to `apply`. With `window=3`, `min_periods=3`, and `center=False`, our first two values are `null`." @@ -1645,6 +1711,7 @@ { "cell_type": "code", "execution_count": 37, + "id": "ddec3263", "metadata": {}, "outputs": [ { @@ -1670,6 +1737,7 @@ }, { "cell_type": "markdown", + "id": "187478db", "metadata": {}, "source": [ "We can apply this function to every column in a DataFrame, too." @@ -1678,6 +1746,7 @@ { "cell_type": "code", "execution_count": 38, + "id": "8b61094a", "metadata": {}, "outputs": [ { @@ -1759,6 +1828,7 @@ { "cell_type": "code", "execution_count": 39, + "id": "bb8c3019", "metadata": {}, "outputs": [ { @@ -1867,10 +1937,10 @@ }, { "cell_type": "markdown", + "id": "d4785060", "metadata": {}, "source": [ - "GroupBy DataFrame UDFs\n", - "-------------------------------\n", + "## GroupBy DataFrame UDFs\n", "\n", "We can also apply UDFs to grouped DataFrames using `apply_grouped`. This example is also drawn and adapted from the RAPIDS [API documentation]().\n", "\n", @@ -1880,6 +1950,7 @@ { "cell_type": "code", "execution_count": 40, + "id": "3dc272ab", "metadata": {}, "outputs": [ { @@ -1971,6 +2042,7 @@ { "cell_type": "code", "execution_count": 41, + "id": "c0578e0a", "metadata": {}, "outputs": [], "source": [ @@ -1979,6 +2051,7 @@ }, { "cell_type": "markdown", + "id": "4808726f", "metadata": {}, "source": [ "Next we'll define a function to apply to each group independently. In this case, we'll take the rolling average of column `e`, and call that new column `rolling_avg_e`." @@ -1987,6 +2060,7 @@ { "cell_type": "code", "execution_count": 42, + "id": "19f0f7fe", "metadata": {}, "outputs": [], "source": [ @@ -2006,6 +2080,7 @@ }, { "cell_type": "markdown", + "id": "7566f359", "metadata": {}, "source": [ "We can execute this with a very similar API to `apply_rows`. This time, though, it's going to execute independently for each group." @@ -2014,6 +2089,7 @@ { "cell_type": "code", "execution_count": 43, + "id": "c43426c3", "metadata": {}, "outputs": [ { @@ -2157,6 +2233,7 @@ }, { "cell_type": "markdown", + "id": "c8511306", "metadata": {}, "source": [ "Notice how, with a window size of three in the kernel, the first two values in each group for our output column are null." @@ -2164,10 +2241,10 @@ }, { "cell_type": "markdown", + "id": "0060678c", "metadata": {}, "source": [ - "Numba Kernels on CuPy Arrays\n", - "-------------------------------------\n", + "## Numba Kernels on CuPy Arrays\n", "\n", "We can also execute Numba kernels on CuPy NDArrays, again thanks to the `__cuda_array_interface__`. We can even run the same UDF on the Series and the CuPy array. First, we define a Series and then create a CuPy array from that Series." ] @@ -2175,6 +2252,7 @@ { "cell_type": "code", "execution_count": 44, + "id": "aa6a8509", "metadata": {}, "outputs": [ { @@ -2198,6 +2276,7 @@ }, { "cell_type": "markdown", + "id": "0fed556f", "metadata": {}, "source": [ "Next, we define a UDF and execute it on our Series. We need to allocate a Series of the same size for our output, which we'll call `out`." @@ -2206,6 +2285,7 @@ { "cell_type": "code", "execution_count": 45, + "id": "0bb8bf93", "metadata": {}, "outputs": [ { @@ -2238,6 +2318,7 @@ }, { "cell_type": "markdown", + "id": "a857b169", "metadata": {}, "source": [ "Finally, we execute the same function on our array. We allocate an empty array `out` to store our results." @@ -2246,6 +2327,7 @@ { "cell_type": "code", "execution_count": 46, + "id": "ce60b639", "metadata": {}, "outputs": [ { @@ -2267,14 +2349,15 @@ }, { "cell_type": "markdown", + "id": "b899d51c", "metadata": {}, "source": [ - "Caveats\n", - "---------" + "## Caveats" ] }, { "cell_type": "markdown", + "id": "fe7eb68b", "metadata": {}, "source": [ "- Only numeric nondecimal scalar types are currently supported as of yet, but strings and structured types are in planning. Attempting to use this API with those types will throw a `TypeError`.\n", @@ -2283,10 +2366,10 @@ }, { "cell_type": "markdown", + "id": "c690563b", "metadata": {}, "source": [ - "Summary\n", - "-----------\n", + "## Summary\n", "\n", "This guide has covered a lot of content. At this point, you should hopefully feel comfortable writing UDFs (with or without null values) that operate on\n", "\n", @@ -2323,5 +2406,5 @@ } }, "nbformat": 4, - "nbformat_minor": 4 + "nbformat_minor": 5 } diff --git a/docs/cudf/source/user_guide/index.md b/docs/cudf/source/user_guide/index.md new file mode 100644 index 00000000000..2750c75790a --- /dev/null +++ b/docs/cudf/source/user_guide/index.md @@ -0,0 +1,16 @@ +# User Guide + +```{toctree} +:maxdepth: 2 + +10min +data-types +io +missing-data +groupby +guide-to-udfs +cupy-interop +dask-cudf +internals +PandasCompat +``` diff --git a/docs/cudf/source/user_guide/index.rst b/docs/cudf/source/user_guide/index.rst deleted file mode 100644 index 1061008eb3c..00000000000 --- a/docs/cudf/source/user_guide/index.rst +++ /dev/null @@ -1,12 +0,0 @@ -========== -User Guide -========== - - -.. toctree:: - :maxdepth: 2 - - 10min.ipynb - 10min-cudf-cupy.ipynb - guide-to-udfs.ipynb - Working-with-missing-data.ipynb diff --git a/docs/cudf/source/user_guide/internals.md b/docs/cudf/source/user_guide/internals.md new file mode 100644 index 00000000000..6ceef3d3492 --- /dev/null +++ b/docs/cudf/source/user_guide/internals.md @@ -0,0 +1,212 @@ +# cuDF internals + +The cuDF API closely matches that of the +[Pandas](https://pandas.pydata.org/) library. Thus, we have the types +`cudf.Series`, `cudf.DataFrame` and `cudf.Index` which look and +feel very much like their Pandas counterparts. + +Under the hood, however, cuDF uses data structures very different from +Pandas. In this document, we describe these internal data structures. + +## Column + +Columns are cuDF's core data structure and they are modeled after the +[Apache Arrow Columnar +Format](https://arrow.apache.org/docs/format/Columnar.html). + +A column represents a sequence of values, any number of which may be +"null". Columns are specialized based on the type of data they contain. +Thus we have `NumericalColumn`, `StringColumn`, `DatetimeColumn`, +etc. + +A column is composed of the following: + +- A **data type**, specifying the type of each element. +- A **data buffer** that may store the data for the column elements. + Some column types do not have a data buffer, instead storing data in + the children columns. +- A **mask buffer** whose bits represent the validity (null or not + null) of each element. Columns whose elements are all "valid" may not + have a mask buffer. Mask buffers are padded to 64 bytes. +- A tuple of **children** columns, which enable the representation + complex types such as columns with non-fixed width elements such as + strings or lists. +- A **size** indicating the number of elements in the column. +- An integer **offset**: a column may represent a "slice" of another + column, in which case this offset represents the first element of the + slice. The size of the column then gives the extent of the slice. A + column that is not a slice has an offset of 0. + +For example, the `NumericalColumn` backing a Series with 1000 elements +of type 'int32' and containing nulls is composed of: + +1. A data buffer of size 4000 bytes (sizeof(int32) * 1000) +2. A mask buffer of size 128 bytes (1000/8 padded to a multiple of 64 + bytes) +3. No children columns + +As another example, the `StringColumn` backing the Series +`['do', 'you', 'have', 'any', 'cheese?']` is composed of: + +1. No data buffer +2. No mask buffer as there are no nulls in the Series +3. Two children columns: + + > - A column of UTF-8 characters + > `['d', 'o', 'y', 'o', 'u', 'h' ..., '?']` + > - A column of "offsets" to the characters column (in this case, + > `[0, 2, 5, 9, 12, 19]`) + +## Buffer + +The data and mask buffers of a column represent data in GPU memory +(a.k.a *device memory*), and are objects of type +`cudf.core.buffer.Buffer`. + +Buffers can be constructed from array-like objects that live either on +the host (e.g., numpy arrays) or the device (e.g., cupy arrays). Arrays +must be of `uint8` dtype or viewed as such. + +When constructing a Buffer from a host object such as a numpy array, new +device memory is allocated: + +```python +>>> from cudf.core.buffer import Buffer +>>> buf = Buffer(np.array([1, 2, 3], dtype='int64').view("uint8")) +>>> print(buf.ptr) # address of new device memory allocation +140050901762560 +>>> print(buf.size) +24 +>>> print(buf._owner) + +``` + +cuDF uses the [RMM](https://github.com/rapidsai/rmm) library for +allocating device memory. You can read more about device memory +allocation with RMM +[here](https://github.com/rapidsai/rmm#devicebuffers). + +When constructing a Buffer from a device object such as a CuPy array, no +new device memory is allocated. Instead, the Buffer points to the +existing allocation, keeping a reference to the device array: + +```python +>>> import cupy as cp +>>> c_ary = cp.asarray([1, 2, 3], dtype='int64') +>>> buf = Buffer(c_ary.view("uint8")) +>>> print(c_ary.data.mem.ptr) +140050901762560 +>>> print(buf.ptr) +140050901762560 +>>> print(buf.size) +24 +>>> print(buf._owner is c_ary) +True +``` + +An uninitialized block of device memory can be allocated with +`Buffer.empty`: + +```python +>>> buf = Buffer.empty(10) +>>> print(buf.size) +10 +>>> print(buf._owner) + +``` + +## ColumnAccessor + +cuDF `Series`, `DataFrame` and `Index` are all subclasses of an +internal `Frame` class. The underlying data structure of `Frame` is +an ordered, dictionary-like object known as `ColumnAccessor`, which +can be accessed via the `._data` attribute: + +```python +>>> a = cudf.DataFrame({'x': [1, 2, 3], 'y': ['a', 'b', 'c']}) +>>> a._data +ColumnAccessor(OrderedColumnDict([('x', ), ('y', )]), multiindex=False, level_names=(None,)) +``` + +ColumnAccessor is an ordered mapping of column labels to columns. In +addition to behaving like an OrderedDict, it supports things like +selecting multiple columns (both by index and label), as well as +hierarchical indexing. + +```python +>>> from cudf.core.column_accessor import ColumnAccessor +``` + +The values of a ColumnAccessor are coerced to Columns during +construction: + +```python +>>> ca = ColumnAccessor({'x': [1, 2, 3], 'y': ['a', 'b', 'c']}) +>>> ca['x'] + +>>> ca['y'] + +>>> ca.pop('x') + +>>> ca +ColumnAccessor(OrderedColumnDict([('y', )]), multiindex=False, level_names=(None,)) +``` + +Columns can be inserted at a specified location: + +```python +>>> ca.insert('z', [3, 4, 5], loc=1) +>>> ca +ColumnAccessor(OrderedColumnDict([('x', ), ('z', ), ('y', )]), multiindex=False, level_names=(None,)) +``` + +Selecting columns by index: + +```python +>>> ca = ColumnAccessor({'x': [1, 2, 3], 'y': ['a', 'b', 'c'], 'z': [4, 5, 6]}) +>>> ca.select_by_index(1) +ColumnAccessor(OrderedColumnDict([('y', )]), multiindex=False, level_names=(None,)) +>>> ca.select_by_index([0, 1]) +ColumnAccessor(OrderedColumnDict([('x', ), ('y', )]), multiindex=False, level_names=(None,)) +>>> ca.select_by_index(slice(1, 3)) +ColumnAccessor(OrderedColumnDict([('y', ), ('z', )]), multiindex=False, level_names=(None,)) +``` + +Selecting columns by label: + +```python +>>> ca.select_by_label(['y', 'z']) +ColumnAccessor(OrderedColumnDict([('y', ), ('z', )]), multiindex=False, level_names=(None,)) +>>> ca.select_by_label(slice('x', 'y')) +ColumnAccessor(OrderedColumnDict([('x', ), ('y', )]), multiindex=False, level_names=(None,)) +``` + +A ColumnAccessor with tuple keys (and constructed with +`multiindex=True`) can be hierarchically indexed: + +```python +>>> ca = ColumnAccessor({('a', 'b'): [1, 2, 3], ('a', 'c'): [2, 3, 4], 'b': [4, 5, 6]}, multiindex=True) +>>> ca.select_by_label('a') +ColumnAccessor(OrderedColumnDict([('b', ), ('c', )]), multiindex=False, level_names=(None,)) +>>> ca.select_by_label(('a', 'b')) +ColumnAccessor(OrderedColumnDict([(('a', 'b'), )]), multiindex=False, level_names=(None,)) +``` + +"Wildcard" indexing is also allowed: + +```python +>>> ca = ColumnAccessor({('a', 'b'): [1, 2, 3], ('a', 'c'): [2, 3, 4], ('d', 'b'): [4, 5, 6]}, multiindex=True) +>>> ca.select_by_label((slice(None), 'b')) +ColumnAccessor(OrderedColumnDict([(('a', 'b'), ), (('d', 'b'), )]), multiindex=True, level_names=(None, None)) +``` + +Finally, ColumnAccessors can convert to Pandas `Index` or +`MultiIndex` objects: + +```python +>>> ca.to_pandas_index() +MultiIndex([('a', 'b'), + ('a', 'c'), + ('d', 'b')], + ) +``` diff --git a/docs/cudf/source/basics/io-supported-types.rst b/docs/cudf/source/user_guide/io.md similarity index 69% rename from docs/cudf/source/basics/io-supported-types.rst rename to docs/cudf/source/user_guide/io.md index 4a7da60fa85..672375eedaf 100644 --- a/docs/cudf/source/basics/io-supported-types.rst +++ b/docs/cudf/source/user_guide/io.md @@ -1,10 +1,17 @@ -I/O Supported dtypes -==================== +# Input / Output -The following table lists are compatible cudf types for each supported IO format. +This page contains Input / Output related APIs in cuDF. -.. rst-class:: io-supported-types-table special-table +## I/O Supported dtypes + +The following table lists are compatible cudf types for each supported +IO format. + +
+ +```{eval-rst} .. table:: + :class: io-supported-types-table special-table :widths: 15 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 +-----------------------+--------+--------+--------+--------+---------+--------+--------+--------+--------+-------------------+--------+--------+---------+---------+ @@ -64,7 +71,103 @@ The following table lists are compatible cudf types for each supported IO format +-----------------------+--------+--------+--------+--------+---------+--------+--------+--------+--------+---------+---------+--------+--------+---------+---------+ | decimal128 | ✅ | ✅ | ✅ | ✅ | ❌ | ❌ | ✅ | ✅ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | +-----------------------+--------+--------+--------+--------+---------+--------+--------+--------+--------+---------+---------+--------+--------+---------+---------+ +``` + +
+ **Notes:** -* [¹] - Not GPU-accelerated. +- \[¹\] - Not GPU-accelerated. + +## GPUDirect Storage Integration + +Many IO APIs can use GPUDirect Storage (GDS) library to optimize IO +operations. GDS enables a direct data path for direct memory access +(DMA) transfers between GPU memory and storage, which avoids a bounce +buffer through the CPU. GDS also has a compatibility mode that allows +the library to fall back to copying through a CPU bounce buffer. The +SDK is available for download +[here](https://developer.nvidia.com/gpudirect-storage). GDS is also +included in CUDA Toolkit 11.4 and higher. + +Use of GPUDirect Storage in cuDF is enabled by default, but can be +disabled through the environment variable `LIBCUDF_CUFILE_POLICY`. +This variable also controls the GDS compatibility mode. + +There are four valid values for the environment variable: + +- "GDS": Enable GDS use; GDS compatibility mode is *off*. +- "ALWAYS": Enable GDS use; GDS compatibility mode is *on*. +- "KVIKIO": Enable GDS through [KvikIO](https://github.com/rapidsai/kvikio). +- "OFF": Completely disable GDS use. + +If no value is set, behavior will be the same as the "GDS" option. + +This environment variable also affects how cuDF treats GDS errors. + +- When `LIBCUDF_CUFILE_POLICY` is set to "GDS" and a GDS API call + fails for any reason, cuDF falls back to the internal implementation + with bounce buffers. +- When `LIBCUDF_CUFILE_POLICY` is set to "ALWAYS" and a GDS API call +fails for any reason (unlikely, given that the compatibility mode is +on), cuDF throws an exception to propagate the error to the user. +- When `LIBCUDF_CUFILE_POLICY` is set to "KVIKIO" and a KvikIO API + call fails for any reason (unlikely, given that KvikIO implements + its own compatibility mode) cuDF throws an exception to propagate + the error to the user. + +For more information about error handling, compatibility mode, and +tuning parameters in KvikIO see: + +Operations that support the use of GPUDirect Storage: + +- {py:func}`cudf.read_avro` +- {py:func}`cudf.read_parquet` +- {py:func}`cudf.read_orc` +- {py:meth}`cudf.DataFrame.to_csv` +- {py:meth}`cudf.DataFrame.to_parquet` +- {py:meth}`cudf.DataFrame.to_orc` + +Several parameters that can be used to tune the performance of +GDS-enabled I/O are exposed through environment variables: + +- `LIBCUDF_CUFILE_THREAD_COUNT`: Integral value, maximum number of + parallel reads/writes per file (default 16); +- `LIBCUDF_CUFILE_SLICE_SIZE`: Integral value, maximum size of each + GDS read/write, in bytes (default 4MB). Larger I/O operations are + split into multiple calls. + +## nvCOMP Integration + +Some types of compression/decompression can be performed using either +the [nvCOMP library](https://github.com/NVIDIA/nvcomp) or the internal +implementation. + +Which implementation is used by default depends on the data format and +the compression type. Behavior can be influenced through environment +variable `LIBCUDF_NVCOMP_POLICY`. + +There are three valid values for the environment variable: + +- "STABLE": Only enable the nvCOMP in places where it has been deemed + stable for production use. +- "ALWAYS": Enable all available uses of nvCOMP, including new, + experimental combinations. +- "OFF": Disable nvCOMP use whenever possible and use the internal + implementations instead. + +If no value is set, behavior will be the same as the "STABLE" option. + +```{eval-rst} +.. table:: Current policy for nvCOMP use for different types + :widths: 20 15 15 15 15 15 15 15 15 15 + + +-----------------------+--------+--------+--------+--------+---------+--------+--------+--------+--------+ + | | CSV | Parquet | JSON | ORC | AVRO | + +-----------------------+--------+--------+--------+--------+---------+--------+--------+--------+--------+ + | Compression Type | Writer | Reader | Writer | Reader | Writer¹ | Reader | Writer | Reader | Reader | + +=======================+========+========+========+========+=========+========+========+========+========+ + | snappy | ❌ | ❌ | Stable | Stable | ❌ | ❌ | Stable | Stable | ❌ | + +-----------------------+--------+--------+--------+--------+---------+--------+--------+--------+--------+ +``` diff --git a/docs/cudf/source/user_guide/Working-with-missing-data.ipynb b/docs/cudf/source/user_guide/missing-data.ipynb similarity index 87% rename from docs/cudf/source/user_guide/Working-with-missing-data.ipynb rename to docs/cudf/source/user_guide/missing-data.ipynb index 54fe774060e..ad12c675373 100644 --- a/docs/cudf/source/user_guide/Working-with-missing-data.ipynb +++ b/docs/cudf/source/user_guide/missing-data.ipynb @@ -2,6 +2,7 @@ "cells": [ { "cell_type": "markdown", + "id": "f8ffbea7", "metadata": {}, "source": [ "# Working with missing data" @@ -9,6 +10,7 @@ }, { "cell_type": "markdown", + "id": "7e3ab093", "metadata": {}, "source": [ "In this section, we will discuss missing (also referred to as `NA`) values in cudf. cudf supports having missing values in all dtypes. These missing values are represented by ``. These values are also referenced as \"null values\"." @@ -16,25 +18,7 @@ }, { "cell_type": "markdown", - "metadata": {}, - "source": [ - "1. [How to Detect missing values](#How-to-Detect-missing-values)\n", - "2. [Float dtypes and missing data](#Float-dtypes-and-missing-data)\n", - "3. [Datetimes](#Datetimes)\n", - "4. [Calculations with missing data](#Calculations-with-missing-data)\n", - "5. [Sum/product of Null/nans](#Sum/product-of-Null/nans)\n", - "6. [NA values in GroupBy](#NA-values-in-GroupBy)\n", - "7. [Inserting missing data](#Inserting-missing-data)\n", - "8. [Filling missing values: fillna](#Filling-missing-values:-fillna)\n", - "9. [Filling with cudf Object](#Filling-with-cudf-Object)\n", - "10. [Dropping axis labels with missing data: dropna](#Dropping-axis-labels-with-missing-data:-dropna)\n", - "11. [Replacing generic values](#Replacing-generic-values)\n", - "12. [String/regular expression replacement](#String/regular-expression-replacement)\n", - "13. [Numeric replacement](#Numeric-replacement)" - ] - }, - { - "cell_type": "markdown", + "id": "8d657a82", "metadata": {}, "source": [ "## How to Detect missing values" @@ -42,6 +26,7 @@ }, { "cell_type": "markdown", + "id": "9ea9f672", "metadata": {}, "source": [ "To detect missing values, you can use `isna()` and `notna()` functions." @@ -50,6 +35,7 @@ { "cell_type": "code", "execution_count": 1, + "id": "58050adb", "metadata": {}, "outputs": [], "source": [ @@ -60,6 +46,7 @@ { "cell_type": "code", "execution_count": 2, + "id": "416d73da", "metadata": {}, "outputs": [], "source": [ @@ -69,6 +56,7 @@ { "cell_type": "code", "execution_count": 3, + "id": "5dfc6bc3", "metadata": {}, "outputs": [ { @@ -141,6 +129,7 @@ { "cell_type": "code", "execution_count": 4, + "id": "4d7f7a6d", "metadata": {}, "outputs": [ { @@ -213,6 +202,7 @@ { "cell_type": "code", "execution_count": 5, + "id": "40edca67", "metadata": {}, "outputs": [ { @@ -236,6 +226,7 @@ }, { "cell_type": "markdown", + "id": "acdf29d7", "metadata": {}, "source": [ "One has to be mindful that in Python (and NumPy), the nan's don’t compare equal, but None's do. Note that cudf/NumPy uses the fact that `np.nan != np.nan`, and treats `None` like `np.nan`." @@ -244,6 +235,7 @@ { "cell_type": "code", "execution_count": 6, + "id": "c269c1f5", "metadata": {}, "outputs": [ { @@ -264,6 +256,7 @@ { "cell_type": "code", "execution_count": 7, + "id": "99fb083a", "metadata": {}, "outputs": [ { @@ -283,22 +276,23 @@ }, { "cell_type": "markdown", + "id": "4fdb8bc7", "metadata": {}, "source": [ - "So as compared to above, a scalar equality comparison versus a None/np.nan doesn’t provide useful information.\n", - "\n" + "So as compared to above, a scalar equality comparison versus a None/np.nan doesn’t provide useful information." ] }, { "cell_type": "code", "execution_count": 8, + "id": "630ef6bb", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 False\n", - "1 False\n", + "1 \n", "2 False\n", "3 False\n", "Name: b, dtype: bool" @@ -316,6 +310,7 @@ { "cell_type": "code", "execution_count": 9, + "id": "8162e383", "metadata": {}, "outputs": [], "source": [ @@ -325,6 +320,7 @@ { "cell_type": "code", "execution_count": 10, + "id": "199775b3", "metadata": {}, "outputs": [ { @@ -348,14 +344,15 @@ { "cell_type": "code", "execution_count": 11, + "id": "cd09d80c", "metadata": {}, "outputs": [ { "data": { "text/plain": [ - "0 False\n", - "1 False\n", - "2 False\n", + "0 \n", + "1 \n", + "2 \n", "dtype: bool" ] }, @@ -371,6 +368,7 @@ { "cell_type": "code", "execution_count": 12, + "id": "6b23bb0c", "metadata": {}, "outputs": [], "source": [ @@ -380,6 +378,7 @@ { "cell_type": "code", "execution_count": 13, + "id": "cafb79ee", "metadata": {}, "outputs": [ { @@ -403,6 +402,7 @@ { "cell_type": "code", "execution_count": 14, + "id": "13363897", "metadata": {}, "outputs": [ { @@ -425,6 +425,7 @@ }, { "cell_type": "markdown", + "id": "208a3776", "metadata": {}, "source": [ "## Float dtypes and missing data" @@ -432,16 +433,18 @@ }, { "cell_type": "markdown", + "id": "2c174b88", "metadata": {}, "source": [ "Because ``NaN`` is a float, a column of integers with even one missing values is cast to floating-point dtype. However this doesn't happen by default.\n", "\n", - "By default if a ``NaN`` value is passed to `Series` constructor, it is treated as `` value. " + "By default if a ``NaN`` value is passed to `Series` constructor, it is treated as `` value." ] }, { "cell_type": "code", "execution_count": 15, + "id": "c59c3c54", "metadata": {}, "outputs": [ { @@ -464,6 +467,7 @@ }, { "cell_type": "markdown", + "id": "a9eb2d9c", "metadata": {}, "source": [ "Hence to consider a ``NaN`` as ``NaN`` you will have to pass `nan_as_null=False` parameter into `Series` constructor." @@ -472,6 +476,7 @@ { "cell_type": "code", "execution_count": 16, + "id": "ecc5ae92", "metadata": {}, "outputs": [ { @@ -494,6 +499,7 @@ }, { "cell_type": "markdown", + "id": "d1db7b08", "metadata": {}, "source": [ "## Datetimes" @@ -501,15 +507,16 @@ }, { "cell_type": "markdown", + "id": "548d3734", "metadata": {}, "source": [ - "For `datetime64` types, cudf doesn't support having `NaT` values. Instead these values which are specific to numpy and pandas are considered as null values(``) in cudf. The actual underlying value of `NaT` is `min(int64)` and cudf retains the underlying value when converting a cudf object to pandas object.\n", - "\n" + "For `datetime64` types, cudf doesn't support having `NaT` values. Instead these values which are specific to numpy and pandas are considered as null values(``) in cudf. The actual underlying value of `NaT` is `min(int64)` and cudf retains the underlying value when converting a cudf object to pandas object." ] }, { "cell_type": "code", "execution_count": 17, + "id": "de70f244", "metadata": {}, "outputs": [ { @@ -535,6 +542,7 @@ { "cell_type": "code", "execution_count": 18, + "id": "8411a914", "metadata": {}, "outputs": [ { @@ -557,6 +565,7 @@ }, { "cell_type": "markdown", + "id": "df664145", "metadata": {}, "source": [ "any operations on rows having `` values in `datetime` column will result in `` value at the same location in resulting column:" @@ -565,6 +574,7 @@ { "cell_type": "code", "execution_count": 19, + "id": "829c32d0", "metadata": {}, "outputs": [ { @@ -587,6 +597,7 @@ }, { "cell_type": "markdown", + "id": "aa8031ef", "metadata": {}, "source": [ "## Calculations with missing data" @@ -594,6 +605,7 @@ }, { "cell_type": "markdown", + "id": "c587fae2", "metadata": {}, "source": [ "Null values propagate naturally through arithmetic operations between pandas objects." @@ -602,6 +614,7 @@ { "cell_type": "code", "execution_count": 20, + "id": "f8f2aec7", "metadata": {}, "outputs": [], "source": [ @@ -611,6 +624,7 @@ { "cell_type": "code", "execution_count": 21, + "id": "0c8a3011", "metadata": {}, "outputs": [], "source": [ @@ -620,6 +634,7 @@ { "cell_type": "code", "execution_count": 22, + "id": "052f6c2b", "metadata": {}, "outputs": [ { @@ -698,6 +713,7 @@ { "cell_type": "code", "execution_count": 23, + "id": "0fb0a083", "metadata": {}, "outputs": [ { @@ -776,6 +792,7 @@ { "cell_type": "code", "execution_count": 24, + "id": "6f8152c0", "metadata": {}, "outputs": [ { @@ -853,6 +870,7 @@ }, { "cell_type": "markdown", + "id": "11170d49", "metadata": {}, "source": [ "While summing the data along a series, `NA` values will be treated as `0`." @@ -861,6 +879,7 @@ { "cell_type": "code", "execution_count": 25, + "id": "45081790", "metadata": {}, "outputs": [ { @@ -886,6 +905,7 @@ { "cell_type": "code", "execution_count": 26, + "id": "39922658", "metadata": {}, "outputs": [ { @@ -905,6 +925,7 @@ }, { "cell_type": "markdown", + "id": "6e99afe0", "metadata": {}, "source": [ "Since `NA` values are treated as `0`, the mean would result to 2 in this case `(1 + 0 + 2 + 3 + 0)/5 = 2`" @@ -913,6 +934,7 @@ { "cell_type": "code", "execution_count": 27, + "id": "b2f16ddb", "metadata": {}, "outputs": [ { @@ -932,6 +954,7 @@ }, { "cell_type": "markdown", + "id": "07f2ec5a", "metadata": {}, "source": [ "To preserve `NA` values in the above calculations, `sum` & `mean` support `skipna` parameter.\n", @@ -942,6 +965,7 @@ { "cell_type": "code", "execution_count": 28, + "id": "d4a463a0", "metadata": {}, "outputs": [ { @@ -962,6 +986,7 @@ { "cell_type": "code", "execution_count": 29, + "id": "a944c42e", "metadata": {}, "outputs": [ { @@ -981,6 +1006,7 @@ }, { "cell_type": "markdown", + "id": "fb8c8f18", "metadata": {}, "source": [ "Cumulative methods like `cumsum` and `cumprod` ignore `NA` values by default." @@ -989,6 +1015,7 @@ { "cell_type": "code", "execution_count": 30, + "id": "4f2a7306", "metadata": {}, "outputs": [ { @@ -1013,6 +1040,7 @@ }, { "cell_type": "markdown", + "id": "c8f6054b", "metadata": {}, "source": [ "To preserve `NA` values in cumulative methods, provide `skipna=False`." @@ -1021,6 +1049,7 @@ { "cell_type": "code", "execution_count": 31, + "id": "d4c46776", "metadata": {}, "outputs": [ { @@ -1045,6 +1074,7 @@ }, { "cell_type": "markdown", + "id": "67077d65", "metadata": {}, "source": [ "## Sum/product of Null/nans" @@ -1052,6 +1082,7 @@ }, { "cell_type": "markdown", + "id": "ffbb9ca1", "metadata": {}, "source": [ "The sum of an empty or all-NA Series of a DataFrame is 0." @@ -1060,6 +1091,7 @@ { "cell_type": "code", "execution_count": 32, + "id": "f430c9ce", "metadata": {}, "outputs": [ { @@ -1080,6 +1112,7 @@ { "cell_type": "code", "execution_count": 33, + "id": "7fde514b", "metadata": {}, "outputs": [ { @@ -1100,6 +1133,7 @@ { "cell_type": "code", "execution_count": 34, + "id": "56cedd17", "metadata": {}, "outputs": [ { @@ -1119,6 +1153,7 @@ }, { "cell_type": "markdown", + "id": "cb188adb", "metadata": {}, "source": [ "The product of an empty or all-NA Series of a DataFrame is 1." @@ -1127,6 +1162,7 @@ { "cell_type": "code", "execution_count": 35, + "id": "d20bbbef", "metadata": {}, "outputs": [ { @@ -1147,6 +1183,7 @@ { "cell_type": "code", "execution_count": 36, + "id": "75abbcfa", "metadata": {}, "outputs": [ { @@ -1167,6 +1204,7 @@ { "cell_type": "code", "execution_count": 37, + "id": "becce0cc", "metadata": {}, "outputs": [ { @@ -1186,6 +1224,7 @@ }, { "cell_type": "markdown", + "id": "0e899e03", "metadata": {}, "source": [ "## NA values in GroupBy" @@ -1193,6 +1232,7 @@ }, { "cell_type": "markdown", + "id": "7fb20874", "metadata": {}, "source": [ "`NA` groups in GroupBy are automatically excluded. For example:" @@ -1201,6 +1241,7 @@ { "cell_type": "code", "execution_count": 38, + "id": "1379037c", "metadata": {}, "outputs": [ { @@ -1279,6 +1320,7 @@ { "cell_type": "code", "execution_count": 39, + "id": "d6b91e6f", "metadata": {}, "outputs": [ { @@ -1345,6 +1387,7 @@ }, { "cell_type": "markdown", + "id": "cb83fb11", "metadata": {}, "source": [ "It is also possible to include `NA` in groups by passing `dropna=False`" @@ -1353,9 +1396,8 @@ { "cell_type": "code", "execution_count": 40, - "metadata": { - "scrolled": true - }, + "id": "768c3e50", + "metadata": {}, "outputs": [ { "data": { @@ -1426,6 +1468,7 @@ }, { "cell_type": "markdown", + "id": "133816b4", "metadata": {}, "source": [ "## Inserting missing data" @@ -1433,6 +1476,7 @@ }, { "cell_type": "markdown", + "id": "306082ad", "metadata": {}, "source": [ "All dtypes support insertion of missing value by assignment. Any specific location in series can made null by assigning it to `None`." @@ -1441,6 +1485,7 @@ { "cell_type": "code", "execution_count": 41, + "id": "7ddde1fe", "metadata": {}, "outputs": [], "source": [ @@ -1450,6 +1495,7 @@ { "cell_type": "code", "execution_count": 42, + "id": "16e54597", "metadata": {}, "outputs": [ { @@ -1474,6 +1520,7 @@ { "cell_type": "code", "execution_count": 43, + "id": "f628f94d", "metadata": {}, "outputs": [], "source": [ @@ -1483,9 +1530,8 @@ { "cell_type": "code", "execution_count": 44, - "metadata": { - "scrolled": true - }, + "id": "b30590b7", + "metadata": {}, "outputs": [ { "data": { @@ -1508,6 +1554,7 @@ }, { "cell_type": "markdown", + "id": "a1b123d0", "metadata": {}, "source": [ "## Filling missing values: fillna" @@ -1515,6 +1562,7 @@ }, { "cell_type": "markdown", + "id": "114aa23a", "metadata": {}, "source": [ "`fillna()` can fill in `NA` & `NaN` values with non-NA data." @@ -1523,6 +1571,7 @@ { "cell_type": "code", "execution_count": 45, + "id": "59e22668", "metadata": {}, "outputs": [ { @@ -1601,6 +1650,7 @@ { "cell_type": "code", "execution_count": 46, + "id": "05c221ee", "metadata": {}, "outputs": [ { @@ -1625,6 +1675,7 @@ }, { "cell_type": "markdown", + "id": "401f91b2", "metadata": {}, "source": [ "## Filling with cudf Object" @@ -1632,6 +1683,7 @@ }, { "cell_type": "markdown", + "id": "e79346d6", "metadata": {}, "source": [ "You can also fillna using a dict or Series that is alignable. The labels of the dict or index of the Series must match the columns of the frame you wish to fill. The use case of this is to fill a DataFrame with the mean of that column." @@ -1640,6 +1692,7 @@ { "cell_type": "code", "execution_count": 47, + "id": "f52c5d8f", "metadata": {}, "outputs": [], "source": [ @@ -1650,6 +1703,7 @@ { "cell_type": "code", "execution_count": 48, + "id": "6affebe9", "metadata": {}, "outputs": [], "source": [ @@ -1659,6 +1713,7 @@ { "cell_type": "code", "execution_count": 49, + "id": "1ce1b96f", "metadata": {}, "outputs": [], "source": [ @@ -1668,6 +1723,7 @@ { "cell_type": "code", "execution_count": 50, + "id": "90829195", "metadata": {}, "outputs": [], "source": [ @@ -1677,6 +1733,7 @@ { "cell_type": "code", "execution_count": 51, + "id": "c0feac14", "metadata": {}, "outputs": [ { @@ -1708,63 +1765,63 @@ "
\n", " \n", " \n", - " \n", - " \n", - " \n", + " \n", + " \n", + " \n", " \n", " \n", " \n", - " \n", - " \n", - " \n", + " \n", + " \n", + " \n", " \n", " \n", " \n", - " \n", - " \n", - " \n", + " \n", + " \n", + " \n", " \n", " \n", " \n", " \n", - " \n", - " \n", + " \n", + " \n", " \n", " \n", " \n", " \n", " \n", - " \n", + " \n", " \n", " \n", " \n", - " \n", + " \n", " \n", " \n", " \n", " \n", " \n", - " \n", - " \n", + " \n", + " \n", " \n", " \n", " \n", " \n", - " \n", - " \n", + " \n", + " \n", " \n", " \n", " \n", " \n", - " \n", - " \n", - " \n", + " \n", + " \n", + " \n", " \n", " \n", " \n", - " \n", - " \n", - " \n", + " \n", + " \n", + " \n", " \n", " \n", "
\n", - " Comm: tcp://127.0.0.1:44033\n", + " Comm: tcp://127.0.0.1:40519\n", " \n", " Total threads: 1\n", @@ -6201,7 +6355,7 @@ "
\n", - " Dashboard: http://127.0.0.1:45225/status\n", + " Dashboard: http://127.0.0.1:40951/status\n", " \n", " Memory: 62.82 GiB\n", @@ -6209,13 +6363,13 @@ "
\n", - " Nanny: tcp://127.0.0.1:46529\n", + " Nanny: tcp://127.0.0.1:39133\n", "
\n", - " Local directory: /home/mmccarty/sandbox/rapids/cudf/docs/cudf/source/user_guide/dask-worker-space/worker-zlsacw8_\n", + " Local directory: /home/ashwin/workspace/rapids/cudf/docs/cudf/source/user_guide/dask-worker-space/worker-3v0c20ux\n", "
00.0000000.00.00.0000000.09.374760.0000000.00.00.00.0000006.2378590.00.00.0000000.00.00.00.000000.0000000.00.00.0000000.00.011.308953
10.0000000.00.00.0000000.00.000000.0000000.00.00.0000000.00.0000000.00.00.065878-5.2412970.00.00.017.584760.0000000.00.012.357050.00.00.000000
23.2327510.00.00.0000000.00.000008.3419150.00.00.0000000.00.0000000.00.00.0000000.00.00.00.000000.0000000.00.00.03.1103620.00.000000
30.0000000.00.00.0000000.00.000000.0000000.00.00.0000000.00.0000000.00.00.0000000.00.00.00.0000010.8692790.00.00.00.00.000000
40.0000000.00.07.7430240.00.000000.0000000.00.05.9870980.0000000.02.5262740.00.00.0000000.00.00.00.000000.0000000.00.00.00.00.000000
00.7712450.0510241.199239-0.408268-0.676643-1.274743
1-1.1680410.702664-0.270806-0.029322-0.873593-1.214105
2-1.467009-0.143080-0.806151-0.8663711.081735-0.226840
3NaN-0.610798-0.2728950.8122781.074973
4NaNNaN1.396784-0.366725
5-0.439343-1.016239NaNNaN
61.093102-0.7647580.6751231.067536NaN
70.003098-0.7226480.2215682.025961NaN
8-0.095899-1.285156-0.300566-0.3172411.0112750.674891
90.1094652.497843-1.199856-0.877041-1.919394-1.029201
\n", @@ -1772,16 +1829,16 @@ ], "text/plain": [ " A B C\n", - "0 0.771245 0.051024 1.199239\n", - "1 -1.168041 0.702664 -0.270806\n", - "2 -1.467009 -0.143080 -0.806151\n", - "3 NaN -0.610798 -0.272895\n", - "4 NaN NaN 1.396784\n", - "5 -0.439343 NaN NaN\n", - "6 1.093102 -0.764758 NaN\n", - "7 0.003098 -0.722648 NaN\n", - "8 -0.095899 -1.285156 -0.300566\n", - "9 0.109465 2.497843 -1.199856" + "0 -0.408268 -0.676643 -1.274743\n", + "1 -0.029322 -0.873593 -1.214105\n", + "2 -0.866371 1.081735 -0.226840\n", + "3 NaN 0.812278 1.074973\n", + "4 NaN NaN -0.366725\n", + "5 -1.016239 NaN NaN\n", + "6 0.675123 1.067536 NaN\n", + "7 0.221568 2.025961 NaN\n", + "8 -0.317241 1.011275 0.674891\n", + "9 -0.877041 -1.919394 -1.029201" ] }, "execution_count": 51, @@ -1796,6 +1853,7 @@ { "cell_type": "code", "execution_count": 52, + "id": "a07c1260", "metadata": {}, "outputs": [ { @@ -1827,63 +1885,63 @@ "
00.7712450.0510241.199239-0.408268-0.676643-1.274743
1-1.1680410.702664-0.270806-0.029322-0.873593-1.214105
2-1.467009-0.143080-0.806151-0.8663711.081735-0.226840
3-0.149173-0.610798-0.272895-0.3272240.8122781.074973
4-0.149173-0.0343641.396784-0.3272240.316145-0.366725
5-0.439343-0.034364-0.036322-1.0162390.316145-0.337393
61.093102-0.764758-0.0363220.6751231.067536-0.337393
70.003098-0.722648-0.0363220.2215682.025961-0.337393
8-0.095899-1.285156-0.300566-0.3172411.0112750.674891
90.1094652.497843-1.199856-0.877041-1.919394-1.029201
\n", @@ -1891,16 +1949,16 @@ ], "text/plain": [ " A B C\n", - "0 0.771245 0.051024 1.199239\n", - "1 -1.168041 0.702664 -0.270806\n", - "2 -1.467009 -0.143080 -0.806151\n", - "3 -0.149173 -0.610798 -0.272895\n", - "4 -0.149173 -0.034364 1.396784\n", - "5 -0.439343 -0.034364 -0.036322\n", - "6 1.093102 -0.764758 -0.036322\n", - "7 0.003098 -0.722648 -0.036322\n", - "8 -0.095899 -1.285156 -0.300566\n", - "9 0.109465 2.497843 -1.199856" + "0 -0.408268 -0.676643 -1.274743\n", + "1 -0.029322 -0.873593 -1.214105\n", + "2 -0.866371 1.081735 -0.226840\n", + "3 -0.327224 0.812278 1.074973\n", + "4 -0.327224 0.316145 -0.366725\n", + "5 -1.016239 0.316145 -0.337393\n", + "6 0.675123 1.067536 -0.337393\n", + "7 0.221568 2.025961 -0.337393\n", + "8 -0.317241 1.011275 0.674891\n", + "9 -0.877041 -1.919394 -1.029201" ] }, "execution_count": 52, @@ -1915,6 +1973,7 @@ { "cell_type": "code", "execution_count": 53, + "id": "9e70d61a", "metadata": {}, "outputs": [ { @@ -1946,63 +2005,63 @@ "
00.7712450.0510241.199239-0.408268-0.676643-1.274743
1-1.1680410.702664-0.270806-0.029322-0.873593-1.214105
2-1.467009-0.143080-0.806151-0.8663711.081735-0.226840
3NaN-0.610798-0.2728950.8122781.074973
4NaN-0.0343641.3967840.316145-0.366725
5-0.439343-0.034364-0.036322-1.0162390.316145-0.337393
61.093102-0.764758-0.0363220.6751231.067536-0.337393
70.003098-0.722648-0.0363220.2215682.025961-0.337393
8-0.095899-1.285156-0.300566-0.3172411.0112750.674891
90.1094652.497843-1.199856-0.877041-1.919394-1.029201
\n", @@ -2010,16 +2069,16 @@ ], "text/plain": [ " A B C\n", - "0 0.771245 0.051024 1.199239\n", - "1 -1.168041 0.702664 -0.270806\n", - "2 -1.467009 -0.143080 -0.806151\n", - "3 NaN -0.610798 -0.272895\n", - "4 NaN -0.034364 1.396784\n", - "5 -0.439343 -0.034364 -0.036322\n", - "6 1.093102 -0.764758 -0.036322\n", - "7 0.003098 -0.722648 -0.036322\n", - "8 -0.095899 -1.285156 -0.300566\n", - "9 0.109465 2.497843 -1.199856" + "0 -0.408268 -0.676643 -1.274743\n", + "1 -0.029322 -0.873593 -1.214105\n", + "2 -0.866371 1.081735 -0.226840\n", + "3 NaN 0.812278 1.074973\n", + "4 NaN 0.316145 -0.366725\n", + "5 -1.016239 0.316145 -0.337393\n", + "6 0.675123 1.067536 -0.337393\n", + "7 0.221568 2.025961 -0.337393\n", + "8 -0.317241 1.011275 0.674891\n", + "9 -0.877041 -1.919394 -1.029201" ] }, "execution_count": 53, @@ -2033,6 +2092,7 @@ }, { "cell_type": "markdown", + "id": "0ace728d", "metadata": {}, "source": [ "## Dropping axis labels with missing data: dropna" @@ -2040,15 +2100,16 @@ }, { "cell_type": "markdown", + "id": "2ccd7115", "metadata": {}, "source": [ - "Missing data can be excluded using `dropna()`:\n", - "\n" + "Missing data can be excluded using `dropna()`:" ] }, { "cell_type": "code", "execution_count": 54, + "id": "98c57be7", "metadata": {}, "outputs": [ { @@ -2127,6 +2188,7 @@ { "cell_type": "code", "execution_count": 55, + "id": "bc3f273a", "metadata": {}, "outputs": [ { @@ -2187,6 +2249,7 @@ { "cell_type": "code", "execution_count": 56, + "id": "a48d4de0", "metadata": {}, "outputs": [ { @@ -2249,14 +2312,16 @@ }, { "cell_type": "markdown", + "id": "0b1954f9", "metadata": {}, "source": [ - "An equivalent `dropna()` is available for Series. " + "An equivalent `dropna()` is available for Series." ] }, { "cell_type": "code", "execution_count": 57, + "id": "2dd8f660", "metadata": {}, "outputs": [ { @@ -2279,6 +2344,7 @@ }, { "cell_type": "markdown", + "id": "121eb6d7", "metadata": {}, "source": [ "## Replacing generic values" @@ -2286,6 +2352,7 @@ }, { "cell_type": "markdown", + "id": "3cc4c5f1", "metadata": {}, "source": [ "Often times we want to replace arbitrary values with other values.\n", @@ -2296,6 +2363,7 @@ { "cell_type": "code", "execution_count": 58, + "id": "e6c14e8a", "metadata": {}, "outputs": [], "source": [ @@ -2305,6 +2373,7 @@ { "cell_type": "code", "execution_count": 59, + "id": "a852f0cb", "metadata": {}, "outputs": [ { @@ -2330,6 +2399,7 @@ { "cell_type": "code", "execution_count": 60, + "id": "f6ac12eb", "metadata": {}, "outputs": [ { @@ -2354,6 +2424,7 @@ }, { "cell_type": "markdown", + "id": "a6e1b6d7", "metadata": {}, "source": [ "We can also replace any value with a `` value." @@ -2362,6 +2433,7 @@ { "cell_type": "code", "execution_count": 61, + "id": "f0156bff", "metadata": {}, "outputs": [ { @@ -2386,6 +2458,7 @@ }, { "cell_type": "markdown", + "id": "6673eefb", "metadata": {}, "source": [ "You can replace a list of values by a list of other values:" @@ -2394,6 +2467,7 @@ { "cell_type": "code", "execution_count": 62, + "id": "f3110f5b", "metadata": {}, "outputs": [ { @@ -2418,6 +2492,7 @@ }, { "cell_type": "markdown", + "id": "61521e8b", "metadata": {}, "source": [ "You can also specify a mapping dict:" @@ -2426,6 +2501,7 @@ { "cell_type": "code", "execution_count": 63, + "id": "45862d05", "metadata": {}, "outputs": [ { @@ -2450,6 +2526,7 @@ }, { "cell_type": "markdown", + "id": "04a34549", "metadata": {}, "source": [ "For a DataFrame, you can specify individual values by column:" @@ -2458,6 +2535,7 @@ { "cell_type": "code", "execution_count": 64, + "id": "348caa64", "metadata": {}, "outputs": [], "source": [ @@ -2467,6 +2545,7 @@ { "cell_type": "code", "execution_count": 65, + "id": "cca41ec4", "metadata": {}, "outputs": [ { @@ -2545,6 +2624,7 @@ { "cell_type": "code", "execution_count": 66, + "id": "64334693", "metadata": {}, "outputs": [ { @@ -2622,6 +2702,7 @@ }, { "cell_type": "markdown", + "id": "2f0ceec7", "metadata": {}, "source": [ "## String/regular expression replacement" @@ -2629,6 +2710,7 @@ }, { "cell_type": "markdown", + "id": "c6f44740", "metadata": {}, "source": [ "cudf supports replacing string values using `replace` API:" @@ -2637,6 +2719,7 @@ { "cell_type": "code", "execution_count": 67, + "id": "031d3533", "metadata": {}, "outputs": [], "source": [ @@ -2646,6 +2729,7 @@ { "cell_type": "code", "execution_count": 68, + "id": "12b41efb", "metadata": {}, "outputs": [], "source": [ @@ -2655,6 +2739,7 @@ { "cell_type": "code", "execution_count": 69, + "id": "d450df49", "metadata": {}, "outputs": [ { @@ -2732,6 +2817,7 @@ { "cell_type": "code", "execution_count": 70, + "id": "f823bc46", "metadata": {}, "outputs": [ { @@ -2809,6 +2895,7 @@ { "cell_type": "code", "execution_count": 71, + "id": "bc52f6e9", "metadata": {}, "outputs": [ { @@ -2885,14 +2972,16 @@ }, { "cell_type": "markdown", + "id": "7c1087be", "metadata": {}, "source": [ - "Replace a few different values (list -> list):\n" + "Replace a few different values (list -> list):" ] }, { "cell_type": "code", "execution_count": 72, + "id": "7e23eba9", "metadata": {}, "outputs": [ { @@ -2969,6 +3058,7 @@ }, { "cell_type": "markdown", + "id": "42845a9c", "metadata": {}, "source": [ "Only search in column 'b' (dict -> dict):" @@ -2977,6 +3067,7 @@ { "cell_type": "code", "execution_count": 73, + "id": "d2e79805", "metadata": {}, "outputs": [ { @@ -3053,6 +3144,7 @@ }, { "cell_type": "markdown", + "id": "774b42a6", "metadata": {}, "source": [ "## Numeric replacement" @@ -3060,6 +3152,7 @@ }, { "cell_type": "markdown", + "id": "1c1926ac", "metadata": {}, "source": [ "`replace()` can also be used similar to `fillna()`." @@ -3068,6 +3161,7 @@ { "cell_type": "code", "execution_count": 74, + "id": "355a2f0d", "metadata": {}, "outputs": [], "source": [ @@ -3077,6 +3171,7 @@ { "cell_type": "code", "execution_count": 75, + "id": "d9eed372", "metadata": {}, "outputs": [], "source": [ @@ -3086,6 +3181,7 @@ { "cell_type": "code", "execution_count": 76, + "id": "ae944244", "metadata": {}, "outputs": [ { @@ -3116,70 +3212,70 @@ " \n", " \n", " 0\n", - " <NA>\n", - " <NA>\n", + " -0.089358787\n", + " -0.728419386\n", " \n", " \n", " 1\n", - " <NA>\n", - " <NA>\n", + " -2.141612003\n", + " -0.574415182\n", " \n", " \n", " 2\n", - " 0.123160746\n", - " 1.09464783\n", + " <NA>\n", + " <NA>\n", " \n", " \n", " 3\n", - " <NA>\n", - " <NA>\n", + " 0.774643462\n", + " 2.07287721\n", " \n", " \n", " 4\n", - " <NA>\n", - " <NA>\n", + " 0.93799853\n", + " -1.054129436\n", " \n", " \n", " 5\n", - " 0.68137677\n", - " -0.357346253\n", + " <NA>\n", + " <NA>\n", " \n", " \n", " 6\n", - " <NA>\n", - " <NA>\n", + " -0.435293012\n", + " 1.163009584\n", " \n", " \n", " 7\n", - " <NA>\n", - " <NA>\n", + " 1.346623287\n", + " 0.31961371\n", " \n", " \n", " 8\n", - " 1.173285961\n", - " -0.968616065\n", + " <NA>\n", + " <NA>\n", " \n", " \n", " 9\n", - " 0.147922362\n", - " -0.154880098\n", + " <NA>\n", + " <NA>\n", " \n", " \n", "\n", "
" ], "text/plain": [ - " 0 1\n", - "0 \n", - "1 \n", - "2 0.123160746 1.09464783\n", - "3 \n", - "4 \n", - "5 0.68137677 -0.357346253\n", - "6 \n", - "7 \n", - "8 1.173285961 -0.968616065\n", - "9 0.147922362 -0.154880098" + " 0 1\n", + "0 -0.089358787 -0.728419386\n", + "1 -2.141612003 -0.574415182\n", + "2 \n", + "3 0.774643462 2.07287721\n", + "4 0.93799853 -1.054129436\n", + "5 \n", + "6 -0.435293012 1.163009584\n", + "7 1.346623287 0.31961371\n", + "8 \n", + "9 " ] }, "execution_count": 76, @@ -3193,15 +3289,16 @@ }, { "cell_type": "markdown", + "id": "0f32607c", "metadata": {}, "source": [ - "Replacing more than one value is possible by passing a list.\n", - "\n" + "Replacing more than one value is possible by passing a list." ] }, { "cell_type": "code", "execution_count": 77, + "id": "59b81c60", "metadata": {}, "outputs": [], "source": [ @@ -3211,6 +3308,7 @@ { "cell_type": "code", "execution_count": 78, + "id": "01a71d4c", "metadata": {}, "outputs": [ { @@ -3241,70 +3339,70 @@ " \n", " \n", " 0\n", - " 5.000000\n", - " 5.000000\n", + " 10.000000\n", + " -0.728419\n", " \n", " \n", " 1\n", - " 5.000000\n", - " 5.000000\n", + " -2.141612\n", + " -0.574415\n", " \n", " \n", " 2\n", - " 0.123161\n", - " 1.094648\n", + " 5.000000\n", + " 5.000000\n", " \n", " \n", " 3\n", - " 5.000000\n", - " 5.000000\n", + " 0.774643\n", + " 2.072877\n", " \n", " \n", " 4\n", - " 5.000000\n", - " 5.000000\n", + " 0.937999\n", + " -1.054129\n", " \n", " \n", " 5\n", - " 0.681377\n", - " -0.357346\n", + " 5.000000\n", + " 5.000000\n", " \n", " \n", " 6\n", - " 5.000000\n", - " 5.000000\n", + " -0.435293\n", + " 1.163010\n", " \n", " \n", " 7\n", - " 5.000000\n", - " 5.000000\n", + " 1.346623\n", + " 0.319614\n", " \n", " \n", " 8\n", - " 1.173286\n", - " -0.968616\n", + " 5.000000\n", + " 5.000000\n", " \n", " \n", " 9\n", - " 0.147922\n", - " -0.154880\n", + " 5.000000\n", + " 5.000000\n", " \n", " \n", "\n", "" ], "text/plain": [ - " 0 1\n", - "0 5.000000 5.000000\n", - "1 5.000000 5.000000\n", - "2 0.123161 1.094648\n", - "3 5.000000 5.000000\n", - "4 5.000000 5.000000\n", - "5 0.681377 -0.357346\n", - "6 5.000000 5.000000\n", - "7 5.000000 5.000000\n", - "8 1.173286 -0.968616\n", - "9 0.147922 -0.154880" + " 0 1\n", + "0 10.000000 -0.728419\n", + "1 -2.141612 -0.574415\n", + "2 5.000000 5.000000\n", + "3 0.774643 2.072877\n", + "4 0.937999 -1.054129\n", + "5 5.000000 5.000000\n", + "6 -0.435293 1.163010\n", + "7 1.346623 0.319614\n", + "8 5.000000 5.000000\n", + "9 5.000000 5.000000" ] }, "execution_count": 78, @@ -3318,15 +3416,16 @@ }, { "cell_type": "markdown", + "id": "1080e97b", "metadata": {}, "source": [ - "You can also operate on the DataFrame in place:\n", - "\n" + "You can also operate on the DataFrame in place:" ] }, { "cell_type": "code", "execution_count": 79, + "id": "5f0859d7", "metadata": {}, "outputs": [], "source": [ @@ -3336,6 +3435,7 @@ { "cell_type": "code", "execution_count": 80, + "id": "5cf28369", "metadata": {}, "outputs": [ { @@ -3366,70 +3466,70 @@ " \n", " \n", " 0\n", - " <NA>\n", - " <NA>\n", + " -0.089358787\n", + " -0.728419386\n", " \n", " \n", " 1\n", - " <NA>\n", - " <NA>\n", + " -2.141612003\n", + " -0.574415182\n", " \n", " \n", " 2\n", - " 0.123160746\n", - " 1.09464783\n", + " <NA>\n", + " <NA>\n", " \n", " \n", " 3\n", - " <NA>\n", - " <NA>\n", + " 0.774643462\n", + " 2.07287721\n", " \n", " \n", " 4\n", - " <NA>\n", - " <NA>\n", + " 0.93799853\n", + " -1.054129436\n", " \n", " \n", " 5\n", - " 0.68137677\n", - " -0.357346253\n", + " <NA>\n", + " <NA>\n", " \n", " \n", " 6\n", - " <NA>\n", - " <NA>\n", + " -0.435293012\n", + " 1.163009584\n", " \n", " \n", " 7\n", - " <NA>\n", - " <NA>\n", + " 1.346623287\n", + " 0.31961371\n", " \n", " \n", " 8\n", - " 1.173285961\n", - " -0.968616065\n", + " <NA>\n", + " <NA>\n", " \n", " \n", " 9\n", - " 0.147922362\n", - " -0.154880098\n", + " <NA>\n", + " <NA>\n", " \n", " \n", "\n", "" ], "text/plain": [ - " 0 1\n", - "0 \n", - "1 \n", - "2 0.123160746 1.09464783\n", - "3 \n", - "4 \n", - "5 0.68137677 -0.357346253\n", - "6 \n", - "7 \n", - "8 1.173285961 -0.968616065\n", - "9 0.147922362 -0.154880098" + " 0 1\n", + "0 -0.089358787 -0.728419386\n", + "1 -2.141612003 -0.574415182\n", + "2 \n", + "3 0.774643462 2.07287721\n", + "4 0.93799853 -1.054129436\n", + "5 \n", + "6 -0.435293012 1.163009584\n", + "7 1.346623287 0.31961371\n", + "8 \n", + "9 " ] }, "execution_count": 80, @@ -3444,7 +3544,7 @@ ], "metadata": { "kernelspec": { - "display_name": "Python 3", + "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, @@ -3458,9 +3558,9 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.7.9" + "version": "3.8.13" } }, "nbformat": 4, - "nbformat_minor": 4 + "nbformat_minor": 5 }