Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Data] ArrowInvalid: offset overflow when calling Dataset.map_groups() #44861

Closed
bveeramani opened this issue Apr 19, 2024 · 0 comments · Fixed by #44862
Closed

[Data] ArrowInvalid: offset overflow when calling Dataset.map_groups() #44861

bveeramani opened this issue Apr 19, 2024 · 0 comments · Fixed by #44862
Assignees
Labels
bug Something that is supposed to be working; but isn't data Ray Data-related issues P1 Issue that should be fixed within a few weeks

Comments

@bveeramani
Copy link
Member

What happened + What you expected to happen

I'm loading large images in map_groups. I expected my program to run, but I got Arrow errors:

Traceback (most recent call last):
  File "/home/ray/default/1.py", line 18, in <module>
    ds.take(1)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/data/dataset.py", line 2377, in take
    for row in limited_ds.iter_rows():
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/data/iterator.py", line 241, in _wrapped_iterator
    for batch in batch_iterable:
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/data/iterator.py", line 162, in _create_iterator
    block_iterator, stats, blocks_owned_by_consumer = self._to_block_iterator()
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/data/_internal/iterator/iterator_impl.py", line 33, in _to_block_iterator
    block_iterator, stats, executor = ds._plan.execute_to_iterator()
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/data/exceptions.py", line 86, in handle_trace
    raise e.with_traceback(None) from SystemException()
ray.exceptions.RayTaskError(AssertionError): ray::MapBatches(group_fn)() (pid=67455, ip=10.0.26.25)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/data/_internal/output_buffer.py", line 94, in next
    block_remainder = block.slice(
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/data/_internal/arrow_block.py", line 246, in slice
    view = _copy_table(view)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/data/_internal/arrow_block.py", line 685, in _copy_table
    return transform_pyarrow.combine_chunks(table)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/data/_internal/arrow_ops/transform_pyarrow.py", line 295, in combine_chunks
    arr = _concatenate_extension_column(col)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/air/util/transform_pyarrow.py", line 34, in _concatenate_extension_column
    return ArrowTensorArray._concat_same_type(ca.chunks)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/air/util/tensor_extensions/arrow.py", line 551, in _concat_same_type
    storage = pa.concat_arrays([c.storage for c in to_concat])
  File "pyarrow/array.pxi", line 3321, in pyarrow.lib.concat_arrays
  File "pyarrow/error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: offset overflow while concatenating arrays

Versions / Dependencies

fe4dd5d

Reproduction script

import ray

import numpy as np


def create_large_data(group):
    # Each result is 128 MiB
    return {"item": np.zeros((1, 128 * 1024 * 1024), dtype=np.uint8)}

ds = (
    ray.data.range(1024, override_num_blocks=1)
    .groupby(key="id")
    .map_groups(create_large_data)
)

ds.take(1)

Issue Severity

High: It blocks me from completing my task.

@bveeramani bveeramani added bug Something that is supposed to be working; but isn't P1 Issue that should be fixed within a few weeks data Ray Data-related issues labels Apr 19, 2024
@bveeramani bveeramani changed the title [Data] ArrowInvalid: offset overflow when calling Dataset.map_groups [Data] ArrowInvalid: offset overflow when calling Dataset.map_groups() Apr 19, 2024
@bveeramani bveeramani self-assigned this Apr 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't data Ray Data-related issues P1 Issue that should be fixed within a few weeks
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant