Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Core] Function to determine in-object-store size of just-yielded object #44584

Closed
bveeramani opened this issue Apr 8, 2024 · 3 comments · Fixed by #45071
Closed

[Core] Function to determine in-object-store size of just-yielded object #44584

bveeramani opened this issue Apr 8, 2024 · 3 comments · Fixed by #45071
Assignees
Labels
core Issues that should be addressed in Ray Core core-object-store enhancement Request for new feature and/or capability P1 Issue that should be fixed within a few weeks

Comments

@bveeramani
Copy link
Member

bveeramani commented Apr 8, 2024

Description

An API to determine the in-object-store size of an object that we just yielded.

This could be exposed as either a get_size_of_last_output API or a callback hook.

Use case

Ray Data accounts the size of objects to make scheduling decisions.

Currently, we use pd.DataFrame.memory_usage to estimate the size of data "blocks." However, this estimate can be inaccurate, and as a result Ray Data can make bad scheduling decisions (see #44577).

Another approach is to serialize "blocks" to estimate their size, but this is unperformant since we'd serialize the data twice (once to determine the size, and another time when we place it in the object store).

Having an API as described would enable Ray Data to make informed scheduling decisions with minimal performance overhead.

(Concretely, we'd use this API after line 425. b_out is the "block", and m_out is the associated metadata like size)

for b_out in map_transformer.apply_transform(iter(blocks), ctx):
# TODO(Clark): Add input file propagation from input blocks.
m_out = BlockAccessor.for_block(b_out).get_metadata([], None)
m_out.exec_stats = stats.build()
m_out.exec_stats.udf_time_s = map_transformer.udf_time()
m_out.exec_stats.task_idx = ctx.task_idx
yield b_out
yield m_out
stats = BlockExecStats.builder()

@bveeramani bveeramani added enhancement Request for new feature and/or capability P1 Issue that should be fixed within a few weeks core Issues that should be addressed in Ray Core labels Apr 8, 2024
@bveeramani bveeramani changed the title [Core] Add function to determine size of objects in object store [Core] Add function to determine size of object yielded from streaming generator Apr 9, 2024
@bveeramani bveeramani changed the title [Core] Add function to determine size of object yielded from streaming generator [Core] Function to determine in-object-store size of just-yielded object Apr 9, 2024
@anyscalesam
Copy link
Collaborator

@jjyao let's review at next sprint planning

@rynewang
Copy link
Contributor

Hi @bveeramani could you explain why pd.DataFrame.memory_usage is inaccurate? e.g. do we have more mem usage than the data frame? It sounds like a bug on its own right.

@bveeramani
Copy link
Member Author

@rynewang it's because pandas doesn't count memory usage from object dtypes:

The + symbol indicates that the true memory usage could be higher, because pandas does not count the memory used by values in columns with dtype=object.

For example, if you columns of strings or lists, pandas doesn't count that data at all. This can lead to pandas thinking a DataFrame is only KBs when in reality it's MBs or GBs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Issues that should be addressed in Ray Core core-object-store enhancement Request for new feature and/or capability P1 Issue that should be fixed within a few weeks
Projects
None yet
4 participants