[Core] Function to determine in-object-store size of just-yielded object #44584
Labels
core
Issues that should be addressed in Ray Core
core-object-store
enhancement
Request for new feature and/or capability
P1
Issue that should be fixed within a few weeks
Description
An API to determine the in-object-store size of an object that we just yielded.
This could be exposed as either a
get_size_of_last_output
API or a callback hook.Use case
Ray Data accounts the size of objects to make scheduling decisions.
Currently, we use
pd.DataFrame.memory_usage
to estimate the size of data "blocks." However, this estimate can be inaccurate, and as a result Ray Data can make bad scheduling decisions (see #44577).Another approach is to serialize "blocks" to estimate their size, but this is unperformant since we'd serialize the data twice (once to determine the size, and another time when we place it in the object store).
Having an API as described would enable Ray Data to make informed scheduling decisions with minimal performance overhead.
(Concretely, we'd use this API after line 425.
b_out
is the "block", andm_out
is the associated metadata like size)ray/python/ray/data/_internal/execution/operators/map_operator.py
Lines 419 to 428 in 9fb9d75
The text was updated successfully, but these errors were encountered: