Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Profiling data of Scheduler.update_graph for very large graph #7998

Open
fjetter opened this issue Jul 12, 2023 · 2 comments
Open

Profiling data of Scheduler.update_graph for very large graph #7998

fjetter opened this issue Jul 12, 2023 · 2 comments
Labels
discussion Discussing a topic with no specific actions yet performance

Comments

@fjetter
Copy link
Member

fjetter commented Jul 12, 2023

I recently had the pleasure to see how the scheduler reacts to a very large graph. Not too well.

I submitted a graph with a couple million tasks. Locally it looks like 2.5MM tasks but the scheduler later says less. Anyhow, it's seven digits. update_graph ran for about 5min, i.e. also blocking the event loop for that time (#7980)

What is eating up the most time is

Function value
particularly this check for __main__ in dumps result 5%
stringfiy 12%
key_split 12%
unpack_remotedata 12%
generate_taskstate 20%
dask.order 12%
transitioning all tasks 10%
Other foo (e.g. walking the graph for deps and such) 17%

It also looks like the TaskState and all the foo attached to them is taking up about 65% of the memory which in this case is about 82GiB. Assuming we're at 2MM tasks that's roundabout 40KB per TaskState. That's quite a lot.

scheduler-profile.zip

Nothing to do here, this is purely informational.

@fjetter fjetter added performance discussion Discussing a topic with no specific actions yet labels Jul 12, 2023
@fjetter fjetter closed this as completed Aug 3, 2023
@fjetter fjetter closed this as not planned Won't fix, can't repro, duplicate, stale Aug 3, 2023
@fjetter
Copy link
Member Author

fjetter commented Aug 14, 2023

I did some memory profiling of the scheduler [1] based on 145c13a

I submitted a large array workload with about 1.5MM tasks. The scheduler requires about 6GB of RAM to hold the computation state in memory. The peak is a bit larger since there is some intermediate state required (mostly for dask.order).

image

Once the plateau of this graph is reached, computation starts and the memory usage breaks down roughly as

5836 MiB Total
├── 290 MiB Raw graph -> (This is only part of the raw deserialized graph. The original one is about 680MiB)
├── 1945 MiB Materialized Graph
│   ├── 584 MiB Stringification of values of dsk [2]
│   ├── 132 MiB Stringification of keys [3]
│   ├── 624 MiB Stringification of dependencies [4]
│   ├── 475 MiB dumps_task (removed on main)
│   └── 130 MiB Other
├── 3379 MiB generate_taskstates
│   ├── 377 MiB key_split (group keys)
│   ├── 80 MiB actual tasks dict
│   └── 2970 MiB TaskState object
│       ├── 347 MiB slots (actual object space)
│       ├── 112 MiB weakref instances on self
│       ├── 110 MiB key_split_group
│       └── 2355 MiB various sets in TaskState (profiler points to waiting_on)
└── 222 MiB Other

The two big contributions worth discussing is the TaskState that allocates more than 2GiB and materialize graph

The tracing for the TaskState object is a little fuzzy (possibly because it is using slots?) but it largely points to the usage of sets in TaskState. Indeed, empty sets are allocating relatively high memory. With 9 sets and 3 dictionaries we're already at a lower bound per TaskState of 2.39KiB

# Python 3.10.11
format_bytes(
   ...:     9 * sys.getsizeof(set())
   ...:     + 3 * sys.getsizeof(dict())
   ...:     )
Out[9]: '2.09 kiB'

which adds up to almost 3GiB alone for 1.5MM tasks. The actual memory use is even better than this calculation suggests (not sure what went wrong here...)

The other large contribution is the stringification of keys. Stringify does not cache/deduplicate str values, nor is the python interpreter able to intern our keys (afaik, only possible w/ ascii chars) every call to stringify effectively allocates new memory.
While the actual stringified keys only take 132MiB in this example, the lack of duplication blow up to much more.

This suggests that we should either remove or rework stringification and possibly consider a slimmer representation of our TaskState object.


[1] scheduler_memory_profile.html.zip
[2]

new_dsk[new_k] = stringify(v, exclusive=exclusive)

[3]
new_k = stringify(k)

[4]
stringify(k): {stringify(dep) for dep in deps}

@fjetter fjetter reopened this Aug 14, 2023
@fjetter
Copy link
Member Author

fjetter commented Aug 14, 2023

Note that the above graph wasn't using any annotations. Annotations will add one more stringification for annotated keys

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discussion Discussing a topic with no specific actions yet performance
Projects
None yet
Development

No branches or pull requests

1 participant