Run on Cubed #908

tomwhite · 2022-09-22T14:44:04Z

This is an umbrella issue to track the work needed to run sgkit on Cubed.

This is possible because Cubed exposes the Python array API standard as well as common Dask functions and methods like map_blocks and Array.compute. Also, there is ongoing work to Integrate cubed in xarray, as a part of exploring alternative parallel execution frameworks in xarray.

The text was updated successfully, but these errors were encountered:

tomwhite · 2022-09-22T14:48:57Z

I've managed to get some basic aggregation tests in test_aggregation.py passing with the changes here: tomwhite@83ff400. This is not to be merged as it's just a demonstration at the moment. Most of the changes are due to the array API being stricter on types (so it needs some explicit casts).

They rely on some changes in xarray too: pydata/xarray#7067.

tomwhite · 2022-09-22T14:56:15Z

Also, this example shows that Cubed works with Numba (locally at least), which answers @hammer's question here: https://github.com/pystatgen/sgkit/issues/885#issuecomment-1209288596.

tomwhite · 2024-08-19T13:56:44Z

Since I opened this issue almost two years ago, Xarray has added a chunk manager abstraction (https://docs.xarray.dev/en/stable/internals/chunked-arrays.html), which makes it a lot easier to switch from Dask to Cubed as the backend computation engine without changing the code to express the computation. The nice thing about this approach is that we can use Dask or Cubed or any other distributed array engine that Xarray might support in the future (such as Arkouda).

I've started to explore what this might look like in https://github.com/tomwhite/sgkit/tree/xarray-apply-ufunc, but the two main ideas are:

move from dask.array.map_blocks to xarray.apply_ufunc for applying functions in parallel,
have a way to run the test suite using either Dask or Cubed so we don't have to do all the changes at once.

The code in the branch does this for count_call_alleles. As you can see in this commit (3833982), another minor benefit of using xarray.apply_ufunc is we can use named dimensions like ploidy and alleles rather than dimension indexes like 2.

This commit (da8657e) shows the new pytest command-line option to run on cubed: --use-cubed.

I would be interested in any thoughts on this direction @jeromekelleher, @hammer, @timothymillar, @benjeffery, @ravwojdyla, @eric-czech.

I'd like to set up a CI workflow that adds --use-cubed and runs just the tests for count_call_alleles to start with, before expanding to cover more of sgkit's aggregation functions.

tomwhite · 2024-08-20T10:41:55Z

Here's a successful run for the count_call_alleles tests on Cubed: https://github.com/tomwhite/sgkit/actions/runs/10455603818/job/28950946965

jeromekelleher · 2024-08-27T10:05:31Z

This sounds like an excellent approach +1

tomwhite added the dispatching Issues related to how we send method calls to different backends label Sep 22, 2022

tomwhite mentioned this issue Sep 22, 2022

Generalize handling of chunked array types pydata/xarray#7019

Merged

15 tasks

tomwhite mentioned this issue Sep 27, 2023

Use Xarray's apply_ufunc rather than Dask's map_blocks #1133

Open

This was referenced Sep 2, 2024

Run test_count_call_alleles on Cubed #1249

Closed

Run test_count_call_alleles on Cubed (alternative approach) #1254

Merged

This was referenced Sep 11, 2024

Use sgkit.distarray for count_variant_alleles and variant_stats #1255

Merged

Use sgkit.distarray for sample_stats and hardy_weinberg_test #1259

Merged

tomwhite mentioned this issue Sep 23, 2024

Use sgkit.distarray for gwas_linear_regression #1262

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Run on Cubed #908

Run on Cubed #908

tomwhite commented Sep 22, 2022

tomwhite commented Sep 22, 2022

tomwhite commented Sep 22, 2022

tomwhite commented Aug 19, 2024

tomwhite commented Aug 20, 2024

jeromekelleher commented Aug 27, 2024

Run on Cubed #908

Run on Cubed #908

Comments

tomwhite commented Sep 22, 2022

tomwhite commented Sep 22, 2022

tomwhite commented Sep 22, 2022

tomwhite commented Aug 19, 2024

tomwhite commented Aug 20, 2024

jeromekelleher commented Aug 27, 2024