Include merge dim positions in group keys emitted by `split_fragments` #521

norlandrhagen · 2023-05-02T19:46:51Z

Opening up a PR to start working on a bugfix for the error described in Issue #517 .

The error from #517 happens in L185 of rechunking.py in the combine_fragments function.

As @rabernat pointed out in #517, the error might be upstream in the split_fragments function.

Basically, it splits up the original dataset pieces into fragments,
and then puts them back together. Each merge dim should end up in a separate group.

So if I'm understanding this correctly, split_fragments does not take the MergeDim into account and outputs PCollections/Fragments that contain multiple variables. When fed into combine_fragments, we get the error: Expected a hypercube of shape [1] but got 2 fragments

There is the begining of a test named test_split_fragment_merge_dim in test_rechunking.py

cc @cisaacstern

cisaacstern · 2023-05-25T21:46:59Z

@norlandrhagen thanks for getting this started.

Would it be okay with you if I take a shot at finishing this PR?

norlandrhagen · 2023-05-29T22:03:45Z

@cisaacstern that would be great!

cisaacstern · 2023-06-08T23:56:56Z

As of the last commit, the test added here now replicates the bug reported in #517:

pytest -vx tests/test_rechunking.py -k test_split_and_combine_fragments_with_merge_dim

        ...
        total_size = functools.reduce(operator.mul, shape)
        if len(fragments) != total_size:
            # this error path is currently untested
>           raise ValueError(
                "Cannot combine fragments. "
                f"Expected a hypercube of shape {shape} but got {len(fragments)} fragments."
            )
E           ValueError: Cannot combine fragments. Expected a hypercube of shape [1] but got 2 fragments.

Thanks again to @norlandrhagen for kicking this off, and @rabernat for pairing on this yesterday. 🙏

Now that the test is in place, I'll work on fixing the bug!

…inefragments

cisaacstern · 2023-06-16T16:48:19Z

Summary to date:

I believe the simplest way to make combine_fragments capable of merge (as well as concat), is to keep the existing concat logic unchanged, and to simply first merge any fragments which require merging, and then pass those merged_fragments downstream to the existing concat logic. The work here in rechunking.py reflects this approach.
I've parametrized the xarray -> zarr end-to-end test with the multivariable file patterns. Before the current changes to rechunking.py were added, this raised the same error as reported in [beam-refactor] StoreToZarr - Cannot Combine Fragments Possible Error #517. Following addition of these changes, all of the multivariable end-to-end tests pass. 🎉
All except one of the parameterizations of the unit test also pass. The failing case in the unit tests is for (nt=10, resample="2D", time_chunks=2). It fails on the check of equality between sizes and expected_sizes... I suspect this may have to do with the way I am mocking the IndexedPositions emitted by the IndexItems transform in the unit test, because AFAICT this combination of parameters is also covered (and passes) in the end-to-end testing. Going to keep digging a bit on this.

cisaacstern · 2023-06-16T16:59:48Z

I suspect this may have to do with the way I am mocking the IndexedPositions emitted by the IndexItems transform in the unit test

Correction... the IndexedPositions look correct, but the subfragments generated by split_fragment in the unit test are a bit suspicious, being of len 14, with group keys:

[(('time', 0),), (('time', 0),), (('time', 0),), (('time', 0),), (('time', 1),), (('time', 1),), (('time', 1),), (('time', 1),), (('time', 1),), (('time', 1),), (('time', 2),), (('time', 2),), (('time', 2),), (('time', 2),)]

Note that (('time', 1),) appears 6 times... which is confusing to me. Getting closer to the issue here I think... 🤔

norlandrhagen · 2023-06-16T17:23:02Z

Super exciting progress @cisaacstern !

cisaacstern · 2023-06-26T19:27:08Z

Thanks to @rabernat for a thoughtful offline critique of this PR. In brief, Ryan pointed out that we parallelize writes horizontally across merge dimensions, therefore combining merge dimensions in the combine_fragments step, as reflected in recent work here, would result in loss of parallelism and/or OOM errors (the latter in cases of recipes with large merge dimensions). The better solution, based on Ryan's suggestion, is to ensure that split_fragments splits fragments which share a concat dim key, but have distinct merge indexes, into separate groups. I am reworking this PR to achieve that goal now.

cisaacstern · 2023-06-27T00:07:30Z

I believe this is now quite close to the intended behavior. The relevant integration tests appear to all be passing, and most of the unit tests are as well. There are two unit test cases which are failing, I just need to figure out if that's a problem with some assumption in the unit test itself, or if I'm actually catching a meaningful corner case.

rabernat · 2023-06-27T00:48:10Z

Great progress Charles! Let me know if I can be helpful here.

cisaacstern · 2023-06-28T23:13:33Z

Thanks so much to @jbusecke for pairing on this, which helped me understand the specific failure mode of the failing test cases (which do appear to be a meaningful bug, and not a testing mistake...):

All failing cases involve the parametrization nt_dayparam=(10, "2D"), wherein an initial aggregate dataset of 10 daily time steps and 2 variables is split into a collection of 2-day, single variable datasets, with a total collection size of
```
(10 / 2 time steps) * (2 variables) = 10 datasets
```

Each of the 10 datasets is of length 2 in the time dimension. Therefore, in the case of the target_chunks={"time": 1} parametrization, we would expect split_fragments to divide these 10 datasets into a total of 20 subfragments, each of length 1 in the time dimension. Indeed these subfragment datasets are emitted as expected, and they are grouped correctly according variable, but their time dimension groupkeys are incorrect. As shown in the table below, the time keys ('time', 1), ('time', 2), ('time', 3), and ('time', 4) each have two time steps grouped under them. IIUC, this is incorrect, as there should only be one time step per time key, for the chunking scheme target_chunks={"time": 1}. This would seem to be a bug in the splitting logic, which I will now try to track down.

Code for generating table

# this code was run from within a debugger console, opened from a breakpoint set
# within `test_rechunking::test_split_and_combine_fragments_with_merge_dim`, which
# was run with the `1-nt_dayparam1` parametrization (the same as described above)
print("nfragment | len(ds.time) | date | vars | groupkey")
print("--------- | ------------ | ---- | ---- | --------")
for i, sf in enumerate(subfragments):
    ds = sf.content[1]
    print(
        i, "|",
        len(ds.time), "|", str(ds.time[0].values)[:10], "|",
        [k for k in ds.data_vars.keys()], "|",
        f"`{sf.groupkey}`",
    )

nfragment	len(ds.time)	date	vars	groupkey
0	1	2010-01-01	['bar']	`(('time', 0), ('variable', 0))`
1	1	2010-01-02	['bar']	`(('time', 1), ('variable', 0))`
2	1	2010-01-03	['bar']	`(('time', 1), ('variable', 0))`
3	1	2010-01-04	['bar']	`(('time', 2), ('variable', 0))`
4	1	2010-01-05	['bar']	`(('time', 2), ('variable', 0))`
5	1	2010-01-06	['bar']	`(('time', 3), ('variable', 0))`
6	1	2010-01-07	['bar']	`(('time', 3), ('variable', 0))`
7	1	2010-01-08	['bar']	`(('time', 4), ('variable', 0))`
8	1	2010-01-09	['bar']	`(('time', 4), ('variable', 0))`
9	1	2010-01-10	['bar']	`(('time', 5), ('variable', 0))`
10	1	2010-01-01	['foo']	`(('time', 0), ('variable', 1))`
11	1	2010-01-02	['foo']	`(('time', 1), ('variable', 1))`
12	1	2010-01-03	['foo']	`(('time', 1), ('variable', 1))`
13	1	2010-01-04	['foo']	`(('time', 2), ('variable', 1))`
14	1	2010-01-05	['foo']	`(('time', 2), ('variable', 1))`
15	1	2010-01-06	['foo']	`(('time', 3), ('variable', 1))`
16	1	2010-01-07	['foo']	`(('time', 3), ('variable', 1))`
17	1	2010-01-08	['foo']	`(('time', 4), ('variable', 1))`
18	1	2010-01-09	['foo']	`(('time', 4), ('variable', 1))`
19	1	2010-01-10	['foo']	`(('time', 5), ('variable', 1))`

cisaacstern · 2023-06-29T00:15:21Z

@rabernat, the test added in f0d3159 reproduces the behavior documented in #521 (comment) above. To your eye, is this indeed a bug, or have I misunderstood what group keys to expect in this situation?

rabernat · 2023-06-29T00:41:18Z

Trying to parse this example scenario, which is indeed pretty complicated. But the crux of it seems to be

IIUC, this is incorrect, as there should only be one time step per time key,

Based on my reading, this conclusion seems correct.

rabernat · 2023-06-29T00:46:58Z

The bug in question seems purely related to the rechunking logic along the concat dim. So it's suspicious that it only arises once you bring in a merge dim. That seems like an important clue.

rabernat · 2023-06-29T01:01:41Z

tests/test_rechunking.py

+    offset = 1
+    index = Index([(dimension, IndexedPosition(offset, dimsize=nt_total))])


This doesn't quite seem right to me. If each dataset has two days in it (nt=2), the only possible offsets are even (0, 2, 4, 6, and 8). So it feels like we are creating inconsistent test data here.

rabernat · 2023-06-29T01:24:27Z

tests/test_rechunking.py

+    # replicates indexes created by IndexItems transform.
+    unique_times = np.unique([ds.time[0].values for ds in dsets])
+    time_positions = {t: i for i, t in enumerate(unique_times)}


This logic seems wrong. The index in IndexedPosition needs to be an actual offset from the beginning of the array.

cisaacstern · 2023-06-29T18:45:04Z

@rabernat thanks so much for the review, your two inline comments were a serious a-ha moment for me. I thought I must've been misunderstanding something, and it turns out I was.

To recap (for my future self and any others reading this), there are actually 3 indexing spaces in play here:

FilePattern index space: the arrangement of input files as defined by file pattern CombineOps.
Array index space: dataset-level indexing.
Chunk index space: zarr chunk indexing.

I had misunderstood the IndexedPosition.value to be referencing an FilePattern index space, when it is in fact referencing array index space. Fixing this misunderstanding, per your comment, in the "possible bug test" b813097 allows the test to pass (and reveals that this is not, in fact, a bug).

I'll now remove the possible bug test, and fix this issue in the unit test. 🚀

rabernat · 2023-06-29T19:37:23Z

I had misunderstood the IndexedPosition.value to be referencing an FilePattern index space, when it is in fact referencing array index space.

I am not super happy about how these types look, but I do believe this is documented.

pangeo-forge-recipes/pangeo_forge_recipes/types.py

Lines 25 to 30 in 5edf5ec

    
           @dataclass(frozen=True, order=True) 
        
           class Position: 
        
               """ 
        
               :param indexed: If True, this position represents an offset within a dataset 
        
                  If False, it is a position within a sequence. 
        
               """

…inefragments

cisaacstern · 2023-06-29T21:46:31Z

@rabernat AFAICT this is good to go. Would love your final review, in case there's anything I've overlooked.

Two other gut checks in process:

Adding recipe for GODAS leap-stc/data-management#21 (comment)
I'm currently running the terraclimate tutorial notebook offline, xref Port of Terraclimate example to beam-refactor #490

IMHO the testing here is pretty robust, so I don't think we should wait on either of these items to merge, but whenever they both complete, they'll be great further verification of these changes.

rabernat

Nice work!

cisaacstern · 2023-06-29T23:48:56Z

🥳

norlandrhagen · 2023-06-29T23:49:34Z

Incredible @cisaacstern! Super exciting to have this PR merged in.

jbusecke · 2023-06-30T15:21:01Z

Awesome @cisaacstern!

alxmrs · 2023-06-30T18:08:13Z

So great to see the release within reach!!

issue 517

6cb7299

cisaacstern self-assigned this Jun 5, 2023

norlandrhagen added the release blocker label Jun 6, 2023

cisaacstern mentioned this pull request Jun 7, 2023

Adding recipe for GODAS leap-stc/data-management#21

Closed

1 task

cisaacstern added 2 commits June 8, 2023 16:46

fix split_fragments docstring

4cdd5ec

adapt test to replicate pangeo-forge#517 bug

bd3084a

cisaacstern added 10 commits June 12, 2023 13:31

rename dims_starts_sizes -> concat_dims_starts_sizes

159c7cc

parametrize end-to-end test with multivar pattern

0c31ced

combine multivar fragments test continued

4762216

rework combine_fragments for (single) merge dim

0f88c2d

drop unused nvars parameter

e9860ba

add .vscode to .gitignore

4fbf5c7

improve coverage of combine fragments merge dim test

aacf017

Merge remote-tracking branch 'origin/beam-refactor' into testing/comb…

3dbacfd

…inefragments

use merge_fragments func to pre-merge fragments

999e809

distinguish fragments from merged_fragments

7f231a7

revert rechunking.py changes

23b08bf

cisaacstern added 3 commits June 26, 2023 13:17

modify test for split fragments grouping rework

dc0c7bd

make split_fragments merge dim aware WIP

134db88

try combining fragments in unit test

6bd0adf

testing tweaks

3b7feb4

add split fragments possible bug test

f0d3159

rabernat reviewed Jun 29, 2023

View reviewed changes

fix offset in possible bug test (its not a bug)

b813097

cisaacstern added 3 commits June 29, 2023 11:49

remove possible bug test (its not a bug)

d2030d9

fix IndexedPosition mistake in test rechunking

6efb408

make some assertions about combined ds

e45656a

cisaacstern mentioned this pull request Jun 29, 2023

Unused objects cleanup #527

Merged

cisaacstern added 2 commits June 29, 2023 14:07

Merge remote-tracking branch 'origin/beam-refactor' into testing/comb…

96d86ea

…inefragments

remove stray comment line

f0c7167

cisaacstern changed the title ~~Cannot Combine Fragments Error - Issue 517 - Testing~~ Include merge dim positions in group keys emitted by split_fragments Jun 29, 2023

clarify comment in test rechunking

7f8ac7e

cisaacstern requested a review from rabernat June 29, 2023 21:43

cisaacstern mentioned this pull request Jun 29, 2023

Port of Terraclimate example to beam-refactor #490

Merged

rabernat approved these changes Jun 29, 2023

View reviewed changes

rabernat merged commit e8e6609 into pangeo-forge:beam-refactor Jun 29, 2023

rabernat mentioned this pull request Jun 29, 2023

[beam-refactor] StoreToZarr - Cannot Combine Fragments Possible Error #517

Closed

cisaacstern mentioned this pull request Jun 30, 2023

Update default branch to main in workflows #539

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Include merge dim positions in group keys emitted by `split_fragments` #521

Include merge dim positions in group keys emitted by `split_fragments` #521

norlandrhagen commented May 2, 2023

cisaacstern commented May 25, 2023 •

edited

Loading

norlandrhagen commented May 29, 2023

cisaacstern commented Jun 8, 2023

cisaacstern commented Jun 16, 2023

cisaacstern commented Jun 16, 2023

norlandrhagen commented Jun 16, 2023

cisaacstern commented Jun 26, 2023

cisaacstern commented Jun 27, 2023 •

edited

Loading

rabernat commented Jun 27, 2023

cisaacstern commented Jun 28, 2023

cisaacstern commented Jun 29, 2023

rabernat commented Jun 29, 2023

rabernat commented Jun 29, 2023

rabernat Jun 29, 2023

rabernat Jun 29, 2023

cisaacstern commented Jun 29, 2023

rabernat commented Jun 29, 2023

cisaacstern commented Jun 29, 2023

rabernat left a comment

cisaacstern commented Jun 29, 2023

norlandrhagen commented Jun 29, 2023

jbusecke commented Jun 30, 2023

alxmrs commented Jun 30, 2023

		offset = 1
		index = Index([(dimension, IndexedPosition(offset, dimsize=nt_total))])

Include merge dim positions in group keys emitted by split_fragments #521

Include merge dim positions in group keys emitted by split_fragments #521

Conversation

norlandrhagen commented May 2, 2023

cisaacstern commented May 25, 2023 • edited Loading

norlandrhagen commented May 29, 2023

cisaacstern commented Jun 8, 2023

cisaacstern commented Jun 16, 2023

cisaacstern commented Jun 16, 2023

norlandrhagen commented Jun 16, 2023

cisaacstern commented Jun 26, 2023

cisaacstern commented Jun 27, 2023 • edited Loading

rabernat commented Jun 27, 2023

cisaacstern commented Jun 28, 2023

cisaacstern commented Jun 29, 2023

rabernat commented Jun 29, 2023

rabernat commented Jun 29, 2023

rabernat Jun 29, 2023

Choose a reason for hiding this comment

rabernat Jun 29, 2023

Choose a reason for hiding this comment

cisaacstern commented Jun 29, 2023

rabernat commented Jun 29, 2023

cisaacstern commented Jun 29, 2023

rabernat left a comment

Choose a reason for hiding this comment

cisaacstern commented Jun 29, 2023

norlandrhagen commented Jun 29, 2023

jbusecke commented Jun 30, 2023

alxmrs commented Jun 30, 2023

Include merge dim positions in group keys emitted by `split_fragments` #521

Include merge dim positions in group keys emitted by `split_fragments` #521

cisaacstern commented May 25, 2023 •

edited

Loading

cisaacstern commented Jun 27, 2023 •

edited

Loading