Dataset remix #5372

dirkgr · 2021-08-21T01:24:15Z

This depends on #5344 being merged first.

It adds a step that can remix an existing dataset in a flexible way. You give it a Dict[str, str] like this:

{
    "train": "train + dev[:10000]",
    "validation": "dev[10000:]"
}

That means, from the original dataset, take the whole "train" split, combine it with the first 10000 instances of the "dev" split, and call that the new "train" split. Also, take everything after the first 10000 instances of the "dev" split, and call that the "validation" split.

Also, there are some urgent Tango fixes in here.

This does not work yet. I'm still working on supporting classes.

…to DatasetRemix

AkshitaB · 2021-08-23T22:54:59Z

allennlp/tango/dataset.py

+        self, input: DatasetDict, new_splits: Dict[str, str], keep_old_splits: bool = True
+    ) -> DatasetDict:
+        def get_slice(split_name: str) -> Sequence[Any]:
+            slice_match = re.match(r"(.*)\[([0123456789:]*)]", split_name)


This won't work for something like train[:50000] + dev[:10000]. Is it supposed to?

Yes, it will work. This function is only called on the parts after .split("+").

Ok I see, and train[:50000] in that case is interpreted as it should be?

Yes. It supports full Python slice syntax on line 98. In condensed form, it does slice(*match.split(":")).

…to DatasetRemix

dirkgr added 26 commits August 6, 2021 17:17

Adds a dataset that can be read and written lazily

10c5479

This does not work yet. I'm still working on supporting classes.

This approach might work better.

36f9b67

Make ShuffledSequence take indices

e74540d

Formatting

0eb53bf

Adds failing test

dcedfd5

Merge remote-tracking branch 'origin/main' into TangoBigData

36948ce

Fix sparse sequence tests

44eccf9

Fixes the Sqlite format

f305de7

Quality-of-life hack

61f8810

Makes an internal string less alarming

989f15c

Save the files to the right place

9c461b7

Merge remote-tracking branch 'origin/main' into TangoBigData

15e0be4

Formatting

ca26abe

Merge remote-tracking branch 'origin/main' into TangoBigData

f2f0a34

Fix for SqliteDatasetFormat

bb572b3

Performance improvement for SqliteSparseSequence

6953d7d

Changelog

3f99be7

Merge branch 'main' into TangoBigData

d69ea38

Global imports

d58a52f

More Sequence classes

104777d

Say DatasetDict when we mean DatasetDict

b6b5f05

Test for the sequences

05c4dd6

Use the step name correctly in the error message

4304a93

Use and consume step_name correctly in Step.from_params()

d6cb8ab

Uncacheable steps don't get cached even if they have a name

fd305a6

Adds a step that can remix a dataset

3ae61eb

dirkgr self-assigned this Aug 21, 2021

dirkgr added 3 commits August 20, 2021 18:29

Improve log message

2004fd2

Fix relative import

b0c3626

Changelog

fcf651f

dirkgr added 3 commits August 23, 2021 13:22

Merge branch 'main' into DatasetRemix

aa82e3d

Adds documentation

ca5cad3

Merge branch 'DatasetRemix' of https://github.com/allenai/allennlp in…

d5f11f4

…to DatasetRemix

dirkgr marked this pull request as ready for review August 23, 2021 20:51

dirkgr requested a review from AkshitaB August 23, 2021 20:51

dirkgr mentioned this pull request Aug 23, 2021

IMDB Model allenai/allennlp-models#297

Merged

AkshitaB reviewed Aug 23, 2021

View reviewed changes

dirkgr added 4 commits August 23, 2021 16:36

Give the option of changing a det_hash simply()

c52b050

Tix fypo

a32c7f2

Adds ability to shuffle datasets

6cccd64

Test for det_hash

765575d

AkshitaB approved these changes Aug 24, 2021

View reviewed changes

Merge branch 'main' into DatasetRemix

c69df7e

dirkgr enabled auto-merge (squash) August 24, 2021 22:23

dirkgr added 2 commits August 24, 2021 19:06

We don't use relative imports

451e4ee

Merge branch 'DatasetRemix' of https://github.com/allenai/allennlp in…

1d71b69

…to DatasetRemix

dirkgr merged commit 27da04c into main Aug 25, 2021

dirkgr deleted the DatasetRemix branch August 25, 2021 02:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dataset remix #5372

Dataset remix #5372

dirkgr commented Aug 21, 2021 •

edited

Loading

AkshitaB Aug 23, 2021

dirkgr Aug 23, 2021

AkshitaB Aug 23, 2021

dirkgr Aug 24, 2021

Dataset remix #5372

Dataset remix #5372

Conversation

dirkgr commented Aug 21, 2021 • edited Loading

AkshitaB Aug 23, 2021

Choose a reason for hiding this comment

dirkgr Aug 23, 2021

Choose a reason for hiding this comment

AkshitaB Aug 23, 2021

Choose a reason for hiding this comment

dirkgr Aug 24, 2021

Choose a reason for hiding this comment

dirkgr commented Aug 21, 2021 •

edited

Loading