Skip to content
This repository has been archived by the owner on Dec 16, 2022. It is now read-only.

Dataset remix #5372

Merged
merged 39 commits into from
Aug 25, 2021
Merged

Dataset remix #5372

merged 39 commits into from
Aug 25, 2021

Conversation

dirkgr
Copy link
Member

@dirkgr dirkgr commented Aug 21, 2021

This depends on #5344 being merged first.

It adds a step that can remix an existing dataset in a flexible way. You give it a Dict[str, str] like this:

{
    "train": "train + dev[:10000]",
    "validation": "dev[10000:]"
}

That means, from the original dataset, take the whole "train" split, combine it with the first 10000 instances of the "dev" split, and call that the new "train" split. Also, take everything after the first 10000 instances of the "dev" split, and call that the "validation" split.

Also, there are some urgent Tango fixes in here.

@dirkgr dirkgr self-assigned this Aug 21, 2021
@dirkgr dirkgr marked this pull request as ready for review August 23, 2021 20:51
@dirkgr dirkgr requested a review from AkshitaB August 23, 2021 20:51
self, input: DatasetDict, new_splits: Dict[str, str], keep_old_splits: bool = True
) -> DatasetDict:
def get_slice(split_name: str) -> Sequence[Any]:
slice_match = re.match(r"(.*)\[([0123456789:]*)]", split_name)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This won't work for something like train[:50000] + dev[:10000]. Is it supposed to?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it will work. This function is only called on the parts after .split("+").

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I see, and train[:50000] in that case is interpreted as it should be?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. It supports full Python slice syntax on line 98. In condensed form, it does slice(*match.split(":")).

@dirkgr dirkgr enabled auto-merge (squash) August 24, 2021 22:23
@dirkgr dirkgr merged commit 27da04c into main Aug 25, 2021
@dirkgr dirkgr deleted the DatasetRemix branch August 25, 2021 02:20
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants