-
Notifications
You must be signed in to change notification settings - Fork 2.2k
Conversation
This does not work yet. I'm still working on supporting classes.
self, input: DatasetDict, new_splits: Dict[str, str], keep_old_splits: bool = True | ||
) -> DatasetDict: | ||
def get_slice(split_name: str) -> Sequence[Any]: | ||
slice_match = re.match(r"(.*)\[([0123456789:]*)]", split_name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This won't work for something like train[:50000] + dev[:10000]
. Is it supposed to?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it will work. This function is only called on the parts after .split("+")
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok I see, and train[:50000] in that case is interpreted as it should be?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. It supports full Python slice syntax on line 98. In condensed form, it does slice(*match.split(":"))
.
This depends on #5344 being merged first.
It adds a step that can remix an existing dataset in a flexible way. You give it a
Dict[str, str]
like this:That means, from the original dataset, take the whole "train" split, combine it with the first 10000 instances of the "dev" split, and call that the new "train" split. Also, take everything after the first 10000 instances of the "dev" split, and call that the "validation" split.
Also, there are some urgent Tango fixes in here.