Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add summarization template #2529

Merged
merged 7 commits into from
Jun 23, 2021

Conversation

lewtun
Copy link
Member

@lewtun lewtun commented Jun 21, 2021

This PR adds a task template for text summarization. As far as I can tell, we do not need to distinguish between "extractive" or "abstractive" summarization - both can be handled with this template.

Usage:

from datasets import load_dataset
from datasets.tasks import Summarization

ds = load_dataset("xsum", split="train")
# Dataset({
#     features: ['document', 'summary', 'id'],
#     num_rows: 204045
# })

summarization = Summarization(text_column="document", summary_column="summary")
ds.prepare_for_task(summarization)
# Dataset({
#     features: ['text', 'summary'],
#     num_rows: 204045
# })

@lewtun lewtun requested review from lhoestq and SBrandeis June 21, 2021 16:24
"dummy": ["123456"],
}
# Test we can load from task name
with Dataset.from_dict(data, info=info) as dset:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it seems that whatever magic @lhoestq did with windows now allows me to avoid using the annoying context manager that we have in many unit tests:

with tempfile.TemporaryDirectory() as tmp_dir, Dataset.from_dict(data) as dset:
    with self._to(in_memory, tmp_dir, dset) as dset:
        # do operations on `dset`

i never really understood why we had to do this so please correct me if it's actually needed 😄

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The with statement are used to properly close open arrow files on windows (it really doesn't like keeping things open).
What you mention though is the call to self._to call that moves a dataset on disk or in memory, which can be useful in some tests. Here you don't need to do this indeed.

Since the in_memory parameter is not used in this test, you can move this test outside of the BaseDatasetTest class and have it as a regular pytest test case.
This way it won't be run twice (once for in_memory=True and once for in_memory=False)

Copy link
Contributor

@SBrandeis SBrandeis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 🚀

Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice thanks !
Could you just move the test outside of the BaseDatasetTest class please ? Otherwise it will unnecessarily be run twice.

"dummy": ["123456"],
}
# Test we can load from task name
with Dataset.from_dict(data, info=info) as dset:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The with statement are used to properly close open arrow files on windows (it really doesn't like keeping things open).
What you mention though is the call to self._to call that moves a dataset on disk or in memory, which can be useful in some tests. Here you don't need to do this indeed.

Since the in_memory parameter is not used in this test, you can move this test outside of the BaseDatasetTest class and have it as a regular pytest test case.
This way it won't be run twice (once for in_memory=True and once for in_memory=False)

@lewtun
Copy link
Member Author

lewtun commented Jun 22, 2021

Nice thanks !
Could you just move the test outside of the BaseDatasetTest class please ? Otherwise it will unnecessarily be run twice.

sure, on it! thanks for the explanations about the self._to method :)

@lewtun
Copy link
Member Author

lewtun commented Jun 22, 2021

@lhoestq i've moved all the task template tests outside of BaseDatasetTest and collected them in their dedicated test case. (at some point i'll revisit this so we can just use pytest natively, but the PR is already getting out-of-scope :))

Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks all good, thanks :)

@lhoestq lhoestq merged commit 5c62bbe into huggingface:master Jun 23, 2021
@lewtun lewtun deleted the add-summarisation-template branch June 23, 2021 14:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants