Add summarization template #2529

lewtun · 2021-06-21T16:08:31Z

This PR adds a task template for text summarization. As far as I can tell, we do not need to distinguish between "extractive" or "abstractive" summarization - both can be handled with this template.

Usage:

from datasets import load_dataset
from datasets.tasks import Summarization

ds = load_dataset("xsum", split="train")
# Dataset({
#     features: ['document', 'summary', 'id'],
#     num_rows: 204045
# })

summarization = Summarization(text_column="document", summary_column="summary")
ds.prepare_for_task(summarization)
# Dataset({
#     features: ['text', 'summary'],
#     num_rows: 204045
# })

lewtun · 2021-06-21T18:59:49Z

tests/test_arrow_dataset.py

+            "dummy": ["123456"],
+        }
+        # Test we can load from task name
+        with Dataset.from_dict(data, info=info) as dset:


it seems that whatever magic @lhoestq did with windows now allows me to avoid using the annoying context manager that we have in many unit tests:

with tempfile.TemporaryDirectory() as tmp_dir, Dataset.from_dict(data) as dset: with self._to(in_memory, tmp_dir, dset) as dset: # do operations on `dset`

i never really understood why we had to do this so please correct me if it's actually needed 😄

The with statement are used to properly close open arrow files on windows (it really doesn't like keeping things open).
What you mention though is the call to self._to call that moves a dataset on disk or in memory, which can be useful in some tests. Here you don't need to do this indeed.

Since the in_memory parameter is not used in this test, you can move this test outside of the BaseDatasetTest class and have it as a regular pytest test case.
This way it won't be run twice (once for in_memory=True and once for in_memory=False)

SBrandeis

LGTM 🚀

lhoestq

Nice thanks !
Could you just move the test outside of the BaseDatasetTest class please ? Otherwise it will unnecessarily be run twice.

lhoestq · 2021-06-22T12:32:59Z

tests/test_arrow_dataset.py

+            "dummy": ["123456"],
+        }
+        # Test we can load from task name
+        with Dataset.from_dict(data, info=info) as dset:


The with statement are used to properly close open arrow files on windows (it really doesn't like keeping things open).
What you mention though is the call to self._to call that moves a dataset on disk or in memory, which can be useful in some tests. Here you don't need to do this indeed.

Since the in_memory parameter is not used in this test, you can move this test outside of the BaseDatasetTest class and have it as a regular pytest test case.
This way it won't be run twice (once for in_memory=True and once for in_memory=False)

lewtun · 2021-06-22T13:54:27Z

Nice thanks !
Could you just move the test outside of the BaseDatasetTest class please ? Otherwise it will unnecessarily be run twice.

sure, on it! thanks for the explanations about the self._to method :)

lewtun · 2021-06-22T15:29:56Z

@lhoestq i've moved all the task template tests outside of BaseDatasetTest and collected them in their dedicated test case. (at some point i'll revisit this so we can just use pytest natively, but the PR is already getting out-of-scope :))

lhoestq

Looks all good, thanks :)

lewtun added 3 commits June 21, 2021 17:50

Add summarisation template

5ef7635

Add unit test for task template

1218782

Add test for summarisation formatting

a2ad2c8

lewtun requested review from lhoestq and SBrandeis June 21, 2021 16:24

lewtun added 2 commits June 21, 2021 20:24

Remove redundant test

4f4d91e

Fix quality

12fb747

lewtun commented Jun 21, 2021

View reviewed changes

SBrandeis approved these changes Jun 22, 2021

View reviewed changes

lhoestq reviewed Jun 22, 2021

View reviewed changes

lhoestq mentioned this pull request Jun 22, 2021

Add task template for automatic speech recognition #2533

Merged

lewtun added 2 commits June 22, 2021 17:03

Move unit tests for task templates to dedicated test case

f438b4e

Move concatenation tests to task template test case

af36d22

lhoestq approved these changes Jun 23, 2021

View reviewed changes

lhoestq merged commit 5c62bbe into huggingface:master Jun 23, 2021

lewtun deleted the add-summarisation-template branch June 23, 2021 14:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add summarization template #2529

Add summarization template #2529

lewtun commented Jun 21, 2021 •

edited

Loading

lewtun Jun 21, 2021

lhoestq Jun 22, 2021

SBrandeis left a comment

lhoestq left a comment

lhoestq Jun 22, 2021

lewtun commented Jun 22, 2021

lewtun commented Jun 22, 2021

lhoestq left a comment

Add summarization template #2529

Add summarization template #2529

Conversation

lewtun commented Jun 21, 2021 • edited Loading

lewtun Jun 21, 2021

Choose a reason for hiding this comment

lhoestq Jun 22, 2021

Choose a reason for hiding this comment

SBrandeis left a comment

Choose a reason for hiding this comment

lhoestq left a comment

Choose a reason for hiding this comment

lhoestq Jun 22, 2021

Choose a reason for hiding this comment

lewtun commented Jun 22, 2021

lewtun commented Jun 22, 2021

lhoestq left a comment

Choose a reason for hiding this comment

lewtun commented Jun 21, 2021 •

edited

Loading