Added glucose dataset (huggingface#1077)

* Added glucose * train/test difference adjustments * Update datasets/glucose/glucose.py Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>
katnoria · Dec 4, 2020 · 552365e · 552365e
1 parent e51242e
commit 552365e
Show file tree

Hide file tree

Showing 4 changed files with 392 additions and 1 deletion.
diff --git a/datasets/glucose/README.md b/datasets/glucose/README.md
@@ -0,0 +1,230 @@
+---
+annotations_creators:
+- crowdsourced
+language_creators:
+- crowdsourced
+languages:
+- en
+licenses:
+- cc-by-4.0
+multilinguality:
+- monolingual
+size_categories:
+- 10K<n<100K
+source_datasets:
+- extended|other-ROC-stories
+task_categories:
+- sequence-modeling
+task_ids:
+- sequence-modeling-other-common-sense-inference
+---
+
+# Dataset Card for [Dataset Name]
+
+## Table of Contents
+- [Dataset Description](#dataset-description)
+  - [Dataset Summary](#dataset-summary)
+  - [Supported Tasks](#supported-tasks-and-leaderboards)
+  - [Languages](#languages)
+- [Dataset Structure](#dataset-structure)
+  - [Data Instances](#data-instances)
+  - [Data Fields](#data-instances)
+  - [Data Splits](#data-instances)
+- [Dataset Creation](#dataset-creation)
+  - [Curation Rationale](#curation-rationale)
+  - [Source Data](#source-data)
+  - [Annotations](#annotations)
+  - [Personal and Sensitive Information](#personal-and-sensitive-information)
+- [Considerations for Using the Data](#considerations-for-using-the-data)
+  - [Social Impact of Dataset](#social-impact-of-dataset)
+  - [Discussion of Biases](#discussion-of-biases)
+  - [Other Known Limitations](#other-known-limitations)
+- [Additional Information](#additional-information)
+  - [Dataset Curators](#dataset-curators)
+  - [Licensing Information](#licensing-information)
+  - [Citation Information](#citation-information)
+
+## Dataset Description
+
+- **[Repository](https://github.com/TevenLeScao/glucose)**
+- **[Paper](https://arxiv.org/abs/2009.07758)**
+- **Point of Contact: [glucose@elementalcognition.com](mailto:glucose@elementalcognition.com)**
+
+### Dataset Summary
+
+GLUCOSE: GeneraLized and COntextualized Story Explanations, is a novel conceptual framework and dataset for commonsense reasoning. Given a short story and a sentence X in the story, GLUCOSE captures ten dimensions of causal explanation related to X. These dimensions, inspired by human cognitive psychology, cover often-implicit causes and effects of X, including events, location, possession, and other attributes.
+
+### Supported Tasks and Leaderboards
+
+Common sense inference of:
+1. Causes
+2. Emotions motivating an event
+3. Locations enabling an event
+4. Possession states enabling an event
+5. Other attributes enabling an event
+6. Consequences
+7. Emotions caused by an event
+8. Changes in location caused by an event
+9. Changes in possession caused by an event
+10. Other attributes that may be changed by an event
+
+### Languages
+
+English, monolingual
+
+## Dataset Structure
+
+### Data Instances
+
+```
+{
+  "experiment_id": "e56c7c3e-4660-40fb-80d0-052d566d676a__4",
+  "story_id": "e56c7c3e-4660-40fb-80d0-052d566d676a",
+  "worker_id": 19,
+  "submission_time_normalized": "20190930",
+  "worker_quality_rating": 3,
+  "selected_sentence_index": 4,
+  "story": "It was bedtime at our house. Two of the three kids hit the pillow and fall asleep. The third is a trouble maker. For two hours he continues to get out of bed and want to play. Finally he becomes tired and falls asleep."
+  selected_sentence: "Finally he becomes tired and falls asleep.",
+  "1_specificNL": "The third kid continues to  get out of bed and wants to play >Causes/Enables> The kid finally becomes tired and falls asleep",
+  "1_specificStructured": "{The third kid}_[subject] {continues}_[verb] {to }_[preposition1] {get out of bed}_[object1] {and wants to play}_[object2] >Causes/Enables> {The kid}_[subject] {finally becomes}_[verb] {tired}_[object1] {and falls asleep}_[object2]",
+  "1_generalNL": "Someone_A doesn't want to  go to sleep >Causes/Enables> Someone_A finally falls asleep",
+  "1_generalStructured": "{Someone_A}_[subject] {doesn't want}_[verb] {to }_[preposition1] {go to sleep}_[object1] >Causes/Enables> {Someone_A}_[subject] {finally falls}_[verb] {asleep}_[object1]",
+  "2_specificNL": "escaped",
+  "2_specificStructured": "escaped",
+  "2_generalNL": "escaped",
+  "2_generalStructured": "escaped",
+  "3_specificNL": "The third kid is in bed >Enables> The kid finally becomes tired and falls asleep",
+  "3_specificStructured": "{The third kid}_[subject] {is}_[verb] {in}_[preposition] {bed}_[object] >Enables> {The kid}_[subject] {finally becomes}_[verb] {tired}_[object1] {and falls asleep}_[object2]",
+  "3_generalNL": "Someone_A is in bed >Enables> Someone_A falls asleep",
+  "3_generalStructured": "{Someone_A}_[subject] {is}_[verb] {in}_[preposition] {bed}_[object] >Enables> {Someone_A}_[subject] {falls}_[verb] {asleep}_[object1]",
+  "4_specificNL": "escaped",
+  "4_specificStructured": "escaped",
+  "4_generalNL": "escaped",
+  "4_generalStructured": "escaped",
+  "5_specificNL": "escaped",
+  "5_specificStructured": "escaped",
+  "5_generalNL": "escaped",
+  "5_generalStructured": "escaped",
+  "6_specificNL": "escaped",
+  "6_specificStructured": "escaped",
+  "6_generalNL": "escaped",
+  "6_generalStructured": "escaped",
+  "7_specificNL": "escaped",
+  "7_specificStructured": "escaped",
+  "7_generalNL": "escaped",
+  "7_generalStructured": "escaped",
+  "8_specificNL": "escaped",
+  "8_specificStructured": "escaped",
+  "8_generalNL": "escaped",
+  "8_generalStructured": "escaped",
+  "9_specificNL": "escaped",
+  "9_specificStructured": "escaped",
+  "9_generalNL": "escaped",
+  "9_generalStructured": "escaped",
+  "10_specificNL": "escaped",
+  "10_specificStructured": "escaped",
+  "10_generalNL": "escaped",
+  "10_generalStructured": "escaped",
+  "number_filled_in": 7
+}
+```
+
+### Data Fields
+
+- __experiment_id__: a randomly generated alphanumeric sequence for a given story with the sentence index appended at the end after two underscores. Example: cbee2b5a-f2f9-4bca-9630-6825b1e36c13__0
+
+- __story_id__: a random alphanumeric identifier for the story. Example: e56c7c3e-4660-40fb-80d0-052d566d676a
+
+- __worker_id__: each worker has a unique identificaiton number. Example: 21
+
+- __submission_time_normalized__: the time of submission in the format YYYYMMDD. Example: 20200115
+
+- __worker_quality_assessment__: rating for the worker on the assignment in the row. Example: 2
+
+- __selected_sentence_index__: the index of a given sentence in a story. Example: 0
+
+- __story__: contains the full text of the ROC story that was used for the HIT. Example: It was bedtime at our house. Two of the three kids hit the pillow and fall asleep. The third is a trouble maker. For two hours he continues to get out of bed and want to play. Finally he becomes tired and falls asleep.
+
+- __selected_sentence__: the sentence from the story that is being annotated. Example: It was bedtime at our house.
+
+- __[1-10]\_[specific/general][NL/Structured]__: This is the primary data collected. It provides the common sense knowledge about the related stories and those general rules about the world derived from the specific statements. For each of the ten relationships, there are four columns. The specific columns give the specific statements from the story. The general statements give the corresponding generalization. The NL columns are formatted in natural language, whereas the structured columns contain indications of the slots used to fill in the data. Example: 
+  - __1_specificNL__: "The school has a football team >Causes/Enables> The football game was last weekend" 
+  - __1_specificStructured__: "{The school }\_[subject] {has }\_[verb] {a football team }\_[object1] >Causes/Enables> {The football game }\_[subject] {was last weekend }\_[verb]"
+  - __1_generalNL__: "Somewhere_A (that is a school ) has Something_A (that is a sports team ) >Causes/Enables> The game was last weekend" 
+  - __1_generalStructured__: "{Somewhere_A ||that is a school ||}\_[subject] {has }\_[verb] {Something_A ||that is a sports team ||}\_[object1] >Causes/Enables> {The game }\_[subject] {was last weekend }\_[verb]" 
+
+- __number\_filled\_in__: number of dimensions filled in for the assignment. Example: 4
+
+
+### Data Splits
+
+Train split: 65,521 examples
+Test splits: 500 examples, without worker id and rating, number filled in, and structured text.
+
+## Dataset Creation
+
+### Curation Rationale
+
+When humans read or listen, they make implicit commonsense inferences that frame their understanding of what happened and why. As a step toward AI systems that can build similar mental models, we introduce GLUCOSE, a large-scale dataset of implicit commonsense causal knowledge, encoded as causal mini-theories about the world, each grounded in a narrative context.
+
+### Source Data
+
+#### Initial Data Collection and Normalization
+
+Initial text from ROCStories
+
+#### Who are the source language producers?
+
+Amazon Mechanical Turk.
+
+### Annotations
+
+#### Annotation process
+
+To enable developing models that can build mental models of narratives, we aimed to crowdsource a large, quality-monitored dataset. Beyond the scalability benefits, using crowd workers (as opposed to a small set of expert annotators) ensures diversity of thought, thus broadening coverage of a common-sense knowledge resource. The annotation task is complex: it requires annotators to understand different causal dimensions in a variety of contexts and to come up with generalized  theories beyond  the  story  context.   For
+strict quality control,  we designed a three-stage knowledge  acquisition  pipeline  for  crowdsourcing the GLUCOSE dataset on the Amazon Mechanical Turk Platform. The workers first go through a qualification test where they must score at least 90% on 10 multiple-choice questions on select GLUCOSE dimensions. Next, qualified workers can work on the main GLUCOSE data collection task:  given a story S and a story sentence X, they are asked to fill in (allowing for non-applicable) all ten GLUCOSE dimensions, getting step-by-step guidance from the GLUCOSE data acquisition. To ensure data consistency,  the same workers answer all dimensions for an S, X pair. Finally, the submissions are reviewed by an expert who rates each worker on a scale from 0 to 3, and provides feedback on how to improve. Our final UIs are the result of more than six rounds of pilot studies, iteratively improving the interaction elements, functionality, dimension definitions, instructions, and examples.
+
+#### Who are the annotators?
+
+Amazon Mechanical Turk workers, with feedback from an expert.
+
+### Personal and Sensitive Information
+
+No personal or sensitive information.
+
+## Considerations for Using the Data
+
+### Social Impact of Dataset
+
+[More Information Needed]
+
+### Discussion of Biases
+
+[More Information Needed]
+
+### Other Known Limitations
+
+[More Information Needed]
+
+## Additional Information
+
+### Dataset Curators
+
+Nasrin Mostafazadeh, Aditya Kalyanpur, Lori Moon, David Buchanan, Lauren Berkowitz, Or Biran, Jennifer Chu-Carroll, from Elemental Cognition
+
+### Licensing Information
+
+Creative Commons Attribution-NonCommercial 4.0 International Public License
+
+### Citation Information
+
+```
+@inproceedings{mostafazadeh2020glucose,
+      title={GLUCOSE: GeneraLized and COntextualized Story Explanations}, 
+      author={Nasrin Mostafazadeh and Aditya Kalyanpur and Lori Moon and David Buchanan and Lauren Berkowitz and Or Biran and Jennifer Chu-Carroll},
+      year={2020},
+      booktitle={The Conference on Empirical Methods in Natural Language Processing},
+      publisher={Association for Computational Linguistics}
+}
+```
diff --git a/datasets/glucose/dummy/glucose/0.0.0/dummy_data.zip b/datasets/glucose/dummy/glucose/0.0.0/dummy_data.zip
diff --git a/datasets/glucose/glucose.py b/datasets/glucose/glucose.py
@@ -0,0 +1,161 @@
+# coding=utf-8
+# Copyright 2020 The HuggingFace Datasets Authors and the current dataset script contributor.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""GLUCOSE: GeneraLized and COntextualized Story Explanations, is a novel conceptual framework and dataset for commonsense reasoning. Given a short story and a sentence X in the story, GLUCOSE captures ten dimensions of causal explanation related to X. These dimensions, inspired by human cognitive psychology, cover often-implicit causes and effects of X, including events, location, possession, and other attributes."""
+
+from __future__ import absolute_import, division, print_function
+
+import csv
+import os
+
+import datasets
+
+
+# Find for instance the citation on arxiv or on the dataset repo/website
+_CITATION = """\
+@inproceedings{mostafazadeh2020glucose,
+      title={GLUCOSE: GeneraLized and COntextualized Story Explanations},
+      author={Nasrin Mostafazadeh and Aditya Kalyanpur and Lori Moon and David Buchanan and Lauren Berkowitz and Or Biran and Jennifer Chu-Carroll},
+      year={2020},
+      booktitle={The Conference on Empirical Methods in Natural Language Processing},
+      publisher={Association for Computational Linguistics}
+}
+"""
+
+# You can copy an official description
+_DESCRIPTION = """\
+When humans read or listen, they make implicit commonsense inferences that frame their understanding of what happened and why. As a step toward AI systems that can build similar mental models, we introduce GLUCOSE, a large-scale dataset of implicit commonsense causal knowledge, encoded as causal mini-theories about the world, each grounded in a narrative context.
+"""
+
+_HOMEPAGE = "https://github.com/ElementalCognition/glucose"
+
+_LICENSE = "Creative Commons Attribution-NonCommercial 4.0 International Public License"
+
+_URLs = {
+    "glucose": {
+        "test": "https://raw.githubusercontent.com/ElementalCognition/glucose/master/test/test_set_no_answers.csv",
+        "train": "https://github.com/TevenLeScao/glucose/blob/master/GLUCOSE_training_data.zip?raw=true",
+    }
+}
+
+
+class Glucose(datasets.GeneratorBasedBuilder):
+    """GLUCOSE: GeneraLized and COntextualized Story Explanations, is a novel conceptual framework and dataset for commonsense reasoning. """
+
+    VERSION = datasets.Version("1.1.0")
+    BUILDER_CONFIGS = [
+        datasets.BuilderConfig(name="glucose", description="Main dataset"),
+    ]
+
+    def _info(self):
+        feature_dict = {
+            "experiment_id": datasets.Value("string"),
+            "story_id": datasets.Value("string"),
+            # The train set contains only one ID in numeric form
+            "worker_id": datasets.Value("int64"),
+            # The test set contains several IDs in string form
+            "worker_ids": datasets.Value("string"),
+            "submission_time_normalized": datasets.Value("string"),
+            "worker_quality_assessment": datasets.Value("int64"),
+            "selected_sentence_index": datasets.Value("int64"),
+            "story": datasets.Value("string"),
+            "selected_sentence": datasets.Value("string"),
+            "number_filled_in": datasets.Value("int64"),
+        }
+        for i in range(1, 11):
+            feature_dict[f"{i}_specificNL"] = datasets.Value("string")
+            feature_dict[f"{i}_specificStructured"] = datasets.Value("string")
+            feature_dict[f"{i}_generalNL"] = datasets.Value("string")
+            feature_dict[f"{i}_generalStructured"] = datasets.Value("string")
+        features = datasets.Features(feature_dict)
+        return datasets.DatasetInfo(
+            description=_DESCRIPTION,
+            features=features,
+            supervised_keys=None,
+            homepage=_HOMEPAGE,
+            license=_LICENSE,
+            citation=_CITATION,
+        )
+
+    def _split_generators(self, dl_manager):
+        """Returns SplitGenerators."""
+        train_url = _URLs[self.config.name]["train"]
+        test_url = _URLs[self.config.name]["test"]
+        train_data = dl_manager.download_and_extract(train_url)
+        test_data = dl_manager.download_and_extract(test_url)
+        return [
+            datasets.SplitGenerator(
+                name=datasets.Split.TRAIN,
+                gen_kwargs={
+                    "filepath": os.path.join(train_data, "GLUCOSE_training_data_final.csv"),
+                    "split": "train",
+                },
+            ),
+            datasets.SplitGenerator(
+                name=datasets.Split.TEST,
+                gen_kwargs={"filepath": test_data, "split": "test"},
+            ),
+        ]
+
+    def _generate_examples(self, filepath, split):
+        with open(filepath, encoding="utf8") as f:
+            data = csv.reader(f)
+            next(data)
+            for id_, row in enumerate(data):
+                if split == "train":
+                    yield id_, train_dict_from_row(row)
+                else:
+                    yield id_, test_dict_from_row(row)
+
+
+def train_dict_from_row(row):
+    return_dict = {
+        "experiment_id": row[0],
+        "story_id": row[1],
+        "worker_id": row[2],
+        "worker_ids": "",
+        "submission_time_normalized": row[3],
+        "worker_quality_assessment": row[4],
+        "selected_sentence_index": row[5],
+        "story": row[6],
+        "selected_sentence": row[7],
+        "number_filled_in": row[48],
+    }
+    for i in range(1, 11):
+        return_dict[f"{i}_specificNL"] = row[4 * i + 4]
+        return_dict[f"{i}_specificStructured"] = row[4 * i + 5]
+        return_dict[f"{i}_generalNL"] = row[4 * i + 6]
+        return_dict[f"{i}_generalStructured"] = row[4 * i + 7]
+    return return_dict
+
+
+def test_dict_from_row(row):
+    return_dict = {
+        "experiment_id": "",
+        "story_id": row[0],
+        "worker_id": -1,
+        "worker_ids": row[3],
+        "submission_time_normalized": "",
+        "worker_quality_assessment": -1,
+        "selected_sentence_index": -1,
+        "story": row[1],
+        "selected_sentence": row[2],
+        "number_filled_in": -1,
+    }
+    for i in range(1, 11):
+        return_dict[f"{i}_specificNL"] = row[2 * i + 2]
+        return_dict[f"{i}_generalNL"] = row[2 * i + 3]
+        return_dict[f"{i}_specificStructured"] = ""
+        return_dict[f"{i}_generalStructured"] = ""
+    return return_dict