Added the Ascent KB (huggingface#2341)

* added files for the ascent_kb dataset * modified README.md * changed task category/ids * dropped copying json data * (re) added "Supported Tasks and Leaderboards" * added missing fields * added an example data row * Apply suggestions from code review Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>
jason9693 · May 11, 2021 · 2f7b94b · 2f7b94b
1 parent 7422f8e
commit 2f7b94b
Show file tree

Hide file tree

Showing 5 changed files with 375 additions and 0 deletions.
diff --git a/datasets/ascent_kb/README.md b/datasets/ascent_kb/README.md
@@ -0,0 +1,227 @@
+---
+annotations_creators:
+- found
+language_creators:
+- found
+languages:
+- en
+licenses:
+- cc-by-4.0
+multilinguality:
+- monolingual
+size_categories:
+- 1M<n<10M
+source_datasets:
+- original
+task_categories:
+- conditional-text-generation
+task_ids:
+- other-stuctured-to-text
+---
+
+# Dataset Card for Ascent KB
+
+## Table of Contents
+- [Dataset Description](#dataset-description)
+  - [Dataset Summary](#dataset-summary)
+  - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards)
+  - [Languages](#languages)
+- [Dataset Structure](#dataset-structure)
+  - [Data Instances](#data-instances)
+  - [Data Fields](#data-instances)
+  - [Data Splits](#data-instances)
+- [Dataset Creation](#dataset-creation)
+  - [Curation Rationale](#curation-rationale)
+  - [Source Data](#source-data)
+  - [Annotations](#annotations)
+  - [Personal and Sensitive Information](#personal-and-sensitive-information)
+- [Considerations for Using the Data](#considerations-for-using-the-data)
+  - [Social Impact of Dataset](#social-impact-of-dataset)
+  - [Discussion of Biases](#discussion-of-biases)
+  - [Other Known Limitations](#other-known-limitations)
+- [Additional Information](#additional-information)
+  - [Dataset Curators](#dataset-curators)
+  - [Licensing Information](#licensing-information)
+  - [Citation Information](#citation-information)
+  - [Contributions](#contributions)
+
+## Dataset Description
+
+- **Homepage:** https://ascent.mpi-inf.mpg.de/
+- **Repository:** https://github.com/phongnt570/ascent
+- **Paper:** https://arxiv.org/abs/2011.00905
+- **Point of Contact:** http://tuan-phong.com
+
+### Dataset Summary
+
+This dataset contains 8.9M commonsense assertions extracted  by the Ascent pipeline developed at the [Max Planck Institute for Informatics](https://www.mpi-inf.mpg.de/departments/databases-and-information-systems/).
+The focus of this dataset is on everyday concepts such as *elephant*, *car*, *laptop*, etc.
+The current version of Ascent KB (v1.0.0) is approximately **19 times larger  than ConceptNet** (note that, in this comparison, non-commonsense knowledge in ConceptNet such as lexical relations is excluded).
+
+For more details, take a look at
+[the research paper](https://arxiv.org/abs/2011.00905) and
+[the website](https://ascent.mpi-inf.mpg.de).
+
+### Supported Tasks and Leaderboards
+
+The dataset can be used in a wide range of downstream tasks such as commonsense question answering or dialogue systems.
+
+### Languages
+
+The dataset is in English.
+
+## Dataset Structure
+
+### Data Instances
+There are two configurations available for this dataset:
+1. `canonical` (default): This part contains `<arg1 ; rel ; arg2>`
+  assertions where the relations (`rel`) were mapped to 
+  [ConceptNet relations](https://github.com/commonsense/conceptnet5/wiki/Relations)
+  with slight modifications:
+    - Introducing 2 new relations: `/r/HasSubgroup`, `/r/HasAspect`.
+    - All `/r/HasA` relations were replaced with `/r/HasAspect`. 
+      This is motivated by the [ATOMIC-2020](https://allenai.org/data/atomic-2020)
+      schema, although they grouped all `/r/HasA` and
+      `/r/HasProperty` into `/r/HasProperty`.
+    - The `/r/UsedFor` relation was replaced with `/r/ObjectUse`
+      which is broader (could be either _"used for"_, _"used in"_, or _"used as"_, ect.).
+      This is also taken from ATOMIC-2020.
+2. `open`: This part contains open assertions of the form
+  `<subject ; predicate ; object>` extracted directly from web
+  contents. This is the original form of the `canonical` triples. 
+
+In both configurations, each assertion is equipped with 
+extra information including: a set of semantic `facets`
+(e.g., *LOCATION*, *TEMPORAL*, etc.), its `support` (i.e., number of occurrences),
+and a list of `source_sentences`.
+
+An example row in the `canonical` configuration:
+
+```JSON
+{
+  "arg1": "elephant",
+  "rel": "/r/HasProperty",
+  "arg2": "intelligent",
+  "support": 15,
+  "facets": [
+    {
+      "value": "extremely",
+      "type": "DEGREE",
+      "support": 11
+    }
+  ],
+  "source_sentences": [
+    {
+      "text": "Elephants are extremely intelligent animals.",
+      "source": "https://www.softschools.com/facts/animals/asian_elephant_facts/2310/"
+    },
+    {
+      "text": "Elephants are extremely intelligent creatures and an elephant's brain can weigh as much as 4-6 kg.",
+      "source": "https://www.elephantsforafrica.org/elephant-facts/"
+    }
+  ]
+}
+```
+
+### Data Fields
+
+- **For `canonical` configuration**
+    - `arg1`: the first argument to the relationship, e.g., *elephant*
+    - `rel`: the canonical relation, e.g., */r/HasProperty*
+    - `arg2`: the second argument to the relationship, e.g., *intelligence*
+    - `support`: the number of occurrences of the assertion, e.g., *15*
+    - `facets`: an array of semantic facets, each contains
+      - `value`: facet value, e.g., *extremely*
+      - `type`: facet type, e.g., *DEGREE*
+      - `support`: the number of occurrences of the facet, e.g., *11*
+    - `source_sentences`: an array of source sentences from which the assertion was
+      extracted, each contains
+      - `text`: the raw text of the sentence
+      - `source`: the URL to its parent document
+
+- **For `open` configuration**
+    - The fields of this configuration are the same as the `canonical`
+      configuration's, except that
+      the (`arg1`, `rel`, `arg2`) fields are replaced with the
+      (`subject`, `predicate`, `object`) fields
+      which are free
+      text phrases extracted directly from the source sentences
+      using an Open Information Extraction (OpenIE) tool.
+
+### Data Splits
+
+There are no splits. All data points come to a default split called `train`.
+
+## Dataset Creation
+
+### Curation Rationale
+
+The commonsense knowledge base was created to assist in development of robust and reliable AI.
+
+### Source Data
+
+#### Initial Data Collection and Normalization
+
+Texts were collected from the web using the Bing Search API, and went through various cleaning steps before being processed by an OpenIE tool to get open assertions.
+The assertions were then grouped into semantically equivalent clusters.
+Take a look at the research paper for more details.
+
+#### Who are the source language producers?
+
+Web users.
+
+### Annotations
+
+#### Annotation process
+
+None.
+
+#### Who are the annotators?
+
+None.
+
+### Personal and Sensitive Information
+
+Unknown.
+
+## Considerations for Using the Data
+
+### Social Impact of Dataset
+
+[Needs More Information]
+
+### Discussion of Biases
+
+[Needs More Information]
+
+### Other Known Limitations
+
+[Needs More Information]
+
+## Additional Information
+
+### Dataset Curators
+
+The knowledge base has been developed by researchers at the
+[Max Planck Institute for Informatics](https://www.mpi-inf.mpg.de/departments/databases-and-information-systems/).
+
+Contact [Tuan-Phong Nguyen](http://tuan-phong.com) in case of questions and comments.
+
+### Licensing Information
+
+[The Creative Commons Attribution 4.0 International License](https://creativecommons.org/licenses/by/4.0/)
+
+### Citation Information
+
+```
+@InProceedings{nguyen2021www,
+  title={Advanced Semantics for Commonsense Knowledge Extraction},
+  author={Nguyen, Tuan-Phong and Razniewski, Simon and Weikum, Gerhard},
+  year={2021},
+  booktitle={The Web Conference 2021},
+}
+```
+
+### Contributions
+
+Thanks to [@phongnt570](https://github.com/phongnt570) for adding this dataset.
diff --git a/datasets/ascent_kb/ascent_kb.py b/datasets/ascent_kb/ascent_kb.py
@@ -0,0 +1,147 @@
+# coding=utf-8
+# Copyright 2020 The HuggingFace Datasets Authors and the current dataset script contributor.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Ascent KB: A Deep Commonsense Knowledge Base"""
+
+import json
+
+import datasets
+
+
+_CITATION = """\
+@InProceedings{nguyen2021www,
+  title={Advanced Semantics for Commonsense Knowledge Extraction},
+  author={Nguyen, Tuan-Phong and Razniewski, Simon and Weikum, Gerhard},
+  year={2021},
+  booktitle={The Web Conference 2021},
+}
+"""
+
+_DESCRIPTION = """\
+This dataset contains 8.9M commonsense assertions extracted by the Ascent pipeline (https://ascent.mpi-inf.mpg.de/).
+"""
+
+_HOMEPAGE = "https://ascent.mpi-inf.mpg.de/"
+
+_LICENSE = "The Creative Commons Attribution 4.0 International License. https://creativecommons.org/licenses/by/4.0/"
+
+# The HuggingFace dataset library don't host the datasets but only point to the original files
+# This can be an arbitrary nested dict/list of URLs (see below in `_split_generators` method)
+
+_URL = "https://nextcloud.mpi-klsb.mpg.de/index.php/s/dFLdTQHqiFrt3Q3/download"
+
+
+# DONE: Name of the dataset usually match the script name with CamelCase instead of snake_case
+class AscentKB(datasets.GeneratorBasedBuilder):
+    """Ascent KB: A Deep Commonsense Knowledge Base. Version 1.0.0."""
+
+    VERSION = datasets.Version("1.0.0")
+
+    BUILDER_CONFIGS = [
+        datasets.BuilderConfig(
+            name="canonical",
+            version=VERSION,
+            description="This KB contains <arg1 ; rel ; arg2> \
+                               assertions where relations are canonicalized based on ConceptNet relations.",
+        ),
+        datasets.BuilderConfig(
+            name="open",
+            version=VERSION,
+            description="This KB contains open assertions of the form \
+                               <subject ; predicate ; object> extracted directly from web contents.",
+        ),
+    ]
+
+    DEFAULT_CONFIG_NAME = "canonical"
+
+    def _info(self):
+        if self.config.name == "canonical":
+            features = datasets.Features(
+                {
+                    "arg1": datasets.Value("string"),
+                    "rel": datasets.Value("string"),
+                    "arg2": datasets.Value("string"),
+                    "support": datasets.Value("int64"),
+                    "facets": [
+                        {
+                            "value": datasets.Value("string"),
+                            "type": datasets.Value("string"),
+                            "support": datasets.Value("int64"),
+                        }
+                    ],
+                    "source_sentences": [{"text": datasets.Value("string"), "source": datasets.Value("string")}],
+                }
+            )
+        else:  # features for the "open" part
+            features = datasets.Features(
+                {
+                    "subject": datasets.Value("string"),
+                    "predicate": datasets.Value("string"),
+                    "object": datasets.Value("string"),
+                    "support": datasets.Value("int64"),
+                    "facets": [
+                        {
+                            "value": datasets.Value("string"),
+                            "type": datasets.Value("string"),
+                            "support": datasets.Value("int64"),
+                        }
+                    ],
+                    "source_sentences": [{"text": datasets.Value("string"), "source": datasets.Value("string")}],
+                }
+            )
+        return datasets.DatasetInfo(
+            description=_DESCRIPTION,
+            features=features,
+            supervised_keys=None,
+            homepage=_HOMEPAGE,
+            license=_LICENSE,
+            citation=_CITATION,
+        )
+
+    def _split_generators(self, dl_manager):
+        """Returns SplitGenerators."""
+        # my_urls = _URLs[self.config.name]
+        # data_file = dl_manager.download_and_extract(my_urls)
+
+        data_file = dl_manager.download_and_extract(_URL)
+
+        return [
+            datasets.SplitGenerator(
+                name=datasets.Split.TRAIN,
+                gen_kwargs={
+                    "filepath": data_file,
+                    "split": "train",
+                },
+            ),
+        ]
+
+    # method parameters are unpacked from `gen_kwargs` as given in `_split_generators`
+    def _generate_examples(self, filepath, split):
+        """Yields examples as (key, example) tuples."""
+        # This method handles input defined in _split_generators to yield (key, example) tuples from the dataset.
+        # The `key` is here for legacy reason (tfds) and is not important in itself.
+
+        with open(filepath, encoding="utf-8") as f:
+            for id_, row in enumerate(f):
+                data = json.loads(row)
+                if self.config.name == "canonical":
+                    data.pop("subject")
+                    data.pop("predicate")
+                    data.pop("object")
+                    yield id_, data
+                else:  # "open"
+                    data.pop("arg1")
+                    data.pop("rel")
+                    data.pop("arg2")
+                    yield id_, data
diff --git a/datasets/ascent_kb/dataset_infos.json b/datasets/ascent_kb/dataset_infos.json
@@ -0,0 +1 @@
+{"canonical": {"description": "This dataset contains 8.9M commonsense assertions extracted by the Ascent pipeline (https://ascent.mpi-inf.mpg.de/).\n", "citation": "@InProceedings{nguyen2021www,\n  title={Advanced Semantics for Commonsense Knowledge Extraction},\n  author={Nguyen, Tuan-Phong and Razniewski, Simon and Weikum, Gerhard},\n  year={2021},\n  booktitle={The Web Conference 2021},\n}\n", "homepage": "https://ascent.mpi-inf.mpg.de/", "license": "The Creative Commons Attribution 4.0 International License. https://creativecommons.org/licenses/by/4.0/", "features": {"arg1": {"dtype": "string", "id": null, "_type": "Value"}, "rel": {"dtype": "string", "id": null, "_type": "Value"}, "arg2": {"dtype": "string", "id": null, "_type": "Value"}, "support": {"dtype": "int64", "id": null, "_type": "Value"}, "facets": [{"value": {"dtype": "string", "id": null, "_type": "Value"}, "type": {"dtype": "string", "id": null, "_type": "Value"}, "support": {"dtype": "int64", "id": null, "_type": "Value"}}], "source_sentences": [{"text": {"dtype": "string", "id": null, "_type": "Value"}, "source": {"dtype": "string", "id": null, "_type": "Value"}}]}, "post_processed": null, "supervised_keys": null, "builder_name": "ascent_kb", "config_name": "canonical", "version": {"version_str": "1.0.0", "description": null, "major": 1, "minor": 0, "patch": 0}, "splits": {"train": {"name": "train", "num_bytes": 2976697816, "num_examples": 8904060, "dataset_name": "ascent_kb"}}, "download_checksums": {"https://nextcloud.mpi-klsb.mpg.de/index.php/s/dFLdTQHqiFrt3Q3/download": {"num_bytes": 710727536, "checksum": "51fd88a07bca4fa48a9157dd1d93d9bac88ad2b38b5eae662d2cbfad47895016"}}, "download_size": 710727536, "post_processing_size": null, "dataset_size": 2976697816, "size_in_bytes": 3687425352}, "open": {"description": "This dataset contains 8.9M commonsense assertions extracted by the Ascent pipeline (https://ascent.mpi-inf.mpg.de/).\n", "citation": "@InProceedings{nguyen2021www,\n  title={Advanced Semantics for Commonsense Knowledge Extraction},\n  author={Nguyen, Tuan-Phong and Razniewski, Simon and Weikum, Gerhard},\n  year={2021},\n  booktitle={The Web Conference 2021},\n}\n", "homepage": "https://ascent.mpi-inf.mpg.de/", "license": "The Creative Commons Attribution 4.0 International License. https://creativecommons.org/licenses/by/4.0/", "features": {"subject": {"dtype": "string", "id": null, "_type": "Value"}, "predicate": {"dtype": "string", "id": null, "_type": "Value"}, "object": {"dtype": "string", "id": null, "_type": "Value"}, "support": {"dtype": "int64", "id": null, "_type": "Value"}, "facets": [{"value": {"dtype": "string", "id": null, "_type": "Value"}, "type": {"dtype": "string", "id": null, "_type": "Value"}, "support": {"dtype": "int64", "id": null, "_type": "Value"}}], "source_sentences": [{"text": {"dtype": "string", "id": null, "_type": "Value"}, "source": {"dtype": "string", "id": null, "_type": "Value"}}]}, "post_processed": null, "supervised_keys": null, "builder_name": "ascent_kb", "config_name": "open", "version": {"version_str": "1.0.0", "description": null, "major": 1, "minor": 0, "patch": 0}, "splits": {"train": {"name": "train", "num_bytes": 2882678298, "num_examples": 8904060, "dataset_name": "ascent_kb"}}, "download_checksums": {"https://nextcloud.mpi-klsb.mpg.de/index.php/s/dFLdTQHqiFrt3Q3/download": {"num_bytes": 710727536, "checksum": "51fd88a07bca4fa48a9157dd1d93d9bac88ad2b38b5eae662d2cbfad47895016"}}, "download_size": 710727536, "post_processing_size": null, "dataset_size": 2882678298, "size_in_bytes": 3593405834}}
diff --git a/datasets/ascent_kb/dummy/canonical/1.0.0/dummy_data.zip b/datasets/ascent_kb/dummy/canonical/1.0.0/dummy_data.zip
diff --git a/datasets/ascent_kb/dummy/open/1.0.0/dummy_data.zip b/datasets/ascent_kb/dummy/open/1.0.0/dummy_data.zip