forked from huggingface/datasets
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Added the Ascent KB (huggingface#2341)
* added files for the ascent_kb dataset * modified README.md * changed task category/ids * dropped copying json data * (re) added "Supported Tasks and Leaderboards" * added missing fields * added an example data row * Apply suggestions from code review Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>
- Loading branch information
1 parent
7422f8e
commit 2f7b94b
Showing
5 changed files
with
375 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,227 @@ | ||
--- | ||
annotations_creators: | ||
- found | ||
language_creators: | ||
- found | ||
languages: | ||
- en | ||
licenses: | ||
- cc-by-4.0 | ||
multilinguality: | ||
- monolingual | ||
size_categories: | ||
- 1M<n<10M | ||
source_datasets: | ||
- original | ||
task_categories: | ||
- conditional-text-generation | ||
task_ids: | ||
- other-stuctured-to-text | ||
--- | ||
|
||
# Dataset Card for Ascent KB | ||
|
||
## Table of Contents | ||
- [Dataset Description](#dataset-description) | ||
- [Dataset Summary](#dataset-summary) | ||
- [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) | ||
- [Languages](#languages) | ||
- [Dataset Structure](#dataset-structure) | ||
- [Data Instances](#data-instances) | ||
- [Data Fields](#data-instances) | ||
- [Data Splits](#data-instances) | ||
- [Dataset Creation](#dataset-creation) | ||
- [Curation Rationale](#curation-rationale) | ||
- [Source Data](#source-data) | ||
- [Annotations](#annotations) | ||
- [Personal and Sensitive Information](#personal-and-sensitive-information) | ||
- [Considerations for Using the Data](#considerations-for-using-the-data) | ||
- [Social Impact of Dataset](#social-impact-of-dataset) | ||
- [Discussion of Biases](#discussion-of-biases) | ||
- [Other Known Limitations](#other-known-limitations) | ||
- [Additional Information](#additional-information) | ||
- [Dataset Curators](#dataset-curators) | ||
- [Licensing Information](#licensing-information) | ||
- [Citation Information](#citation-information) | ||
- [Contributions](#contributions) | ||
|
||
## Dataset Description | ||
|
||
- **Homepage:** https://ascent.mpi-inf.mpg.de/ | ||
- **Repository:** https://github.com/phongnt570/ascent | ||
- **Paper:** https://arxiv.org/abs/2011.00905 | ||
- **Point of Contact:** http://tuan-phong.com | ||
|
||
### Dataset Summary | ||
|
||
This dataset contains 8.9M commonsense assertions extracted by the Ascent pipeline developed at the [Max Planck Institute for Informatics](https://www.mpi-inf.mpg.de/departments/databases-and-information-systems/). | ||
The focus of this dataset is on everyday concepts such as *elephant*, *car*, *laptop*, etc. | ||
The current version of Ascent KB (v1.0.0) is approximately **19 times larger than ConceptNet** (note that, in this comparison, non-commonsense knowledge in ConceptNet such as lexical relations is excluded). | ||
|
||
For more details, take a look at | ||
[the research paper](https://arxiv.org/abs/2011.00905) and | ||
[the website](https://ascent.mpi-inf.mpg.de). | ||
|
||
### Supported Tasks and Leaderboards | ||
|
||
The dataset can be used in a wide range of downstream tasks such as commonsense question answering or dialogue systems. | ||
|
||
### Languages | ||
|
||
The dataset is in English. | ||
|
||
## Dataset Structure | ||
|
||
### Data Instances | ||
There are two configurations available for this dataset: | ||
1. `canonical` (default): This part contains `<arg1 ; rel ; arg2>` | ||
assertions where the relations (`rel`) were mapped to | ||
[ConceptNet relations](https://github.com/commonsense/conceptnet5/wiki/Relations) | ||
with slight modifications: | ||
- Introducing 2 new relations: `/r/HasSubgroup`, `/r/HasAspect`. | ||
- All `/r/HasA` relations were replaced with `/r/HasAspect`. | ||
This is motivated by the [ATOMIC-2020](https://allenai.org/data/atomic-2020) | ||
schema, although they grouped all `/r/HasA` and | ||
`/r/HasProperty` into `/r/HasProperty`. | ||
- The `/r/UsedFor` relation was replaced with `/r/ObjectUse` | ||
which is broader (could be either _"used for"_, _"used in"_, or _"used as"_, ect.). | ||
This is also taken from ATOMIC-2020. | ||
2. `open`: This part contains open assertions of the form | ||
`<subject ; predicate ; object>` extracted directly from web | ||
contents. This is the original form of the `canonical` triples. | ||
|
||
In both configurations, each assertion is equipped with | ||
extra information including: a set of semantic `facets` | ||
(e.g., *LOCATION*, *TEMPORAL*, etc.), its `support` (i.e., number of occurrences), | ||
and a list of `source_sentences`. | ||
|
||
An example row in the `canonical` configuration: | ||
|
||
```JSON | ||
{ | ||
"arg1": "elephant", | ||
"rel": "/r/HasProperty", | ||
"arg2": "intelligent", | ||
"support": 15, | ||
"facets": [ | ||
{ | ||
"value": "extremely", | ||
"type": "DEGREE", | ||
"support": 11 | ||
} | ||
], | ||
"source_sentences": [ | ||
{ | ||
"text": "Elephants are extremely intelligent animals.", | ||
"source": "https://www.softschools.com/facts/animals/asian_elephant_facts/2310/" | ||
}, | ||
{ | ||
"text": "Elephants are extremely intelligent creatures and an elephant's brain can weigh as much as 4-6 kg.", | ||
"source": "https://www.elephantsforafrica.org/elephant-facts/" | ||
} | ||
] | ||
} | ||
``` | ||
|
||
### Data Fields | ||
|
||
- **For `canonical` configuration** | ||
- `arg1`: the first argument to the relationship, e.g., *elephant* | ||
- `rel`: the canonical relation, e.g., */r/HasProperty* | ||
- `arg2`: the second argument to the relationship, e.g., *intelligence* | ||
- `support`: the number of occurrences of the assertion, e.g., *15* | ||
- `facets`: an array of semantic facets, each contains | ||
- `value`: facet value, e.g., *extremely* | ||
- `type`: facet type, e.g., *DEGREE* | ||
- `support`: the number of occurrences of the facet, e.g., *11* | ||
- `source_sentences`: an array of source sentences from which the assertion was | ||
extracted, each contains | ||
- `text`: the raw text of the sentence | ||
- `source`: the URL to its parent document | ||
|
||
- **For `open` configuration** | ||
- The fields of this configuration are the same as the `canonical` | ||
configuration's, except that | ||
the (`arg1`, `rel`, `arg2`) fields are replaced with the | ||
(`subject`, `predicate`, `object`) fields | ||
which are free | ||
text phrases extracted directly from the source sentences | ||
using an Open Information Extraction (OpenIE) tool. | ||
|
||
### Data Splits | ||
|
||
There are no splits. All data points come to a default split called `train`. | ||
|
||
## Dataset Creation | ||
|
||
### Curation Rationale | ||
|
||
The commonsense knowledge base was created to assist in development of robust and reliable AI. | ||
|
||
### Source Data | ||
|
||
#### Initial Data Collection and Normalization | ||
|
||
Texts were collected from the web using the Bing Search API, and went through various cleaning steps before being processed by an OpenIE tool to get open assertions. | ||
The assertions were then grouped into semantically equivalent clusters. | ||
Take a look at the research paper for more details. | ||
|
||
#### Who are the source language producers? | ||
|
||
Web users. | ||
|
||
### Annotations | ||
|
||
#### Annotation process | ||
|
||
None. | ||
|
||
#### Who are the annotators? | ||
|
||
None. | ||
|
||
### Personal and Sensitive Information | ||
|
||
Unknown. | ||
|
||
## Considerations for Using the Data | ||
|
||
### Social Impact of Dataset | ||
|
||
[Needs More Information] | ||
|
||
### Discussion of Biases | ||
|
||
[Needs More Information] | ||
|
||
### Other Known Limitations | ||
|
||
[Needs More Information] | ||
|
||
## Additional Information | ||
|
||
### Dataset Curators | ||
|
||
The knowledge base has been developed by researchers at the | ||
[Max Planck Institute for Informatics](https://www.mpi-inf.mpg.de/departments/databases-and-information-systems/). | ||
|
||
Contact [Tuan-Phong Nguyen](http://tuan-phong.com) in case of questions and comments. | ||
|
||
### Licensing Information | ||
|
||
[The Creative Commons Attribution 4.0 International License](https://creativecommons.org/licenses/by/4.0/) | ||
|
||
### Citation Information | ||
|
||
``` | ||
@InProceedings{nguyen2021www, | ||
title={Advanced Semantics for Commonsense Knowledge Extraction}, | ||
author={Nguyen, Tuan-Phong and Razniewski, Simon and Weikum, Gerhard}, | ||
year={2021}, | ||
booktitle={The Web Conference 2021}, | ||
} | ||
``` | ||
|
||
### Contributions | ||
|
||
Thanks to [@phongnt570](https://github.com/phongnt570) for adding this dataset. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,147 @@ | ||
# coding=utf-8 | ||
# Copyright 2020 The HuggingFace Datasets Authors and the current dataset script contributor. | ||
# | ||
# Licensed under the Apache License, Version 2.0 (the "License"); | ||
# you may not use this file except in compliance with the License. | ||
# You may obtain a copy of the License at | ||
# | ||
# http://www.apache.org/licenses/LICENSE-2.0 | ||
# | ||
# Unless required by applicable law or agreed to in writing, software | ||
# distributed under the License is distributed on an "AS IS" BASIS, | ||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
# See the License for the specific language governing permissions and | ||
# limitations under the License. | ||
"""Ascent KB: A Deep Commonsense Knowledge Base""" | ||
|
||
import json | ||
|
||
import datasets | ||
|
||
|
||
_CITATION = """\ | ||
@InProceedings{nguyen2021www, | ||
title={Advanced Semantics for Commonsense Knowledge Extraction}, | ||
author={Nguyen, Tuan-Phong and Razniewski, Simon and Weikum, Gerhard}, | ||
year={2021}, | ||
booktitle={The Web Conference 2021}, | ||
} | ||
""" | ||
|
||
_DESCRIPTION = """\ | ||
This dataset contains 8.9M commonsense assertions extracted by the Ascent pipeline (https://ascent.mpi-inf.mpg.de/). | ||
""" | ||
|
||
_HOMEPAGE = "https://ascent.mpi-inf.mpg.de/" | ||
|
||
_LICENSE = "The Creative Commons Attribution 4.0 International License. https://creativecommons.org/licenses/by/4.0/" | ||
|
||
# The HuggingFace dataset library don't host the datasets but only point to the original files | ||
# This can be an arbitrary nested dict/list of URLs (see below in `_split_generators` method) | ||
|
||
_URL = "https://nextcloud.mpi-klsb.mpg.de/index.php/s/dFLdTQHqiFrt3Q3/download" | ||
|
||
|
||
# DONE: Name of the dataset usually match the script name with CamelCase instead of snake_case | ||
class AscentKB(datasets.GeneratorBasedBuilder): | ||
"""Ascent KB: A Deep Commonsense Knowledge Base. Version 1.0.0.""" | ||
|
||
VERSION = datasets.Version("1.0.0") | ||
|
||
BUILDER_CONFIGS = [ | ||
datasets.BuilderConfig( | ||
name="canonical", | ||
version=VERSION, | ||
description="This KB contains <arg1 ; rel ; arg2> \ | ||
assertions where relations are canonicalized based on ConceptNet relations.", | ||
), | ||
datasets.BuilderConfig( | ||
name="open", | ||
version=VERSION, | ||
description="This KB contains open assertions of the form \ | ||
<subject ; predicate ; object> extracted directly from web contents.", | ||
), | ||
] | ||
|
||
DEFAULT_CONFIG_NAME = "canonical" | ||
|
||
def _info(self): | ||
if self.config.name == "canonical": | ||
features = datasets.Features( | ||
{ | ||
"arg1": datasets.Value("string"), | ||
"rel": datasets.Value("string"), | ||
"arg2": datasets.Value("string"), | ||
"support": datasets.Value("int64"), | ||
"facets": [ | ||
{ | ||
"value": datasets.Value("string"), | ||
"type": datasets.Value("string"), | ||
"support": datasets.Value("int64"), | ||
} | ||
], | ||
"source_sentences": [{"text": datasets.Value("string"), "source": datasets.Value("string")}], | ||
} | ||
) | ||
else: # features for the "open" part | ||
features = datasets.Features( | ||
{ | ||
"subject": datasets.Value("string"), | ||
"predicate": datasets.Value("string"), | ||
"object": datasets.Value("string"), | ||
"support": datasets.Value("int64"), | ||
"facets": [ | ||
{ | ||
"value": datasets.Value("string"), | ||
"type": datasets.Value("string"), | ||
"support": datasets.Value("int64"), | ||
} | ||
], | ||
"source_sentences": [{"text": datasets.Value("string"), "source": datasets.Value("string")}], | ||
} | ||
) | ||
return datasets.DatasetInfo( | ||
description=_DESCRIPTION, | ||
features=features, | ||
supervised_keys=None, | ||
homepage=_HOMEPAGE, | ||
license=_LICENSE, | ||
citation=_CITATION, | ||
) | ||
|
||
def _split_generators(self, dl_manager): | ||
"""Returns SplitGenerators.""" | ||
# my_urls = _URLs[self.config.name] | ||
# data_file = dl_manager.download_and_extract(my_urls) | ||
|
||
data_file = dl_manager.download_and_extract(_URL) | ||
|
||
return [ | ||
datasets.SplitGenerator( | ||
name=datasets.Split.TRAIN, | ||
gen_kwargs={ | ||
"filepath": data_file, | ||
"split": "train", | ||
}, | ||
), | ||
] | ||
|
||
# method parameters are unpacked from `gen_kwargs` as given in `_split_generators` | ||
def _generate_examples(self, filepath, split): | ||
"""Yields examples as (key, example) tuples.""" | ||
# This method handles input defined in _split_generators to yield (key, example) tuples from the dataset. | ||
# The `key` is here for legacy reason (tfds) and is not important in itself. | ||
|
||
with open(filepath, encoding="utf-8") as f: | ||
for id_, row in enumerate(f): | ||
data = json.loads(row) | ||
if self.config.name == "canonical": | ||
data.pop("subject") | ||
data.pop("predicate") | ||
data.pop("object") | ||
yield id_, data | ||
else: # "open" | ||
data.pop("arg1") | ||
data.pop("rel") | ||
data.pop("arg2") | ||
yield id_, data |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
{"canonical": {"description": "This dataset contains 8.9M commonsense assertions extracted by the Ascent pipeline (https://ascent.mpi-inf.mpg.de/).\n", "citation": "@InProceedings{nguyen2021www,\n title={Advanced Semantics for Commonsense Knowledge Extraction},\n author={Nguyen, Tuan-Phong and Razniewski, Simon and Weikum, Gerhard},\n year={2021},\n booktitle={The Web Conference 2021},\n}\n", "homepage": "https://ascent.mpi-inf.mpg.de/", "license": "The Creative Commons Attribution 4.0 International License. https://creativecommons.org/licenses/by/4.0/", "features": {"arg1": {"dtype": "string", "id": null, "_type": "Value"}, "rel": {"dtype": "string", "id": null, "_type": "Value"}, "arg2": {"dtype": "string", "id": null, "_type": "Value"}, "support": {"dtype": "int64", "id": null, "_type": "Value"}, "facets": [{"value": {"dtype": "string", "id": null, "_type": "Value"}, "type": {"dtype": "string", "id": null, "_type": "Value"}, "support": {"dtype": "int64", "id": null, "_type": "Value"}}], "source_sentences": [{"text": {"dtype": "string", "id": null, "_type": "Value"}, "source": {"dtype": "string", "id": null, "_type": "Value"}}]}, "post_processed": null, "supervised_keys": null, "builder_name": "ascent_kb", "config_name": "canonical", "version": {"version_str": "1.0.0", "description": null, "major": 1, "minor": 0, "patch": 0}, "splits": {"train": {"name": "train", "num_bytes": 2976697816, "num_examples": 8904060, "dataset_name": "ascent_kb"}}, "download_checksums": {"https://nextcloud.mpi-klsb.mpg.de/index.php/s/dFLdTQHqiFrt3Q3/download": {"num_bytes": 710727536, "checksum": "51fd88a07bca4fa48a9157dd1d93d9bac88ad2b38b5eae662d2cbfad47895016"}}, "download_size": 710727536, "post_processing_size": null, "dataset_size": 2976697816, "size_in_bytes": 3687425352}, "open": {"description": "This dataset contains 8.9M commonsense assertions extracted by the Ascent pipeline (https://ascent.mpi-inf.mpg.de/).\n", "citation": "@InProceedings{nguyen2021www,\n title={Advanced Semantics for Commonsense Knowledge Extraction},\n author={Nguyen, Tuan-Phong and Razniewski, Simon and Weikum, Gerhard},\n year={2021},\n booktitle={The Web Conference 2021},\n}\n", "homepage": "https://ascent.mpi-inf.mpg.de/", "license": "The Creative Commons Attribution 4.0 International License. https://creativecommons.org/licenses/by/4.0/", "features": {"subject": {"dtype": "string", "id": null, "_type": "Value"}, "predicate": {"dtype": "string", "id": null, "_type": "Value"}, "object": {"dtype": "string", "id": null, "_type": "Value"}, "support": {"dtype": "int64", "id": null, "_type": "Value"}, "facets": [{"value": {"dtype": "string", "id": null, "_type": "Value"}, "type": {"dtype": "string", "id": null, "_type": "Value"}, "support": {"dtype": "int64", "id": null, "_type": "Value"}}], "source_sentences": [{"text": {"dtype": "string", "id": null, "_type": "Value"}, "source": {"dtype": "string", "id": null, "_type": "Value"}}]}, "post_processed": null, "supervised_keys": null, "builder_name": "ascent_kb", "config_name": "open", "version": {"version_str": "1.0.0", "description": null, "major": 1, "minor": 0, "patch": 0}, "splits": {"train": {"name": "train", "num_bytes": 2882678298, "num_examples": 8904060, "dataset_name": "ascent_kb"}}, "download_checksums": {"https://nextcloud.mpi-klsb.mpg.de/index.php/s/dFLdTQHqiFrt3Q3/download": {"num_bytes": 710727536, "checksum": "51fd88a07bca4fa48a9157dd1d93d9bac88ad2b38b5eae662d2cbfad47895016"}}, "download_size": 710727536, "post_processing_size": null, "dataset_size": 2882678298, "size_in_bytes": 3593405834}} |
Binary file not shown.
Binary file not shown.