Skip to content

Commit

Permalink
Added the Ascent KB (huggingface#2341)
Browse files Browse the repository at this point in the history
* added files for the ascent_kb dataset

* modified README.md

* changed task category/ids

* dropped copying json data

* (re) added "Supported Tasks and Leaderboards"

* added missing fields

* added an example data row

* Apply suggestions from code review

Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>
  • Loading branch information
phongnt570 and lhoestq committed May 11, 2021
1 parent 7422f8e commit 2f7b94b
Show file tree
Hide file tree
Showing 5 changed files with 375 additions and 0 deletions.
227 changes: 227 additions & 0 deletions datasets/ascent_kb/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,227 @@
---
annotations_creators:
- found
language_creators:
- found
languages:
- en
licenses:
- cc-by-4.0
multilinguality:
- monolingual
size_categories:
- 1M<n<10M
source_datasets:
- original
task_categories:
- conditional-text-generation
task_ids:
- other-stuctured-to-text
---

# Dataset Card for Ascent KB

## Table of Contents
- [Dataset Description](#dataset-description)
- [Dataset Summary](#dataset-summary)
- [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards)
- [Languages](#languages)
- [Dataset Structure](#dataset-structure)
- [Data Instances](#data-instances)
- [Data Fields](#data-instances)
- [Data Splits](#data-instances)
- [Dataset Creation](#dataset-creation)
- [Curation Rationale](#curation-rationale)
- [Source Data](#source-data)
- [Annotations](#annotations)
- [Personal and Sensitive Information](#personal-and-sensitive-information)
- [Considerations for Using the Data](#considerations-for-using-the-data)
- [Social Impact of Dataset](#social-impact-of-dataset)
- [Discussion of Biases](#discussion-of-biases)
- [Other Known Limitations](#other-known-limitations)
- [Additional Information](#additional-information)
- [Dataset Curators](#dataset-curators)
- [Licensing Information](#licensing-information)
- [Citation Information](#citation-information)
- [Contributions](#contributions)

## Dataset Description

- **Homepage:** https://ascent.mpi-inf.mpg.de/
- **Repository:** https://github.com/phongnt570/ascent
- **Paper:** https://arxiv.org/abs/2011.00905
- **Point of Contact:** http://tuan-phong.com

### Dataset Summary

This dataset contains 8.9M commonsense assertions extracted by the Ascent pipeline developed at the [Max Planck Institute for Informatics](https://www.mpi-inf.mpg.de/departments/databases-and-information-systems/).
The focus of this dataset is on everyday concepts such as *elephant*, *car*, *laptop*, etc.
The current version of Ascent KB (v1.0.0) is approximately **19 times larger than ConceptNet** (note that, in this comparison, non-commonsense knowledge in ConceptNet such as lexical relations is excluded).

For more details, take a look at
[the research paper](https://arxiv.org/abs/2011.00905) and
[the website](https://ascent.mpi-inf.mpg.de).

### Supported Tasks and Leaderboards

The dataset can be used in a wide range of downstream tasks such as commonsense question answering or dialogue systems.

### Languages

The dataset is in English.

## Dataset Structure

### Data Instances
There are two configurations available for this dataset:
1. `canonical` (default): This part contains `<arg1 ; rel ; arg2>`
assertions where the relations (`rel`) were mapped to
[ConceptNet relations](https://github.com/commonsense/conceptnet5/wiki/Relations)
with slight modifications:
- Introducing 2 new relations: `/r/HasSubgroup`, `/r/HasAspect`.
- All `/r/HasA` relations were replaced with `/r/HasAspect`.
This is motivated by the [ATOMIC-2020](https://allenai.org/data/atomic-2020)
schema, although they grouped all `/r/HasA` and
`/r/HasProperty` into `/r/HasProperty`.
- The `/r/UsedFor` relation was replaced with `/r/ObjectUse`
which is broader (could be either _"used for"_, _"used in"_, or _"used as"_, ect.).
This is also taken from ATOMIC-2020.
2. `open`: This part contains open assertions of the form
`<subject ; predicate ; object>` extracted directly from web
contents. This is the original form of the `canonical` triples.

In both configurations, each assertion is equipped with
extra information including: a set of semantic `facets`
(e.g., *LOCATION*, *TEMPORAL*, etc.), its `support` (i.e., number of occurrences),
and a list of `source_sentences`.

An example row in the `canonical` configuration:

```JSON
{
"arg1": "elephant",
"rel": "/r/HasProperty",
"arg2": "intelligent",
"support": 15,
"facets": [
{
"value": "extremely",
"type": "DEGREE",
"support": 11
}
],
"source_sentences": [
{
"text": "Elephants are extremely intelligent animals.",
"source": "https://www.softschools.com/facts/animals/asian_elephant_facts/2310/"
},
{
"text": "Elephants are extremely intelligent creatures and an elephant's brain can weigh as much as 4-6 kg.",
"source": "https://www.elephantsforafrica.org/elephant-facts/"
}
]
}
```

### Data Fields

- **For `canonical` configuration**
- `arg1`: the first argument to the relationship, e.g., *elephant*
- `rel`: the canonical relation, e.g., */r/HasProperty*
- `arg2`: the second argument to the relationship, e.g., *intelligence*
- `support`: the number of occurrences of the assertion, e.g., *15*
- `facets`: an array of semantic facets, each contains
- `value`: facet value, e.g., *extremely*
- `type`: facet type, e.g., *DEGREE*
- `support`: the number of occurrences of the facet, e.g., *11*
- `source_sentences`: an array of source sentences from which the assertion was
extracted, each contains
- `text`: the raw text of the sentence
- `source`: the URL to its parent document

- **For `open` configuration**
- The fields of this configuration are the same as the `canonical`
configuration's, except that
the (`arg1`, `rel`, `arg2`) fields are replaced with the
(`subject`, `predicate`, `object`) fields
which are free
text phrases extracted directly from the source sentences
using an Open Information Extraction (OpenIE) tool.

### Data Splits

There are no splits. All data points come to a default split called `train`.

## Dataset Creation

### Curation Rationale

The commonsense knowledge base was created to assist in development of robust and reliable AI.

### Source Data

#### Initial Data Collection and Normalization

Texts were collected from the web using the Bing Search API, and went through various cleaning steps before being processed by an OpenIE tool to get open assertions.
The assertions were then grouped into semantically equivalent clusters.
Take a look at the research paper for more details.

#### Who are the source language producers?

Web users.

### Annotations

#### Annotation process

None.

#### Who are the annotators?

None.

### Personal and Sensitive Information

Unknown.

## Considerations for Using the Data

### Social Impact of Dataset

[Needs More Information]

### Discussion of Biases

[Needs More Information]

### Other Known Limitations

[Needs More Information]

## Additional Information

### Dataset Curators

The knowledge base has been developed by researchers at the
[Max Planck Institute for Informatics](https://www.mpi-inf.mpg.de/departments/databases-and-information-systems/).

Contact [Tuan-Phong Nguyen](http://tuan-phong.com) in case of questions and comments.

### Licensing Information

[The Creative Commons Attribution 4.0 International License](https://creativecommons.org/licenses/by/4.0/)

### Citation Information

```
@InProceedings{nguyen2021www,
title={Advanced Semantics for Commonsense Knowledge Extraction},
author={Nguyen, Tuan-Phong and Razniewski, Simon and Weikum, Gerhard},
year={2021},
booktitle={The Web Conference 2021},
}
```

### Contributions

Thanks to [@phongnt570](https://github.com/phongnt570) for adding this dataset.
147 changes: 147 additions & 0 deletions datasets/ascent_kb/ascent_kb.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,147 @@
# coding=utf-8
# Copyright 2020 The HuggingFace Datasets Authors and the current dataset script contributor.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Ascent KB: A Deep Commonsense Knowledge Base"""

import json

import datasets


_CITATION = """\
@InProceedings{nguyen2021www,
title={Advanced Semantics for Commonsense Knowledge Extraction},
author={Nguyen, Tuan-Phong and Razniewski, Simon and Weikum, Gerhard},
year={2021},
booktitle={The Web Conference 2021},
}
"""

_DESCRIPTION = """\
This dataset contains 8.9M commonsense assertions extracted by the Ascent pipeline (https://ascent.mpi-inf.mpg.de/).
"""

_HOMEPAGE = "https://ascent.mpi-inf.mpg.de/"

_LICENSE = "The Creative Commons Attribution 4.0 International License. https://creativecommons.org/licenses/by/4.0/"

# The HuggingFace dataset library don't host the datasets but only point to the original files
# This can be an arbitrary nested dict/list of URLs (see below in `_split_generators` method)

_URL = "https://nextcloud.mpi-klsb.mpg.de/index.php/s/dFLdTQHqiFrt3Q3/download"


# DONE: Name of the dataset usually match the script name with CamelCase instead of snake_case
class AscentKB(datasets.GeneratorBasedBuilder):
"""Ascent KB: A Deep Commonsense Knowledge Base. Version 1.0.0."""

VERSION = datasets.Version("1.0.0")

BUILDER_CONFIGS = [
datasets.BuilderConfig(
name="canonical",
version=VERSION,
description="This KB contains <arg1 ; rel ; arg2> \
assertions where relations are canonicalized based on ConceptNet relations.",
),
datasets.BuilderConfig(
name="open",
version=VERSION,
description="This KB contains open assertions of the form \
<subject ; predicate ; object> extracted directly from web contents.",
),
]

DEFAULT_CONFIG_NAME = "canonical"

def _info(self):
if self.config.name == "canonical":
features = datasets.Features(
{
"arg1": datasets.Value("string"),
"rel": datasets.Value("string"),
"arg2": datasets.Value("string"),
"support": datasets.Value("int64"),
"facets": [
{
"value": datasets.Value("string"),
"type": datasets.Value("string"),
"support": datasets.Value("int64"),
}
],
"source_sentences": [{"text": datasets.Value("string"), "source": datasets.Value("string")}],
}
)
else: # features for the "open" part
features = datasets.Features(
{
"subject": datasets.Value("string"),
"predicate": datasets.Value("string"),
"object": datasets.Value("string"),
"support": datasets.Value("int64"),
"facets": [
{
"value": datasets.Value("string"),
"type": datasets.Value("string"),
"support": datasets.Value("int64"),
}
],
"source_sentences": [{"text": datasets.Value("string"), "source": datasets.Value("string")}],
}
)
return datasets.DatasetInfo(
description=_DESCRIPTION,
features=features,
supervised_keys=None,
homepage=_HOMEPAGE,
license=_LICENSE,
citation=_CITATION,
)

def _split_generators(self, dl_manager):
"""Returns SplitGenerators."""
# my_urls = _URLs[self.config.name]
# data_file = dl_manager.download_and_extract(my_urls)

data_file = dl_manager.download_and_extract(_URL)

return [
datasets.SplitGenerator(
name=datasets.Split.TRAIN,
gen_kwargs={
"filepath": data_file,
"split": "train",
},
),
]

# method parameters are unpacked from `gen_kwargs` as given in `_split_generators`
def _generate_examples(self, filepath, split):
"""Yields examples as (key, example) tuples."""
# This method handles input defined in _split_generators to yield (key, example) tuples from the dataset.
# The `key` is here for legacy reason (tfds) and is not important in itself.

with open(filepath, encoding="utf-8") as f:
for id_, row in enumerate(f):
data = json.loads(row)
if self.config.name == "canonical":
data.pop("subject")
data.pop("predicate")
data.pop("object")
yield id_, data
else: # "open"
data.pop("arg1")
data.pop("rel")
data.pop("arg2")
yield id_, data
1 change: 1 addition & 0 deletions datasets/ascent_kb/dataset_infos.json
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
{"canonical": {"description": "This dataset contains 8.9M commonsense assertions extracted by the Ascent pipeline (https://ascent.mpi-inf.mpg.de/).\n", "citation": "@InProceedings{nguyen2021www,\n title={Advanced Semantics for Commonsense Knowledge Extraction},\n author={Nguyen, Tuan-Phong and Razniewski, Simon and Weikum, Gerhard},\n year={2021},\n booktitle={The Web Conference 2021},\n}\n", "homepage": "https://ascent.mpi-inf.mpg.de/", "license": "The Creative Commons Attribution 4.0 International License. https://creativecommons.org/licenses/by/4.0/", "features": {"arg1": {"dtype": "string", "id": null, "_type": "Value"}, "rel": {"dtype": "string", "id": null, "_type": "Value"}, "arg2": {"dtype": "string", "id": null, "_type": "Value"}, "support": {"dtype": "int64", "id": null, "_type": "Value"}, "facets": [{"value": {"dtype": "string", "id": null, "_type": "Value"}, "type": {"dtype": "string", "id": null, "_type": "Value"}, "support": {"dtype": "int64", "id": null, "_type": "Value"}}], "source_sentences": [{"text": {"dtype": "string", "id": null, "_type": "Value"}, "source": {"dtype": "string", "id": null, "_type": "Value"}}]}, "post_processed": null, "supervised_keys": null, "builder_name": "ascent_kb", "config_name": "canonical", "version": {"version_str": "1.0.0", "description": null, "major": 1, "minor": 0, "patch": 0}, "splits": {"train": {"name": "train", "num_bytes": 2976697816, "num_examples": 8904060, "dataset_name": "ascent_kb"}}, "download_checksums": {"https://nextcloud.mpi-klsb.mpg.de/index.php/s/dFLdTQHqiFrt3Q3/download": {"num_bytes": 710727536, "checksum": "51fd88a07bca4fa48a9157dd1d93d9bac88ad2b38b5eae662d2cbfad47895016"}}, "download_size": 710727536, "post_processing_size": null, "dataset_size": 2976697816, "size_in_bytes": 3687425352}, "open": {"description": "This dataset contains 8.9M commonsense assertions extracted by the Ascent pipeline (https://ascent.mpi-inf.mpg.de/).\n", "citation": "@InProceedings{nguyen2021www,\n title={Advanced Semantics for Commonsense Knowledge Extraction},\n author={Nguyen, Tuan-Phong and Razniewski, Simon and Weikum, Gerhard},\n year={2021},\n booktitle={The Web Conference 2021},\n}\n", "homepage": "https://ascent.mpi-inf.mpg.de/", "license": "The Creative Commons Attribution 4.0 International License. https://creativecommons.org/licenses/by/4.0/", "features": {"subject": {"dtype": "string", "id": null, "_type": "Value"}, "predicate": {"dtype": "string", "id": null, "_type": "Value"}, "object": {"dtype": "string", "id": null, "_type": "Value"}, "support": {"dtype": "int64", "id": null, "_type": "Value"}, "facets": [{"value": {"dtype": "string", "id": null, "_type": "Value"}, "type": {"dtype": "string", "id": null, "_type": "Value"}, "support": {"dtype": "int64", "id": null, "_type": "Value"}}], "source_sentences": [{"text": {"dtype": "string", "id": null, "_type": "Value"}, "source": {"dtype": "string", "id": null, "_type": "Value"}}]}, "post_processed": null, "supervised_keys": null, "builder_name": "ascent_kb", "config_name": "open", "version": {"version_str": "1.0.0", "description": null, "major": 1, "minor": 0, "patch": 0}, "splits": {"train": {"name": "train", "num_bytes": 2882678298, "num_examples": 8904060, "dataset_name": "ascent_kb"}}, "download_checksums": {"https://nextcloud.mpi-klsb.mpg.de/index.php/s/dFLdTQHqiFrt3Q3/download": {"num_bytes": 710727536, "checksum": "51fd88a07bca4fa48a9157dd1d93d9bac88ad2b38b5eae662d2cbfad47895016"}}, "download_size": 710727536, "post_processing_size": null, "dataset_size": 2882678298, "size_in_bytes": 3593405834}}
Binary file not shown.
Binary file not shown.

0 comments on commit 2f7b94b

Please sign in to comment.