Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix bugs in NewsQA dataset #3734

Merged
merged 6 commits into from
Feb 17, 2022
Merged
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Update dataset card
  • Loading branch information
albertvillanova committed Feb 17, 2022
commit c96ea43a430b7c62ef3f92e34add6da122303448
151 changes: 127 additions & 24 deletions datasets/newsqa/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -78,53 +78,156 @@ English
### Data Instances

```
{'questions': {'answers': [{'sourcerAnswers': [{'e': [297], 'noAnswer': [False], 's': [294]}, {'e': [0], 'noAnswer': [True], 's': [0]}, {'e': [0], 'noAnswer': [True], 's': [0]}]}, {'sourcerAnswers': [{'e': [271], 'noAnswer': [False], 's': [261]}, {'e': [271], 'noAnswer': [False], 's': [258]}, {'e': [271], 'noAnswer': [False], 's': [261]}]}, {'sourcerAnswers': [{'e': [33], 'noAnswer': [False], 's': [26]}, {'e': [0], 'noAnswer': [True], 's': [0]}, {'e': [640], 'noAnswer': [False], 's': [624]}]}, {'sourcerAnswers': [{'e': [218], 'noAnswer': [False], 's': [195]}, {'e': [218], 'noAnswer': [False], 's': [195]}]}, {'sourcerAnswers': [{'e': [0], 'noAnswer': [True], 's': [0]}, {'e': [218, 271], 'noAnswer': [False, False], 's': [195, 232]}, {'e': [0], 'noAnswer': [True], 's': [0]}]}, {'sourcerAnswers': [{'e': [192], 'noAnswer': [False], 's': [129]}, {'e': [151], 'noAnswer': [False], 's': [129]}, {'e': [151], 'noAnswer': [False], 's': [133]}]}, {'sourcerAnswers': [{'e': [218], 'noAnswer': [False], 's': [195]}, {'e': [218], 'noAnswer': [False], 's': [195]}]}, {'sourcerAnswers': [{'e': [297], 'noAnswer': [False], 's': [294]}, {'e': [297], 'noAnswer': [False], 's': [294]}]}, {'sourcerAnswers': [{'e': [297], 'noAnswer': [False], 's': [294]}, {'e': [297], 'noAnswer': [False], 's': [294]}]}], 'consensus': [{'badQuestion': False, 'e': 297, 'noAnswer': False, 's': 294}, {'badQuestion': False, 'e': 271, 'noAnswer': False, 's': 261}, {'badQuestion': False, 'e': 640, 'noAnswer': False, 's': 624}, {'badQuestion': False, 'e': 218, 'noAnswer': False, 's': 195}, {'badQuestion': False, 'e': 218, 'noAnswer': False, 's': 195}, {'badQuestion': False, 'e': 151, 'noAnswer': False, 's': 129}, {'badQuestion': False, 'e': 218, 'noAnswer': False, 's': 195}, {'badQuestion': False, 'e': 297, 'noAnswer': False, 's': 294}, {'badQuestion': False, 'e': 297, 'noAnswer': False, 's': 294}], 'isAnswerAbsent': [0, 0, 0, 0, 0, 0, 0, 0, 0], 'isQuestionBad': [0, 0, 0, 0, 0, 0, 0, 0, 0], 'q': ['What was the amount of children murdered?', 'When was Pandher sentenced to death?', 'The court aquitted Moninder Singh Pandher of what crime?', 'who was acquitted', 'who was sentenced', 'What was Moninder Singh Pandher acquitted for?', 'Who was sentenced to death in February?', 'how many people died', 'How many children and young women were murdered?'], 'validated_answers': [{'sourcerAnswers': [{'count': [0], 'e': [297], 'noAnswer': [False], 's': [294]}, {'count': [0], 'e': [0], 'noAnswer': [True], 's': [0]}, {'count': [0], 'e': [0], 'noAnswer': [True], 's': [0]}]}, {'sourcerAnswers': [{'count': [0], 'e': [271], 'noAnswer': [False], 's': [261]}, {'count': [0], 'e': [271], 'noAnswer': [False], 's': [258]}, {'count': [0], 'e': [271], 'noAnswer': [False], 's': [261]}]}, {'sourcerAnswers': [{'count': [0], 'e': [33], 'noAnswer': [False], 's': [26]}, {'count': [0], 'e': [0], 'noAnswer': [True], 's': [0]}, {'count': [0], 'e': [640], 'noAnswer': [False], 's': [624]}]}, {'sourcerAnswers': [{'count': [0], 'e': [218], 'noAnswer': [False], 's': [195]}, {'count': [0], 'e': [218], 'noAnswer': [False], 's': [195]}]}, {'sourcerAnswers': [{'count': [0], 'e': [0], 'noAnswer': [True], 's': [0]}, {'count': [0, 0], 'e': [218, 271], 'noAnswer': [False, False], 's': [195, 232]}, {'count': [0], 'e': [0], 'noAnswer': [True], 's': [0]}]}, {'sourcerAnswers': [{'count': [0], 'e': [192], 'noAnswer': [False], 's': [129]}, {'count': [0], 'e': [151], 'noAnswer': [False], 's': [129]}, {'count': [0], 'e': [151], 'noAnswer': [False], 's': [133]}]}, {'sourcerAnswers': [{'count': [0], 'e': [218], 'noAnswer': [False], 's': [195]}, {'count': [0], 'e': [218], 'noAnswer': [False], 's': [195]}]}, {'sourcerAnswers': [{'count': [0], 'e': [297], 'noAnswer': [False], 's': [294]}, {'count': [0], 'e': [297], 'noAnswer': [False], 's': [294]}]}, {'sourcerAnswers': [{'count': [0], 'e': [297], 'noAnswer': [False], 's': [294]}, {'count': [0], 'e': [297], 'noAnswer': [False], 's': [294]}]}]}, 'storyId': './cnn/stories/42d01e187213e86f5fe617fe32e716ff7fa3afc4.story', 'text': 'NEW DELHI, India (CNN) -- A high court in northern India on Friday acquitted a wealthy businessman facing the death sentence for the killing of a teen in a case dubbed "the house of horrors."\n\n\n\nMoninder Singh Pandher was sentenced to death by a lower court in February.\n\n\n\nThe teen was one of 19 victims -- children and young women -- in one of the most gruesome serial killings in India in recent years.\n\n\n\nThe Allahabad high court has acquitted Moninder Singh Pandher, his lawyer Sikandar B. Kochar told CNN.\n\n\n\nPandher and his domestic employee Surinder Koli were sentenced to death in February by a lower court for the rape and murder of the 14-year-old.\n\n\n\nThe high court upheld Koli\'s death sentence, Kochar said.\n\n\n\nThe two were arrested two years ago after body parts packed in plastic bags were found near their home in Noida, a New Delhi suburb. Their home was later dubbed a "house of horrors" by the Indian media.\n\n\n\nPandher was not named a main suspect by investigators initially, but was summoned as co-accused during the trial, Kochar said.\n\n\n\nKochar said his client was in Australia when the teen was raped and killed.\n\n\n\nPandher faces trial in the remaining 18 killings and could remain in custody, the attorney said.', 'type': 'train'}
{'storyId': './cnn/stories/42d01e187213e86f5fe617fe32e716ff7fa3afc4.story',
'text': 'NEW DELHI, India (CNN) -- A high court in northern India on Friday acquitted a wealthy businessman facing the death sentence for the killing of a teen in a case dubbed "the house of horrors."\n\n\n\nMoninder Singh Pandher was sentenced to death by a lower court in February.\n\n\n\nThe teen was one of 19 victims -- children and young women -- in one of the most gruesome serial killings in India in recent years.\n\n\n\nThe Allahabad high court has acquitted Moninder Singh Pandher, his lawyer Sikandar B. Kochar told CNN.\n\n\n\nPandher and his domestic employee Surinder Koli were sentenced to death in February by a lower court for the rape and murder of the 14-year-old.\n\n\n\nThe high court upheld Koli\'s death sentence, Kochar said.\n\n\n\nThe two were arrested two years ago after body parts packed in plastic bags were found near their home in Noida, a New Delhi suburb. Their home was later dubbed a "house of horrors" by the Indian media.\n\n\n\nPandher was not named a main suspect by investigators initially, but was summoned as co-accused during the trial, Kochar said.\n\n\n\nKochar said his client was in Australia when the teen was raped and killed.\n\n\n\nPandher faces trial in the remaining 18 killings and could remain in custody, the attorney said.',
'type': 'train',
'questions': {'q': ['What was the amount of children murdered?',
'When was Pandher sentenced to death?',
'The court aquitted Moninder Singh Pandher of what crime?',
'who was acquitted',
'who was sentenced',
'What was Moninder Singh Pandher acquitted for?',
'Who was sentenced to death in February?',
'how many people died',
'How many children and young women were murdered?'],
'isAnswerAbsent': [0, 0, 0, 0, 0, 0, 0, 0, 0],
'isQuestionBad': [0, 0, 0, 0, 0, 0, 0, 0, 0],
'consensus': [{'s': 294, 'e': 297, 'badQuestion': False, 'noAnswer': False},
{'s': 261, 'e': 271, 'badQuestion': False, 'noAnswer': False},
{'s': 624, 'e': 640, 'badQuestion': False, 'noAnswer': False},
{'s': 195, 'e': 218, 'badQuestion': False, 'noAnswer': False},
{'s': 195, 'e': 218, 'badQuestion': False, 'noAnswer': False},
{'s': 129, 'e': 151, 'badQuestion': False, 'noAnswer': False},
{'s': 195, 'e': 218, 'badQuestion': False, 'noAnswer': False},
{'s': 294, 'e': 297, 'badQuestion': False, 'noAnswer': False},
{'s': 294, 'e': 297, 'badQuestion': False, 'noAnswer': False}],
'answers': [{'sourcerAnswers': [{'s': [294],
'e': [297],
'badQuestion': [False],
'noAnswer': [False]},
{'s': [0], 'e': [0], 'badQuestion': [False], 'noAnswer': [True]},
{'s': [0], 'e': [0], 'badQuestion': [False], 'noAnswer': [True]}]},
{'sourcerAnswers': [{'s': [261],
'e': [271],
'badQuestion': [False],
'noAnswer': [False]},
{'s': [258], 'e': [271], 'badQuestion': [False], 'noAnswer': [False]},
{'s': [261], 'e': [271], 'badQuestion': [False], 'noAnswer': [False]}]},
{'sourcerAnswers': [{'s': [26],
'e': [33],
'badQuestion': [False],
'noAnswer': [False]},
{'s': [0], 'e': [0], 'badQuestion': [False], 'noAnswer': [True]},
{'s': [624], 'e': [640], 'badQuestion': [False], 'noAnswer': [False]}]},
{'sourcerAnswers': [{'s': [195],
'e': [218],
'badQuestion': [False],
'noAnswer': [False]},
{'s': [195], 'e': [218], 'badQuestion': [False], 'noAnswer': [False]}]},
{'sourcerAnswers': [{'s': [0],
'e': [0],
'badQuestion': [False],
'noAnswer': [True]},
{'s': [195, 232],
'e': [218, 271],
'badQuestion': [False, False],
'noAnswer': [False, False]},
{'s': [0], 'e': [0], 'badQuestion': [False], 'noAnswer': [True]}]},
{'sourcerAnswers': [{'s': [129],
'e': [192],
'badQuestion': [False],
'noAnswer': [False]},
{'s': [129], 'e': [151], 'badQuestion': [False], 'noAnswer': [False]},
{'s': [133], 'e': [151], 'badQuestion': [False], 'noAnswer': [False]}]},
{'sourcerAnswers': [{'s': [195],
'e': [218],
'badQuestion': [False],
'noAnswer': [False]},
{'s': [195], 'e': [218], 'badQuestion': [False], 'noAnswer': [False]}]},
{'sourcerAnswers': [{'s': [294],
'e': [297],
'badQuestion': [False],
'noAnswer': [False]},
{'s': [294], 'e': [297], 'badQuestion': [False], 'noAnswer': [False]}]},
{'sourcerAnswers': [{'s': [294],
'e': [297],
'badQuestion': [False],
'noAnswer': [False]},
{'s': [294], 'e': [297], 'badQuestion': [False], 'noAnswer': [False]}]}],
'validated_answers': [{'s': [0, 294],
'e': [0, 297],
'badQuestion': [False, False],
'noAnswer': [True, False],
'count': [1, 2]},
{'s': [], 'e': [], 'badQuestion': [], 'noAnswer': [], 'count': []},
{'s': [624],
'e': [640],
'badQuestion': [False],
'noAnswer': [False],
'count': [2]},
{'s': [], 'e': [], 'badQuestion': [], 'noAnswer': [], 'count': []},
{'s': [195],
'e': [218],
'badQuestion': [False],
'noAnswer': [False],
'count': [2]},
{'s': [129],
'e': [151],
'badQuestion': [False],
'noAnswer': [False],
'count': [2]},
{'s': [], 'e': [], 'badQuestion': [], 'noAnswer': [], 'count': []},
{'s': [], 'e': [], 'badQuestion': [], 'noAnswer': [], 'count': []},
{'s': [], 'e': [], 'badQuestion': [], 'noAnswer': [], 'count': []}]}}
```

### Data Fields


Configuration: combined-csv
- 'story_id': An identifier of the story
- 'story_text': text of the story
- 'story_id': An identifier of the story.
- 'story_text': Text of the story.
- 'question': A question about the story.
- 'answer_char_ranges': The raw data collected for character based indices to answers in story_text. E.g. 196:228|196:202,217:228|None. Answers from different crowdsourcers are separated by |, within those, multiple selections from the same crowdsourcer are separated by ,. None means the crowdsourcer thought there was no answer to the question in the story. The start is inclusive and the end is exclusive. The end may point to whitespace after a token.
- 'answer_char_ranges': The raw data collected for character based indices to answers in story_text. E.g. 196:228|196:202,217:228|None. Answers from different crowdsourcers are separated by `|`; within those, multiple selections from the same crowdsourcer are separated by `,`. `None` means the crowdsourcer thought there was no answer to the question in the story. The start is inclusive and the end is exclusive. The end may point to whitespace after a token.

Configuration: combined-csv
Configuration: combined-json
- 'storyId': An identifier of the story.
- 'text': Text of the story
- 'type': Split type - train, validation or test
- 'questions': A list containing the following.
- 'q': A question
- 'text': Text of the story.
- 'type': Split type. Will be "train", "validation" or "test".
- 'questions': A list containing the following:
- 'q': A question about the story.
- 'isAnswerAbsent': Proportion of crowdsourcers that said there was no answer to the question in the story.
- 'isQuestionBad': Proportion of crowdsourcers that said the question does not make sense.
- 'consensus': The consensus answer. Use this field to pick the best continuous answer span from the text. If you want to know about a question having multiple answers in the text then you can use the more detailed "answers" and "validatedAnswers". The object can have start and end positions like in the example above or can be {"badQuestion": true} or {"noAnswer": true}. Note that there is only one consensus answer since it's based on the majority agreement of the crowdsourcers.
- 's': start of the answer
- 'e': end of the answer
- 'consensus': The consensus answer. Use this field to pick the best continuous answer span from the text. If you want to know about a question having multiple answers in the text then you can use the more detailed "answers" and "validated_answers". The object can have start and end positions like in the example above or can be {"badQuestion": true} or {"noAnswer": true}. Note that there is only one consensus answer since it's based on the majority agreement of the crowdsourcers.
- 's': Start of the answer. The first character of the answer in "text" (inclusive).
- 'e': End of the answer. The last character of the answer in "text" (exclusive).
- 'badQuestion': The validator said that the question did not make sense.
- 'noAnswer': The crowdsourcer said that there was no answer to the question in the text.
- 'answers': The answers from various crowdsourcers.
- 'sourcerAnswers': The answer provided from one crowdsourcer.
- 's': start
- 'e': end
- 's': Start of the answer. The first character of the answer in "text" (inclusive).
- 'e': End of the answer. The last character of the answer in "text" (exclusive).
- 'badQuestion': The crowdsourcer said that the question did not make sense.
- 'noAnswer': The crowdsourcer said that there was no answer to the question in the text.
- 'validated_answers': The answers from the validators.
- 'sourcerAnswers': The answer provided from one crowdsourcer.
- 's': start
- 'e': end
- 'noAnswer': The crowdsourcer said that there was no answer to the question in the text.
- 'count': The number of validators that agreed with this answer.
- 's': Start of the answer. The first character of the answer in "text" (inclusive).
- 'e': End of the answer. The last character of the answer in "text" (exclusive).
- 'badQuestion': The validator said that the question did not make sense.
- 'noAnswer': The validator said that there was no answer to the question in the text.
- 'count': The number of validators that agreed with this answer.

Configuration: split
- 'story_id': An identifier of the story
- 'story_id': An identifier of the story
- 'story_text': text of the story
- 'question': A question about the story.
- 'answer_token_ranges': Word based indices to answers in story_text. E.g. 196:202,217:228. Multiple selections from the same answer are separated by ,. The start is inclusive and the end is exclusive. The end may point to whitespace after a token.
- 'answer_token_ranges': Word based indices to answers in story_text. E.g. 196:202,217:228. Multiple selections from the same answer are separated by `,`. The start is inclusive and the end is exclusive. The end may point to whitespace after a token.

### Data Splits

split: Train, Validation and Test.
combined-csv and combined-json: train (whole dataset)
| name | train | validation | test |
|---------------|-----------:|-----------:|--------:|
| combined-csv | 119633 | | |
| combined-json | 12744 | | |
| split | 92549 | 5166 | 5126 |

## Dataset Creation

Expand Down