SOCC

SFU Opinion and Comments Corpus

The SFU Opinion and Comments Corpus (SOCC) is a corpus for the analysis of online news comments. Our corpus contains comments and the articles from which the comments originated. The articles are all opinion articles, not hard news articles. The corpus is larger than any other currently available comments corpora, and has been collected with attention to preserving reply structures and other metadata. In addition to the raw corpus, we also present annotations for four different phenomena: constructiveness, toxicity, negation and its scope, and appraisal.

For more information about this work, please see our papers.

Kolhatkar. V. and M. Taboada (2017) Using New York Times Picks to identify constructive comments. Proceedings of the Workshop Natural Language Processing Meets Journalism, Conference on Empirical Methods in Natural Language Processing. Copenhagen. September 2017.
Kolhatkar, V. and M. Taboada (2017) Constructiveness in news comments. Proceedings of the 1st Abusive Language Online Workshop, 55th Annual Meeting of the Association for Computational Linguistics. Vancouver. August 2017, pp. 11-17.

The data is divided into two main parts, with the annotated portion being, in turn, divided into three portions:

Raw data
Annotated data

Raw data

The corpus contains 10,339 opinion articles (editorials, columns, and op-eds) together with their 663,173 comments from 304,099 comment threads, from the main Canadian daily in English, The Globe and Mail, for a five-year period (from January 2012 to December 2016). We organize our corpus into three sub-corpora: the articles corpus, the comments corpus, and the comment-threads corpus, organized into three CSV files: gnm_articles.csv, gnm_comments.csv, and gnm_comment_threads.csv.

gnm_articles.csv

This CSV contains information about The Globe and Mail articles in our dataset. Below we describe fields in this CSV.

article_id
A unique identifier for the article. We use this identifier in the comments CSV. You'll also see this identifier in the article url. (E.g., 26691065)

title
The title or the headline of The Globe and Mail opinion article. (E.g., Fifty years in Canada, and now I feel like a second-class citizen)

article_url
The Globe and Mail url for the article. (E.g., http://www.theglobeandmail.com/opinion/fifty-years-in-canada-and-now-i-feel-like-a-second-class-citizen/article26691065/)

author
The author of the opinion article.

published_date
The date when the article was published. (E.g., 2015-10-16 EDT)

ncomments
The number of comments in the comments corpus for this article.

ntop_level_comments
The number of top-level comments in the comments corpus for this article.

article_text
The article text. We have preserved the paragraph structure in the text with paragraph tags.

gnm_comments.csv

The CSV contains all unique comments (663,173 comments) in response to the articles in articles.csv after removing duplicates and comments with large overlap. The corpus is useful to study individual comments, i.e., without considering their location in the comment thread structure. Below we describe fields in this CSV.

article_id
A unique identifier for the article. We use this identifier in the comments CSV. You'll also see this identifier in the article url. (E.g., 26691065)

comment_counter
The comment counter which encodes the position and depth of a comment in a comment thread. Below are some examples.

First top-level comment: source1_article-id_0
First child of the top-level comment: source1_article-id_0_0
Second child of the top-level comment: source1_article-id_0_1
Grandchildren. source1_article-id_0_0_0, source1_article-id_0_0_1

comment_author
The username of the author of the comment.

timestamp
The timestamp indicating the posting time of the comment. The comments from source1 have timestamp.

post_time
The posting time of the comment. The comments from source2 have post_time.

comment_text
The comment text. The text is minimally preproessed. We have cleaned the HTML tags and have done preliminary word segmentation to fix missing spaces after punctuation.

TotalVotes
The total votes (positive votes + negative votes)

posVotes
The positive votes received by the comment.

negVotes
The negative votes received by the comment.

vote
Not sure. A Field from the scraped comments JSON.

reactions
A list of reactions of other commenters on this comment. The comments from source2 occassionaly have reactions. Here is an example:

{u'reaction_list': [{u'reaction_user': u'areukiddingme', u'reaction': u'disagree', u'reaction_time': u'Dec 13, 2016'}, {u'reaction_user': u'Mark Shore', u'reaction': u'like', u'reaction_time': u'Dec 13, 2016'}], u'reaction_counts': [u'All 2']}

replies
A flag indicating whether the comment has replies or not.

comment_id
The comment identifier from the scraped comments JSON

parentID
The parent's identifier from the scraped comments JSON

threadID
The thread identifier from the scraped comments JSON

streamId
The stream identifier from the scraped comments JSON

edited
A Field from the scraped comments JSON. Guess: Whether the comment is edited or not.

isModerator
A Field from the scraped comments JSON. Guess: Whether the commenter is a moderator. The value is usually False.

highlightGroups
Not sure. A Field from the scraped comments JSON.

moderatorEdit
Not sure. A Field from the scraped comments JSON. Guess: Whether the comment is edited by the moderator or not.

descendantsCount
Not sure. A Field from the scraped comments JSON. Guess: The number of descendents in the thread structure.

threadTimestamp
The thread time stamp from the scraped JSON.

flagCount
Not sure. A Field from the scraped comments JSON.

sender_isSelf
Not sure. A Field from the scraped comments JSON.

sender_loginProvider
The login provider (e.g., Facebook, GooglePlus, LinkedIn, Twitter, Google)

data_type
A Field from the scraped comments JSON, usually marked as 'comment'.

is_empty
Not sure. A Field from the scraped comments JSON. Guess: Whether the comment is empty or not.

status
The status of the comment (e.g., published, rejected, deleted)

gnm_comment_threads.csv

This CSV contains all unique comment threads -- a total of 304,099 unique comment threads in response to the articles in the gnm_articles.csv. This CSV can be used to study online conversations.

The fields in this CSV are same as that of gnm_comments.csv.

Annotated data

SFU constructiveness and toxicity corpus

We annotated a subset of SOCC for constructiveness and toxicity. The annotated corpus is organized as a CSV and contains 1,043 annotated comments in responses to 10 different articles covering a variety of subjects: technology, immigration, terrorism, politics, budget, social issues, religion, property, and refugees. For half of the articles, we included only top-level comments. For the other half, we included both top-level comments and responses. We used CrowdFlower (https://www.crowdflower.com/) as our crowdsourcing annotation platform and annotated the comments for constructiveness. We asked the annotators to first read the articles, and then to tell us whether the displayed comment was constructive or not.

For toxicity, we asked annotators a multiple-choice question, How toxic is the comment? Four answers were possible:

Very toxic
Toxic
Mildly toxic
Not toxic

More information on the annotation, and the instructions to annotators, is available in the CrowdFlower_instructions file.

SFU_constructiveness_toxicity_corpus.csv

article_id
A unique identifier for the article. This identifier can be used to link the comment to the appropriate article from gnm_articles.csv in the raw corpus.

comment_counter
The comment counter which encodes the position and depth of a comment in a comment thread. The comment counter can be used to link the comment to the raw corpus.

title
The title of The Globe and Mail opinion article.

globe_url
The URL of the article on The Globe and Mail.

comment_text
The comment text.

is_constructive
Crowd's annotation on constructiveness (yes, no, not sure)

is_constructive:confidence
Crowd's confidence (between 0 to 1.0) about the answer. In CrowdFlower terminology, each annotator has a trust level based on how they perform on the gold examples, and each answer has a confidence, which is a normalized score of the summation of the trusts associated with annotators.

toxic_level
Crowd's annotation on the toxicity level of the comment

toxic_level:confidence
Crowd's confidence (between 0 to 1.0) about the answer.

did_you_read_the_article
Whether the annotator has read the article or not.

did_you_read_the_article:confidence
Crowd's confidence (between 0 to 1.0) about the answer.

annotator_comments
Free text comments from the annotators.

expert_is_constructive
Expert's judgement on constructiveness of the comment.

expert_toxicity_level
Expert's judgement on the toxicity level of the comment.

expert_comments
Expert's free text comments on crowd's annotations.

SFU negation corpus

The negation annotations were performed using WebAnno. You can see WebAnno server installation instructions on our GitHub page.

The guidelines directory contains a full description of the annotation guidelines. The annotations are made available as a project in .tsv files from WebAnno.

The WebAnno directory is structured in folders. Each folder is named with a comment ID (the same as in the raw corpus), and inside is a .tsv file with the annotations. The annotations were exported from WebAnno in CoNLL format. The annotations can be viewed from the .tsv files using a document viewer, and can also be imported back into WebAnno, a process which we detail in the WebAnno instructions (see link above).

SFU Appraisal corpus

The Appraisal annotations were performed using WebAnno. The structure of the corpus is identical to that of the negation corpus. Guidelines for Appraisal annotation are available in the guidelines directory.

Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
guidelines		guidelines
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SOCC

Raw data

gnm_articles.csv

gnm_comments.csv

gnm_comment_threads.csv

Annotated data

SFU constructiveness and toxicity corpus

SFU_constructiveness_toxicity_corpus.csv

SFU negation corpus

SFU Appraisal corpus

About

Releases

Packages

License

lcavasso/SOCC

Folders and files

Latest commit

History

Repository files navigation

SOCC

Raw data

gnm_articles.csv

gnm_comments.csv

gnm_comment_threads.csv

Annotated data

SFU constructiveness and toxicity corpus

SFU_constructiveness_toxicity_corpus.csv

SFU negation corpus

SFU Appraisal corpus

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Packages