Skip to content

Wikipedia page conflict analyzer project developed for Negapedia

License

Notifications You must be signed in to change notification settings

negapedia/wikitfidf

Repository files navigation

Wikipedia TFIDF Analyzer

Go Report Card GoDoc Bugs Coverage Lines of Code Maintainability Rating Reliability Rating Security Rating Vulnerabilities Build Status
Negapedia TFIDF Analyzer analyze Wikipedia's dumps and makes statistical analysis on reverts text.
The data produced in output can be used to clarify the theme of the contrast inside a Wikipedia page.

Handled languages

english, arabic, danish, dutch, finnish, french, german, greek, hungarian, indonesian, italian, kazakh, nepali, portuguese, romanian, russian, spanish, swedish, turkish, armenian, azerbaijani, basque, bengali, bulgarian, catalan, chinese, croatian, czech, galician, hebrew, hindi, irish, japanese, korean, latvian, lithuanian, marathi, persian, polish, slovak, thai, ukrainian, urdu, simple-english
This kind of data come from Negapedia/nltk

Badwords handled languages

english, arabic, danish, dutch, finnish, french, german, hungarian, italian, portuguese, spanish, swedish, chinese, czech, hindi, japanese, korean, persian, polish, thai, simple-english
This kind of data come from Negapedia/badwords

Outuput files

  • GlobalPagesTFIDF.json: contains for every page the list of words associated with their absolute frequency and tf-idf value;
  • GlobalPagesTFIDF_topNwords.json: as GlobalPagesTFIDF.json, but are reported only the most important N words (in term of tf-idf value);
  • GlobalWords.json: contains all the analyzed wiki's words associated with their absolute frequency;
  • GlobalTopic.json: contains all the words in every topic (using Negapedia topics);
  • BadWordsReport.json: contains for every page which has them, a list of badwords associated with their absolute frequency.

Minimum and Recommended Requirements

The minimum requirements which are needed for executing the project in reasonable times are:

  • At least 4 cores-8 threads CPU;
  • At least 16GB of RAM (required);
  • At least 300GB of disk space.

However the recommended requirements are:

  • 32GB of RAM or more (highly recommended).

Usage

Building docker image

docker build -t <image_name> .
from the root of repository directory.

Running docker image

docker run -d -v <path_on_fs_where_to_save_results>:<container_results_path> <image_name>
example:
docker run -d -v /path/2/out/dir:/data my_image

Executions flags

  • -lang: wiki language;
  • -d: container result dir;
  • -s: revert starting date to consider;
  • -e: revert ending date to consider;
  • -specialList: special page list to consider;
  • -rev: number of revert to consider;
  • -topPages: number of top words per page to consider;
  • -topWords: number of top words of global words to consider;
  • -topTopic: number of top words per topic to consider;
  • -delete: if true, after compressing results directory will be deleted (default: true);
  • -test: if true, logs are shown and is processed a single dump.


example:
docker run -v /path/2/out/dir:/data wikitfidf dothething -lang it

Installation

Go packages can be installed by:
go get github.com/negapedia/wikitfidf
and docker image can be downloaded by:
docker pull negapedia/wikitfidf