OCTIS/preprocessed_datasets at master · dash-uvic/OCTIS

README.rst

Name	Source	# Docs	# Words	# Labels
20NewsGroup	20Newsgroup	16309	1612	20
BBC_News	BBC-News	2225	2949	5
DBLP	DBLP	54595	1513	4
M10	M10	8355	1696	10

To load one of the already preprocessed datasets as follows:

from octis.dataset.dataset import Dataset
dataset = Dataset()
dataset.fetch_dataset("20NewsGroup")

Just use one of the dataset names listed above. Note: it is case-sensitive!

Otherwise, you can load a custom preprocessed dataset in the following way:

from octis.dataset.dataset import Dataset
dataset = Dataset()
dataset.load_custom_dataset_from_folder("../path/to/the/dataset/folder")

Make sure that the dataset is in the following format:

corpus file: a .tsv file (tab-separated) that contains up to three columns, i.e. the document, the partitition, and the label associated to the document (optional).
vocabulary: a .txt file where each line represents a word of the vocabulary

The partition can be "training", "test" or "validation". An example of dataset can be found here: sample_dataset_.