source: http://brandonrose.org/clustering modified by : kirra
- Kompas online collection. This corpus contains Kompas online news articles from 2001-2002. See here for more info and citations.
- Tempo online collection. This corpus contains Tempo online news articles from 2000-2002. See here for more info and citations.
- tokenizing and stemming each article (Bahasa Indonesia)
- transforming the corpus into vector space using tf-idf
- calculating cosine distance between each document as a measure of similarity
- clustering the documents using the k-means algorithm
- using multidimensional scaling to reduce dimensionality within the corpus
- plotting the clustering output using matplotlib and mpld3
- conducting a hierarchical clustering on the corpus using Ward clustering
- plotting a Ward dendrogram
- topic modeling using Latent Dirichlet Allocation (LDA)
- download the new (kompas and tempo) extract to folder "data"
- create virtualenvironment python >>> $ virtualenv env
- activate virtualenvironment >>> source env/bin/activate
- install all depedencies >>> pip install -r requirements.txt
- run jupiter >>> jupyter notebook
- open file "Clustering.ipynb"