ToPMine

I'm hosting the files originially provided by Ahmed El-Kishky on github with minimal changes.

From El-Kishky

Illimine Copyright

University of Illinois at Urbana-Champaign, 2014

illimine.cs.illinois.edu

Additional Copyright

This package contains the source code and the dataset used in the following paper:

@inproceedings{conf/vldb/ElkishkySWVH13, author = {Ahmed El-Kishky and Yanglei Song and Chi Wang and Clare R. Voss and Jiawei Han}, title = {Scalable Topical Phrase Mining from Text Corpora}, booktitle = {VLDB}, year = {2015}, }

If you use any contents in this package, please cite the above paper as your reference.

Code explanation

Place the input corpus in rawFiles with each document on a line Edit the parameters in run.sh(linux)/win_run.bat(for windows) for minimum support and topic modeling parameters run run.sh (for linux)/ win_run.bat (for windows, or just double click)

Parameter explanation: 1.) minsup = minumum support (minimum times a phrase candidate should appear in the corpus to be significant) 2.) maxpattern = max size you would like a phrase to be (if you don't want too long of phrases that occasionally occur) 3.) numTopics = number of topics - same as LDA 4.) Gibbs Sampling iterations = number of iterations done for inference (learning the parameters. Usually 500 is good, may do more if you like) 5.) thresh (significance) = the significance of a phrase. Equivalent to a z-score. I usually use 3 to 5. The higher it is, the fewer phrases will be found, but they will be of very high quality. 6.) Topic Model, two variants of PhraseLDA are used (choose 1 or 2). 2 is the default topic model.

The output should be in output, you will have corpus and topics

Corpus is the corpus partitioned into phrases, stopwords are removed except those in phrases each phrase is delimited by a comma

The topics file is a one-to-one and onto mapping between documents and topics for each phrase

Alert from Justin: This is not always true.

e.g

line 1 in corpus is document 1 line 1 in topics is the topics for phrases in document 1

example:

Line 1 corpus: Cow Dog Flower Line 1 topics: 2 2 1

Cow is topic 2, dog is topic 2, flower is topic 1

there are also output files for the top phrases (ranked by their frequency)

For More Questions

Please contact illimine.cs.illinois.edu or Ahmed El-Kishky (elkishk2@illinois.edu) or Yanglei Song (ysong44@illinois.edu)

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
TopicalPhrases		TopicalPhrases
external		external
output		output
.gitignore		.gitignore
LICENSE.html		LICENSE.html
README.md		README.md
cleanTmp.sh		cleanTmp.sh
commasToPhrases.py		commasToPhrases.py
makefile		makefile
run.sh		run.sh
setEnv.sh		setEnv.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ToPMine

From El-Kishky

Illimine Copyright

Additional Copyright

Code explanation

For More Questions

About

Releases

Packages

Languages

License

JSybrandt/ToPMine

Folders and files

Latest commit

History

Repository files navigation

ToPMine

From El-Kishky

Illimine Copyright

Additional Copyright

Code explanation

For More Questions

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages