Introduction

Starting code repository taken from the CS 410 Text Information Systems course at UIUC in Fall 2021.

Introduction

The Probabilistic Latent Semantic Analysis (PLSA) algorithm is a statistical model used for analyzing and discovering hidden semantic structures within a large collection of documents, often referred to as a corpus. PLSA is commonly used in natural language processing and information retrieval tasks, such as document clustering, topic modeling, and recommendation systems.

PLSA is a generative model that aims to represent the observed co-occurrence patterns of words and documents in terms of latent (hidden) topics. It assumes that each document is a mixture of topics, and each topic is a mixture of words. The underlying idea is that the generation of a word within a document is governed by a combination of the topics present in that document.

Implementing the PLSA Algorithm

The test.txt data set contains 1000 lines. Each line is a document. The first 100 documents are labeled with the topic the document belongs to, either 0 (“Seattle”) or 1 (“Chicago”). Each of the first 100 document is generated by sampling completely from the topic that is labelled (i.e., generated from one topic only). The rest 900 documents are generated from a mixture model of the topic “Seattle” and “Chicago”.

The DBLP.txt data set is made up of research paper abstracts from DBLP. Each line is a document. We first test our code on the simpler data set test.txt before running it on DBLP.txt.

I started with the provided code skeleton plsa.py for my implementation.

The main data structures involved in the implementation of this EM algorithm are three matrices:

T (topics by words): this is the set of parameters characterizing topic content that we denoted by θ_i's. Each element is the probability of a particular word in a particular topic.
D (documents by topics): this is the set of parameters modeling the coverage of topics in each document, which we denoted by p_ij's. Each element is the probability of a particular topic is covered in a particular document.
Z (hidden variables): For every document, we need one Z which represents the probability that each word in the document has been generated from a particular topic, so for any document, this is a "word-by-topic" matrix, encoding p(Z|w) for a particular document. Z is the matrix that we compute in the E-step (based on matrices T and D, which represent our parameters). Note that we need to compute a different Z for each document, so we need to allocate a matrix Z for every document. If we do so, the M-step is simply to use all these Z matrices together with word counts in each document to re-estimate all the parameters, i.e., updating matrices T and D based on Z. Thus at a high level, this is what's happening in the algorithm:
- T and D are initialized.
- E-step computes all Z's based on T and D.
- M-step uses all Z's to update T and D.
- We iterate until the likelihood doesn't change much when we would use T and D as our output. Note that Zs are also very useful.

Important Notes

Using matrices is strongly encouraged (writing nested for-loops for calculation is painful)
This PLSA implementation does not use a background model. I adjust the traditional formulas to assume a zero probability that the background model would be chosen. That is, I set λ_B to zero. Doing this, I see that E-step would essentially only compute a distribution over k topics for z=1, ..., k, given each word, i.e., p(z=i|w). The M-step would also be simpler as p(Z=B|w) is also zero for all words (due to the fact that λ_B=0).

Resources:

[1] Cheng’s note on the EM algorithm
[2] Chase Geigle’s note on the EM algorithm, which includes a derivation of the EM algorithm (see section 4), and
[3] Qiaozhu Mei’s note on the EM algorithm for PLSA, which includes a different derivation of the EM algorithm.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
data		data
.gitignore		.gitignore
README.md		README.md
livedatalab_config.json		livedatalab_config.json
plsa.py		plsa.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Introduction

Implementing the PLSA Algorithm

Important Notes

Resources:

About

Releases

Packages

Languages

Kuderic/Topic-Modeling-with-PLSA

Folders and files

Latest commit

History

Repository files navigation

Introduction

Implementing the PLSA Algorithm

Important Notes

Resources:

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages