Starting code repository taken from the CS 410 Text Information Systems course at UIUC in Fall 2021.
The Probabilistic Latent Semantic Analysis (PLSA) algorithm is a statistical model used for analyzing and discovering hidden semantic structures within a large collection of documents, often referred to as a corpus. PLSA is commonly used in natural language processing and information retrieval tasks, such as document clustering, topic modeling, and recommendation systems.
PLSA is a generative model that aims to represent the observed co-occurrence patterns of words and documents in terms of latent (hidden) topics. It assumes that each document is a mixture of topics, and each topic is a mixture of words. The underlying idea is that the generation of a word within a document is governed by a combination of the topics present in that document.
The test.txt
data set contains 1000 lines. Each line is a document. The first 100 documents are labeled with the topic the document belongs to, either 0 (“Seattle”) or 1 (“Chicago”). Each of the first 100 document is generated by sampling completely from the topic that is labelled (i.e., generated from one topic only). The rest 900 documents are generated from a mixture model of the topic “Seattle” and “Chicago”.
The DBLP.txt
data set is made up of research paper abstracts from DBLP. Each line is a document. We first test our code on the simpler data set test.txt
before running it on DBLP.txt
.
I started with the provided code skeleton plsa.py
for my implementation.
The main data structures involved in the implementation of this EM algorithm are three matrices:
-
T (topics by words): this is the set of parameters characterizing topic content that we denoted by θi's. Each element is the probability of a particular word in a particular topic.
-
D (documents by topics): this is the set of parameters modeling the coverage of topics in each document, which we denoted by pij's. Each element is the probability of a particular topic is covered in a particular document.
-
Z (hidden variables): For every document, we need one Z which represents the probability that each word in the document has been generated from a particular topic, so for any document, this is a "word-by-topic" matrix, encoding p(Z|w) for a particular document. Z is the matrix that we compute in the E-step (based on matrices T and D, which represent our parameters). Note that we need to compute a different Z for each document, so we need to allocate a matrix Z for every document. If we do so, the M-step is simply to use all these Z matrices together with word counts in each document to re-estimate all the parameters, i.e., updating matrices T and D based on Z. Thus at a high level, this is what's happening in the algorithm:
- T and D are initialized.
- E-step computes all Z's based on T and D.
- M-step uses all Z's to update T and D.
- We iterate until the likelihood doesn't change much when we would use T and D as our output. Note that Zs are also very useful.
-
Using matrices is strongly encouraged (writing nested
for-loops
for calculation is painful) -
This PLSA implementation does not use a background model. I adjust the traditional formulas to assume a zero probability that the background model would be chosen. That is, I set λB to zero. Doing this, I see that E-step would essentially only compute a distribution over k topics for z=1, ..., k, given each word, i.e., p(z=i|w). The M-step would also be simpler as p(Z=B|w) is also zero for all words (due to the fact that λB=0).
[1] Cheng’s note on the EM algorithm
[2] Chase Geigle’s note on the EM algorithm, which includes a derivation of the EM algorithm (see section 4), and
[3] Qiaozhu Mei’s note on the EM algorithm for PLSA, which includes a different derivation of the EM algorithm.