Skip to content

Kuderic/Topic-Modeling-with-PLSA

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Starting code repository taken from the CS 410 Text Information Systems course at UIUC in Fall 2021.

Introduction

The Probabilistic Latent Semantic Analysis (PLSA) algorithm is a statistical model used for analyzing and discovering hidden semantic structures within a large collection of documents, often referred to as a corpus. PLSA is commonly used in natural language processing and information retrieval tasks, such as document clustering, topic modeling, and recommendation systems.

PLSA is a generative model that aims to represent the observed co-occurrence patterns of words and documents in terms of latent (hidden) topics. It assumes that each document is a mixture of topics, and each topic is a mixture of words. The underlying idea is that the generation of a word within a document is governed by a combination of the topics present in that document.

Implementing the PLSA Algorithm

The test.txt data set contains 1000 lines. Each line is a document. The first 100 documents are labeled with the topic the document belongs to, either 0 (“Seattle”) or 1 (“Chicago”). Each of the first 100 document is generated by sampling completely from the topic that is labelled (i.e., generated from one topic only). The rest 900 documents are generated from a mixture model of the topic “Seattle” and “Chicago”.

The DBLP.txt data set is made up of research paper abstracts from DBLP. Each line is a document. We first test our code on the simpler data set test.txt before running it on DBLP.txt.

I started with the provided code skeleton plsa.py for my implementation.

The main data structures involved in the implementation of this EM algorithm are three matrices:

  1. T (topics by words): this is the set of parameters characterizing topic content that we denoted by θi's. Each element is the probability of a particular word in a particular topic.

  2. D (documents by topics): this is the set of parameters modeling the coverage of topics in each document, which we denoted by pij's. Each element is the probability of a particular topic is covered in a particular document.

  3. Z (hidden variables): For every document, we need one Z which represents the probability that each word in the document has been generated from a particular topic, so for any document, this is a "word-by-topic" matrix, encoding p(Z|w) for a particular document. Z is the matrix that we compute in the E-step (based on matrices T and D, which represent our parameters). Note that we need to compute a different Z for each document, so we need to allocate a matrix Z for every document. If we do so, the M-step is simply to use all these Z matrices together with word counts in each document to re-estimate all the parameters, i.e., updating matrices T and D based on Z. Thus at a high level, this is what's happening in the algorithm:

    • T and D are initialized.
    • E-step computes all Z's based on T and D.
    • M-step uses all Z's to update T and D.
    • We iterate until the likelihood doesn't change much when we would use T and D as our output. Note that Zs are also very useful.

Important Notes

  1. Using matrices is strongly encouraged (writing nested for-loops for calculation is painful)

  2. This PLSA implementation does not use a background model. I adjust the traditional formulas to assume a zero probability that the background model would be chosen. That is, I set λB to zero. Doing this, I see that E-step would essentially only compute a distribution over k topics for z=1, ..., k, given each word, i.e., p(z=i|w). The M-step would also be simpler as p(Z=B|w) is also zero for all words (due to the fact that λB=0).

Resources:

[1] Cheng’s note on the EM algorithm
[2] Chase Geigle’s note on the EM algorithm, which includes a derivation of the EM algorithm (see section 4), and
[3] Qiaozhu Mei’s note on the EM algorithm for PLSA, which includes a different derivation of the EM algorithm.

Releases

No releases published

Packages

No packages published

Languages