There are 3 sections to this work.
- Part 1: Data Extraction
- Part 2: Creating LDA models
- Part 3: Mapping data from the LDA model to the corpus
In this section, the metadata is extracted from the input corpus, and saved in a file called metadata.csv .
The metadata is then used for in part 3 for mapping purposes.
Run the following command to extract the metadata :
$ python Data_Extraction/ExtractMetadata.py inputcorpusfolder/
This command runs the ExtractMetadata.py script from the Data_Extraction folder.
inputcorpusfolder: refers to the folder which should contain PubMed XML data files.
For experimental purposes, I have added a folder to this repository called testdatafolder. It contains some articles from PMC Open Access Subset.
In this section, the article text from the input XML files is extracted and pre-processed here. The text undergoes sentence segmentation, tokenization, and lemmatization. Then the tokens are filtered based on multiple criteria.
After the aforementioned pre-processing and filtering, a new corpus is created that will be used to create the LDA model. Moreover, this corpus will be used in Part 3, for calculating word frequencies.
Run the following command to pre-process the text and create a new corpus:
$ python Data_Extraction/ExtractData.py inputcorpusfolder yes
This command runs the ExtractData.py script from the Data_Extraction folder. This script takes 2 command line variables
- inputcorpusfolder : path to the input corpus folder
- POS-tagging option, valid input yes, or no. This variable is used by the script to verify if a POS-tagging and filtering will be done during the data extraction process.
The output from this process will be saved as individual JSON files in a folder called corpus.
The dictionary for the LDA model is created here. Run the following command to create the LDA dictionary :
$ python LDA/LDA_1_make_dictionary.py corpus
This command runs the LDA_1_make_dictionary.py script from the LDA folder.
corpus : This is the corpus folder that was created during Corpus creation.
The script usues the data in this folder to generate the dictionary. Learn more about the dictionary format used by Gensim here. The output of this script is in the folder LDA_modeldata, which will be created by this script ( if it does not exist already).
The dictionary file will be named as follows :
original_dict_<DATE>.dictionary
original_dict_: Denotes that this is the original dictionary created from the corpus.
.dictionary : Denotes that this is a dictionary file.
<DATE> : This is the date on which the dictionary files is created.
Example: original_dict_2017-06-30.dictionary
The dictionary for the LDA model is modified here. Run the following command to create a modified LDA dictionary :
$ python LDA/LDA_2_reduce_dictionary.py <LDA_model_dictionary> <num_mostfrequent_words>
This command runs the LDA_2_reduce_dictionary.py script from the LDA folder. The script uses the previously generated dictionary and runs certain fucntions on the script to improve the quality of the dictionary. <LDA_model_dictionary> : This is the LDA dictionary that was created during Create LDA dictionary.
<num_mostfrequent_words> : Refers to the number of most frquent words that one wishes to remove from the dictionary.
The output of this script can also be found in the folder LDA_modeldata.
The new dictionary file will be named as follows :
mod_dict_<DATE>.dictionary
mod_dict_: Denotes that this is the modified dictionary created from the original dictionary.
.dictionary : Denotes that this is a dictionary file.
<DATE> : This is the date on which the dictionary files is created.
Example: mod_dict_2017-06-30.dictionary
The corpus for the LDA model is created here. Learn more about the corpus created by Gensim here. Run the following command to create the LDA corpus :
$ python LDA/LDA_3_make_corpus.py corpus LDA_modeldata/<dictionary file>
This command runs the LDA_3_make_corpus.py script from the LDA folder.
corpus : This is the corpus folder that was created during Corpus creation.
LDA_modeldata/<dictionary file>: This is the .dictionary file (in the LDA_modeldata folder) that one wishes to use to create the corpus.
The output of this script can also be found in the folder LDA_modeldata.
The new corpus file will be named as follows : corp_<DATE>.corpus
corp_: Denotes that this is the corpus.
.corpus : Denotes that this is a corpus file.
<DATE> : This is the date on which the corpus file is created.
Example: corp_2017-06-30.corpus
The LDA model is created here.
Run the following command to create the LDA corpus :
$ python LDA/LDA_4_create_LDA_Model.py <workers> <chunksize> <passes> <numtopics> <version> <LDA_model_dictionary> <LDA_model_corpus>
This command runs the LDA_4_create_LDA_Model.py script from the LDA folder.
Learn more about the parametres uses by Gensim to create the LDA model here.
<workers> : Denotes the number of processes to use for parallelization.
<chunksize> : Denotes the number of documents used for each step of online training.
<passes> : Denotes the number of times the model goes through the entire corpus
<numtopics> : Denotes the number latent topics to be extracted from the corpus.
<version> : Denotes the version of the model. Can be used to distinguish multiple models. Valid input: any string.
<passes> : Denotes the number of times the model goes through the entire corpus.
<LDA_model_dictionary> : This is the .dictionary file (in the LDA_modeldata folder) that one wishes to use generate the model.
<LDA_model_corpus> : This is the .corpus file (in the LDA_modeldata folder) that one wishes to use generate the model.
The new LDA model file will be named as follows :
model_<DATE>__<numtopics>_<workers>_<chunksize>_<passes>_v-<version>.LDA'
model_ : enotes that this is the LDA model. <DATE> : This is the date on which the LDA model file is created. .LDA : Denotes that this is a model file, and the .LDA is used the mark the file as the main model file.
See above fot the explanation of the other parameters.
Example : model_2017-06-30__10_2_20_50_v-1.LDA
Generates the document topic distribtuion of all files from the corpus using the LDA model, LDA corpus and the metadata.
Run the following command to get the document topic distribution :
This command runs the Mapping_1_get_document_topic_distribution.py script from the Mapping folder.
$ python Mapping/Mapping_1_get_document_topic_distribution.py <LDA_model_corpus> <LDA_model> metadata.csv
<LDA_model_corpus> : This is the .corpus file (in the LDA_modeldata folder) that one wishes to use generate the model.
<LDA_model> : This is the .LDA file (in the LDA_modeldata folder) which is the LDA model file.
metadata.csv: This is the file that contains all the metadata information about the corpus during Metadata extraction.
The new output file will be named as follows :
Example : M1_topic_distr_df_2017-06-30.csv
M1_topic_distr_df_ : Denotes that this is the output of the first mappping procedure (M1), and that it is a dataframe containing the topic distribution data .
<DATE> : This is the date on which this file is created.
.csv : Denotes that this is a CSV file.
Generates a CSV file that shows the average yearly topic distrubution of LDA model using the topic distribution of the topics from Get document topic distribution.
Run the following command to get the average yearly topic distrubution :
$ python Mapping/Mapping_1_1_yearly_doc_top.dist.py M1_topic_distr_df_<DATE>.csv <LDA_model>
This command runs the Mapping_1_1_yearly_doc_top.dist.py script from the Mapping folder.
M1_topic_distr_df_<DATE>.csv : This is the output (CSV file) of Get document topic distribution
<LDA_model> : This is the .LDA file (in the LDA_modeldata folder) which is the LDA model file.
The new output file will be named as follows :
M1_1_yearly_average_topic_distrbution_<DATE>.csv
M1_1_yearly_average_topic_distrbution_ : Denotes that this is the output of the first mappping procedure (M1), and that it is a dataframe containing average yearly topic distrubution of LDA model.
<DATE> : This is the date on which this file is created.
.csv : Denotes that this is a CSV file.
Example : M1_1_yearly_average_topic_distrbution_2017-06-30.csv
Generates the document topic distribtuion of all files from the corpus using the corpus, the topics from the LDA model, and the metadata.
Run the following command to get the topic word distribution for each topic word :
$ python Mapping/Mapping_2_get_topic_word_distribution.py <LDA_model> <inputcorpusfolder> metadata.csv
This command runs the Mapping_2_get_topic_word_distribution.py script from the Mapping folder.
<LDA_model> : This is the .LDA file (in the LDA_modeldata folder) which is the LDA model file.
inputcorpusfolder: refers to the folder which should contain PubMed XML data files.
metadata.csv: This is the file that contains all the metadata information about the corpus during Metadata extraction.
The new output file will be named as follows :
'M2_topic_word_dist_df_<DATE>.csv
M2_topic_word_dist_df_ : Denotes that this is the output of the sencond mappping procedure (M2), and that it is a dataframe containing the topic word distribution of all the documents in the LDA model.
<DATE> : This is the date on which this file is created.
.csv : Denotes that this is a CSV file.
Example : M2_topic_word_dist_df_2017-06-30.csv
Generates a CSV file that shows the relative yearly topic word distribuition of the LDA model using the topic word distribution of the topics from Topic word distribution.
Run the following command to get the topic relative yearly topic word distribuition for each topic word for a given topic:
$ python Mapping/Mapping_2_1_get_yearly_topic_word_distribution.py <LDA_model> <inputcorpusfolder> M2_topic_word_dist_df_<DATE>.csv <topicnumber>
This command runs the Mapping_2_1_get_yearly_topic_word_distribution.py script from the Mapping folder.
<LDA_model> : This is the .LDA file (in the LDA_modeldata folder) which is the LDA model file.
inputcorpusfolder: refers to the folder which should contain PubMed XML data files.
M2_topic_word_dist_df_<DATE>.csv: This is the output (CSV file) of Topic word distribution
<topicnumber> : Denotes the topic number for which the CSV file will be generated
In this section, for each topic in the LDA model, this script divides the corpus into topic subcorpora. For each subcorpus, the script calculates the top n most common words. This is calculates if the word occurs in the document that make up this subcorpus. The script uses the data from Get document topic distribution.
Run the following command to get the distribution of top n most popular words each topic topic:
$ python Mapping/Mapping_3_get_TOP_word_distribution.py M1_topic_distr_df_<DATE>.csv <LDA_model> <number of top words>
This command runs the Mapping_3_get_TOP_word_distribution.py script from the Mapping folder.
M1_topic_distr_df_ : Denotes that this is the output of the first mappping procedure (M1), and that it is a dataframe containing the topic distribution data .
<LDA_model> : This is the .LDA file (in the LDA_modeldata folder) which is the LDA model file.
<number of top words> : number of most popular words in the corpus
Run the following command to get the yearly relative frequency of top n most popular words for a given topic:
$ python Mapping/Mapping_3_1_get_TOP_word_distribution_giventopic.py <LDA_model> <topicnumber>
This command runs the Mapping_3_1_get_TOP_word_distribution_giventopic.py script from the Mapping folder.
<LDA_model> : This is the .LDA file (in the LDA_modeldata folder) which is the LDA model file.
<topicnumber> : Denotes the topic number for which the CSV file will be generated
This repository contains the scripts that were used to generate the results for my MA thesis.
Link to slides.