Quora-Question-Pair-Similarity

Quora Question Pair Similarity project aims to identify duplicate questions using natural language processing. Leveraging machine learning algorithms like Logistic Regression, SGD Classifier, and XGBoost, the system achieves accurate classification, enhancing user experience by reducing redundancy in question content on the Quora platform.

Problem Statement

Where else but Quora can a physicist help a chef with a math problem and get cooking tips in return? Quora is a place to gain and share knowledge—about anything. It’s a platform to ask questions and connect with people who contribute unique insights and quality answers. This empowers people to learn from each other and to better understand the world. Over 100 million people visit Quora every month, so it's no surprise that many people ask similarly worded questions. Multiple questions with the same intent can cause seekers to spend more time finding the best answer to their question, and make writers feel they need to answer multiple versions of the same question. Quora values canonical questions because they provide a better experience to active seekers and writers, and offer more value to both of these groups in the long term. Currently, Quora uses a Random Forest model to identify duplicate questions. In this competition, Kagglers are challenged to tackle this natural language processing problem by applying advanced techniques to classify whether question pairs are duplicates or not. Doing so will make it easier to find high quality answers to questions resulting in an improved experience for Quora writers, seekers, and readers.

Data is taken from kaggle competition (2017) : https://kaggle.com/competitions/quora-question-pairs

Dataset Description

The goal of this competition is to predict which of the provided pairs of questions contain two questions with the same meaning. The ground truth is the set of labels that have been supplied by human experts. The ground truth labels are inherently subjective, as the true meaning of sentences can never be known with certainty. Human labeling is also a 'noisy' process, and reasonable people will disagree. As a result, the ground truth labels on this dataset should be taken to be 'informed' but not 100% accurate, and may include incorrect labeling. We believe the labels, on the whole, to represent a reasonable consensus, but this may often not be true on a case by case basis for individual items in the dataset. Please note: as an anti-cheating measure, Kaggle has supplemented the test set with computer-generated question pairs. Those rows do not come from Quora, and are not counted in the scoring. All of the questions in the training set are genuine examples from Quora.

Project Architecture

File structure

Data Details

id - the id of a training set question pair
qid1, qid2 - unique ids of each question (only available in train.csv)
question1, question2 - the full text of each question
is_duplicate - the target variable, set to 1 if question1 and question2 have essentially the same meaning, and 0 otherwise.

Businesss Constraints and Objectives

Missclassification should be reduced.
No Strict Low Latency requirements.
Adaptable threshold for probability of classification.

How this is a ML problem ?

The objective is to classify whether two given questions are having same intention, typical classification problem.
Create a application that takes two questions as input and in return tells if the questions have same meaning or not.

Performance Metric

Metric(s):

Log-Loss
Binary Confusion Matrix

Importing Needed Libraries and accessing other py files(feature-extraction)

The project initiates data analysis and machine learning by importing essential Python libraries, including feature extraction, data visualization, and algorithms. It accesses specific functionalities from the 'feature_extraction' and 'ml_algorithms' modules for further use.

Load the Data and Perform Data Analysis

Read CSV file into a Pandas DataFrame, display the first five rows and provide information about the dataset. It identifies missing values, visualizes them with a bar plot, and then drops the rows with null values, resulting in three rows being removed. The dataset initially has 404,290 entries, and after dropping rows with missing values, it has 404,287 entries. There are two null values in question 2 and one null value in question 1, dropping those rows.

Distribution of data points among output classes (Similar and Non Similar Questions

Distribution of Duplicate and Non-duplicate Questions: The bar plot illustrates the percentage distribution of questions categorized as duplicate and non-duplicate, checking for balance in the 'is_duplicate' column.
Number of Unique and Repeated Questions: Analyzing the dataset reveals 537,929 unique questions. About 20.78% of questions appear more than once, with the maximum repetition being 157 times.
Checking for Duplicates: No rows are found where 'qid1' and 'qid2' are the same or interchanged, indicating no duplicate question pairs in the dataset.
Number of Occurrences of Each Question: The histogram shows the log-scale distribution of the number of occurrences for each question, highlighting the maximum occurrence with a red dashed line.

The plot is close to a power-law distribution not exactly power-law but close to it.

Distribution of Question Lengths:

Function determines the number of words in a sentence. It then applies this function to both 'question1' and 'question2' columns in a DataFrame. The resulting word counts are visualized using histograms for each question, allowing for a comparison of the distribution of question lengths.

References for feature extraction:

Kaggle Winning Solution and other approaches: https://www.dropbox.com/sh/93968nfnrzh8bp5/AACZdtsApc1QSTQc7X0H3QZ5a?dl=0
Blog 1 : https://engineering.quora.com/Semantic-Question-Matching-with-Deep-Learning
Blog 2 : https://towardsdatascience.com/identifying-duplicate-questions-on-quora-top-12-on-kaggle-4c1cf93f1c30

Feature Engineering

Feature Extraction

freq_qid1: Frequency of qid1's
freq_qid2: Frequency of qid2's
q1len: Length of q1
q2len: Length of q2
q1_n_words: Number of words in Question 1
q2_n_words: Number of words in Question 2
word_Common: Number of common unique words in Question 1 and Question 2
word_Total: Total number of words in Question 1 + Total number of words in Question 2
word_share: (word_common)/(word_Total)
freq_q1+freq_q2: Sum total of frequency of qid1 and qid2
freq_q1-freq_q2: Absolute difference of frequency of qid1 and qid2

Feature Extraction after pre-processing.

Featurization (NLP and Fuzzy Features) Definition:

Token: You get a token by splitting sentence a space
Stop_Word: stop words as per NLTK.
Word: A token that is not a stop_word

Features:

cwc_min: Ratio of common_word_count to min length of word count of Q1 and Q2
- cwc_min = common_word_count / (min(len(q1_words), len(q2_words)))
cwc_max: Ratio of common_word_count to max length of word count of Q1 and Q2
- cwc_max = common_word_count / (max(len(q1_words), len(q2_words)))
csc_min: Ratio of common_stop_count to min length of stop count of Q1 and Q2
- csc_min = common_stop_count / (min(len(q1_stops), len(q2_stops)))
csc_max: Ratio of common_stop_count to max length of stop count of Q1 and Q2
- csc_max = common_stop_count / (max(len(q1_stops), len(q2_stops)))
ctc_min: Ratio of common_token_count to min length of token count of Q1 and Q2
- ctc_min = common_token_count / (min(len(q1_tokens), len(q2_tokens)))
ctc_max: Ratio of common_token_count to max length of token count of Q1 and Q2
- ctc_max = common_token_count / (max(len(q1_tokens), len(q2_tokens)))
last_word_eq: Check if the last word of both questions is equal or not
- last_word_eq = int(q1_tokens[-1] == q2_tokens[-1])
first_word_eq: Check if the first word of both questions is equal or not
- first_word_eq = int(q1_tokens[0] == q2_tokens[0])
abs_len_diff: Absolute length difference
- abs_len_diff = abs(len(q1_tokens) - len(q2_tokens))
mean_len: Average Token Length of both Questions
- mean_len = (len(q1_tokens) + len(q2_tokens))/2
fuzz_ratio: Fuzzy Ratio
fuzz_partial_ratio: Fuzzy Partial Ratio
token_sort_ratio: Token Sort Ratio
token_set_ratio: Token Set Ratio
longest_substr_ratio: Ratio of length of the longest common substring to min length of token count of Q1 and Q2
- longest_substr_ratio = len(longest common substring) / (min(len(q1_tokens), len(q2_tokens)))

Some additional features [Added by Me]

ratio_q_lengths: Calculates the ratio of the lengths of the two questions.
common_prefix: Computes the length of the common prefix (the initial common sequence of characters) between the two questions.
common_suffix: Calculates the length of the common suffix (the final common sequence of characters) between the two questions.
diff_words: Calculates the absolute difference in the number of words between the two questions.
diff_chars: Computes the absolute difference in the number of characters between the two questions.
jaccard_similarity: Calculates the Jaccard similarity coefficient between the sets of words in the two questions.
longest_common_subsequence: Computes the length of the longest common subsequence (LCS) between the two questions.

Processing and Extracting Features

Sets the file path for a CSV file named "data_with_features.csv." It also specifies the number of rows to be used for training the model, with the variable rows_to_train set to 100,000. This number can be adjusted based on specific needs or dataset sizes.

Pre-processing of Text

Preprocessing:

Removing html tags
Removing Punctuations
Performing stemming
Removing Stopwords
Expanding contractions etc.

Extracting Features

Function generates features for the specified number of rows (rows_to_train), and then saves the data to the file. The resulting DataFrame is displayed, showing the first five rows with additional features extracted from the original dataset. The added features include various characteristics like frequency, length, and similarity ratios between the questions.

Check for questions with 2 words or less than 2 words

Filters sentences from the DataFrame based on the condition that either 'q1' or 'q2' should have two words or fewer. The filtered data is stored in a new DataFrame called filtered_data. Then prints details for the first 10 filtered sentences and the total number of sentences meeting the criteria. This filtering process helps inspect and understand specific characteristics of sentences with a low word count in either 'q1' or 'q2'.

Provide insights into the distribution of question lengths, highlighting the minimum lengths and the number of questions with the minimum length in both 'question1' and 'question2'.

Kullback-Leibler (KL) Divergence helps analyze the discriminatory power of 33 different features used, in distinguishing between duplicate and non-duplicate pairs in a dataset.
This visualization allows us to compare the distribution of each feature for duplicate and non-duplicate pairs, providing insights into the characteristics that might differentiate between the two categories.
Violin plots show the distribution shape, while Density plots provide a smooth estimate of the probability density function for each class.

This visualization helps identify features with high inverted KL Divergence, highlighting those that exhibit significant differences between duplicate and non-duplicate pairs. Higher values indicate features that are more discriminative in distinguishing between the two classes.

Identifies the bottom (least discriminative) 5 features based on their calculated KL Divergence and then creates a pair plot for these features along with the target variable 'is_duplicate'.

Important features in differentiating Duplicate(Similar) and Non-Duplicate(Dissimilar) Questions.

Distribution of q1len for Duplicate and Non-duplicate Questions overlap but not completely making it a good feature.
Distribution of q2len for Duplicate and Non-duplicate Questions overlap but not completely making it a good feature.
Distribution of q1_n_words for Duplicate and Non-duplicate Questions overlap but not completely making it a good feature.
Distribution of q2_n_words for Duplicate and Non-duplicate Questions overlap but not completely making it a good feature.
Distribution of word_Total for Duplicate and Non-duplicate Questions overlap but not completely making it a good feature.
Distribution of word_share for Duplicate and Non-duplicate Questions overlap but not completely making it a good feature.

Visualizing in lower dimension using t-SNE - 3D

Featurizing text data with Tf-Idf weighted word-vectors

Extracts features for each question in the dataset using spaCy, considering the semantic meaning of words and their TF-IDF weights. These features are then added to the DataFrame for further analysis.
Loads processed features, drops unnecessary columns, extracts features for Question 1 and Question 2, and displays information about the features in separate DataFrames.
Consolidates the features from different DataFrames into a single DataFrame and saves it to the specified CSV file for further use.
The code replaces non-numeric values in the DataFrame with NaN, checks for the presence of NaN values, and prints the count of NaN values in each column after replacement.
Converts all features to numeric format, handling any errors by coercing non-numeric values to NaN.

Due to lack of Computation Power the models are trained on 100,000 Rows.

Checks if there are any NA (missing) values in the DataFrame after converting features to numeric format. If present, it prints "NA Values Present"; otherwise, it prints "No NA Values Present." It then displays the number of NaN values in each column after the conversion. Additionally, it converts the target variable y_true to a list of integers and shows the first few rows of the DataFrame.

Splitting into Train and Test Data

Train Data : 70% Test Data : 30%

Distribution of Output Variable in Train and Test Data

The left subplot shows the distribution in the training data, while the right subplot shows the distribution in the testing data. This helps to understand the balance or imbalance in the classes of the output variable.

Results

Random Model :
- Log Loss for Training Data: 4.27141
- Log Loss for Test Data: 3.95542
Logistic Regression :
- Train Log Loss: 0.46723
- Test Log Loss: 0.47019
SGDClassifier :
- Train Log Loss: 0.44927
- Test Log Loss: 0.45210
NaiveBayesClassifier :
- Train Log Loss: 11.47686
- Test Log Loss: 11.49861
XGBoost :
- Train Log Loss: 0.23361
- Test Log Loss: 0.35239

Log loss metrics reveal model performance. Random Model shows high log loss (4.27 train, 3.96 test). Logistic Regression and SGDClassifier perform well, while NaiveBayesClassifier indicates poor performance. XGBoost demonstrates effective generalization (0.23 train, 0.35 test).

Name		Name	Last commit message	Last commit date
Latest commit History 40 Commits
__pycache__		__pycache__
ml_algorithms		ml_algorithms
.gitignore		.gitignore
3D_plot.png		3D_plot.png
README.md		README.md
feature_extraction.py		feature_extraction.py
main.ipynb		main.ipynb
main.py		main.py
pre_processing.py		pre_processing.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Quora-Question-Pair-Similarity

Table of Contents

Problem Statement

Dataset Description

Project Architecture

File structure

Data Details

Businesss Constraints and Objectives

How this is a ML problem ?

Performance Metric

Importing Needed Libraries and accessing other py files(feature-extraction)

Load the Data and Perform Data Analysis

Distribution of data points among output classes (Similar and Non Similar Questions

Top 10 Most asked questions on Quora:

Distribution of Question Lengths:

Feature Engineering

Feature Extraction

Feature Extraction after pre-processing.

Some additional features [Added by Me]

Processing and Extracting Features

Pre-processing of Text

Extracting Features

Check for questions with 2 words or less than 2 words

Important features in differentiating Duplicate(Similar) and Non-Duplicate(Dissimilar) Questions.

Visualizing in lower dimension using t-SNE - 3D

Featurizing text data with Tf-Idf weighted word-vectors

Due to lack of Computation Power the models are trained on 100,000 Rows.

Splitting into Train and Test Data

Distribution of Output Variable in Train and Test Data

Results

About

Releases

Packages

Languages

anandr07/Quora-Question-Pair-Similarity

Folders and files

Latest commit

History

Repository files navigation

Quora-Question-Pair-Similarity

Table of Contents

Problem Statement

Dataset Description

Project Architecture

File structure

Data Details

Businesss Constraints and Objectives

How this is a ML problem ?

Performance Metric

Importing Needed Libraries and accessing other py files(feature-extraction)

Load the Data and Perform Data Analysis

Distribution of data points among output classes (Similar and Non Similar Questions

Top 10 Most asked questions on Quora:

Distribution of Question Lengths:

Feature Engineering

Feature Extraction

Feature Extraction after pre-processing.

Some additional features [Added by Me]

Processing and Extracting Features

Pre-processing of Text

Extracting Features

Check for questions with 2 words or less than 2 words

Important features in differentiating Duplicate(Similar) and Non-Duplicate(Dissimilar) Questions.

Visualizing in lower dimension using t-SNE - 3D

Featurizing text data with Tf-Idf weighted word-vectors

Due to lack of Computation Power the models are trained on 100,000 Rows.

Splitting into Train and Test Data

Distribution of Output Variable in Train and Test Data

Results

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages