Skip to content

This project aims to identify duplicate questions using natural language processing. Leveraging machine learning algorithms like Logistic Regression, SGD Classifier, and XGBoost, the system achieves accurate classification, enhancing user experience by reducing redundancy in question content on the Quora platform.

Notifications You must be signed in to change notification settings

anandr07/Quora-Question-Pair-Similarity

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

40 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Quora-Question-Pair-Similarity

Quora Question Pair Similarity project aims to identify duplicate questions using natural language processing. Leveraging machine learning algorithms like Logistic Regression, SGD Classifier, and XGBoost, the system achieves accurate classification, enhancing user experience by reducing redundancy in question content on the Quora platform. image

Table of Contents

  1. Problem Statement
  2. Dataset Description
  3. Project Architecture
  4. File Structure
  5. Data Details
  6. Performance Metric
  7. Load the Data and Perform Data Analysis
  8. Top 10 Most Asked Questions on Quora
  9. Distribution of Question Lengths
  10. Feature Engineering
  1. Splitting into Train and Test Data

  2. Distribution of Output Variable in Train and Test Data

  3. Results

Problem Statement

Where else but Quora can a physicist help a chef with a math problem and get cooking tips in return? Quora is a place to gain and share knowledge—about anything. It’s a platform to ask questions and connect with people who contribute unique insights and quality answers. This empowers people to learn from each other and to better understand the world. Over 100 million people visit Quora every month, so it's no surprise that many people ask similarly worded questions. Multiple questions with the same intent can cause seekers to spend more time finding the best answer to their question, and make writers feel they need to answer multiple versions of the same question. Quora values canonical questions because they provide a better experience to active seekers and writers, and offer more value to both of these groups in the long term. Currently, Quora uses a Random Forest model to identify duplicate questions. In this competition, Kagglers are challenged to tackle this natural language processing problem by applying advanced techniques to classify whether question pairs are duplicates or not. Doing so will make it easier to find high quality answers to questions resulting in an improved experience for Quora writers, seekers, and readers.

Data is taken from kaggle competition (2017) : https://kaggle.com/competitions/quora-question-pairs

Dataset Description

The goal of this competition is to predict which of the provided pairs of questions contain two questions with the same meaning. The ground truth is the set of labels that have been supplied by human experts. The ground truth labels are inherently subjective, as the true meaning of sentences can never be known with certainty. Human labeling is also a 'noisy' process, and reasonable people will disagree. As a result, the ground truth labels on this dataset should be taken to be 'informed' but not 100% accurate, and may include incorrect labeling. We believe the labels, on the whole, to represent a reasonable consensus, but this may often not be true on a case by case basis for individual items in the dataset. Please note: as an anti-cheating measure, Kaggle has supplemented the test set with computer-generated question pairs. Those rows do not come from Quora, and are not counted in the scoring. All of the questions in the training set are genuine examples from Quora.

Project Architecture

Block_diagram_2

File structure

file_structure

Data Details

  • id - the id of a training set question pair
  • qid1, qid2 - unique ids of each question (only available in train.csv)
  • question1, question2 - the full text of each question
  • is_duplicate - the target variable, set to 1 if question1 and question2 have essentially the same meaning, and 0 otherwise.

Businesss Constraints and Objectives

  • Missclassification should be reduced.
  • No Strict Low Latency requirements.
  • Adaptable threshold for probability of classification.

How this is a ML problem ?

  • The objective is to classify whether two given questions are having same intention, typical classification problem.
  • Create a application that takes two questions as input and in return tells if the questions have same meaning or not.

Performance Metric

Metric(s):

  • Log-Loss
  • Binary Confusion Matrix

Importing Needed Libraries and accessing other py files(feature-extraction)

The project initiates data analysis and machine learning by importing essential Python libraries, including feature extraction, data visualization, and algorithms. It accesses specific functionalities from the 'feature_extraction' and 'ml_algorithms' modules for further use.

Load the Data and Perform Data Analysis

Read CSV file into a Pandas DataFrame, display the first five rows and provide information about the dataset. It identifies missing values, visualizes them with a bar plot, and then drops the rows with null values, resulting in three rows being removed. The dataset initially has 404,290 entries, and after dropping rows with missing values, it has 404,287 entries. There are two null values in question 2 and one null value in question 1, dropping those rows.

image

image

Distribution of data points among output classes (Similar and Non Similar Questions

  • Distribution of Duplicate and Non-duplicate Questions: The bar plot illustrates the percentage distribution of questions categorized as duplicate and non-duplicate, checking for balance in the 'is_duplicate' column.

    image

  • Number of Unique and Repeated Questions: Analyzing the dataset reveals 537,929 unique questions. About 20.78% of questions appear more than once, with the maximum repetition being 157 times.

    image

  • Checking for Duplicates: No rows are found where 'qid1' and 'qid2' are the same or interchanged, indicating no duplicate question pairs in the dataset.

  • Number of Occurrences of Each Question: The histogram shows the log-scale distribution of the number of occurrences for each question, highlighting the maximum occurrence with a red dashed line.

    image The plot is close to a power-law distribution not exactly power-law but close to it.

Top 10 Most asked questions on Quora:

  • What are the best ways to lose weight? 161
  • How can you look at someone's private Instagram account without following them? 120
  • How can I lose weight quickly? 111
  • What's the easiest way to make money online? 88
  • Can you see who views your Instagram? 79
  • What are some things new employees should know going into their first day at AT&T? 77
  • What do you think of the decision by the Indian Government to demonetize 500 and 1000 rupee notes? 68
  • Which is the best digital marketing course? 66
  • How can you increase your height? 63
  • How do l see who viewed my videos on Instagram? 61

Distribution of Question Lengths:

Function determines the number of words in a sentence. It then applies this function to both 'question1' and 'question2' columns in a DataFrame. The resulting word counts are visualized using histograms for each question, allowing for a comparison of the distribution of question lengths.

image

References for feature extraction:

Feature Engineering

Feature Extraction

  • freq_qid1: Frequency of qid1's
  • freq_qid2: Frequency of qid2's
  • q1len: Length of q1
  • q2len: Length of q2
  • q1_n_words: Number of words in Question 1
  • q2_n_words: Number of words in Question 2
  • word_Common: Number of common unique words in Question 1 and Question 2
  • word_Total: Total number of words in Question 1 + Total number of words in Question 2
  • word_share: (word_common)/(word_Total)
  • freq_q1+freq_q2: Sum total of frequency of qid1 and qid2
  • freq_q1-freq_q2: Absolute difference of frequency of qid1 and qid2

Feature Extraction after pre-processing.

Featurization (NLP and Fuzzy Features) Definition:

  • Token: You get a token by splitting sentence a space
  • Stop_Word: stop words as per NLTK.
  • Word: A token that is not a stop_word

Features:

  • cwc_min: Ratio of common_word_count to min length of word count of Q1 and Q2

    • cwc_min = common_word_count / (min(len(q1_words), len(q2_words)))
  • cwc_max: Ratio of common_word_count to max length of word count of Q1 and Q2

    • cwc_max = common_word_count / (max(len(q1_words), len(q2_words)))
  • csc_min: Ratio of common_stop_count to min length of stop count of Q1 and Q2

    • csc_min = common_stop_count / (min(len(q1_stops), len(q2_stops)))
  • csc_max: Ratio of common_stop_count to max length of stop count of Q1 and Q2

    • csc_max = common_stop_count / (max(len(q1_stops), len(q2_stops)))
  • ctc_min: Ratio of common_token_count to min length of token count of Q1 and Q2

    • ctc_min = common_token_count / (min(len(q1_tokens), len(q2_tokens)))
  • ctc_max: Ratio of common_token_count to max length of token count of Q1 and Q2

    • ctc_max = common_token_count / (max(len(q1_tokens), len(q2_tokens)))
  • last_word_eq: Check if the last word of both questions is equal or not

    • last_word_eq = int(q1_tokens[-1] == q2_tokens[-1])
  • first_word_eq: Check if the first word of both questions is equal or not

    • first_word_eq = int(q1_tokens[0] == q2_tokens[0])
  • abs_len_diff: Absolute length difference

    • abs_len_diff = abs(len(q1_tokens) - len(q2_tokens))
  • mean_len: Average Token Length of both Questions

    • mean_len = (len(q1_tokens) + len(q2_tokens))/2
  • fuzz_ratio: Fuzzy Ratio

  • fuzz_partial_ratio: Fuzzy Partial Ratio

  • token_sort_ratio: Token Sort Ratio

  • token_set_ratio: Token Set Ratio

  • longest_substr_ratio: Ratio of length of the longest common substring to min length of token count of Q1 and Q2

    • longest_substr_ratio = len(longest common substring) / (min(len(q1_tokens), len(q2_tokens)))

Some additional features [Added by Me]

  • ratio_q_lengths: Calculates the ratio of the lengths of the two questions.
  • common_prefix: Computes the length of the common prefix (the initial common sequence of characters) between the two questions.
  • common_suffix: Calculates the length of the common suffix (the final common sequence of characters) between the two questions.
  • diff_words: Calculates the absolute difference in the number of words between the two questions.
  • diff_chars: Computes the absolute difference in the number of characters between the two questions.
  • jaccard_similarity: Calculates the Jaccard similarity coefficient between the sets of words in the two questions.
  • longest_common_subsequence: Computes the length of the longest common subsequence (LCS) between the two questions.

Processing and Extracting Features

Sets the file path for a CSV file named "data_with_features.csv." It also specifies the number of rows to be used for training the model, with the variable rows_to_train set to 100,000. This number can be adjusted based on specific needs or dataset sizes.

Pre-processing of Text

Preprocessing:

  • Removing html tags
  • Removing Punctuations
  • Performing stemming
  • Removing Stopwords
  • Expanding contractions etc.

Extracting Features

Function generates features for the specified number of rows (rows_to_train), and then saves the data to the file. The resulting DataFrame is displayed, showing the first five rows with additional features extracted from the original dataset. The added features include various characteristics like frequency, length, and similarity ratios between the questions.

image

Check for questions with 2 words or less than 2 words

Filters sentences from the DataFrame based on the condition that either 'q1' or 'q2' should have two words or fewer. The filtered data is stored in a new DataFrame called filtered_data. Then prints details for the first 10 filtered sentences and the total number of sentences meeting the criteria. This filtering process helps inspect and understand specific characteristics of sentences with a low word count in either 'q1' or 'q2'.

Provide insights into the distribution of question lengths, highlighting the minimum lengths and the number of questions with the minimum length in both 'question1' and 'question2'.

image

  • Kullback-Leibler (KL) Divergence helps analyze the discriminatory power of 33 different features used, in distinguishing between duplicate and non-duplicate pairs in a dataset.

  • This visualization allows us to compare the distribution of each feature for duplicate and non-duplicate pairs, providing insights into the characteristics that might differentiate between the two categories.

  • Violin plots show the distribution shape, while Density plots provide a smooth estimate of the probability density function for each class.

    image

    image

    image

    image

This visualization helps identify features with high inverted KL Divergence, highlighting those that exhibit significant differences between duplicate and non-duplicate pairs. Higher values indicate features that are more discriminative in distinguishing between the two classes.

image

Identifies the bottom (least discriminative) 5 features based on their calculated KL Divergence and then creates a pair plot for these features along with the target variable 'is_duplicate'.

image

image

Important features in differentiating Duplicate(Similar) and Non-Duplicate(Dissimilar) Questions.

  • Distribution of q1len for Duplicate and Non-duplicate Questions overlap but not completely making it a good feature.
  • Distribution of q2len for Duplicate and Non-duplicate Questions overlap but not completely making it a good feature.
  • Distribution of q1_n_words for Duplicate and Non-duplicate Questions overlap but not completely making it a good feature.
  • Distribution of q2_n_words for Duplicate and Non-duplicate Questions overlap but not completely making it a good feature.
  • Distribution of word_Total for Duplicate and Non-duplicate Questions overlap but not completely making it a good feature.
  • Distribution of word_share for Duplicate and Non-duplicate Questions overlap but not completely making it a good feature.

Visualizing in lower dimension using t-SNE - 3D

3D_plot

Featurizing text data with Tf-Idf weighted word-vectors

  • Extracts features for each question in the dataset using spaCy, considering the semantic meaning of words and their TF-IDF weights. These features are then added to the DataFrame for further analysis.

  • Loads processed features, drops unnecessary columns, extracts features for Question 1 and Question 2, and displays information about the features in separate DataFrames.

  • Consolidates the features from different DataFrames into a single DataFrame and saves it to the specified CSV file for further use.

  • The code replaces non-numeric values in the DataFrame with NaN, checks for the presence of NaN values, and prints the count of NaN values in each column after replacement.

  • Converts all features to numeric format, handling any errors by coercing non-numeric values to NaN.

Due to lack of Computation Power the models are trained on 100,000 Rows.

  • Checks if there are any NA (missing) values in the DataFrame after converting features to numeric format. If present, it prints "NA Values Present"; otherwise, it prints "No NA Values Present." It then displays the number of NaN values in each column after the conversion. Additionally, it converts the target variable y_true to a list of integers and shows the first few rows of the DataFrame.

image

Splitting into Train and Test Data

Train Data : 70% Test Data : 30%

Distribution of Output Variable in Train and Test Data

The left subplot shows the distribution in the training data, while the right subplot shows the distribution in the testing data. This helps to understand the balance or imbalance in the classes of the output variable.

image

Results

  • Random Model :

    • Log Loss for Training Data: 4.27141
    • Log Loss for Test Data: 3.95542
  • Logistic Regression :

    • Train Log Loss: 0.46723
    • Test Log Loss: 0.47019

    image

  • SGDClassifier :

    • Train Log Loss: 0.44927
    • Test Log Loss: 0.45210

    image

  • NaiveBayesClassifier :

    • Train Log Loss: 11.47686
    • Test Log Loss: 11.49861

    image

  • XGBoost :

    • Train Log Loss: 0.23361
    • Test Log Loss: 0.35239

    image

Log loss metrics reveal model performance. Random Model shows high log loss (4.27 train, 3.96 test). Logistic Regression and SGDClassifier perform well, while NaiveBayesClassifier indicates poor performance. XGBoost demonstrates effective generalization (0.23 train, 0.35 test).

About

This project aims to identify duplicate questions using natural language processing. Leveraging machine learning algorithms like Logistic Regression, SGD Classifier, and XGBoost, the system achieves accurate classification, enhancing user experience by reducing redundancy in question content on the Quora platform.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published