Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Query Refinement Backtranslation #27

Open
DelaramRajaei opened this issue Jun 20, 2023 · 16 comments
Open

Query Refinement Backtranslation #27

DelaramRajaei opened this issue Jun 20, 2023 · 16 comments
Assignees
Labels
enhancement New feature or request experiment

Comments

@DelaramRajaei
Copy link
Member

This is the issue where I report my progress on the project.

@DelaramRajaei DelaramRajaei added enhancement New feature or request experiment labels Jun 20, 2023
@DelaramRajaei DelaramRajaei self-assigned this Jun 20, 2023
@DelaramRajaei
Copy link
Member Author

DelaramRajaei commented Jun 20, 2023

@hosseinfani
Initially, I addressed two bugs related to reading and storing the CSV file in the project. To resolve this issue, I replaced the deprecated dataframe.append command with dataframe.concat. Additionally, for reading the CSV file, I examined the format and observed that all entries start with a <top>. Consequently, I implemented the following code to handle this situation:

if '<top>' in line and not is_tag_file:
   is_tag_file = True

If the file type is txt, the tag will be activated.

I also modified the .../qe/main.py file and replaced the .format() function with f-string style.

Subsequently, I incorporated the backtranslation expander exclusively for the French language. You can find the relevant code snippet in the "../qe/expander/backtranslation.py" file. The settings of the backtranslation model and languages are in the "../qe/cmn/param.py" file.

To facilitate result comparison, I have developed the "toy-compare.py" python code, which can be found in the toy file. However, I plan to relocate this file to the "../qe/eval" directory.

There are three available functions that can be used to compare the results:

  1. compare_mAP_each_row(): This function compares the mAP between each row, specifically comparing the mAP of the original query with the mAP of the selected column(this column could be a list- for example different languages in backtranslation expander). The results are written to a CSV file.

  2. compare_mAP_all_row(): This function calculates the mean of the mAP values for each column and writes the results to a txt file.

  3. plot_result(): Although this function is still under development, it is intended to generate plots of the results for a selected dataset, displaying both the original and backtranslation results.

Next, I made further updates to the code to enable it to handle multiple languages and generate queries accordingly. Subsequently, I compared the results using the "toy-compare.py" script.

However, there are still a few remaining bugs in the project:

  1. When running the code for multiple languages it will raise an error for the Dutch language for which I have not yet identified the source.

  2. After generating the "topics.robust04.bm25.map.all.csv" file, the first query consistently returns a NaN value, and I am currently investigating the cause of this issue.
    image

By next Friday, I have outlined the following tasks to be completed:

  1. My priority is to address and resolve the existing bugs.
  2. I aim to finalize and complete the plot function.
  3. I will conduct a thorough comparison of the results obtained and subsequently prepare a comprehensive report on the findings.

@hosseinfani
Copy link
Member

@DelaramRajaei
Thank you for the detailed report.

  • Please keep this issue page updated with your progress on a 2-3 daily basis.
  • Please read our paper on backtranslation for review analysis. It gives you some idea about the roadmap

@DelaramRajaei
Copy link
Member Author

@hosseinfani

The reported bugs have been successfully resolved.

Queries 301 and 672 were returning NaN values. The first bug was identified as an issue in the code and has been rectified. The second bug occurred due to a topic lacking a "qrel" in the ir-dataset for "robust04."
The absence of the "qrel" was mentioned in the accompanying paper. You can refer to the following image for further details:
image

Additionally, some modifications were made to improve the code. Specifically, the run() function in the main.py file was restructured, and duplicate lines were removed.

Another bug related to the backtranslation feature was identified and resolved. The issue stemmed from the lowercase storage of the model name in the df dataframe in the main.py file's build function. The model name contained the target language, such as 'fra_Latn,' but it was stored in lowercase, causing the bug.

Both bugs are fixed and a pull request has been sent.

@hosseinfani
Copy link
Member

@DelaramRajaei
thank you.
pls put a quick comment in the code about the query with no qrels.
Next step will be the help/hirt chart, right?

@DelaramRajaei
Copy link
Member Author

@hosseinfani
I added the comment on the code and push it to my repository. Should I create a new pull request for this comment?

Yes, in the next step I am completeing the plot and report my finding about it.

@hosseinfani
Copy link
Member

i don't think so. it automatically accumulate

@DelaramRajaei
Copy link
Member Author

I have updated the code and pushed the new changes.

I fixed the bug of the antique dataset. I changed main.py and abstractqexpander.py . There were some problems in reading and writing new queries in .txt files.

Here are two logs of running the code with backtranslation expander on 5 different languages for robust04 and antique datasets.
log_file_antique.txt
log_file_robust04.txt

@DelaramRajaei
Copy link
Member Author

@hosseinfani
I need the results of other expanders for overall comparison.

@hosseinfani
Copy link
Member

@DelaramRajaei
I'm uploading them in the Query Refinement channel at ReQue >> v2.0 >> qe >> output.

@hosseinfani
Copy link
Member

@DelaramRajaei
done!

@DelaramRajaei
Copy link
Member Author

@hosseinfani

I have run the program for below datasets and here are the logs of running the code with backtranslation expander on 5 different languages for these datasets.

Unfortunately, I was unable to download the indexes for the ClueWeb datasets due to their large size.
Could you please share the indexes with me?

I'm currently in the process of drafting the paper and analyzing the results to identify any trends.

@hosseinfani
Copy link
Member

@DelaramRajaei
For the record :D
we got into problems when dl the files from msteams. So, I gave the key to my office to Delaram and asked her to open the computer casing and bring the hard disk (first SSD hard but then the other one). We internally attached the correct hard disk.

@DelaramRajaei
Copy link
Member Author

I have run the program for clueweb09b and this is the log.

log_file_clueweb09b.txt

Unfortunately, I encountered a problem with the zip files for clueweb12b13, as they were found to be corrupted. I am currently exploring potential solutions to fix this issue.

In addition to that, I have been focusing on plotting the results and comparing the mean Average Precision (mAP) of the original queries with that of the backtranslated queries.
So far, I have not achieved promising results. Overall, the performance did not improve with the implementation of backtranslation. However, I am investigating ways to enhance it, and trying to figure out which languages or datasets might work better and get better outcomes.

@DelaramRajaei
Copy link
Member Author

@hosseinfani

After analyzing the results of these five datasets in five distinct languages, here are the findings:
analyze.xlsx

Overall, it can be observed that the datasets "dbpedia" and "robust04" tend to yield superior results compared to the other datasets.

Additionally, Isaac compiled a list of new datasets related to law, medicine, and finance. I can process and analyze these new datasets. Also, I can change the translation model and see if that makes the results better.

@DelaramRajaei
Copy link
Member Author

DelaramRajaei commented Aug 12, 2023

Hey @hosseinfani,

I'd like to fill you in on my activities this week. I've been working on adding tct-colbert as a dense retrieval method. I went through the RePair project and pyserini's documentation on dense retrieval. It seems that I need to modify the format of my stored files within the write_expanded_query function in the abstractqexpander.py file.

To maintain the integrity of the original code, I introduced new functions: read_queries and write_queries. The read_queries() function takes a file name as input and reads the file, adapting various formats like tagged or CSV formats. It's similar to the old read_expanded_queries function, but with a minor adjustment. I also introduced a new variable for each expander called query_set, which holds the outcomes of the expanded queries specific to that expander.

There was a problem with the write_expanded_queries function, where it read each query line and immediately expanded and wrote it to a new file with the same format. Unfortunately, this posed a challenge when attempting to add batching to the system and also using pyserini and colbert.

So, I decided to restructure the approach a bit. Let me give you an overview of how things are unfolding within the generate function.

Generate_function_process

In this sequence, we begin by providing the filename of the original queries in any format. Once the file is read, it produces a dataframe as output. The preprocess_expanded_function receives a query as input, and a loop is executed over the generated dataframe. Within the function, it initially expands the query based on the specific expander and method, and subsequently cleans it (preprocesses it). Each expanded query that is generated is then stored in the query_set variable.
Here, we can also think about adding batches (although this might need quite a bit of changing based on other expanders). Additionally, we could consider integrating a message queue and broker like Celery and Redis to set up multiple instances of this function.

Afterward, by specifying the file name, the query_set will be saved in a more user-friendly CSV format.

This architeture now support using only pyserini and let me remove the anserini from the code in the search and evalute function. To modify the evaluation function, I referred to the documentation provided by Pyserini.

After encountering several bugs and errors, I'm pleased to share that I've managed to address all of them today, resulting in the project running seamlessly now.

Subsequently, I attempted to add tct_colbert to the project, and I successfully achieved it. However, I'm currently facing an indexing issue with the datasets. Specifically, I need to encode them and obtain the dense index.

In the meantime, I've gone ahead and made updates to both the environment.yaml and requirement.txt files. I've changed library versions and introduced some new ones as well. Isaac reviewed these changes and confirmed their smooth functionality.

I've made updates to the Excel task sheet. Regarding my upcoming tasks, here's the list:

  • My main focus is to run the project using colbert and obtain the dense indexes.
  • I aim to implement multiprocessing in the project using a message queue and broker, such as Celery and Redis.
  • Additionally, I plan to dedicate time to reading research papers and working on my own paper, which also includes a section on the datasets.

@hosseinfani
Copy link
Member

@DelaramRajaei
Thank you very much for the detailed report.
We need a code review together so I can fully understand the changes.
About the todo list, not sure I understood the Celery and Redis for multiprocessing. We'll talk.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request experiment
Projects
None yet
Development

No branches or pull requests

2 participants