Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add ability to select columns from csv to use as metadata #660

Open
wants to merge 10 commits into
base: main
Choose a base branch
from

Conversation

khairulhaaziq
Copy link

@khairulhaaziq khairulhaaziq commented Apr 7, 2023

Feat: add functionality to the CSVLoader to be able to set columns as metadata.

I added the docs and implement the function in the CSVLoader class. The usage below would explain how it works.

Usage, extracting a single column with metadata

Example CSV file:

hadith_id,chapter_no,hadith_no,chapter,text_ar,text_en,source
91,3,91,Knowledge - كتاب العلم,"حدثنا عبد الله بن محمد... ثم أدها إليه ".","Narrated Zaid bin Khalid Al-Juhani:... for the wolf.",Sahih Bukhari
92,3,92,Knowledge - كتاب العلم,"حدثنا محمد بن العلاء... إلى الله عز وجل.","Narrated Abu Musa:... (Our offending you).",Sahih Bukhari
93,3,93,Knowledge - كتاب العلم,"حدثنا أبو اليمان... وبمحمد صلى الله عليه وسلم نبيا، فسكت.","Narrated Anas bin Malik:... the Prophet became silent.",Sahih Bukhari

Example code:

import { CSVLoader } from "langchain/document_loaders";
const loader = new CSVLoader(
  "all_hadiths_clean.csv",
  "text_ar",
  ["text_en", "source", "hadith_id", "chapter_no", "hadith_no", "chapter"]
);
const docs = await loader.load();
/*
[
  Document {
    pageContent: 'حدثنا عبد الله بن محمد... ثم أدها إليه ".',
    metadata: {
      text_en: ' Narrated Zaid bin Khalid Al-Juhani:... for the wolf."',
      source: 'Sahih Bukhari',
      hadith_id: '91',
      chapter_no: '3',
      hadith_no: ' 91 ',
      chapter: 'Knowledge - كتاب العلم',
      line: 91
    }
  },
  Document {
    pageContent: 'حدثنا محمد بن العلاء... إلى الله عز وجل.',
    metadata: {
      text_en: ' Narrated Abu Musa:... (Our offending you).',
      source: 'Sahih Bukhari',
      hadith_id: '92',
      chapter_no: '3',
      hadith_no: ' 92 ',
      chapter: 'Knowledge - كتاب العلم',
      line: 92
    }
  },
  Document {
    pageContent: 'حدثنا أبو اليمان... وبمحمد صلى الله عليه وسلم نبيا، فسكت.',
    metadata: {
      text_en: ' Narrated Anas bin Malik:... the Prophet became silent.',
      source: 'Sahih Bukhari',
      hadith_id: '93',
      chapter_no: '3',
      hadith_no: ' 93 ',
      chapter: 'Knowledge - كتاب العلم',
      line: 93
    }
  }
]
*/

Tested on my local machine:
Screenshot 2023-04-07 at 7 05 46 PM

@vercel
Copy link

vercel bot commented Apr 7, 2023

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Updated (UTC)
langchainjs-docs ✅ Ready (Inspect) Visit Preview Apr 10, 2023 9:52am

Copy link
Collaborator

@hwchase17 hwchase17 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this functionality seems great to me

@khairulhaaziq
Copy link
Author

this functionality seems great to me

thanks! next I am figuring out how to handle if the text is to large to be embedded, need to split the rows. I think this is a problem across all document loader/types. wonder if theres already a solution for this

@hwchase17
Copy link
Collaborator

this functionality seems great to me

thanks! next I am figuring out how to handle if the text is to large to be embedded, need to split the rows. I think this is a problem across all document loader/types. wonder if theres already a solution for this

so the pipeline is generally:

  • load documents
  • split documents (with the text splitters)
  • embed text

so i think its more the responsibility of the text splitter to split documents if needed. does that make sense?

@khairulhaaziq
Copy link
Author

this functionality seems great to me

thanks! next I am figuring out how to handle if the text is to large to be embedded, need to split the rows. I think this is a problem across all document loader/types. wonder if theres already a solution for this

so the pipeline is generally:

  • load documents
  • split documents (with the text splitters)
  • embed text

so i think its more the responsibility of the text splitter to split documents if needed. does that make sense?

sorry I am still going through the codebase and dont fully grasp it. do you mean the current implementation should split the rows into multiple documents or that is the goal? by documents I mean the Document class, because i think you meant document in general sense. Also should I fix anything with my current code?

@khairulhaaziq
Copy link
Author

this functionality seems great to me

thanks! next I am figuring out how to handle if the text is to large to be embedded, need to split the rows. I think this is a problem across all document loader/types. wonder if theres already a solution for this

so the pipeline is generally:

  • load documents
  • split documents (with the text splitters)
  • embed text

so i think its more the responsibility of the text splitter to split documents if needed. does that make sense?

I got it now. instead of using loader.load(), I should use loader.loadAndSplit(). so now it works perfect.

@khairulhaaziq
Copy link
Author

I got it now. instead of using loader.load(), I should use loader.loadAndSplit(). so now it works perfect.

for csv in particular. should we add a metadata signifying a row has been splitted, putting chunk number, e.g. if a row splitted into 3, theres a metadata of chunk: 1/3 , chunk: 2/3 and chunk 3/3 ?

also currently theres no option to modify the loadAndSplit function to customize chunkSize and chunkOverlap. should I override the loadAndSplit function in the CSVLoader class?

@khairulhaaziq
Copy link
Author

@nfcampos I would like a review, used for my use cases and worked for me! If you approve I will try to find a way to add the same feature for python and other document loaders!

Copy link
Collaborator

@nfcampos nfcampos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One comment, I'll address

export class CSVLoader extends TextLoader {
constructor(filePathOrBlob: string | Blob, public column?: string) {
super(filePathOrBlob);
export class CSVLoader extends BaseDocumentLoader {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't actually want to change the base class here, instead we want to update TextLoader to let subclasses specify metadata. I'll do that and then will merge this. Thanks!

@nfcampos nfcampos self-assigned this Apr 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants