feat: add ability to select columns from csv to use as metadata #660

khairulhaaziq · 2023-04-07T11:18:55Z

Feat: add functionality to the CSVLoader to be able to set columns as metadata.

I added the docs and implement the function in the CSVLoader class. The usage below would explain how it works.

Usage, extracting a single column with metadata

Example CSV file:

hadith_id,chapter_no,hadith_no,chapter,text_ar,text_en,source
91,3,91,Knowledge - كتاب العلم,"حدثنا عبد الله بن محمد... ثم أدها إليه ".","Narrated Zaid bin Khalid Al-Juhani:... for the wolf.",Sahih Bukhari
92,3,92,Knowledge - كتاب العلم,"حدثنا محمد بن العلاء... إلى الله عز وجل.","Narrated Abu Musa:... (Our offending you).",Sahih Bukhari
93,3,93,Knowledge - كتاب العلم,"حدثنا أبو اليمان... وبمحمد صلى الله عليه وسلم نبيا، فسكت.","Narrated Anas bin Malik:... the Prophet became silent.",Sahih Bukhari

Example code:

import { CSVLoader } from "langchain/document_loaders";
const loader = new CSVLoader(
  "all_hadiths_clean.csv",
  "text_ar",
  ["text_en", "source", "hadith_id", "chapter_no", "hadith_no", "chapter"]
);
const docs = await loader.load();
/*
[
  Document {
    pageContent: 'حدثنا عبد الله بن محمد... ثم أدها إليه ".',
    metadata: {
      text_en: ' Narrated Zaid bin Khalid Al-Juhani:... for the wolf."',
      source: 'Sahih Bukhari',
      hadith_id: '91',
      chapter_no: '3',
      hadith_no: ' 91 ',
      chapter: 'Knowledge - كتاب العلم',
      line: 91
    }
  },
  Document {
    pageContent: 'حدثنا محمد بن العلاء... إلى الله عز وجل.',
    metadata: {
      text_en: ' Narrated Abu Musa:... (Our offending you).',
      source: 'Sahih Bukhari',
      hadith_id: '92',
      chapter_no: '3',
      hadith_no: ' 92 ',
      chapter: 'Knowledge - كتاب العلم',
      line: 92
    }
  },
  Document {
    pageContent: 'حدثنا أبو اليمان... وبمحمد صلى الله عليه وسلم نبيا، فسكت.',
    metadata: {
      text_en: ' Narrated Anas bin Malik:... the Prophet became silent.',
      source: 'Sahih Bukhari',
      hadith_id: '93',
      chapter_no: '3',
      hadith_no: ' 93 ',
      chapter: 'Knowledge - كتاب العلم',
      line: 93
    }
  }
]
*/

Tested on my local machine:

fix: overwrite prefilled metadata if selected columns have same name

docs: add usage for metadata columns

vercel · 2023-04-07T11:20:24Z

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name	Status	Preview	Updated (UTC)
langchainjs-docs	✅ Ready (Inspect)	Visit Preview	Apr 10, 2023 9:52am

hwchase17

this functionality seems great to me

khairulhaaziq · 2023-04-07T14:36:21Z

this functionality seems great to me

thanks! next I am figuring out how to handle if the text is to large to be embedded, need to split the rows. I think this is a problem across all document loader/types. wonder if theres already a solution for this

hwchase17 · 2023-04-07T14:37:35Z

this functionality seems great to me

thanks! next I am figuring out how to handle if the text is to large to be embedded, need to split the rows. I think this is a problem across all document loader/types. wonder if theres already a solution for this

so the pipeline is generally:

load documents
split documents (with the text splitters)
embed text

so i think its more the responsibility of the text splitter to split documents if needed. does that make sense?

khairulhaaziq · 2023-04-07T14:50:44Z

this functionality seems great to me

thanks! next I am figuring out how to handle if the text is to large to be embedded, need to split the rows. I think this is a problem across all document loader/types. wonder if theres already a solution for this

so the pipeline is generally:

load documents

split documents (with the text splitters)

embed text

so i think its more the responsibility of the text splitter to split documents if needed. does that make sense?

sorry I am still going through the codebase and dont fully grasp it. do you mean the current implementation should split the rows into multiple documents or that is the goal? by documents I mean the Document class, because i think you meant document in general sense. Also should I fix anything with my current code?

khairulhaaziq · 2023-04-08T11:43:28Z

this functionality seems great to me

thanks! next I am figuring out how to handle if the text is to large to be embedded, need to split the rows. I think this is a problem across all document loader/types. wonder if theres already a solution for this

so the pipeline is generally:

load documents

split documents (with the text splitters)

embed text

so i think its more the responsibility of the text splitter to split documents if needed. does that make sense?

I got it now. instead of using loader.load(), I should use loader.loadAndSplit(). so now it works perfect.

khairulhaaziq · 2023-04-08T11:55:38Z

I got it now. instead of using loader.load(), I should use loader.loadAndSplit(). so now it works perfect.

for csv in particular. should we add a metadata signifying a row has been splitted, putting chunk number, e.g. if a row splitted into 3, theres a metadata of chunk: 1/3 , chunk: 2/3 and chunk 3/3 ?

also currently theres no option to modify the loadAndSplit function to customize chunkSize and chunkOverlap. should I override the loadAndSplit function in the CSVLoader class?

khairulhaaziq · 2023-04-11T11:18:23Z

@nfcampos I would like a review, used for my use cases and worked for me! If you approve I will try to find a way to add the same feature for python and other document loaders!

nfcampos

One comment, I'll address

nfcampos · 2023-04-13T13:31:54Z

langchain/src/document_loaders/csv.ts

-export class CSVLoader extends TextLoader {
-  constructor(filePathOrBlob: string | Blob, public column?: string) {
-    super(filePathOrBlob);
+export class CSVLoader extends BaseDocumentLoader {


We don't actually want to change the base class here, instead we want to update TextLoader to let subclasses specify metadata. I'll do that and then will merge this. Thanks!

khairulhaaziq added 4 commits April 7, 2023 19:10

docs: add usage for metadata columns

97202a2

fix: overwrite prefilled metadata if selected columns have same name

27b0cae

Merge pull request #1 from khairulhaaziq/khairulhaaziq-patch-1-1

1842f69

fix: overwrite prefilled metadata if selected columns have same name

Merge pull request #3 from khairulhaaziq/khairulhaaziq-patch-1

9cfce17

docs: add usage for metadata columns

vercel bot had a problem deploying to Preview April 7, 2023 11:22 Failure

khairulhaaziq added 2 commits April 7, 2023 20:23

fix: forgot to add imports

1dd663b

fix: syntax error

e549dd3

hwchase17 reviewed Apr 7, 2023

View reviewed changes

khairulhaaziq and others added 2 commits April 8, 2023 19:22

fix build, lint, format error

f86f3e6

Merge branch 'main' into main

fc0c9de

khairulhaaziq requested a review from hwchase17 April 8, 2023 11:40

vercel bot deployed to Preview April 8, 2023 11:42 View deployment

Merge branch 'main' into main

88b6ab6

vercel bot deployed to Preview April 9, 2023 20:37 View deployment

Merge branch 'main' into main

c9511c6

vercel bot deployed to Preview April 10, 2023 09:52 View deployment

nfcampos reviewed Apr 13, 2023

View reviewed changes

nfcampos self-assigned this Apr 13, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add ability to select columns from csv to use as metadata #660

feat: add ability to select columns from csv to use as metadata #660

khairulhaaziq commented Apr 7, 2023 •

edited

Loading

vercel bot commented Apr 7, 2023 •

edited

Loading

hwchase17 left a comment

khairulhaaziq commented Apr 7, 2023

hwchase17 commented Apr 7, 2023

khairulhaaziq commented Apr 7, 2023

khairulhaaziq commented Apr 8, 2023

khairulhaaziq commented Apr 8, 2023

khairulhaaziq commented Apr 11, 2023

nfcampos left a comment

nfcampos Apr 13, 2023

feat: add ability to select columns from csv to use as metadata #660

Are you sure you want to change the base?

feat: add ability to select columns from csv to use as metadata #660

Conversation

khairulhaaziq commented Apr 7, 2023 • edited Loading

Feat: add functionality to the CSVLoader to be able to set columns as metadata.

Usage, extracting a single column with metadata

vercel bot commented Apr 7, 2023 • edited Loading

hwchase17 left a comment

Choose a reason for hiding this comment

khairulhaaziq commented Apr 7, 2023

hwchase17 commented Apr 7, 2023

khairulhaaziq commented Apr 7, 2023

khairulhaaziq commented Apr 8, 2023

khairulhaaziq commented Apr 8, 2023

khairulhaaziq commented Apr 11, 2023

nfcampos left a comment

Choose a reason for hiding this comment

nfcampos Apr 13, 2023

Choose a reason for hiding this comment

khairulhaaziq commented Apr 7, 2023 •

edited

Loading

vercel bot commented Apr 7, 2023 •

edited

Loading