-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add ability to select columns from csv to use as metadata #660
base: main
Are you sure you want to change the base?
Conversation
fix: overwrite prefilled metadata if selected columns have same name
docs: add usage for metadata columns
The latest updates on your projects. Learn more about Vercel for Git ↗︎
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this functionality seems great to me
thanks! next I am figuring out how to handle if the text is to large to be embedded, need to split the rows. I think this is a problem across all document loader/types. wonder if theres already a solution for this |
so the pipeline is generally:
so i think its more the responsibility of the text splitter to split documents if needed. does that make sense? |
sorry I am still going through the codebase and dont fully grasp it. do you mean the current implementation should split the rows into multiple documents or that is the goal? by documents I mean the Document class, because i think you meant document in general sense. Also should I fix anything with my current code? |
I got it now. instead of using loader.load(), I should use loader.loadAndSplit(). so now it works perfect. |
for csv in particular. should we add a metadata signifying a row has been splitted, putting chunk number, e.g. if a row splitted into 3, theres a metadata of chunk: 1/3 , chunk: 2/3 and chunk 3/3 ? also currently theres no option to modify the loadAndSplit function to customize chunkSize and chunkOverlap. should I override the loadAndSplit function in the CSVLoader class? |
@nfcampos I would like a review, used for my use cases and worked for me! If you approve I will try to find a way to add the same feature for python and other document loaders! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One comment, I'll address
export class CSVLoader extends TextLoader { | ||
constructor(filePathOrBlob: string | Blob, public column?: string) { | ||
super(filePathOrBlob); | ||
export class CSVLoader extends BaseDocumentLoader { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't actually want to change the base class here, instead we want to update TextLoader to let subclasses specify metadata. I'll do that and then will merge this. Thanks!
Feat: add functionality to the CSVLoader to be able to set columns as metadata.
I added the docs and implement the function in the CSVLoader class. The usage below would explain how it works.
Usage, extracting a single column with metadata
Example CSV file:
Example code:
Tested on my local machine: