Using the chroma library and the bge-large-zh-v1.5 model, when recalling certain words, completely irrelevant slices are recalled. #1646

tanghaichen · 2024-06-19T09:48:45Z

Search before asking

I had searched in the issues and found no similar issues.

Operating system information

Linux

Python version information

3.10

DB-GPT version

main

Related scenes

Installation Information

Device information

GPU 96G

Models information

bge-large-zh-v1.5

What happened

使用的是bge-large-zh-v1.5模型和chroma向量库，在检索某些词语的时候，召回的切片分数很高但是是和词语完全无关的。但只有某个词语是这样的，其他的绝大部分词语的召回还是比较准的。
目前文档存在pdf、csv和word，切片数量大概6000个左右。
示例：
词语：“水资源”
存在20个文档，900个切片，直接出现了水资源词语。其他文档均未出现这三个字。
但在询问水资源时，召回的切片中出现的均是与其无关的切片
目前未发现其他词语出现这个问题。

What you expected to happen

正常应该是从完全出现这个词语的切片中进行召回才是合理的。

How to reproduce

未知复现方法

Additional context

No response

Are you willing to submit PR?

Yes I am willing to submit a PR!

Aries-ckt · 2024-06-19T14:10:54Z

what kind of your document type and could you show some bad cases for us?

tanghaichen added bug Something isn't working Waiting for reply labels Jun 19, 2024

Aries-ckt changed the title ~~使用chroma库和bge-large-zh-v1.5模型，对某些词语召回时，召回的却是完全不相关的切片~~ Using the chroma library and the bge-large-zh-v1.5 model, when recalling certain words, completely irrelevant slices are recalled. Jun 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using the chroma library and the bge-large-zh-v1.5 model, when recalling certain words, completely irrelevant slices are recalled. #1646

Using the chroma library and the bge-large-zh-v1.5 model, when recalling certain words, completely irrelevant slices are recalled. #1646

tanghaichen commented Jun 19, 2024

Aries-ckt commented Jun 19, 2024

Using the chroma library and the bge-large-zh-v1.5 model, when recalling certain words, completely irrelevant slices are recalled. #1646

Using the chroma library and the bge-large-zh-v1.5 model, when recalling certain words, completely irrelevant slices are recalled. #1646

Comments

tanghaichen commented Jun 19, 2024

Search before asking

Operating system information

Python version information

DB-GPT version

Related scenes

Installation Information

Device information

Models information

What happened

What you expected to happen

How to reproduce

Additional context

Are you willing to submit PR?

Aries-ckt commented Jun 19, 2024