Feat rdb summary wide table#2035
Merged
Merged
Conversation
Member
|
@Aries-ckt Please review it. |
|
@Aries-ckt 有什么进展吗,啥时候能合进去,我目前也遇到了同样的问题 |
Contributor
Author
先用该分支试试看 |
Collaborator
|
你好, 这边pr功能没啥问题,主要有个顾虑,可能老用户的表schema 向量数据就找不到了 |
added 5 commits
December 12, 2024 14:29
Contributor
Author
针对该问题,已对代码作出了修改,可保证用户原有的向量数据正常被检索 |
Collaborator
|
@FOkvj , hi why do you remove field_vector_connector |
Contributor
Author
|
@Aries-ckt here,I make it created by default, so I think those code is not necessary. |
Collaborator
Contributor
Author
|
@Aries-ckt fixed |
Collaborator
Collaborator
|
@FOkvj Are you in our WeChat group? and what's your WeChat alias? |
Contributor
Author
|
@Aries-ckt Yep,I go by Wooop! 😊 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.




Description
When the relational database table is a wide table, there are a lot of fields. In addition to many redundant fields, retrieving the full number of fields may also exceed the maximum sequence length accepted by the embedding model when performing summary embedding. As a result, the generated embedding cannot accurately reflect the semantic information of the summary. Therefore, for wide tables, I split the fields and the basic information of the table. If the number of fields in the table is too large, the fields will be divided into multiple chunks during summary, and the length of a chunk does not exceed the maximum sequence length of the embedding model. If the table is not wide, then the summary is the same as the original, and the table name and the table description and fields are in the same chunk. In the retrieval, the table name is retrieved first, then the table name (id) is used as filter, and the query is used for vector retrieval, and then the table name and table description are assembled with the field as the final result.
How Has This Been Tested?
Test summary of wide table and retrieve respectively in
dbgpt/rag/assembler/tests/test_db_struct_assembler.pyanddbgpt/rag/assembler/tests/test_embedding_assembler.pySnapshots:
Checklist: