Skip to content

Let document retrieval be more flexible #206

@lgabs

Description

@lgabs

Currently, the LCEL retriever in dialog-lib forces the document content to join question and content together:

https://github.com/talkdai/dialog-lib/blob/4e8de796be1a21c877eb393066a78235e6a193ac/dialog_lib/embeddings/retrievers.py#L31-L39

However, the user already defines which fields should be embedded in load_csv.py`, so this retriever should keep this choice with a simple return like

        return [
            Document(
                page_content=content.content,
                metadata={
                    "title": content.question,
                    "category": content.category,
                    "subcategory": content.subcategory,
                    "dataset": content.dataset,
                    "link": content.link,
                },
            )
            for content in relevant_contents
        ]

Moreover, since the default embedding way of langchain's CSVLoader is to already embedd the field name prefixed to the field value, e.g. category: cat1\nsubcategory: subcat1\ncontent: content1 (see this test), it already achieves the same idea that the current implementation does, but in generic way.

That proposition works normally with default project chains, while giving flexibility to users that would implement their own prompt design. For example, the project default RAG Chain has this format_docs:

def format_docs(docs):
return "\n\n".join([d.page_content for d in docs])

and users can customize this as they wish to achieve their ideas. Later, when we implement metadata saving to the vectorstore, we could even return other metadata dynamically as well.

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions