Let document retrieval be more flexible

Currently, the LCEL retriever in dialog-lib forces the document content to join question and content together:

https://github.com/talkdai/dialog-lib/blob/4e8de796be1a21c877eb393066a78235e6a193ac/dialog_lib/embeddings/retrievers.py#L31-L39

However, the user already defines which fields should be embedded in [`load_csv.py`](https://github.com/talkdai/dialog/blob/main/src/load_csv.py)`, so this retriever should keep this choice with a simple return  like 
```
        return [
            Document(
                page_content=content.content,
                metadata={
                    "title": content.question,
                    "category": content.category,
                    "subcategory": content.subcategory,
                    "dataset": content.dataset,
                    "link": content.link,
                },
            )
            for content in relevant_contents
        ]
```

Moreover, since the default embedding way of langchain's CSVLoader is to already embedd the field name prefixed to the field value, e.g. `category: cat1\nsubcategory: subcat1\ncontent: content1` (see this test), it already achieves the same idea that the current implementation does, but in generic way.

That proposition works normally with default project chains, while giving flexibility to users that would implement their own prompt design. For example, the project default RAG Chain has this `format_docs`:
https://github.com/talkdai/dialog/blob/fbb13af3b3ee70d36b8ece499828ec74ab593f36/src/dialog/llm/agents/lcel.py#L60-L61 

and users can customize this as they wish to achieve their ideas. Later, when we implement metadata saving to the vectorstore, we could even return other metadata dynamically as well.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Let document retrieval be more flexible #206

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

	def format_docs(docs):
	return "\n\n".join([d.page_content for d in docs])

Let document retrieval be more flexible #206

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions