Skip to content

Conversation

@GTimothee
Copy link
Contributor

@GTimothee GTimothee commented Apr 18, 2025

Description

  • Adds save and load methods to KnowledgeBase
  • Adds save and load methods to KnowledgeBase using hugginface hub
  • Adds card template for the newly created dataset on the hub
  • Adds a _config property to KnowledgeBase
  • Adds a list of file names to be saved to the KnowledgeBase class

Related Issue

#2145

Type of Change

  • 📚 Examples / docs / tutorials / dependencies update
  • 🔧 Bug fix (non-breaking change which fixes an issue)
  • 🥂 Improvement (non-breaking change which improves an existing feature)
  • 🚀 New feature (non-breaking change which adds functionality)
  • 💥 Breaking change (fix or feature that would cause existing functionality to change)
  • 🔐 Security fix

It is an improvement as it avoids embeddings to be recomputed.

It is a new feature because we can now save/load to/from disk and hugginface hub.

Checklist

  • I've read the CODE_OF_CONDUCT.md document.
  • I've read the CONTRIBUTING.md guide.
  • I've written tests for all new methods and classes that I created.
  • I've written the docstring in Google format for all the methods and classes that I used.
  • I've updated the pdm.lock running pdm update-lock (only applicable when pyproject.toml has been
    modified)

@GTimothee
Copy link
Contributor Author

GTimothee commented Apr 18, 2025

I still need to write the tests, but I am sharing the PR already to get feedbacks on the implementation.

Little note: You will see that in the load methods you can select another llm or embedding client than the one that was used to generate the KnowledgeBase. I added this purely to add flexibility and improve developer experience, but it was not necessary nor asked in the issue.

@GTimothee
Copy link
Contributor Author

GTimothee commented Apr 18, 2025

TODO

  • fix test push to hub
  • fix test load from hub
  • add documentation to docs/
  • make pre commit checks pass

Copy link
Member

@davidberenstein1957 davidberenstein1957 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @GTimothee, this already looks nice. I left some comments to improve some things..

Hugging Face token for authentication. If None, will use local token.
private : bool
Whether to make the repo private or public.
"""

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we also pass kwargs for the push to hub/upload file etc?

Comment on lines 401 to 409
kb = cls(
data=data,
columns=config.get("columns"),
llm_client=llm_client,
embedding_model=embedding_model,
chunk_size=config.get("chunk_size", 2048),
seed=config.get("seed"),
min_topic_size=config.get("min_topic_size"),
)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't this already create the embeddings during initialisation?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

2 participants