-
Notifications
You must be signed in to change notification settings - Fork 1.9k
[ENH] Add BM25 embedding function to JS #5756
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This stack of pull requests is managed by Graphite. Learn more about stacking. |
Reviewer ChecklistPlease leverage this checklist to ensure your code review is thorough before approving Testing, Bugs, Errors, Logs, Documentation
System Compatibility
Quality
|
|
Introduce JS‐side BM25 sparse embedding function package A new TypeScript implementation of BM25 is added and wired into the JS client build. The PR creates a standalone package Key Changes• Created new package Affected Areas• This summary was automatically generated by @propel-code-bot |
fda682e to
048d5a9
Compare
| "yourselves", | ||
| ] as const; | ||
|
|
||
| export const DEFAULT_CHROMA_BM25_STOPWORDS = [...DEFAULT_ENGLISH_STOPWORDS]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[BestPractice]
This is a great addition! The code is well-structured and the tests are comprehensive.
One minor suggestion to improve the public API of this new package is to export DEFAULT_CHROMA_BM25_STOPWORDS as a readonly array. This prevents consumers of the package from accidentally modifying the default list, which is generally safer for exported constants.
| export const DEFAULT_CHROMA_BM25_STOPWORDS = [...DEFAULT_ENGLISH_STOPWORDS]; | |
| export const DEFAULT_CHROMA_BM25_STOPWORDS: readonly string[] = DEFAULT_ENGLISH_STOPWORDS; |
This follows TypeScript best practices for immutable exports. The readonly modifier ensures that consumers cannot use mutating methods like push(), pop(), or direct index assignment, preventing accidental modification of shared constants across different parts of an application.
⚡ Committable suggestion
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation.
Context for Agents
[**BestPractice**]
This is a great addition! The code is well-structured and the tests are comprehensive.
One minor suggestion to improve the public API of this new package is to export `DEFAULT_CHROMA_BM25_STOPWORDS` as a `readonly` array. This prevents consumers of the package from accidentally modifying the default list, which is generally safer for exported constants.
```suggestion
export const DEFAULT_CHROMA_BM25_STOPWORDS: readonly string[] = DEFAULT_ENGLISH_STOPWORDS;
```
This follows TypeScript best practices for immutable exports. The `readonly` modifier ensures that consumers cannot use mutating methods like `push()`, `pop()`, or direct index assignment, preventing accidental modification of shared constants across different parts of an application.
⚡ **Committable suggestion**
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation.
File: clients/new-js/packages/ai-embeddings/chroma-bm25/src/index.ts
Line: 198048d5a9 to
ffc5524
Compare
| case 3: | ||
| k1 ^= (key.charCodeAt(i + 2) & 0xff) << 16; | ||
| case 2: | ||
| k1 ^= (key.charCodeAt(i + 1) & 0xff) << 8; | ||
| case 1: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[BestPractice]
This switch statement uses intentional fall-throughs, which is correct for the Murmur3 algorithm. To improve readability and prevent future maintainers from mistakenly "fixing" this, it's good practice to make the fall-throughs explicit with comments.
Suggested Change
| case 3: | |
| k1 ^= (key.charCodeAt(i + 2) & 0xff) << 16; | |
| case 2: | |
| k1 ^= (key.charCodeAt(i + 1) & 0xff) << 8; | |
| case 1: | |
| case 3: | |
| k1 ^= (key.charCodeAt(i + 2) & 0xff) << 16; | |
| // falls through | |
| case 2: | |
| k1 ^= (key.charCodeAt(i + 1) & 0xff) << 8; | |
| // falls through | |
| case 1: |
⚡ Committable suggestion
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation.
Context for Agents
[**BestPractice**]
This `switch` statement uses intentional fall-throughs, which is correct for the Murmur3 algorithm. To improve readability and prevent future maintainers from mistakenly "fixing" this, it's good practice to make the fall-throughs explicit with comments.
<details>
<summary>Suggested Change</summary>
```suggestion
case 3:
k1 ^= (key.charCodeAt(i + 2) & 0xff) << 16;
// falls through
case 2:
k1 ^= (key.charCodeAt(i + 1) & 0xff) << 8;
// falls through
case 1:
```
⚡ **Committable suggestion**
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation.
</details>
File: clients/new-js/packages/ai-embeddings/chroma-bm25/src/index.ts
Line: 77
Description of changes
Summarize the changes made by this PR.
Test plan
How are these changes tested?
tested manually + added unit tests
pytestfor python,yarn testfor js,cargo testfor rustMigration plan
Are there any migrations, or any forwards/backwards compatibility changes needed in order to make sure this change deploys reliably?
Observability plan
What is the plan to instrument and monitor this change?
Documentation Changes
Are all docstrings for user-facing APIs updated if required? Do we need to make documentation changes in the docs section?