[DON'T MERGE]: chore(wren-ai-service): add name, score to retrieved tables #1510

cyyeh · 2025-04-02T04:16:40Z

Summary by CodeRabbit

New Features
- Enhanced table retrieval now includes a scoring metric, providing more detailed insights.
- Ask responses have been updated to display tables with accompanying scores for improved clarity.
Refactor
- Streamlined data processing across the retrieval and ask workflows to support the enhanced table information.

coderabbitai · 2025-04-02T04:16:51Z

Walkthrough

The changes update the data structures within the retrieval pipeline and ask service. In the retrieval module, function signatures are revised so that dbschema_retrieval now returns a dictionary containing both documents and table_scores instead of a simple list. This necessitates updates in functions that process these outputs. Similarly, the ask service modifications change the _AskResultResponse and ask method to handle a list of dictionaries (with table names and scores) rather than a list of strings.

Changes

File(s)	Change Summary
`wren-ai-service/src/pipelines/retrieval/retrieval.py`	Modified function signatures: - `dbschema_retrieval` now returns a dict with keys `documents` and `table_scores` instead of a list of `Document` objects. - `construct_db_schemas`, `check_using_db_schemas_without_pruning`, and `construct_retrieval_results` updated to handle the new dict structure.
`wren-ai-service/src/web/v1/services/ask.py`	Updated `_AskResultResponse` class and `ask` method: - Changed `retrieved_tables` from an optional list of strings to an optional list of dictionaries containing table names and associated scores.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant AskService
    participant RetrievalPipeline

    Client->>AskService: Submit query
    AskService->>RetrievalPipeline: Call dbschema_retrieval()
    RetrievalPipeline-->>AskService: Return {documents, table_scores}
    AskService->>RetrievalPipeline: Call construct_db_schemas() & check_using_db_schemas_without_pruning()
    RetrievalPipeline-->>AskService: Return constructed schemas with scores
    AskService->>Client: Return AskResultResponse with retrieved_tables [{name, score}, ...]

Poem

In the garden of code, I hop with glee,
Tables now score with structured decree,
Functions reform to a rhythm so neat,
Keys and dicts dance in every beat,
A rabbit’s delight in changes so sweet! 🐰✨

✨ Finishing Touches

📝 Generate Docstrings

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Generate unit testing code for this file.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai generate unit testing code for this file.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and generate unit testing code.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai generate docstrings to generate docstrings for this PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai plan to trigger planning for file edits and PR creation.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (9)

wren-ai-service/src/web/v1/services/ask.py (2)
97-97: Consider using more specific type annotations for retrieved_tables.

The change from Optional[List[str]] to Optional[List[dict]] is good, but using a more specific type annotation like Optional[List[Dict[str, Union[str, float]]]] or a TypedDict would better document the expected structure and improve IDE completion.
-    retrieved_tables: Optional[List[dict]] = None
+    retrieved_tables: Optional[List[Dict[str, Union[str, float]]]] = None
368-374: Add defensive coding to handle potential missing fields.

The dictionary construction assumes that both 'table_name' and 'table_score' exist in each document. Consider using .get() with default values to handle cases where these fields might be missing.
retrieved_tables = [
    {
-        "name": document.get("table_name"),
-        "score": document.get("table_score"),
+        "name": document.get("table_name", "unknown"),
+        "score": document.get("table_score", 0.0),
    }
    for document in documents
]
wren-ai-service/src/pipelines/retrieval/retrieval.py (7)
168-169: Improve return type annotation with more specific types.

Using a generic dict return type loses information about the expected structure. Consider using a more specific type hint like Dict[str, Union[List[Document], Dict[str, float]]] to document the expected return structure.
-) -> dict:
+) -> Dict[str, Union[List[Document], Dict[str, float]]]:
171-176: Add error handling for table name retrieval.

The code assumes that 'name' is a key in the content dictionary. Consider adding error handling to gracefully handle cases where this key might be missing.
# assign score to each table
table_scores = {}
for table in tables:
    content = ast.literal_eval(table.content)
-    table_names.append(content["name"])
-    table_scores[content["name"]] = table.score
+    table_name = content.get("name")
+    if table_name:
+        table_names.append(table_name)
+        table_scores[table_name] = table.score
229-232: Maintain consistent key naming between functions.

While dbschema_retrieval returns a dict with "documents" key, construct_db_schemas returns a dict with "db_schemas" key. Consider standardizing the naming convention across functions for better consistency.
return {
-    "db_schemas": list(db_schemas.values()),
+    "documents": list(db_schemas.values()),  # Or rename the key in dbschema_retrieval to "db_schemas"
    "table_scores": dbschema_retrieval["table_scores"],
}
253-256: Prevent potential KeyError with defensive dictionary access.

Direct dictionary access with construct_db_schemas["table_scores"][table_schema["name"]] will throw a KeyError if the key doesn't exist. Consider using .get() with a default value.
-                    "table_score": construct_db_schemas["table_scores"][
-                        table_schema["name"]
-                    ],
+                    "table_score": construct_db_schemas["table_scores"].get(
+                        table_schema["name"], 0.0
+                    ),
268-271: Apply consistent defensive dictionary access pattern.

Similar to the previous comment, use .get() with a default value to prevent KeyErrors when accessing table scores.

Apply this pattern to both instances:
-                    "table_score": construct_db_schemas["table_scores"][
-                        content["name"]
-                    ],
+                    "table_score": construct_db_schemas["table_scores"].get(
+                        content["name"], 0.0
+                    ),
Also applies to: 279-282

375-380: Missing table_score in retrieval_results dictionaries.

When building the retrieval_results list in the column filtering section, you're not including the table_score field which is included elsewhere. Consider adding this for consistency.
retrieval_results.append(
    {
        "table_name": table_schema["name"],
        "table_ddl": ddl,
+        "table_score": dbschema_retrieval["table_scores"].get(table_schema["name"], 0.0),
    }
)
391-394: Use consistent defensive coding pattern for table scores.

Apply the same defensive dictionary access pattern for table scores here as suggested earlier.
-                            "table_score": dbschema_retrieval["table_scores"][
-                                content["name"]
-                            ],
+                            "table_score": dbschema_retrieval["table_scores"].get(
+                                content["name"], 0.0
+                            ),
Also applies to: 402-405

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 00eb9d4 and b82537b.

📒 Files selected for processing (2)

wren-ai-service/src/pipelines/retrieval/retrieval.py (11 hunks)
wren-ai-service/src/web/v1/services/ask.py (9 hunks)

🔇 Additional comments (1)

wren-ai-service/src/pipelines/retrieval/retrieval.py (1)

198-201: LGTM! The return structure clearly separates documents and scores.

The enhanced return structure provides better organization by separating the documents list from the table scores dictionary.

coderabbitai

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b82537b and 8d2aa83.

📒 Files selected for processing (2)

wren-ai-service/src/pipelines/retrieval/retrieval.py (11 hunks)
wren-ai-service/src/web/v1/services/ask.py (9 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

wren-ai-service/src/web/v1/services/ask.py

⏰ Context from checks skipped due to timeout of 90000ms (3)

GitHub Check: pytest
GitHub Check: Analyze (javascript-typescript)
GitHub Check: Analyze (go)

🔇 Additional comments (11)

wren-ai-service/src/pipelines/retrieval/retrieval.py (11)

167-177: Scores now captured for each table - good addition!

You've updated dbschema_retrieval to track table relevancy scores in a dictionary, which is a sensible enhancement. This allows consumers of this function to know not just which tables were retrieved, but also how relevant each one is according to the retrieval algorithm.

198-201: Return structure change looks good

The function now returns both the documents and the table scores in a dictionary, which is a better structured approach than the previous return type. This change correctly supports the new functionality while maintaining backward compatibility by preserving the document access pattern.

205-208: Type signature and access pattern correctly updated

The updated function signature reflects the new dictionary return type from dbschema_retrieval, and the document access now uses the correct key in the input dictionary.

229-232: Good return structure improvement

The function now returns both the DB schemas and the table scores in a well-structured dictionary, properly passing along the relevancy information for downstream use.

236-239: Function signature properly updated

The parameter types have been correctly updated to accept dictionaries instead of lists, aligning with the upstream changes.

246-256: Table scores now included in retrieval results

You've correctly included the table scores in the retrieval results, accessing them from the table_scores dictionary using the table name as the key. This ensures that score information is preserved in the pipeline.

260-282: Table scores added to metrics and views

Similar to regular tables, you've properly included table scores for metrics and views in the retrieval results. The implementation is consistent across all table types.

308-320: Prompt function adjusted for new data structure

The prompt function has been correctly updated to work with the new dictionary-based DB schema structure, maintaining functionality while supporting the new data model.

343-348: Function signature properly updated in construct_retrieval_results

The parameter types have been correctly updated to match the changes in upstream functions.

365-365: DB schemas access pattern correctly updated

The function now accesses the database schemas through the db_schemas key in the input dictionary, correctly adapting to the new structure.

382-405: Table scores added to retrieval results for filtered tables

You've properly included table scores in the retrieval results when filtering by columns, using a consistent implementation approach. This ensures score information is preserved even when table filtering occurs.

coderabbitai · 2025-04-07T05:33:23Z

wren-ai-service/src/pipelines/retrieval/retrieval.py

+                            "table_score": dbschema_retrieval["table_scores"][
+                                content["name"]
+                            ],


💡 Verification agent

🧩 Analysis chain

Verify potential KeyError issue

The code accesses table scores using content["name"] as the key. Ensure that all table names in these records exist in the table_scores dictionary to avoid potential KeyError exceptions.

Also applies to: 402-404

🏁 Script executed:

#!/bin/bash # Check for other places where table_scores is accessed to see if there's error handling rg "table_scores\[" -A 2 -B 2

Length of output: 473

Action Required: Add Key Existence Check for table_scores Access

It appears that both at lines 391–393 and 402–404 of wren-ai-service/src/pipelines/retrieval/retrieval.py, the code directly accesses table_scores using content["name"] without any check. Since we found that the key is inserted based solely on content["name"] in another part of the file, there's a risk of a KeyError if any table name is missing from this dictionary. Please consider adding an explicit check or handling for missing keys (e.g., using .get() with a fallback or wrapping the access in a try/except block) to ensure robustness.

cyyeh added module/ai-service ai-service related ci/ai-service ai-service related labels Apr 2, 2025

github-actions bot added the wren-ai-service label Apr 2, 2025

coderabbitai bot reviewed Apr 2, 2025

View reviewed changes

add name, score to retrieved tables

8d2aa83

cyyeh force-pushed the chore/ai-service/improve-retrieval-tables branch from b82537b to 8d2aa83 Compare April 7, 2025 05:30

coderabbitai bot reviewed Apr 7, 2025

View reviewed changes

cyyeh closed this Apr 9, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[DON'T MERGE]: chore(wren-ai-service): add name, score to retrieved tables #1510

[DON'T MERGE]: chore(wren-ai-service): add name, score to retrieved tables #1510

Uh oh!

cyyeh commented Apr 2, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Apr 2, 2025 •

edited

Loading

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (`.coderabbit.yaml`)

Documentation and Community

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Apr 7, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[DON'T MERGE]: chore(wren-ai-service): add name, score to retrieved tables #1510

[DON'T MERGE]: chore(wren-ai-service): add name, score to retrieved tables #1510

Uh oh!

Conversation

cyyeh commented Apr 2, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Apr 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Poem

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Documentation and Community

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 7, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

cyyeh commented Apr 2, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Apr 2, 2025 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)