-
Notifications
You must be signed in to change notification settings - Fork 1.5k
[DON'T MERGE]: chore(wren-ai-service): add name, score to retrieved tables #1510
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
WalkthroughThe changes update the data structures within the retrieval pipeline and ask service. In the retrieval module, function signatures are revised so that Changes
Sequence Diagram(s)sequenceDiagram
participant Client
participant AskService
participant RetrievalPipeline
Client->>AskService: Submit query
AskService->>RetrievalPipeline: Call dbschema_retrieval()
RetrievalPipeline-->>AskService: Return {documents, table_scores}
AskService->>RetrievalPipeline: Call construct_db_schemas() & check_using_db_schemas_without_pruning()
RetrievalPipeline-->>AskService: Return constructed schemas with scores
AskService->>Client: Return AskResultResponse with retrieved_tables [{name, score}, ...]
Poem
✨ Finishing Touches
🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (9)
wren-ai-service/src/web/v1/services/ask.py (2)
97-97: Consider using more specific type annotations for retrieved_tables.The change from
Optional[List[str]]toOptional[List[dict]]is good, but using a more specific type annotation likeOptional[List[Dict[str, Union[str, float]]]]or a TypedDict would better document the expected structure and improve IDE completion.- retrieved_tables: Optional[List[dict]] = None + retrieved_tables: Optional[List[Dict[str, Union[str, float]]]] = None
368-374: Add defensive coding to handle potential missing fields.The dictionary construction assumes that both 'table_name' and 'table_score' exist in each document. Consider using .get() with default values to handle cases where these fields might be missing.
retrieved_tables = [ { - "name": document.get("table_name"), - "score": document.get("table_score"), + "name": document.get("table_name", "unknown"), + "score": document.get("table_score", 0.0), } for document in documents ]wren-ai-service/src/pipelines/retrieval/retrieval.py (7)
168-169: Improve return type annotation with more specific types.Using a generic
dictreturn type loses information about the expected structure. Consider using a more specific type hint likeDict[str, Union[List[Document], Dict[str, float]]]to document the expected return structure.-) -> dict: +) -> Dict[str, Union[List[Document], Dict[str, float]]]:
171-176: Add error handling for table name retrieval.The code assumes that 'name' is a key in the content dictionary. Consider adding error handling to gracefully handle cases where this key might be missing.
# assign score to each table table_scores = {} for table in tables: content = ast.literal_eval(table.content) - table_names.append(content["name"]) - table_scores[content["name"]] = table.score + table_name = content.get("name") + if table_name: + table_names.append(table_name) + table_scores[table_name] = table.score
229-232: Maintain consistent key naming between functions.While
dbschema_retrievalreturns a dict with "documents" key,construct_db_schemasreturns a dict with "db_schemas" key. Consider standardizing the naming convention across functions for better consistency.return { - "db_schemas": list(db_schemas.values()), + "documents": list(db_schemas.values()), # Or rename the key in dbschema_retrieval to "db_schemas" "table_scores": dbschema_retrieval["table_scores"], }
253-256: Prevent potential KeyError with defensive dictionary access.Direct dictionary access with
construct_db_schemas["table_scores"][table_schema["name"]]will throw a KeyError if the key doesn't exist. Consider using .get() with a default value.- "table_score": construct_db_schemas["table_scores"][ - table_schema["name"] - ], + "table_score": construct_db_schemas["table_scores"].get( + table_schema["name"], 0.0 + ),
268-271: Apply consistent defensive dictionary access pattern.Similar to the previous comment, use .get() with a default value to prevent KeyErrors when accessing table scores.
Apply this pattern to both instances:
- "table_score": construct_db_schemas["table_scores"][ - content["name"] - ], + "table_score": construct_db_schemas["table_scores"].get( + content["name"], 0.0 + ),Also applies to: 279-282
375-380: Missing table_score in retrieval_results dictionaries.When building the retrieval_results list in the column filtering section, you're not including the table_score field which is included elsewhere. Consider adding this for consistency.
retrieval_results.append( { "table_name": table_schema["name"], "table_ddl": ddl, + "table_score": dbschema_retrieval["table_scores"].get(table_schema["name"], 0.0), } )
391-394: Use consistent defensive coding pattern for table scores.Apply the same defensive dictionary access pattern for table scores here as suggested earlier.
- "table_score": dbschema_retrieval["table_scores"][ - content["name"] - ], + "table_score": dbschema_retrieval["table_scores"].get( + content["name"], 0.0 + ),Also applies to: 402-405
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
wren-ai-service/src/pipelines/retrieval/retrieval.py(11 hunks)wren-ai-service/src/web/v1/services/ask.py(9 hunks)
🔇 Additional comments (1)
wren-ai-service/src/pipelines/retrieval/retrieval.py (1)
198-201: LGTM! The return structure clearly separates documents and scores.The enhanced return structure provides better organization by separating the documents list from the table scores dictionary.
b82537b to
8d2aa83
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
wren-ai-service/src/pipelines/retrieval/retrieval.py(11 hunks)wren-ai-service/src/web/v1/services/ask.py(9 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- wren-ai-service/src/web/v1/services/ask.py
⏰ Context from checks skipped due to timeout of 90000ms (3)
- GitHub Check: pytest
- GitHub Check: Analyze (javascript-typescript)
- GitHub Check: Analyze (go)
🔇 Additional comments (11)
wren-ai-service/src/pipelines/retrieval/retrieval.py (11)
167-177: Scores now captured for each table - good addition!You've updated
dbschema_retrievalto track table relevancy scores in a dictionary, which is a sensible enhancement. This allows consumers of this function to know not just which tables were retrieved, but also how relevant each one is according to the retrieval algorithm.
198-201: Return structure change looks goodThe function now returns both the documents and the table scores in a dictionary, which is a better structured approach than the previous return type. This change correctly supports the new functionality while maintaining backward compatibility by preserving the document access pattern.
205-208: Type signature and access pattern correctly updatedThe updated function signature reflects the new dictionary return type from
dbschema_retrieval, and the document access now uses the correct key in the input dictionary.
229-232: Good return structure improvementThe function now returns both the DB schemas and the table scores in a well-structured dictionary, properly passing along the relevancy information for downstream use.
236-239: Function signature properly updatedThe parameter types have been correctly updated to accept dictionaries instead of lists, aligning with the upstream changes.
246-256: Table scores now included in retrieval resultsYou've correctly included the table scores in the retrieval results, accessing them from the
table_scoresdictionary using the table name as the key. This ensures that score information is preserved in the pipeline.
260-282: Table scores added to metrics and viewsSimilar to regular tables, you've properly included table scores for metrics and views in the retrieval results. The implementation is consistent across all table types.
308-320: Prompt function adjusted for new data structureThe prompt function has been correctly updated to work with the new dictionary-based DB schema structure, maintaining functionality while supporting the new data model.
343-348: Function signature properly updated in construct_retrieval_resultsThe parameter types have been correctly updated to match the changes in upstream functions.
365-365: DB schemas access pattern correctly updatedThe function now accesses the database schemas through the
db_schemaskey in the input dictionary, correctly adapting to the new structure.
382-405: Table scores added to retrieval results for filtered tablesYou've properly included table scores in the retrieval results when filtering by columns, using a consistent implementation approach. This ensures score information is preserved even when table filtering occurs.
| "table_score": dbschema_retrieval["table_scores"][ | ||
| content["name"] | ||
| ], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Verification agent
🧩 Analysis chain
Verify potential KeyError issue
The code accesses table scores using content["name"] as the key. Ensure that all table names in these records exist in the table_scores dictionary to avoid potential KeyError exceptions.
Also applies to: 402-404
🏁 Script executed:
#!/bin/bash
# Check for other places where table_scores is accessed to see if there's error handling
rg "table_scores\[" -A 2 -B 2Length of output: 473
Action Required: Add Key Existence Check for table_scores Access
It appears that both at lines 391–393 and 402–404 of wren-ai-service/src/pipelines/retrieval/retrieval.py, the code directly accesses table_scores using content["name"] without any check. Since we found that the key is inserted based solely on content["name"] in another part of the file, there's a risk of a KeyError if any table name is missing from this dictionary. Please consider adding an explicit check or handling for missing keys (e.g., using .get() with a fallback or wrapping the access in a try/except block) to ensure robustness.
Summary by CodeRabbit
New Features
Refactor