-
Notifications
You must be signed in to change notification settings - Fork 1.5k
chore(wren-ai-service): improve text2sql process #1070
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
WalkthroughThe pull request modifies the Changes
Sequence DiagramsequenceDiagram
participant User
participant AskService
participant HistoricalQuestionProcessor
participant IntentClassifier
User->>AskService: Submit query
AskService->>HistoricalQuestionProcessor: Check historical questions
alt Historical questions found
HistoricalQuestionProcessor-->>AskService: Generate SQL results
else No historical questions
AskService->>IntentClassifier: Perform intent classification
IntentClassifier-->>AskService: Classify intent and generate results
end
AskService-->>User: Return query response
Possibly related PRs
Suggested reviewers
Poem
🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (6)
wren-ai-service/src/web/v1/services/ask.py (6)
145-145: Consider clarifying the purpose ofapi_results.
While initializingapi_resultsto an empty list is straightforward, it might be beneficial to rename it to something more specific (e.g.,historical_or_generated_results) to reflect how it may hold data from different pipelines.
160-163: Clarify slicing to a single document.
Slicing the first result ([:1]) works fine, but if only one document is needed, you could consider returning either a single item orNonefor clarity.
206-225: Consider capturing results fromdata_assistance.
Currently,data_assistanceis run in a separate task, and its response is never stored. If real-time or post-processing usage is required, you might want to capture the results or recheck the pipeline outcome.
273-283: DRY up the followup vs. direct SQL generation.
The logic branches into two pipelines,"followup_sql_generation"vs."sql_generation". If they share much of the same core logic, consider introducing a helper function to reduce duplication.
335-346: Finalize whenapi_resultsis present.
This completion logic properly updates the ask status and returns the final results. For readability, consider centralizing success/failure outcome setting in a single helper method.
347-360:logger.exceptionused for a non-exception scenario.
Raising a structured warning or info log (e.g.,logger.warning) might be more appropriate when indicating no relevant SQL rather than including a stack trace.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
wren-ai-service/src/web/v1/services/ask.py(3 hunks)
🔇 Additional comments (12)
wren-ai-service/src/web/v1/services/ask.py (12)
155-157: Validate pipeline availability.
Consider verifying that "historical_question" is a valid key in self._pipelines before attempting to run it, to avoid potential KeyError.
165-176: Check for missing keys in pipeline results.
The code assumes "statement" and "viewId" exist in each result. Consider handling potential missing keys or invalid structures from upstream pipeline data.
177-188: Account for partially populated classification results.
If intent_classification_result lacks expected fields like "intent", "rephrased_question", or "reasoning", this code could cause unexpected behaviors. Ensure you handle missing or malformed data gracefully.
190-193: Edge case with empty rephrased_question.
if not rephrased_question: also evaluates to False if rephrased_question is an empty string. Verify that this logic covers all intended scenarios.
196-205: Misleading query logic is clear.
The flow for handling a "MISLEADING_QUERY" intent looks consistent, sets the proper status, and returns.
227-233: Transition to TEXT_TO_SQL is handled well.
Setting up the response status and type for TEXT_TO_SQL is logically consistent.
265-265: Status progression to generating.
The code correctly updates the status to generating only if no results have been populated yet. No issues spotted.
284-292: Similar branching logic.
This block largely mirrors the logic in lines 273-283, so centralizing the pipeline decision in one place could make the code more maintainable.
294-305: Valid generation results successfully integrated.
Assigning valid SQL results to api_results is straightforward and well-structured.
306-310: Fallback to SQL correction looks good.
Switching to SQL correction if we detect failed dry-run results is a logical flow to ensure query accuracy.
311-320: Tracking status as correcting.
Setting the status to correcting prior to invoking the correction pipeline allows clear progress monitoring.
322-332: Integrating valid generation results post-correction.
Capturing the first valid corrected SQL and storing it in api_results aligns well with the pipeline design.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (5)
wren-ai-service/src/web/v1/services/ask.py (5)
165-176: BuildingAskResultfrom historical question results
The list comprehension is neat. Ensuring the fieldsstatementandviewIdexist inresultis crucial; otherwise, consider default values or error handling if they might be absent.
190-193: Consider inlininguser_query
The inline conditional is readable. As a minor nitpick, you could inlineuser_queryusage wherever needed, but this approach is perfectly fine if you want to keep logic clear.
218-233: Fallback for unexpected intents
Thiselsebranch covers the "TEXT_TO_SQL" flow. If the pipeline returns an unknown or unsupported intent, the code also falls here. You might consider explicitly handling unexpected intents or adding a defaultlogging.warningin case the pipeline yields unexpected values.
265-320: Consider refactoring to reduce complexity
The logic here (generating SQL or handling invalid generation through corrections) is comprehensive but significantly increases method length. Consider extracting the text-to-SQL generation and correction logic into helper functions to improve readability and reusability.
322-360: Uselogger.errorinstead oflogger.exceptionfor logical errors
Line 347, for example, useslogger.exceptionoutside of an actualexceptblock. If the situation is not triggered by an exception, uselogger.errororlogger.warning. Reservelogger.exceptionfor cases within an actual exception handler to preserve stack trace usage.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
wren-ai-service/src/web/v1/services/ask.py(3 hunks)
🔇 Additional comments (5)
wren-ai-service/src/web/v1/services/ask.py (5)
145-145: Initialize api_results earlier for clarity
This line introduces the api_results list. It's a straightforward initialization, but it's good practice to declare such tracking lists at the start of the method for clarity and maintainability.
155-157: Verify the pipeline key "historical_question"
Ensure that "historical_question" is a valid key in the self._pipelines dictionary and handle the possibility of missing or misnamed pipelines (e.g., with a try-except, fallback, or logging).
160-163: Slicing the historical question results
Slicing to the top result [:1] is a clear way to limit the returned documents. This approach is concise and maintains the code’s readability.
177-188: Intent classification result usage
This block correctly retrieves the post-processed classification data. The .get("post_process", {}) usage safely handles missing keys. Ensure that any unexpected or missing data triggers appropriate logging or error handling if needed.
196-217: Double-check concurrency and background task handling
Creating a background task via asyncio.create_task detaches its execution from the main flow. Make sure exceptions in the data_assistance pipeline won't silently fail. Consider adding a callback or logging to handle potential background errors.
Summary by CodeRabbit
New Features
Bug Fixes