-
Notifications
You must be signed in to change notification settings - Fork 1.4k
chore(wren-ai-service): improve displayName for indexing #1931
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
WalkthroughIntroduces clean_display_name in indexing, replacing raw displayName usage with sanitized alias generation across indexing and generation paths. Updates embedded schema comments to reference alias. Adds comprehensive unit tests for the sanitizer. Public indexing API expanded to export additional symbols. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant Source as Source Metadata
participant Indexing as Indexing Utils
participant Clean as clean_display_name
participant Store as Schema/Comments
participant Gen as Generation Pipeline
Source->>Indexing: properties.displayName
Indexing->>Clean: sanitize(displayName)
Clean-->>Indexing: alias
Indexing->>Store: write alias (sanitized)
Gen->>Store: read model/column metadata
Store-->>Gen: alias
Note right of Gen: Use alias in semantics and SQL descriptions
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Suggested reviewers
Poem
✨ Finishing Touches
🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR/Issue comments)Type Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (10)
wren-ai-service/src/pipelines/generation/utils/sql.py (1)
424-431: Fix sample table name mismatch (“users” vs “user”).In the example, the DDL shows table "users" but the SELECT uses "user". Align to "users" for consistency.
Apply this diff in the sample snippet:
- `SELECT LAX_STRING(JSON_QUERY(u.address, '$.city')) FROM user as u` + `SELECT LAX_STRING(JSON_QUERY(u.address, '$.city')) FROM users AS u`wren-ai-service/src/pipelines/indexing/__init__.py (3)
91-237: Sanitizer: also normalize whitespace to underscores.Currently spaces/tabs/newlines pass through and can produce invalid SQL aliases. Normalize all whitespace to “_” before underscore collapsing.
Apply this diff near the end of the function:
- cleaned = "".join(result) - - # Collapse multiple consecutive underscores - cleaned = re.sub(r"_+", "_", cleaned) + cleaned = "".join(result) + # Replace any whitespace (space, tab, newline, etc.) with underscores + cleaned = re.sub(r"\s+", "_", cleaned) + # Collapse multiple consecutive underscores + cleaned = re.sub(r"_+", "_", cleaned)
248-255: Export clean_display_name in all (public API clarity).Other modules import it from the package; adding to all makes the intent explicit and matches the PR summary.
Apply this diff:
__all__ = [ "DBSchema", "TableDescription", "HistoricalQuestion", "SqlPairs", "Instructions", "ProjectMeta", + "clean_display_name", ]
91-237: Minor: handle None defensively (optional).If ever called with None, this returns None despite the str return type. Safe-guard to return "" instead.
Apply this diff at the beginning:
-def clean_display_name(display_name: str) -> str: - if not display_name: - return display_name +def clean_display_name(display_name: str) -> str: + if not display_name: + return "" if display_name is None else display_namewren-ai-service/src/pipelines/indexing/db_schema.py (1)
136-139: Emit JSON for table comment to match column comments.Columns use JSON via orjson; models use Python dict repr. Standardize to JSON for easier parsing.
Apply this diff:
- comment = f"\n/* {str(model_properties)} */\n" + comment = f"\n/* {orjson.dumps(model_properties).decode('utf-8')} */\n"And add the missing import at the top of this file:
import orjsonwren-ai-service/tests/pytest/test_utils.py (5)
90-92: Comment mentions None but no None case is asserted.Either add a test for
Noneor update the comment to only mention the empty string. Given prod code passesprops.get("displayName", ""), testingNonemight be unnecessary—your call.Apply this tiny cleanup if you prefer to adjust the comment:
- # Test empty and None cases + # Test empty case
127-153: Add missing middle-char coverage for '@' and '$'.
@and$are listed in middle-invalid but not exercised here. Add two quick assertions.Apply:
assert clean_display_name("na?me") == "na_me" + assert clean_display_name("na@me") == "na_me" + assert clean_display_name("na$me") == "na_me" assert clean_display_name("na[me") == "na_me"
199-204: Remove duplicate assertion.
"!@#$%^&*()"is asserted twice for the same expectation.Apply:
- # Test underscore collapsing in complex scenarios - result = clean_display_name("!@#$%^&*()") - assert result == "_" # All get replaced, then collapsed
205-212: Decide and test whitespace and non-ASCII digit behavior.
- Whitespace: Should spaces be preserved, underscored, or trimmed?
- Non-ASCII digits (e.g., Arabic-Indic): Should leading digits of any locale trigger the leading underscore?
If aligning with “letters/underscore first, then letters/digits/underscores”:
+ # Whitespace handling (choose desired behavior) + # assert clean_display_name(" user name ") == "user name" # preserve spaces + # assert clean_display_name(" user name ") == "user_name" # convert to underscore + # assert clean_display_name(" user name ") == "user name".strip() # trim only + + # Non-ASCII digits at prefix (locale-agnostic digit handling) + assert clean_display_name("١٢٣name") == "_١٢٣name"I can follow up with a sanitizer tweak if you confirm the desired semantics.
89-212: Consider parameterizing to reduce verbosity and ease maintenance.Turn the many literals into a table-driven test for quicker additions and clearer failures.
Here’s a compact pattern you can drop in (in addition to or replacing the current function):
+@pytest.mark.parametrize( + "raw,expected", + [ + ("", ""), + ("valid_name", "valid_name"), + ("ValidName", "ValidName"), + ("valid123", "valid123"), + ("123name", "_123name"), + ("-name", "_name"), + ("na-me", "na_me"), + ("na..me", "na_me"), + ("name-", "name_"), + ("1", "_"), + (".", "_"), + ("a", "a"), + ("123-test.name@", "_123_test_name_"), + (".table.name.", "_table_name_"), + ("!@#$%^&*()", "_"), + ("user.email", "user_email"), + ("order-total", "order_total"), + ("2023_sales", "_2023_sales"), + ("product_name!", "product_name_"), + ], +) +def test_clean_display_name_param(raw, expected): + assert clean_display_name(raw) == expected
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (6)
wren-ai-service/src/pipelines/generation/semantics_description.py(3 hunks)wren-ai-service/src/pipelines/generation/utils/sql.py(3 hunks)wren-ai-service/src/pipelines/indexing/__init__.py(2 hunks)wren-ai-service/src/pipelines/indexing/db_schema.py(2 hunks)wren-ai-service/src/pipelines/indexing/utils/helper.py(2 hunks)wren-ai-service/tests/pytest/test_utils.py(2 hunks)
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-06-20T02:37:21.292Z
Learnt from: cyyeh
PR: Canner/WrenAI#1763
File: wren-ai-service/src/pipelines/generation/semantics_description.py:31-33
Timestamp: 2025-06-20T02:37:21.292Z
Learning: In the wren-ai-service codebase, when adding new fields like "alias" to the output of functions that use Pydantic models for validation, the user prefers not to update the corresponding Pydantic model definitions to include these new fields.
Applied to files:
wren-ai-service/src/pipelines/generation/semantics_description.pywren-ai-service/src/pipelines/generation/utils/sql.py
🧬 Code graph analysis (4)
wren-ai-service/src/pipelines/indexing/db_schema.py (1)
wren-ai-service/src/pipelines/indexing/__init__.py (1)
clean_display_name(91-237)
wren-ai-service/src/pipelines/generation/semantics_description.py (1)
wren-ai-service/src/pipelines/indexing/__init__.py (1)
clean_display_name(91-237)
wren-ai-service/src/pipelines/indexing/utils/helper.py (1)
wren-ai-service/src/pipelines/indexing/__init__.py (1)
clean_display_name(91-237)
wren-ai-service/tests/pytest/test_utils.py (1)
wren-ai-service/src/pipelines/indexing/__init__.py (1)
clean_display_name(91-237)
🔇 Additional comments (12)
wren-ai-service/src/pipelines/generation/utils/sql.py (2)
234-242: Examples updated to alias-based notation look good.The alias demonstration aligns with the new sanitizer and the “no dot in alias” rule.
438-445: JSON array example aligns with alias usage.The json_fields now reference aliases consistently. No functional issues spotted.
wren-ai-service/src/pipelines/indexing/__init__.py (1)
248-255: PR summary vs. code mismatch: all not updated.The summary claims the public API was expanded; the code doesn’t include clean_display_name in all. The above patch resolves it.
wren-ai-service/src/pipelines/indexing/utils/helper.py (2)
10-10: Importing clean_display_name here is correct.Keeps alias logic centralized and avoids duplication.
34-36: Column alias now sanitized — good.This makes column comment metadata consistent with model-level aliasing.
wren-ai-service/src/pipelines/indexing/db_schema.py (2)
18-23: Consolidated import incl. clean_display_name — good.Clearer and prepares for consistent aliasing.
16-24: All rawdisplayNameusages are sanitized
Verified that everydisplayNamereference inpipelines/indexingandpipelines/generationis either passed throughclean_display_nameor removed; no further changes required.wren-ai-service/src/pipelines/generation/semantics_description.py (2)
15-16: Importing clean_display_name here is appropriate.Keeps alias derivation consistent with indexing.
106-109: Aliasing added to prompt payload only—safe with existing Pydantic models. No downstream schema changes required; alias fields are present as intended in the prompt payload.wren-ai-service/tests/pytest/test_utils.py (3)
11-11: Import looks correct and aligned with the new public API.
clean_display_nameis imported fromsrc.pipelines.indexingas intended.
87-88: No-op change.Blank line change; nothing to review.
188-195: Clarify underscore collapsing intent for user-authored underscores.The implementation collapses all consecutive underscores, including user-provided ones. If that's desired, add a test like below; if not, we should tweak the sanitizer.
Option A (collapse all):
assert ( clean_display_name("na..me") == "na_me" ) # dots become underscores, then collapsed + assert clean_display_name("name__test") == "name_test"Option B (preserve user underscores): keep tests as-is and we’ll adjust the sanitizer accordingly on the prod side.
Summary by CodeRabbit