Skip to content

Conversation

@cyyeh
Copy link
Member

@cyyeh cyyeh commented Sep 2, 2025

Summary by CodeRabbit

  • New Features
    • Automatic sanitization of display names into valid aliases across generated schemas and descriptions, ensuring consistent, safe identifiers (handles invalid characters, numeric starts, collapsing repeats).
  • Documentation
    • Updated schema examples to use alias instead of displayName, clarifying expected metadata format.
  • Tests
    • Added comprehensive tests covering edge cases for alias sanitization to ensure stability and predictable transformations.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Sep 2, 2025

Walkthrough

Introduces clean_display_name in indexing, replacing raw displayName usage with sanitized alias generation across indexing and generation paths. Updates embedded schema comments to reference alias. Adds comprehensive unit tests for the sanitizer. Public indexing API expanded to export additional symbols.

Changes

Cohort / File(s) Summary
Alias sanitization integration
wren-ai-service/src/pipelines/generation/semantics_description.py, wren-ai-service/src/pipelines/indexing/db_schema.py, wren-ai-service/src/pipelines/indexing/utils/helper.py
Replace direct displayName usage with clean_display_name(...) for alias derivation in generation and indexing flows.
Indexing package API
wren-ai-service/src/pipelines/indexing/__init__.py
Add clean_display_name function; export SqlPairs, Instructions, ProjectMeta, TableDescription; update all; minor regex-based underscore collapsing.
Docs/examples in SQL utils
wren-ai-service/src/pipelines/generation/utils/sql.py
Update inline schema comments: switch displayName -> alias in examples, including nested json_fields. No code logic changes.
Tests
wren-ai-service/tests/pytest/test_utils.py
Add test_clean_display_name with extensive edge cases validating sanitization behavior.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant Source as Source Metadata
  participant Indexing as Indexing Utils
  participant Clean as clean_display_name
  participant Store as Schema/Comments
  participant Gen as Generation Pipeline

  Source->>Indexing: properties.displayName
  Indexing->>Clean: sanitize(displayName)
  Clean-->>Indexing: alias
  Indexing->>Store: write alias (sanitized)
  Gen->>Store: read model/column metadata
  Store-->>Gen: alias
  Note right of Gen: Use alias in semantics and SQL descriptions
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Suggested reviewers

  • yichieh-lu

Poem

Hop hop! I sweep the names so clean,
From noisy symbols to aliases keen.
Underscores line up in tidy rows,
As tables smile with well-groomed bows.
Code burrows neat, tests nibble through—
Sanitized fields, fresh as morning dew. 🐇✨

✨ Finishing Touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch chore/ai-service/improve-displayName

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

  • Add @coderabbitai ignore or @coderabbit ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Status, Documentation and Community

  • Visit our Status Page to check the current availability of CodeRabbit.
  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@cyyeh cyyeh added module/ai-service ai-service related ci/ai-service ai-service related labels Sep 2, 2025
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (10)
wren-ai-service/src/pipelines/generation/utils/sql.py (1)

424-431: Fix sample table name mismatch (“users” vs “user”).

In the example, the DDL shows table "users" but the SELECT uses "user". Align to "users" for consistency.

Apply this diff in the sample snippet:

- `SELECT LAX_STRING(JSON_QUERY(u.address, '$.city')) FROM user as u`
+ `SELECT LAX_STRING(JSON_QUERY(u.address, '$.city')) FROM users AS u`
wren-ai-service/src/pipelines/indexing/__init__.py (3)

91-237: Sanitizer: also normalize whitespace to underscores.

Currently spaces/tabs/newlines pass through and can produce invalid SQL aliases. Normalize all whitespace to “_” before underscore collapsing.

Apply this diff near the end of the function:

-    cleaned = "".join(result)
-
-    # Collapse multiple consecutive underscores
-    cleaned = re.sub(r"_+", "_", cleaned)
+    cleaned = "".join(result)
+    # Replace any whitespace (space, tab, newline, etc.) with underscores
+    cleaned = re.sub(r"\s+", "_", cleaned)
+    # Collapse multiple consecutive underscores
+    cleaned = re.sub(r"_+", "_", cleaned)

248-255: Export clean_display_name in all (public API clarity).

Other modules import it from the package; adding to all makes the intent explicit and matches the PR summary.

Apply this diff:

 __all__ = [
     "DBSchema",
     "TableDescription",
     "HistoricalQuestion",
     "SqlPairs",
     "Instructions",
     "ProjectMeta",
+    "clean_display_name",
 ]

91-237: Minor: handle None defensively (optional).

If ever called with None, this returns None despite the str return type. Safe-guard to return "" instead.

Apply this diff at the beginning:

-def clean_display_name(display_name: str) -> str:
-    if not display_name:
-        return display_name
+def clean_display_name(display_name: str) -> str:
+    if not display_name:
+        return "" if display_name is None else display_name
wren-ai-service/src/pipelines/indexing/db_schema.py (1)

136-139: Emit JSON for table comment to match column comments.

Columns use JSON via orjson; models use Python dict repr. Standardize to JSON for easier parsing.

Apply this diff:

-            comment = f"\n/* {str(model_properties)} */\n"
+            comment = f"\n/* {orjson.dumps(model_properties).decode('utf-8')} */\n"

And add the missing import at the top of this file:

import orjson
wren-ai-service/tests/pytest/test_utils.py (5)

90-92: Comment mentions None but no None case is asserted.

Either add a test for None or update the comment to only mention the empty string. Given prod code passes props.get("displayName", ""), testing None might be unnecessary—your call.

Apply this tiny cleanup if you prefer to adjust the comment:

-    # Test empty and None cases
+    # Test empty case

127-153: Add missing middle-char coverage for '@' and '$'.

@ and $ are listed in middle-invalid but not exercised here. Add two quick assertions.

Apply:

     assert clean_display_name("na?me") == "na_me"
+    assert clean_display_name("na@me") == "na_me"
+    assert clean_display_name("na$me") == "na_me"
     assert clean_display_name("na[me") == "na_me"

199-204: Remove duplicate assertion.

"!@#$%^&*()" is asserted twice for the same expectation.

Apply:

-    # Test underscore collapsing in complex scenarios
-    result = clean_display_name("!@#$%^&*()")
-    assert result == "_"  # All get replaced, then collapsed

205-212: Decide and test whitespace and non-ASCII digit behavior.

  • Whitespace: Should spaces be preserved, underscored, or trimmed?
  • Non-ASCII digits (e.g., Arabic-Indic): Should leading digits of any locale trigger the leading underscore?

If aligning with “letters/underscore first, then letters/digits/underscores”:

+    # Whitespace handling (choose desired behavior)
+    # assert clean_display_name(" user name ") == "user name"         # preserve spaces
+    # assert clean_display_name(" user name ") == "user_name"         # convert to underscore
+    # assert clean_display_name(" user name ") == "user name".strip() # trim only
+
+    # Non-ASCII digits at prefix (locale-agnostic digit handling)
+    assert clean_display_name("١٢٣name") == "_١٢٣name"

I can follow up with a sanitizer tweak if you confirm the desired semantics.


89-212: Consider parameterizing to reduce verbosity and ease maintenance.

Turn the many literals into a table-driven test for quicker additions and clearer failures.

Here’s a compact pattern you can drop in (in addition to or replacing the current function):

+@pytest.mark.parametrize(
+    "raw,expected",
+    [
+        ("", ""),
+        ("valid_name", "valid_name"),
+        ("ValidName", "ValidName"),
+        ("valid123", "valid123"),
+        ("123name", "_123name"),
+        ("-name", "_name"),
+        ("na-me", "na_me"),
+        ("na..me", "na_me"),
+        ("name-", "name_"),
+        ("1", "_"),
+        (".", "_"),
+        ("a", "a"),
+        ("123-test.name@", "_123_test_name_"),
+        (".table.name.", "_table_name_"),
+        ("!@#$%^&*()", "_"),
+        ("user.email", "user_email"),
+        ("order-total", "order_total"),
+        ("2023_sales", "_2023_sales"),
+        ("product_name!", "product_name_"),
+    ],
+)
+def test_clean_display_name_param(raw, expected):
+    assert clean_display_name(raw) == expected
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 2aa6f8a and bb83f2b.

📒 Files selected for processing (6)
  • wren-ai-service/src/pipelines/generation/semantics_description.py (3 hunks)
  • wren-ai-service/src/pipelines/generation/utils/sql.py (3 hunks)
  • wren-ai-service/src/pipelines/indexing/__init__.py (2 hunks)
  • wren-ai-service/src/pipelines/indexing/db_schema.py (2 hunks)
  • wren-ai-service/src/pipelines/indexing/utils/helper.py (2 hunks)
  • wren-ai-service/tests/pytest/test_utils.py (2 hunks)
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-06-20T02:37:21.292Z
Learnt from: cyyeh
PR: Canner/WrenAI#1763
File: wren-ai-service/src/pipelines/generation/semantics_description.py:31-33
Timestamp: 2025-06-20T02:37:21.292Z
Learning: In the wren-ai-service codebase, when adding new fields like "alias" to the output of functions that use Pydantic models for validation, the user prefers not to update the corresponding Pydantic model definitions to include these new fields.

Applied to files:

  • wren-ai-service/src/pipelines/generation/semantics_description.py
  • wren-ai-service/src/pipelines/generation/utils/sql.py
🧬 Code graph analysis (4)
wren-ai-service/src/pipelines/indexing/db_schema.py (1)
wren-ai-service/src/pipelines/indexing/__init__.py (1)
  • clean_display_name (91-237)
wren-ai-service/src/pipelines/generation/semantics_description.py (1)
wren-ai-service/src/pipelines/indexing/__init__.py (1)
  • clean_display_name (91-237)
wren-ai-service/src/pipelines/indexing/utils/helper.py (1)
wren-ai-service/src/pipelines/indexing/__init__.py (1)
  • clean_display_name (91-237)
wren-ai-service/tests/pytest/test_utils.py (1)
wren-ai-service/src/pipelines/indexing/__init__.py (1)
  • clean_display_name (91-237)
🔇 Additional comments (12)
wren-ai-service/src/pipelines/generation/utils/sql.py (2)

234-242: Examples updated to alias-based notation look good.

The alias demonstration aligns with the new sanitizer and the “no dot in alias” rule.


438-445: JSON array example aligns with alias usage.

The json_fields now reference aliases consistently. No functional issues spotted.

wren-ai-service/src/pipelines/indexing/__init__.py (1)

248-255: PR summary vs. code mismatch: all not updated.

The summary claims the public API was expanded; the code doesn’t include clean_display_name in all. The above patch resolves it.

wren-ai-service/src/pipelines/indexing/utils/helper.py (2)

10-10: Importing clean_display_name here is correct.

Keeps alias logic centralized and avoids duplication.


34-36: Column alias now sanitized — good.

This makes column comment metadata consistent with model-level aliasing.

wren-ai-service/src/pipelines/indexing/db_schema.py (2)

18-23: Consolidated import incl. clean_display_name — good.

Clearer and prepares for consistent aliasing.


16-24: All raw displayName usages are sanitized
Verified that every displayName reference in pipelines/indexing and pipelines/generation is either passed through clean_display_name or removed; no further changes required.

wren-ai-service/src/pipelines/generation/semantics_description.py (2)

15-16: Importing clean_display_name here is appropriate.

Keeps alias derivation consistent with indexing.


106-109: Aliasing added to prompt payload only—safe with existing Pydantic models. No downstream schema changes required; alias fields are present as intended in the prompt payload.

wren-ai-service/tests/pytest/test_utils.py (3)

11-11: Import looks correct and aligned with the new public API.

clean_display_name is imported from src.pipelines.indexing as intended.


87-88: No-op change.

Blank line change; nothing to review.


188-195: Clarify underscore collapsing intent for user-authored underscores.

The implementation collapses all consecutive underscores, including user-provided ones. If that's desired, add a test like below; if not, we should tweak the sanitizer.

Option A (collapse all):

     assert (
         clean_display_name("na..me") == "na_me"
     )  # dots become underscores, then collapsed
+    assert clean_display_name("name__test") == "name_test"

Option B (preserve user underscores): keep tests as-is and we’ll adjust the sanitizer accordingly on the prod side.

@cyyeh cyyeh merged commit 1dd8ad2 into main Sep 2, 2025
12 checks passed
@cyyeh cyyeh deleted the chore/ai-service/improve-displayName branch September 2, 2025 06:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci/ai-service ai-service related module/ai-service ai-service related wren-ai-service

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants