Skip to content

UN-2807 [MISC] Changed user_data to custom_data in variable replacement#1548

Merged
jaseemjaskp merged 12 commits intomainfrom
feature/user-data-variable-support
Sep 22, 2025
Merged

UN-2807 [MISC] Changed user_data to custom_data in variable replacement#1548
jaseemjaskp merged 12 commits intomainfrom
feature/user-data-variable-support

Conversation

@jaags-dev
Copy link
Copy Markdown
Contributor

@jaags-dev jaags-dev commented Sep 22, 2025

What

  • Add custom_data variable support
  • Implement custom_data field validation and processing for API deployments
  • Support nested JSON object access in variable replacement (e.g., {{custom_data.name}}, {{custom_data.address.city}})
  • Enable custom_data variables in both direct API deployments and exported Prompt Studio tools

Why

  • Users need ability to pass dynamic JSON data to prompts during API deployment execution
  • Current variable system only supports static and dynamic variables, missing support for user-provided JSON data
  • Prompt Studio tools exported as containers need access to the same custom_data functionality as direct API calls
  • Enhances flexibility for users to create more dynamic and data-driven prompt templates

How

  • Enhanced ExecutionRequestSerializer with custom_data JSONField including JSON validation
  • Added CUSTOM_DATA variable type and regex pattern matching in prompt service constants
  • Implemented dot notation parsing for nested JSON traversal in variable replacement engine

Can this PR break any existing features. If yes, please list possible items. If no, please explain why. (PS: Admins do not merge the PR without this section filled)

No, this PR should not break any existing features because:

  • custom_data field is optional (required=False, allow_null=True) in API serializer
  • Added to EXECUTION_EXCLUDED_PARAMS to prevent passing to incompatible methods
  • Variable replacement maintains backward compatibility with existing static/dynamic variables

Database Migrations

  • No database schema changes required
  • All changes are at the application/service layer

Env Config

  • No new environment variables required
  • Uses existing workflow execution and metadata infrastructure

Relevant Docs

Related Issues or PRs

Zipstack/unstract-sdk#202

Dependencies Versions

Notes on Testing

  • Tested custom_data JSON validation in API serializer
  • Verified dot notation variable replacement for nested objects (custom_data.address.city)
  • Confirmed exported Prompt Studio tools receive custom_data through metadata
  • Validated backward compatibility with existing static/dynamic variables
  • Tested error handling for invalid JSON and missing object keys
  • Verified workflow execution pipeline passes custom_data correctly

Screenshots

N/A - Backend API feature

Checklist

I have read and understood the Contribution Guidelines.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Sep 22, 2025

Summary by CodeRabbit

  • Refactor

    • Renamed the “user_data” field to “custom_data” across APIs, workflows, metadata, and prompt services.
    • Updated request/response payloads, serializer fields, method parameters, and variable replacement to use custom_data (e.g., prompt variables: custom_data.path.to.value).
    • Execution exclusions and metadata keys now reference custom_data.
  • Chores

    • Bumped Structure Tool version to 0.0.88.
  • Note

    • This is a breaking change: update client requests, templates, and integrations to use the custom_data key.

Walkthrough

A project-wide rename changes the user-provided payload key from user_data to custom_data. This propagates through API constants, serializers, helpers, workflow orchestration, file metadata handling, prompt-service variable replacement, and tool settings. Environment/tool versions are bumped from 0.0.87 to 0.0.88. One serializer file contains merge-conflict markers.

Changes

Cohort / File(s) Summary of changes
API v2 surface
backend/api_v2/constants.py, backend/api_v2/deployment_helper.py, backend/api_v2/api_deployment_views.py
Rename USER_DATA → CUSTOM_DATA; function parameters and calls now use custom_data; serializer access key switched accordingly.
API v2 serializers
backend/api_v2/serializers.py
ExecutionRequestSerializer field user_data → custom_data; validator renamed and messages updated; note: merge-conflict markers present in docstring/field area.
Workflow manager orchestration
backend/workflow_manager/workflow_v2/workflow_helper.py, backend/workflow_manager/workflow_v2/file_execution_tasks.py, backend/workflow_manager/endpoint_v2/source.py, backend/workflow_manager/workflow_v2/dto.py
Public signatures updated to custom_data; EXECUTION_EXCLUDED_PARAMS filters custom_data; FileData now exposes custom_data; calls to add_file_to_volume pass custom_data; minor formatting-only hunks elsewhere.
Workflow execution layer
unstract/workflow-execution/src/unstract/workflow_execution/constants.py, unstract/workflow-execution/src/unstract/workflow_execution/execution_file_handler.py
MetaDataKey.USER_DATA → CUSTOM_DATA; ExecutionFileHandler.add_metadata_to_volume parameter renamed to custom_data; metadata writes use CUSTOM_DATA key.
Prompt service constants and variable replacement
prompt-service/src/unstract/prompt_service/constants.py, prompt-service/src/unstract/prompt_service/helpers/variable_replacement.py, prompt-service/src/unstract/prompt_service/services/variable_replacement.py, prompt-service/src/unstract/prompt_service/controllers/answer_prompt.py
Public constants and enum updated to CUSTOM_DATA; regex targets custom_data; replacement helper renamed replace_custom_data_variable; service/controller accept and propagate custom_data instead of user_data.
Structure tool
tools/structure/src/constants.py, tools/structure/src/main.py, tools/structure/src/config/properties.json
SettingsKeys.USER_DATA → CUSTOM_DATA; payload now writes custom_data; toolVersion bumped 0.0.87 → 0.0.88.
Env version bump
backend/sample.env
STRUCTURE_TOOL_IMAGE_URL and STRUCTURE_TOOL_IMAGE_TAG updated to 0.0.88.

Sequence Diagram(s)

sequenceDiagram
    autonumber
    participant C as Client
    participant API as API v2
    participant DH as DeploymentHelper
    participant WH as WorkflowHelper
    participant Q as Task Queue
    participant EFH as ExecutionFileHandler
    participant SRC as SourceConnector
    participant VOL as Volume/Metadata

    note over C,API: Request contains custom_data
    C->>API: POST /execute (custom_data)
    API->>DH: execute_workflow(custom_data)
    DH->>WH: execute_workflow_async(custom_data)
    WH->>Q: Enqueue task (custom_data)
    Q-->>WH: Task started
    WH->>SRC: add_file_to_volume(..., custom_data)
    SRC->>EFH: add_metadata_to_volume(..., custom_data)
    EFH->>VOL: Write metadata { ..., custom_data }
    VOL-->>EFH: OK
    EFH-->>SRC: OK
    SRC-->>WH: File prepared
    WH-->>DH: Execution progressing
    DH-->>API: Ack
    API-->>C: 202 Accepted
    note right of VOL: Metadata field key: "custom_data"
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 57.14% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Title Check ✅ Passed The title correctly references a real part of the changeset (renaming user_data → custom_data in variable replacement), but it understates the scope: this PR performs a cross-cutting rename and API/serializer changes across multiple modules (constants, serializers, workflow, metadata, prompt service, etc.). As written the title is therefore only partially representative of the main change.
Description Check ✅ Passed The PR description follows the repository template and fills the required What/Why/How sections, includes a "Can this PR break..." justification, and documents testing and related PRs, so it is largely complete and matches the template structure. Some non-critical template sections (Relevant Docs, Dependencies Versions) are blank, and the description does not call out observed unresolved merge-conflict markers in serializers.py or explicitly describe external API compatibility/migration guidance for consumers expecting "user_data".
✨ Finishing touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch feature/user-data-variable-support

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@jaags-dev jaags-dev changed the title Feature/user data variable support UN-2807 [FEAT] Add custom_data variable support for Prompt Studio Sep 22, 2025
@jaags-dev jaags-dev changed the title UN-2807 [FEAT] Add custom_data variable support for Prompt Studio UN-2807 [FEAT] Changed user_data to custom_data in variable replacement Sep 22, 2025
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (6)
backend/workflow_manager/workflow_v2/file_execution_tasks.py (3)

555-563: Crash risk: _build_final_result called with workflow_file_execution=None.

This path passes workflow_file_execution=None, but _build_final_result dereferences .id, causing AttributeError. Fix in _build_final_result to handle None.

See proposed fix in the comment for Lines 1128-1192.


564-584: UnboundLocalError risk: destination/workflow_log may be undefined in except.

If the exception occurs before these vars are assigned, referencing them will raise. Initialize both before the try and guard their use.

     def _process_file(
         cls,
         current_file_idx: int,
         total_files: int,
         file_data: FileData,
         file_hash: FileHash,
         workflow_execution: WorkflowExecution,
         workflow_file_execution: WorkflowFileExecution | None = None,
     ) -> FileExecutionResult:
@@
-        try:
+        # Ensure variables exist for exception paths
+        destination: DestinationConnector | None = None
+        workflow_log: WorkflowLog | None = None
+        try:
@@
-        except Exception as error:
+        except Exception as error:
             if isinstance(error, UnsupportedMimeTypeError):
                 error_msg = str(error)
             else:
                 error_msg = f"File execution failed: {error}"
-                workflow_log.log_error(
-                    logger=logger, message=error_msg, exc_info=True, stack_info=True
-                )
+                if workflow_log:
+                    workflow_log.log_error(
+                        logger=logger, message=error_msg, exc_info=True, stack_info=True
+                    )
+                else:
+                    logger.error(error_msg, exc_info=True, stack_info=True)
             workflow_file_execution.update_status(
                 status=ExecutionStatus.ERROR, execution_error=error_msg[:500]
             )
             result = FinalOutputResult(output=None, metadata=None, error=error_msg)
             return cls._build_final_result(
                 workflow_execution=workflow_execution,
                 file_hash=file_hash,
                 result=result,
                 workflow_file_execution=workflow_file_execution,
                 error=error_msg,
-                is_api=destination.is_api if destination else False,
+                is_api=destination.is_api if destination else False,
                 destination=destination,
             )

1128-1192: Null dereference: workflow_file_execution may be None in _build_final_result.

Guard all usages and skip tracker/usage updates when not available.

     def _build_final_result(
         cls,
         workflow_execution: WorkflowExecution,
         file_hash: FileHash,
         result: FinalOutputResult,
-        workflow_file_execution: WorkflowFileExecution | None = None,
+        workflow_file_execution: WorkflowFileExecution | None = None,
         error: str | None = None,
         is_api: bool = False,
         destination: DestinationConnector | None = None,
     ) -> FileExecutionResult:
         """Construct and cache the final execution result."""
-        final_result = FileExecutionResult(
+        file_execution_id = (
+            str(workflow_file_execution.id) if workflow_file_execution else ""
+        )
+        final_result = FileExecutionResult(
             file=file_hash.file_name,
-            file_execution_id=str(workflow_file_execution.id),
+            file_execution_id=file_execution_id,
             error=error,
             result=result.output,
             metadata=result.metadata,
         )
 
-        if is_api:
+        if is_api and workflow_file_execution:
             # Update cache with final result
             ResultCacheUtils.update_api_results(
                 workflow_id=workflow_execution.workflow.id,
                 execution_id=str(workflow_execution.id),
                 api_result=final_result,
             )
@@
-                APIHubUsageUtil.track_api_hub_usage(
-                    workflow_execution_id=str(workflow_execution.id),
-                    workflow_file_execution_id=str(workflow_file_execution.id),
-                    organization_id=organization_id,
-                )
+                if workflow_file_execution:
+                    APIHubUsageUtil.track_api_hub_usage(
+                        workflow_execution_id=str(workflow_execution.id),
+                        workflow_file_execution_id=str(workflow_file_execution.id),
+                        organization_id=organization_id,
+                    )
@@
-        cls._update_file_execution_tracker(
-            execution_id=str(workflow_execution.id),
-            file_execution_id=str(workflow_file_execution.id),
-            stage=FileExecutionStage.COMPLETED,
-            status=status,
-            error=error,
-        )
-        cls.delete_tool_execution_tracker(
-            execution_id=str(workflow_execution.id),
-            file_execution_id=str(workflow_file_execution.id),
-        )
+        if workflow_file_execution:
+            cls._update_file_execution_tracker(
+                execution_id=str(workflow_execution.id),
+                file_execution_id=str(workflow_file_execution.id),
+                stage=FileExecutionStage.COMPLETED,
+                status=status,
+                error=error,
+            )
+            cls.delete_tool_execution_tracker(
+                execution_id=str(workflow_execution.id),
+                file_execution_id=str(workflow_file_execution.id),
+            )
 
         return final_result
unstract/workflow-execution/src/unstract/workflow_execution/execution_file_handler.py (1)

98-106: Public API break: rename user_data→custom_data without alias.

This method is likely imported by other modules. Provide a backward‑compatible alias for user_data (deprecated).

-    def add_metadata_to_volume(
+    def add_metadata_to_volume(
         self,
         input_file_path: str,
         file_execution_id: str,
         source_hash: str,
         tags: list[str],
         llm_profile_id: str | None = None,
-        custom_data: dict[str, Any] | None = None,
+        custom_data: dict[str, Any] | None = None,
+        **kwargs,
     ) -> None:
@@
-        # Add custom_data to metadata if provided
-        if custom_data:
+        # Back-compat: allow legacy 'user_data' kwarg
+        if custom_data is None and "user_data" in kwargs:
+            custom_data = kwargs.get("user_data")
+        # Add custom_data to metadata if provided
+        if custom_data:
             content[MetaDataKey.CUSTOM_DATA] = custom_data
backend/api_v2/constants.py (1)

14-15: Remove merge conflict markers in serializers and add user_data fallback in views

  • In backend/api_v2/serializers.py (around lines 214–219), remove the leftover conflict markers (=======, >>>>>>>) so only the custom_data doc remains.
  • In backend/api_v2/api_deployment_views.py (line 73), fall back to the legacy key:
custom_data = (
    serializer.validated_data.get(ApiExecution.CUSTOM_DATA)
    or serializer.validated_data.get('user_data')
)

This ensures clients posting user_data continue working during the deprecation window.

prompt-service/src/unstract/prompt_service/services/variable_replacement.py (1)

83-97: Stop logging the fully rendered prompt (PII/secret leakage).

Full prompts may contain sensitive CUSTOM_DATA; emitting them to logs violates least‑data and can breach compliance.

Apply this diff to redact:

-            app.logger.info(
-                f"[{tool_id}] Prompt after variable replacement: {prompt_text}"
-            )
+            app.logger.info(f"[{tool_id}] Prompt after variable replacement: [redacted]")
             publish_log(
                 log_events_id,
                 {
                     "tool_id": tool_id,
                     "prompt_key": prompt_name,
                     "doc_name": doc_name,
                 },
                 LogLevel.DEBUG,
                 RunLevel.RUN,
-                f"Prompt after variable replacement:{prompt_text} ",
+                "Prompt after variable replacement: [redacted]",
             )

If you must debug content, gate with a secure feature flag and redact values. I can provide a minimal redactor if needed.

🧹 Nitpick comments (17)
backend/workflow_manager/workflow_v2/file_execution_tasks.py (1)

821-828: Bug: provider_file_uuid mismatch logged without comparison.

This warns even when IDs are equal. Compare before warning.

-        if file_history.provider_file_uuid and file_hash.provider_file_uuid:
-            logger.warning(
-                f"Provider file UUID mismatch for file '{file_hash.file_name}' in workflow '{workflow}'"
-            )
+        if (
+            file_history.provider_file_uuid
+            and file_hash.provider_file_uuid
+            and file_history.provider_file_uuid != file_hash.provider_file_uuid
+        ):
+            logger.warning(
+                f"Provider file UUID mismatch for file '{file_hash.file_name}' in workflow '{workflow}'"
+            )
unstract/workflow-execution/src/unstract/workflow_execution/execution_file_handler.py (2)

112-121: Docstring missing param for custom_data.

Add param details for custom_data.

         Parameters:
             input_file_path (str): The path of the input file.
             file_execution_id (str): Unique execution id for the file.
             source_hash (str): The hash value of the source/input file.
             tags (list[str]): Tag names associated with the workflow execution.
             llm_profile_id (str, optional): LLM profile ID for overriding tool settings.
+            custom_data (dict[str, Any], optional): Arbitrary user-provided metadata to persist with the file's METADATA.json.

153-155: Tiny log grammar nit.

Consider: “metadata for … is added into execution directory.”

prompt-service/src/unstract/prompt_service/constants.py (1)

175-176: Regex covers dot-paths; consider BC alias if templates still use user_data.

If you need a deprecation window, support both patterns temporarily at the extractor.

prompt-service/src/unstract/prompt_service/controllers/answer_prompt.py (1)

56-56: BC: accept legacy 'user_data' from payload if CUSTOM_DATA absent.

Prevents breaking existing API clients.

-    custom_data: dict[str, Any] = payload.get(PSKeys.CUSTOM_DATA, {})
+    custom_data: dict[str, Any] = payload.get(PSKeys.CUSTOM_DATA) or payload.get("user_data", {})
+    if not isinstance(custom_data, dict):
+        custom_data = {}
tools/structure/src/main.py (1)

223-225: BC: accept legacy 'user_data' if CUSTOM_DATA absent and validate type.

tools/structure/src/constants.py defines CUSTOM_DATA (line 83); tools/structure/src/main.py (lines 223–225) currently falls back to {} and will ignore legacy "user_data" — add a fallback to "user_data" and ensure custom_data is a dict.

-        custom_data = self.get_exec_metadata.get(SettingsKeys.CUSTOM_DATA, {})
-        payload["custom_data"] = custom_data
+        custom_data = self.get_exec_metadata.get(SettingsKeys.CUSTOM_DATA)
+        # Back-compat: fallback to legacy key if present in exec metadata
+        if custom_data is None:
+            custom_data = self.get_exec_metadata.get("user_data")
+        if not isinstance(custom_data, dict):
+            self.stream_log("Ignoring non-dict custom_data in exec metadata")
+            custom_data = {}
+        payload["custom_data"] = custom_data
tools/structure/src/constants.py (1)

83-83: Add a transitional alias for backward compatibility (optional).

If any external tools/configs still send "user_data", consider a short-lived alias to de-risk the rollout.

Apply this diff:

@@
-    CUSTOM_DATA = "custom_data"
+    CUSTOM_DATA = "custom_data"
+    # TODO: remove after one minor release
+    USER_DATA = "custom_data"

Also, minor nit: SettingsKeys contains duplicate names (e.g., NAME, OUTPUTS, TOOL_ID) earlier in the class—worth consolidating separately.

unstract/workflow-execution/src/unstract/workflow_execution/constants.py (1)

49-49: Provide a migration-friendly alias (optional).

Existing METADATA.json written with "user_data" may still be present in volumes. A temporary alias helps readers tolerate old artifacts.

@@
-    CUSTOM_DATA = "custom_data"
+    CUSTOM_DATA = "custom_data"
+    # TODO: remove after one minor release
+    USER_DATA = "custom_data"

Please verify readers/writers of metadata now use MetaDataKey.CUSTOM_DATA everywhere and gracefully handle old artifacts. If you want, I can script-check the repo for remaining "user_data" metadata usages.

prompt-service/src/unstract/prompt_service/helpers/variable_replacement.py (1)

64-68: Use re.search() instead of re.findall() for presence check.

Small clarity/perf win; avoids building a list when only existence matters.

-        custom_data_pattern = re.compile(VariableConstants.CUSTOM_DATA_VARIABLE_REGEX)
-        if re.findall(custom_data_pattern, variable):
+        custom_data_pattern = re.compile(VariableConstants.CUSTOM_DATA_VARIABLE_REGEX)
+        if re.search(custom_data_pattern, variable):
             variable_type = VariableType.CUSTOM_DATA
backend/api_v2/serializers.py (1)

234-235: Expose custom_data field (OK). Consider accepting legacy user_data for a deprecation window.

To avoid breaking existing clients, optionally accept user_data write-only and map it to custom_data if custom_data is absent.

@@
-    custom_data = JSONField(required=False, allow_null=True)
+    custom_data = JSONField(required=False, allow_null=True)
+    # Backward-compat: accept legacy key, write-only
+    user_data = JSONField(required=False, allow_null=True, write_only=True)

Add mapping in validate (outside the shown hunk):

# Insert at the start of ExecutionRequestSerializer.validate()
legacy = data.pop("user_data", None)
if legacy is not None and data.get("custom_data") is None:
    data["custom_data"] = legacy
elif legacy is not None and data.get("custom_data") is not None:
    raise ValidationError({"custom_data": "Provide either custom_data or user_data, not both."})

If you prefer a hard cutover, skip the alias; otherwise, I can open a follow-up PR with tests and docs for the transition.

prompt-service/src/unstract/prompt_service/services/variable_replacement.py (4)

37-38: Fix implicit Optional typing (RUF013).

Use explicit union for optional types.

Apply this diff:

-        custom_data: dict[str, Any] = None,
+        custom_data: dict[str, Any] | None = None,

101-104: Fix implicit Optional typing (RUF013).

Mirror the public signature fix here.

Apply this diff:

-        prompt_text: str,
-        variable_map: dict[str, Any],
-        custom_data: dict[str, Any] = None,
+        prompt_text: str,
+        variable_map: dict[str, Any],
+        custom_data: dict[str, Any] | None = None,

126-131: Handle missing/empty custom_data when CUSTOM_DATA variables are present.

and custom_data skips replacement for empty dicts and silently leaves placeholders. Prefer failing fast or explicit no‑data behavior.

Option A (fail fast):

-            elif variable_type == VariableType.CUSTOM_DATA and custom_data:
-                prompt_text = VariableReplacementHelper.replace_custom_data_variable(
+            elif variable_type == VariableType.CUSTOM_DATA:
+                if custom_data is None:
+                    raise KeyError(f"Missing custom_data for variable: {variable}")
+                prompt_text = VariableReplacementHelper.replace_custom_data_variable(
                     prompt=prompt_text,
                     variable=variable,
                     custom_data=custom_data,
                 )

Please confirm expected behavior when CUSTOM_DATA variables exist but custom_data is {} or None.


16-24: Docstring arg name mismatch.

Arg doc refers to prompt but function param is prompt_text.

Apply this diff:

-        Args:
-            prompt (str): Prompt to check
+        Args:
+            prompt_text (str): Prompt to check
backend/workflow_manager/workflow_v2/workflow_helper.py (3)

154-156: Signature change LGTM; update docstrings where applicable.

Parameter renamed to custom_data: dict[str, Any] | None = None. Ensure any docstrings/comments reflect this.


445-446: Celery payload risk: size and JSON‑serializability of custom_data.

Large or non‑JSON‑serializable custom_data can break task enqueueing or exceed broker limits.

  • Enforce JSON‑serializable dicts and consider a size cap (e.g., 256–512 KB).
  • Optionally strip/whitelist keys before enqueue.

Example pre‑validation (before send_task):

@@
-            async_execution: AsyncResult = celery_app.send_task(
+            # Ensure custom_data is JSON-serializable and bounded
+            if custom_data is not None:
+                try:
+                    _cd_json = json.dumps(custom_data)
+                    # Optional: cap at 512KB
+                    if len(_cd_json.encode("utf-8")) > 512 * 1024:
+                        raise ValueError("custom_data too large for async payload")
+                except TypeError as e:
+                    raise ValueError(f"custom_data must be JSON-serializable: {e}")
+            async_execution: AsyncResult = celery_app.send_task(

485-486: Consider sanitizing custom_data before passing to Celery.

Pass sanitized object to reduce risk and copy by value.

Apply this diff and helper:

-                    "custom_data": custom_data,
+                    "custom_data": custom_data if custom_data is not None else None,

Optional helper (outside this hunk) to centralize logic:

def _sanitize_custom_data(obj: dict[str, Any] | None, max_bytes: int = 512 * 1024) -> dict[str, Any] | None:
    if obj is None:
        return None
    try:
        data = json.loads(json.dumps(obj))  # ensure JSON-serializable copy
    except TypeError as e:
        raise ValueError(f"custom_data must be JSON-serializable: {e}")
    if len(json.dumps(data).encode("utf-8")) > max_bytes:
        raise ValueError("custom_data too large")
    return data

Then call _sanitize_custom_data(custom_data) before enqueue.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

Cache: Disabled due to Reviews > Disable Cache setting

Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting

📥 Commits

Reviewing files that changed from the base of the PR and between d40ea54 and c953332.

📒 Files selected for processing (18)
  • backend/api_v2/api_deployment_views.py (2 hunks)
  • backend/api_v2/constants.py (1 hunks)
  • backend/api_v2/deployment_helper.py (3 hunks)
  • backend/api_v2/serializers.py (3 hunks)
  • backend/sample.env (1 hunks)
  • backend/workflow_manager/endpoint_v2/source.py (2 hunks)
  • backend/workflow_manager/workflow_v2/dto.py (1 hunks)
  • backend/workflow_manager/workflow_v2/file_execution_tasks.py (3 hunks)
  • backend/workflow_manager/workflow_v2/workflow_helper.py (8 hunks)
  • prompt-service/src/unstract/prompt_service/constants.py (3 hunks)
  • prompt-service/src/unstract/prompt_service/controllers/answer_prompt.py (2 hunks)
  • prompt-service/src/unstract/prompt_service/helpers/variable_replacement.py (2 hunks)
  • prompt-service/src/unstract/prompt_service/services/variable_replacement.py (4 hunks)
  • tools/structure/src/config/properties.json (1 hunks)
  • tools/structure/src/constants.py (1 hunks)
  • tools/structure/src/main.py (1 hunks)
  • unstract/workflow-execution/src/unstract/workflow_execution/constants.py (1 hunks)
  • unstract/workflow-execution/src/unstract/workflow_execution/execution_file_handler.py (2 hunks)
🧰 Additional context used
🪛 Ruff (0.13.1)
backend/api_v2/serializers.py

261-261: Avoid specifying long messages outside the exception class

(TRY003)

prompt-service/src/unstract/prompt_service/services/variable_replacement.py

37-37: PEP 484 prohibits implicit Optional

Convert to T | None

(RUF013)


103-103: PEP 484 prohibits implicit Optional

Convert to T | None

(RUF013)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: build
🔇 Additional comments (18)
prompt-service/src/unstract/prompt_service/constants.py (2)

151-152: Enum rename confirmed — no VariableType.USER_DATA references remain.
rg output shows only VariableType.CUSTOM_DATA occurrences; no VariableType.USER_DATA found.


70-71: Approve — rename clear; verify no remaining USER_DATA references

File: prompt-service/src/unstract/prompt_service/constants.py (CUSTOM_DATA, TEXT). rg in prompt-service returned no matches; run a repo-wide search for 'USER_DATA' / 'user_data' and confirm all downstream consumers are updated.

tools/structure/src/config/properties.json (1)

5-5: Approve: version bump and image tags aligned to 0.0.88

Verified — tools/structure/src/config/properties.json and backend/sample.env both reference 0.0.88; no 0.0.87 occurrences found.

backend/workflow_manager/workflow_v2/file_execution_tasks.py (2)

922-925: Confirmed — signature matches; no change required.
Definition: def log_total_cost_per_file(self, run_id: str, file_name: str) in backend/workflow_manager/workflow_v2/execution.py:279; call uses run_id and file_name in backend/workflow_manager/workflow_v2/file_execution_tasks.py:922-925.


749-756: Confirm add_file_to_volume signature accepts custom_data (not user_data).

Verify backend/workflow_manager/endpoint_v2/source.py (def add_file_to_volume — ~line 918) accepts the keyword parameter custom_data; update either the callee or the callsites to match to avoid a runtime TypeError.

prompt-service/src/unstract/prompt_service/controllers/answer_prompt.py (1)

90-100: replace_variables_in_prompt accepts custom_data — no action required. The signature in prompt-service/src/unstract/prompt_service/services/variable_replacement.py declares custom_data: dict[str, Any] = None and internal calls pass custom_data.

backend/api_v2/serializers.py (1)

255-263: Validator rename aligned; message clear.

No issues.

prompt-service/src/unstract/prompt_service/helpers/variable_replacement.py (1)

103-156: CUSTOM_DATA_VARIABLE_REGEX exposes the path in group(1) — resolved.
Regex is defined as r"custom_data.([a-zA-Z0-9_.]+)" in prompt-service/src/unstract/prompt_service/constants.py:175, so custom_data_match.group(1) correctly yields the path; no change required.

backend/api_v2/api_deployment_views.py (1)

73-74: LGTM — confirm ApiExecution.USER_DATA removed and execute_workflow is called with custom_data=
Sandbox ripgrep produced no output; cannot verify here. Re-run the two checks locally or paste their output.
Location: backend/api_v2/api_deployment_views.py (≈lines 73–91).

backend/workflow_manager/endpoint_v2/source.py (1)

924-925: Signature rename approved — downstream handler accepts custom_data.
add_metadata_to_volume includes the custom_data parameter and writes it as content[MetaDataKey.CUSTOM_DATA] = custom_data in unstract/workflow-execution/src/unstract/workflow_execution/execution_file_handler.py (def at line 98, write at line 147).

backend/workflow_manager/workflow_v2/dto.py (1)

159-166: Confirm legacy user_data mapping necessity
I couldn’t find any occurrences of "user_data" in the repo; please verify whether any external producers still send this legacy key before adding the backward-compat mapping.

backend/api_v2/deployment_helper.py (1)

158-159: Verified — WorkflowHelper accepts and propagates custom_data.
execute_workflow_async is defined and custom_data is passed through call sites (backend/workflow_manager/workflow_v2/workflow_helper.py — lines ~206, 325, 433, 702), and propagated to downstream callers (file_execution_tasks.py:754; endpoint_v2/source.py:966).

backend/workflow_manager/workflow_v2/workflow_helper.py (5)

274-275: Propagation to run_workflow LGTM.


325-326: Forwarding custom_data into process_input_files LGTM.


703-703: Propagation into execute_workflow LGTM.


206-207: custom_data is supported in DTO and carried in task payloads; DB persistence is explicitly excluded — confirm intent.

  • FileData declares custom_data and its to_dict()/from_dict() include it (backend/workflow_manager/workflow_v2/dto.py).
  • WorkflowHelper builds FileData(custom_data=...) and sends FileBatchData.to_dict() to Celery (backend/workflow_manager/workflow_v2/workflow_helper.py).
  • FileExecutionTasks reconstructs FileBatchData/FileData and uses file_data.custom_data during processing (backend/workflow_manager/workflow_v2/file_execution_tasks.py).
  • create_workflow_execution explicitly excludes "custom_data" via EXECUTION_EXCLUDED_PARAMS, so custom_data is not persisted to the workflow execution DB and no cache-persist behavior was found — confirm whether custom_data should be stored or intentional omission is desired.

67-68: Excluding "custom_data" is correct — create_workflow_execution has no matching parameter.
Signature at backend/workflow_manager/workflow_v2/execution.py:126 does not include "custom_data", so excluding it prevents silent drops/TypeErrors.

prompt-service/src/unstract/prompt_service/services/variable_replacement.py (1)

69-74: Propagation looks correct — add unit tests for CUSTOM_DATA path.

replace_custom_data_variable is present (prompt-service/src/unstract/prompt_service/helpers/variable_replacement.py) and is invoked from the service (prompt-service/src/unstract/prompt_service/services/variable_replacement.py, ~lines 126–129); controllers pull CUSTOM_DATA at prompt-service/src/unstract/prompt_service/controllers/answer_prompt.py:56. Add unit tests for:

  • Prompt with CUSTOM_DATA variable + present data (assert replacement).
  • Prompt with CUSTOM_DATA variable + empty/missing data (assert no crash and expected fallback/behavior).

@github-actions
Copy link
Copy Markdown
Contributor

filepath function $$\textcolor{#23d18b}{\tt{passed}}$$ SUBTOTAL
$$\textcolor{#23d18b}{\tt{runner/src/unstract/runner/clients/test\_docker.py}}$$ $$\textcolor{#23d18b}{\tt{test\_logs}}$$ $$\textcolor{#23d18b}{\tt{1}}$$ $$\textcolor{#23d18b}{\tt{1}}$$
$$\textcolor{#23d18b}{\tt{runner/src/unstract/runner/clients/test\_docker.py}}$$ $$\textcolor{#23d18b}{\tt{test\_cleanup}}$$ $$\textcolor{#23d18b}{\tt{1}}$$ $$\textcolor{#23d18b}{\tt{1}}$$
$$\textcolor{#23d18b}{\tt{runner/src/unstract/runner/clients/test\_docker.py}}$$ $$\textcolor{#23d18b}{\tt{test\_cleanup\_skip}}$$ $$\textcolor{#23d18b}{\tt{1}}$$ $$\textcolor{#23d18b}{\tt{1}}$$
$$\textcolor{#23d18b}{\tt{runner/src/unstract/runner/clients/test\_docker.py}}$$ $$\textcolor{#23d18b}{\tt{test\_client\_init}}$$ $$\textcolor{#23d18b}{\tt{1}}$$ $$\textcolor{#23d18b}{\tt{1}}$$
$$\textcolor{#23d18b}{\tt{runner/src/unstract/runner/clients/test\_docker.py}}$$ $$\textcolor{#23d18b}{\tt{test\_get\_image\_exists}}$$ $$\textcolor{#23d18b}{\tt{1}}$$ $$\textcolor{#23d18b}{\tt{1}}$$
$$\textcolor{#23d18b}{\tt{runner/src/unstract/runner/clients/test\_docker.py}}$$ $$\textcolor{#23d18b}{\tt{test\_get\_image}}$$ $$\textcolor{#23d18b}{\tt{1}}$$ $$\textcolor{#23d18b}{\tt{1}}$$
$$\textcolor{#23d18b}{\tt{runner/src/unstract/runner/clients/test\_docker.py}}$$ $$\textcolor{#23d18b}{\tt{test\_get\_container\_run\_config}}$$ $$\textcolor{#23d18b}{\tt{1}}$$ $$\textcolor{#23d18b}{\tt{1}}$$
$$\textcolor{#23d18b}{\tt{runner/src/unstract/runner/clients/test\_docker.py}}$$ $$\textcolor{#23d18b}{\tt{test\_get\_container\_run\_config\_without\_mount}}$$ $$\textcolor{#23d18b}{\tt{1}}$$ $$\textcolor{#23d18b}{\tt{1}}$$
$$\textcolor{#23d18b}{\tt{runner/src/unstract/runner/clients/test\_docker.py}}$$ $$\textcolor{#23d18b}{\tt{test\_run\_container}}$$ $$\textcolor{#23d18b}{\tt{1}}$$ $$\textcolor{#23d18b}{\tt{1}}$$
$$\textcolor{#23d18b}{\tt{runner/src/unstract/runner/clients/test\_docker.py}}$$ $$\textcolor{#23d18b}{\tt{test\_get\_image\_for\_sidecar}}$$ $$\textcolor{#23d18b}{\tt{1}}$$ $$\textcolor{#23d18b}{\tt{1}}$$
$$\textcolor{#23d18b}{\tt{runner/src/unstract/runner/clients/test\_docker.py}}$$ $$\textcolor{#23d18b}{\tt{test\_sidecar\_container}}$$ $$\textcolor{#23d18b}{\tt{1}}$$ $$\textcolor{#23d18b}{\tt{1}}$$
$$\textcolor{#23d18b}{\tt{TOTAL}}$$ $$\textcolor{#23d18b}{\tt{11}}$$ $$\textcolor{#23d18b}{\tt{11}}$$

@jaags-dev jaags-dev changed the title UN-2807 [FEAT] Changed user_data to custom_data in variable replacement UN-2807 [MISC] Changed user_data to custom_data in variable replacement Sep 22, 2025
@sonarqubecloud
Copy link
Copy Markdown

@jaseemjaskp jaseemjaskp merged commit 29b89ff into main Sep 22, 2025
4 checks passed
@jaseemjaskp jaseemjaskp deleted the feature/user-data-variable-support branch September 22, 2025 13:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants