[Responses API] Structured output + reasoning via structural tag embedding by will-deines · Pull Request #35873 · vllm-project/vllm

will-deines · 2026-03-03T15:12:07Z

Summary

Embed content constraints in structural tags: When a user requests JSON schema enforcement (text.format.type=json_schema) with a GPT-OSS reasoning model, the grammar constraint is now scoped to the <|channel|>final region via xgrammar's TriggeredTagsFormat. Previously, grammar bitmasks were applied from token 0, clobbering reasoning output.
Handle json_object format: text.format.type=json_object was silently ignored in the Responses API. Now produces StructuredOutputsParams(json_object=True), matching chat completions behavior.
Fix streaming + json_schema alias bug: Remove .model_dump() in the streaming path that dropped the schema → schema_ Pydantic alias, causing ResponseCreatedEvent deserialization failures.
Apply reasoning channel tags unconditionally: When a reasoning parser is active but no structured output is requested, reasoning channel tags are still applied (the struct_out is None branch).

Approach

Rather than modifying StructuredOutputsParams to allow multiple simultaneous constraint types (which would require deep changes to validation, backends, and dispatch), we embed the content constraint inside the structural tag's <|channel|>final tag.

xgrammar's TagFormat.content field already accepts a discriminated union of JSONSchemaFormat, GrammarFormat, RegexFormat, etc. (defined in xgrammar/structural_tag.py). The infrastructure to "apply JSON schema grammar only within the <|channel|>final region" already exists — we just wire it up from the Responses API.

This means:

StructuredOutputsParams keeps its existing mutual-exclusivity invariant (one constraint type)
The constraint type used is structural_tag, which internally contains both reasoning channel enforcement AND the content constraint scoped to the final channel
xgrammar handles the compilation natively — no custom grammar composition needed
User-specified options (disable_any_whitespace, disable_fallback, etc.) are preserved via dataclasses.replace()

Decisions We Made That Can Be Debated

1. Embed constraint inside structural tag vs. allow multiple constraint types on `StructuredOutputsParams`

What we chose: When a reasoning parser is active and a content constraint (json_schema, regex, grammar, choice) is present, we convert the content constraint into an xgrammar content format dict, embed it in the <|channel|>final tag within the structural tag, then clear the original constraint fields. The final StructuredOutputsParams has only structural_tag set.

Alternative: Modify StructuredOutputsParams to support multiple simultaneous constraint types (e.g. structural_tag + json). This would avoid the mid-pipeline mutation pattern where we clear fields after embedding them, but requires changes to validation logic, backend dispatch in StructuredOutputManager, and every guided decoding backend's understanding of what "one constraint" means.

Why we chose this: xgrammar's TagFormat.content field already supports this composition natively — the infrastructure exists and is tested. The mutual-exclusivity invariant on StructuredOutputsParams is load-bearing across the entire structured output stack, and relaxing it has a large blast radius.

What reviewers might disagree with: The mid-pipeline mutation (clearing json/regex/etc. after embedding) means StructuredOutputsParams no longer reflects what the user originally requested. If downstream code inspects these fields (e.g., for logging, metrics, or error messages), it will see None instead of the original constraint. An alternative could be to construct a fresh StructuredOutputsParams(structural_tag=...) rather than mutating via dataclasses.replace().

2. Fix `text.format` path rather than redirecting users to `structured_outputs` field

What we chose: We fix the standard OpenAI text.format path so that json_schema, json_object, and streaming all work correctly. Users can use either text.format (OpenAI-compatible) or the vLLM-specific structured_outputs field (#33709).

Alternative: Only support structured output through the vLLM-specific structured_outputs field and treat text.format as a passthrough/echo-only field (the status quo before this PR, where json_object was silently ignored).

Context: This is an area of active debate. In #33709, @yeqcharlotte and @chaunceyjiang questioned why structured output wasn't going through text.format instead of a separate field. In #33381, @chaunceyjiang argued vLLM-specific extensions should go through the OpenResponses extension mechanism. Meanwhile, @alecsolder defended the separate field for cross-provider reusability and separation of concerns. In #19097, vllm_-prefixed types were proposed but the RFC was auto-closed without implementation.

Why we chose this: Users coming from the OpenAI SDK will naturally use text.format.type=json_schema — it should just work. The structured_outputs field is additive for vLLM-specific capabilities (grammar, regex, choice) that text.format can't express. Fixing both paths costs little and prevents user confusion.

3. Remove `.model_dump()` vs. add `by_alias=True` for streaming alias bug

What we chose: Remove the .model_dump() call in the streaming path and pass the ResponsesResponse Pydantic object directly to ResponseCreatedEvent, matching how ResponseCompletedEvent already works. This is the approach from #34611.

Alternative: Keep .model_dump() but add by_alias=True so Pydantic serializes schema_ as "schema". This is the approach from #26356, which has community confirmation of working.

Why we chose this: Removing the unnecessary dict round-trip eliminates the entire class of alias bugs rather than patching one instance. This is consistent with @qandrew's own #26185 which previously removed a .model_dump() call on the ResponseCompletedEvent path for the same category of issue. The by_alias=True approach is fragile — any future alias field would break again if someone forgets the flag.

4. Apply reasoning channel tags even when no structured output is requested

What we chose: When struct_out is None and a reasoning parser is active, we now create a StructuredOutputsParams(structural_tag=...) with just the reasoning channel tags. Previously, the prepare_structured_tag() block was only entered when struct_out was already a StructuredOutputsParams instance.

Alternative: Keep the existing behavior where reasoning channel tags are only applied when the user explicitly requests some form of structured output.

Why we chose this: Without structural tags, GPT-OSS models emit raw Harmony format (<|channel|>analysis<|message|>...) that the reasoning parser must post-hoc parse. With structural tags, xgrammar enforces the channel structure at decode time, which is more robust and enables future optimizations. This also means the reasoning parser's is_reasoning_end state machine (which has had multi-turn bugs per #34454) is supplemented by grammar-level enforcement.

What reviewers might disagree with: This changes default behavior for all GPT-OSS requests that don't request structured output. If a model produces valid output without structural tags but would be over-constrained with them, this could cause regressions. We don't have e2e validation of this path yet.

5. `json_object` mapped to `{"type": "object"}` in structural tag content

What we chose: In _constraint_to_content_format(), json_object=True is converted to {"type": "json_schema", "json_schema": {"type": "object"}} for embedding in the structural tag.

Alternative: Map it to a dedicated json_object content format type if xgrammar supports one, or skip embedding entirely and let the existing json_object handling in the structured output backend handle it outside the structural tag.

Why we chose this: xgrammar's TagFormat.content expects one of its known format types (json_schema, regex, grammar, etc.). {"type": "object"} is the minimal JSON schema that enforces "output must be a JSON object" — semantically equivalent to json_object mode. This ensures the constraint is properly scoped to the <|channel|>final region for reasoning models rather than being applied globally.

6. Adding `final_content_format` parameter to the base class `prepare_structured_tag()`

What we chose: We added final_content_format: dict | None = None as an optional parameter on ReasoningParser.prepare_structured_tag() in the base class, with a default of None that preserves backward compatibility.

Alternative: Only add the parameter on GPTOSSReasoningParser and handle the dispatch in serving.py with a type check or capability flag. Or create a separate method like prepare_structured_tag_with_constraint().

Why we chose this: The base class change is backward-compatible (default None, existing implementations don't need changes). The concept of "scope this content constraint to the model's final output region" is generic — it's not GPT-OSS-specific. Other reasoning models (Qwen3, DeepSeek-R1, future models) with structural tag support would benefit from the same interface. Keeping it on the base class establishes a clean contract.

What reviewers might disagree with: This couples content constraint format knowledge (xgrammar dict format) to the reasoning parser interface. If vLLM ever supports a non-xgrammar structured output backend, this dict format may not apply. A more abstract interface (e.g., passing StructuredOutputsParams directly) might be more future-proof.

Related Issues, PRs, and RFCs

Directly Addressed by This PR

#	Title	Status	How This PR Relates
#34857	Responses API & Tool Calling H1 2026 roadmap	Open	Explicitly lists "guided decode and structured outputs" as focus area. This PR delivers that.
#23120	Structured output not correctly enforced with GPT-OSS	Open	Root cause: grammar bitmasks applied from token 0 without structural tag channel separation. This PR fixes the Responses API path.
#26288	`schema` field becomes `None` in streaming with json_schema	Closed	Root-cause analysis of the `schema_`/`schema` alias bug in streaming. The `.model_dump()` removal in this PR fixes it.
#34611	Fix ResponseCreatedEvent ValidationError for json_schema in streaming	Open	Proposes removing `.model_dump()` in streaming path. We adopt this approach.
#26356	Fix json schema alias serializing when streaming	Open	Alternative fix (add `by_alias=True`). We prefer #34611's approach (pass objects directly).
#26822	Fix crash when `text` type response_format received	Merged	Added validation for `type: "text"` passthrough. Our `json_object` handling follows the same pattern.
#26639	ValueError: No valid structured output parameter found	Closed (by #26822)	The `json_object` gap in the Responses API could produce similar errors. Our Step 1 prevents this.

Foundation This PR Builds On

#	Title	Status	Relevance
#33709	Enable generic `structured_outputs` for responses API	Merged	Added the `structured_outputs` field to ResponsesRequest. Our work builds on this.
#32609	Add sampling parameters to Responses API	Merged	Established `to_sampling_params()` infrastructure on ResponsesRequest.
#32712	Initial Parser for Responses API	Merged	Introduced `Parser`/`ParserManager` and the structural tag preparation block we're extending.
#34454	Fix structured output in multi-turn GPT-OSS	Merged	Fixed premature grammar bitmask activation from previous-turn markers. Our structural tag approach inherently avoids this class of bug by constraining grammar to the `<
#32791	chat.completions returns null for GPT-OSS multi-turn with json_object	Closed (by #34454)	Same root cause as #23120. Our approach prevents this by design.

Related RFCs

#	Title	Status	Design Decision
#19097	RFC: Response format extensions for structured outputs	Closed	Led to the `structured_outputs` field. We reuse `StructuredOutputsParams` rather than creating new types.
#33381	RFC: Align with openresponses spec	Open	Argues vLLM-specific extensions should go through extension mechanism. Decision: We keep the existing `structured_outputs` field (already merged in #33709) and also fix the standard `text.format` path. No new protocol extensions.
#29632	RFC: Force EOS when grammar terminates	Open	When grammar is satisfied, model may not produce EOS immediately. Out of scope for this PR but noted as a follow-up.
#16313	Support structured output + tool call together	Open	Tool calls + JSON schema in one request. Our structural tag approach naturally supports this since tool channels and the final channel are independent tags.
#33249	Add `structured_outputs` as instance field on ResponsesRequest	Open	Promotes `structured_outputs` from local var to field for tool parser mutation. Compatible with our changes; we support both the field path and the `text.format` path.

Changes

File	Change
`vllm/entrypoints/openai/responses/protocol.py`	Add `json_object` handling in `to_sampling_params()`
`vllm/entrypoints/openai/responses/serving.py`	Add `_constraint_to_content_format()` helper; rewire structural tag preparation block; fix streaming `.model_dump()`
`vllm/reasoning/abs_reasoning_parsers.py`	Add `final_content_format` param to `prepare_structured_tag()` base class
`vllm/reasoning/gptoss_reasoning_parser.py`	Implement `final_content_format` — append `<
`tests/entrypoints/openai/responses/test_structured_output.py`	New — unit tests for `_constraint_to_content_format`
`tests/v1/structured_output/test_gptoss_structural_tags.py`	Extend with constraint embedding tests
`tests/entrypoints/openai/responses/test_sampling_params.py`	Extend with `json_object` test
`.gitignore`	Add `local_test/`

Test plan

Unit tests pass: pytest tests/entrypoints/openai/responses/test_structured_output.py tests/v1/structured_output/test_gptoss_structural_tags.py tests/entrypoints/openai/responses/test_sampling_params.py -v
No regressions in responses unit tests
Pre-commit passes on all changed files
E2e with GPT-OSS model: verify json_schema + reasoning produces valid JSON with reasoning properly separated
E2e with Qwen3: verify json_schema, json_object, and streaming all work (non-regression)

Out of Scope (follow-ups)

Chat Completions has the same gap (#23120) — serving_chat.py never calls prepare_structured_tag(). Same fix pattern applies but is a separate PR targeting the chat completions path.
strict field forwarding from ResponseFormatTextJSONSchemaConfig — low priority, vLLM always enforces strictly.
Force EOS when grammar terminates (#29632) — separate design discussion affecting all APIs.
OpenResponses alignment (#33381) — policy decision about whether structured_outputs should go through extension mechanism.
structured_outputs as instance field (#33249) — promotes structured_outputs from local variable to field for tool parser mutation. Compatible with our changes but independent concern.

…dding When a user requests JSON schema enforcement (text.format.type=json_schema) with a reasoning model (GPT-OSS), the grammar constraint was never scoped to the final output channel. This caused grammar bitmasks to be applied from token 0, clobbering reasoning output. Fix by embedding content constraints (json_schema, json_object, regex, grammar, choice) inside the structural tag's <|channel|>final region using xgrammar's native TriggeredTagsFormat support. This ensures grammar enforcement only applies within the final output region, not during reasoning. Also: - Handle text.format.type=json_object (was silently ignored) - Fix streaming + json_schema alias bug (.model_dump() dropped schema alias) - Apply reasoning channel tags even when no structured output is requested

gemini-code-assist

Code Review

This pull request enhances structured output capabilities, especially for reasoning models, by embedding content constraints within structural tags. It also adds support for json_object format and fixes a streaming bug. My review found one high-severity issue where options for structured output generation were being dropped when combining with reasoning. I've suggested a fix to preserve these options.

vllm/entrypoints/openai/responses/serving.py

…nstraints When creating a new StructuredOutputsParams with the structural_tag, use dataclasses.replace() to clear content constraint fields while preserving user-specified options like disable_any_whitespace, disable_fallback, disable_additional_properties, and whitespace_pattern.

mergify bot added frontend gpt-oss Related to GPT-OSS models labels Mar 3, 2026

github-project-automation bot added this to gpt-oss Issues & Enhancements Mar 3, 2026

mergify bot added the structured-output label Mar 3, 2026

github-project-automation bot moved this to To Triage in gpt-oss Issues & Enhancements Mar 3, 2026

mergify bot added the v1 label Mar 3, 2026

github-project-automation bot added this to Structured Output Mar 3, 2026

gemini-code-assist bot reviewed Mar 3, 2026

View reviewed changes

vllm/entrypoints/openai/responses/serving.py Outdated Show resolved Hide resolved

garrio-1 and others added 2 commits March 3, 2026 10:20

Merge branch 'main' into worktree-responses-structured-output

02e8882

will-deines closed this Mar 3, 2026

github-project-automation bot moved this to Done in Structured Output Mar 3, 2026

github-project-automation bot moved this from To Triage to Done in gpt-oss Issues & Enhancements Mar 3, 2026

will-deines mentioned this pull request Mar 3, 2026

[Responses API] Structured output + reasoning via structural tag embedding #35904

Open

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Responses API] Structured output + reasoning via structural tag embedding#35873

[Responses API] Structured output + reasoning via structural tag embedding#35873
will-deines wants to merge 3 commits intovllm-project:mainfrom
will-deines:worktree-responses-structured-output

will-deines commented Mar 3, 2026 •

edited

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

will-deines commented Mar 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Approach

Decisions We Made That Can Be Debated

1. Embed constraint inside structural tag vs. allow multiple constraint types on StructuredOutputsParams

2. Fix text.format path rather than redirecting users to structured_outputs field

3. Remove .model_dump() vs. add by_alias=True for streaming alias bug

4. Apply reasoning channel tags even when no structured output is requested

5. json_object mapped to {"type": "object"} in structural tag content

6. Adding final_content_format parameter to the base class prepare_structured_tag()

Related Issues, PRs, and RFCs

Directly Addressed by This PR

Foundation This PR Builds On

Related RFCs

Changes

Test plan

Out of Scope (follow-ups)

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

will-deines commented Mar 3, 2026 •

edited

Loading

1. Embed constraint inside structural tag vs. allow multiple constraint types on `StructuredOutputsParams`

2. Fix `text.format` path rather than redirecting users to `structured_outputs` field

3. Remove `.model_dump()` vs. add `by_alias=True` for streaming alias bug

5. `json_object` mapped to `{"type": "object"}` in structural tag content

6. Adding `final_content_format` parameter to the base class `prepare_structured_tag()`