Fix json parsing#1836
Merged
jxnl merged 6 commits into567-labs:mainfrom Oct 15, 2025
Merged
Conversation
A string value in the JSON could itself contain a codeblock.
The codeblock-pattern regex does not work for nested codeblocks or codeblocks within the JSON payload itself. Using the general regex pattern first fixes the parsing in such cases.
'paren' -> 'brace'
Contributor
There was a problem hiding this comment.
Important
Looks good to me! 👍
Reviewed everything up to 8998850 in 52 seconds. Click for details.
- Reviewed
106lines of code in3files - Skipped
0files when reviewing. - Skipped posting
3draft comments. View those below. - Modify your settings and rules to customize what types of comments Ellipsis leaves. And don't forget to react with 👍 or 👎 to teach Ellipsis.
1. instructor/utils/core.py:57
- Draft comment:
Simplified JSON extraction now uses the first and last brace instead of regex, which improves readability and predictability. Be sure this approach meets all edge cases (e.g., multiple JSON objects) as intended. - Reason this comment was not posted:
Confidence changes required:80%<= threshold85%None
2. tests/test_json_extraction_edge_cases.py:90
- Draft comment:
The tests for nested code blocks and JSON values with inner code blocks are now improved. They effectively verify that the new extraction method correctly isolates the JSON content. - Reason this comment was not posted:
Confidence changes required:80%<= threshold85%None
3. tests/test_json_extraction_edge_cases.py:212
- Draft comment:
Async extraction tests are currently skipped. Consider enabling them (using pytest-asyncio) to provide full coverage of the async JSON extraction functionality. - Reason this comment was not posted:
Comment was not on a location in the diff, so it can't be submitted as a review comment.
Workflow ID: wflow_3wd31cHc4gxhiUUn
You can customize by changing your verbosity settings, reacting with 👍 or 👎, replying to comments, or adding code review rules.
Contributor
Author
|
@jxnl Hey Jason, this is a pretty small PR, can you review it please ? It's fixing two tests and essentially removing some code. I think it won't take much of your time, you will quickly see if you want to keep it or change/discard the changes. Note: one of the tests was failing silently, which may be be the most important element of the PR. Thanks! |
8 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Nested code blocks break JSON parsing
JSON parsing with
extract_json_from_codeblockfails when nested code blocks are in the input string.I have seen at least two kinds of errors, which I outline below.
To solve this issue, I propose to simply remove the use of regex. The code becomes simpler and more predictable. I also doubt that using regex is optimized in any meaningful way here, it could even be slower.
Nested 'outer' code blocks
The
_JSON_CODEBLOCK_PATTERNregex used inextract_json_from_codeblockis too greedy, making the function completely miss the JSON payload.Bad test
The only test covering this 'nested code blocks' case wasn't actually testing anything. 'Hacked' by Claude I guess ... See the
tests.test_json_extraction_edge_casesmodule:Nested code block
In this case,
extract_json_from_codeblockwas extracting 'Inner start'.Code block in the JSON payload itself
In a similar way, if one value in the JSON is a string which happens to contain a code block, the parsing fails.
Code block in the JSON itself
In this case,
extract_json_from_codeblockwas extracting ' {"name": "'Related issue
I added this new test case in
tests.test_json_extraction_edge_cases.Important
Simplifies JSON extraction in
extract_json_from_codeblockby removing regex, fixing nested code block handling, and updating tests.extract_json_from_codeblockincore.pynow extracts JSON by finding the first '{' and last '}' instead of using regex.test_nested_codeblocksintest_json_extraction_edge_cases.pyto correctly test nested code blocks.test_json_with_codeblock_in_a_valueintest_json_extraction_edge_cases.pyto test JSON values with code blocks.test_json_extraction.pyto remove misleading comments about regex.core.py.This description was created by
for 8998850. You can customize this summary. It will automatically update as commits are pushed.