-
Notifications
You must be signed in to change notification settings - Fork 1.6k
[ty] Fix empty spans following a line terminator and unprintable character spans in diagnostics #19535
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
|
After looking at this again, I think we can probably do both this check and the unprintable check in |
d0c9194 to
b9d27c6
Compare
..._linter/src/message/snapshots/ruff_linter__message__text__tests__unprintable_characters.snap
Outdated
Show resolved
Hide resolved
the original implementation used the input range in `update_range`, which I didn't notice, so I was double-updating the new ranges and eventually causing a panic from an invalid byte offset
b9d27c6 to
665909e
Compare
|
ruff_dbruff_db
ruff_db| fn empty_span_after_line_terminator() -> anyhow::Result<()> { | ||
| let path = Path::new("pycodestyle").join("E11.py"); | ||
| let settings = LinterSettings::for_rule(Rule::NoIndentedBlock); | ||
| let diagnostics = test_path(path, &settings)?; | ||
| let config = DisplayDiagnosticConfig::default().format(DiagnosticFormat::Full); | ||
| let notebook_indexes = FxHashMap::default(); | ||
| let context = EmitterContext::new(¬ebook_indexes); | ||
| let value = DisplayDiagnostics::new(&context, &config, &diagnostics); | ||
| insta::assert_snapshot!(value.to_string()); | ||
| Ok(()) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could these be tests in render::tests instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not very easily. For the tests in ruff_db we have to synthesize diagnostics:
ruff/crates/ruff_db/src/diagnostic/render.rs
Lines 2597 to 2606 in dc92966
| env.builder("unused-import", Severity::Error, "`os` imported but unused") | |
| .primary("fib.py", "1:7", "1:9", "") | |
| .help("Remove unused import: `os`") | |
| .secondary_code("F401") | |
| .fix(Fix::unsafe_edit(Edit::range_deletion(TextRange::new( | |
| TextSize::from(0), | |
| TextSize::from(10), | |
| )))) | |
| .noqa_offset(TextSize::from(7)) | |
| .build(), |
I started out doing that and then realized it would be much easier to use Ruff's actual infrastructure to get the diagnostics and not accidentally test the wrong thing.
It should be safe to delete these tests once Ruff uses the new renderer in general anyway, since they'll be duplicates of existing snapshots.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still think it's worth having dedicated tests for this or we'll loose the assertion if the rule changes or gets removed. I'm aware that it requires building the diagnostics manually but I'd expect that ir isn't too hard. It only requires calling a fewer diagnostic builder methods (and it's definitely easier to debug). But maybe I'm missing something?
(Looking at the test. The builder code is roughly as much code as what you have here)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's the builder code for one diagnostic, whereas the current test has 6 diagnostics in this case and 10 diagnostics for unprintable_characters. I'll strip down the two interesting cases and move them to ruff_db. I thought it was nice to have the full snapshot for diffing against Ruff, but now that that's resolved we can just keep minimized versions of the tests.
I'm still a bit wary of having to manually input the ranges in the builder, which could also fall out of sync with real diagnostics, but I see the benefits otherwise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright, that wasn't so bad: a4f7434. I also applied the patch to main and checked that they failed there as expected.
The stripped down versions are also nice because they fit better as inline snapshots.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm still a bit wary of having to manually input the ranges in the builder
What I like to do in those cases is to use str.find to get the offset
MichaReiser
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice. I only have a few nit comments. The only thing I would change is to make these unit tests in ruff db. We do have the infrastructure to write those (I think?)
| for (ann, &original_range) in annotations.iter_mut().zip(&original_ranges) { | ||
| if index < usize::from(original_range.start()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think I fully understand why we need to collect the ranges first, only to then zip then here again. Can't we call ann.range() inside the loop?
for ann in annotations.iter_mut() {
let original_range = ann.range();
if index < usize::from(original_range.start()) {There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need the initial range from the start of the function. I ran into a double-counting issue yesterday when I was using the current ann.range throughout (I was ending up with a range of 64-65 instead of the expected 62-63, starting from 60-61). I think capturing the initial range should match the other version of this code, which saves the input annotation_range and mutates the range copy:
ruff/crates/ruff_linter/src/message/text.rs
Lines 279 to 283 in dc92966
| fn replace_whitespace_and_unprintable(source: &str, annotation_range: TextRange) -> SourceCode { | |
| let mut result = String::new(); | |
| let mut last_end = 0; | |
| let mut range = annotation_range; | |
| let mut line_width = LineWidthBuilder::new(IndentWidth::default()); |
though it pained me to collect here too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I realized that we didn't have a test for this in the stripped-down version, so I just added another test case. Without using original_ranges, this input:
��� # rendered as ^H^Z^[ in my editorcauses
thread 'diagnostic::render::full::tests::multiple_unprintable_characters' panicked at crates/ruff_annotate_snippets/src/renderer/display_list.rs:1450:29:
byte index 5 is not a char boundary; it is inside '␛' (bytes 4..7) of `␈␛`
From my earlier debugging, the issue was some kind of double counting in replace_whitespace_and_unprintable. The input annotation range here is 1..1, and the output in this panicking version is 5..5, instead of the correct 3..3, so it gets shifted twice if you compare against the current range.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ohh, I see now. This is a callback function. That's why it matters.
It seems unfortunate that we have to collect all original ranges even if the source code doesn't contain a single tab character (which should be the most common case).
Would it make sense to change the for (index, c) in source.char_indices() { to track the relative offset compared to the source offset. This way, you could call update_ranges with index + relative_offset and remove the original ranges thing entirely
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking about this: I think the relative offset is already known. It's result.text_len() - source.text_len(). That's how many characters where inserted by this function
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
result is likely empty until the very end of the function because we try to return a Cow::Borrowed(source) if nothing is modified, so I don't think we can get the offset from there.
It seems we update result everytime before we call update_ranges. Which makes sense to me. We update the source text and that, in return, requires updating the annotations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alternatively, we could just do a regex search for tab and unprintable characters (looks like 5 distinct bytes?). If none are found, then you can just quit early with the Cow::Borrowed case. This search won't need to do UTF-8 decoding and regex might even vectorize it, so it could save you from allocating and UTF-8 decoding.
With all that said, even without that search to bail early, the up-front alloc here is almost certainly marginal. We are in the rendering code here, which does all sorts of allocs (including in annotate-snippets).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah maybe I'm misunderstanding the suggestion. In my naive attempt I tried result.text_len() - source.text_len() and kept getting underflow errors because we're building up the result over time, but source is always quite long from the beginning. Did you mean some sub-slice of source? Or maybe I injected this at the wrong point in the function.
What I meant to say above is "result doesn't have a comparable length to source until the very end of the function," not that it's totally empty.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I see. In that case, isn't the relative offset result.text_len() - i?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, you're right, thanks! I had just missed accounting for the width of the character being inserted when I tried this initially.
| let mut result = String::new(); | ||
| let mut last_end = 0; | ||
| let original_ranges: Vec<TextRange> = annotations.iter().map(|ann| ann.range).collect(); | ||
| let mut line_width = LineWidthBuilder::default(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Move the declaration of this variable closer to where it's used.
Same for last_end. It helps readability because, as a reader, I have to page less variables into my memory :)
| if matches!(c, '\t') { | ||
| let tab_width = u32::try_from(line_width.get() - old_width) | ||
| .expect("small width because of tab size"); | ||
| result.push_str(&source[last_end..index]); | ||
| for _ in 0..tab_width { | ||
| result.push(' '); | ||
| } | ||
| last_end = index + 1; | ||
| update_ranges(index, tab_width); | ||
| } else if let Some(printable) = unprintable_replacement(c) { | ||
| result.push_str(&source[last_end..index]); | ||
| result.push(printable); | ||
| last_end = index + 1; | ||
|
|
||
| let len = printable.text_len().to_u32(); | ||
| update_ranges(index, len); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that we have to match on tabs here anyway, do you think the LineWidthBuilder is worth it or would it simplify the code if it were inlined instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it does look nicer inlined now that I tried it, thanks! You can see the whole looping logic on one screen now.
I don't see a great way to combine the two matches, but I still think it's an improvement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I see how to combine them. Obviously \n and \r won't hit the unprintable branch 🤦
| { | ||
| continue; | ||
| } | ||
| if self.text.as_bytes()[range.start().to_usize() - 1] != b'\n' { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is probably a pre-existing issue but we should also handle \r here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought this would be okay in this context because checking \n as the preceding character should cover both \n and \r\n line endings, but it's easy enough to add \r either way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's true. But Ruff still supports \r (because python does)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added!
| for (ann, &original_range) in annotations.iter_mut().zip(&original_ranges) { | ||
| if index < usize::from(original_range.start()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alternatively, we could just do a regex search for tab and unprintable characters (looks like 5 distinct bytes?). If none are found, then you can just quit early with the Cow::Borrowed case. This search won't need to do UTF-8 decoding and regex might even vectorize it, so it could save you from allocating and UTF-8 decoding.
With all that said, even without that search to bail early, the up-front alloc here is almost certainly marginal. We are in the rendering code here, which does all sorts of allocs (including in annotate-snippets).
We can just track this as the offset between the original source length and the current or `result` length. This offset is given by: ```rust let offset = result.text_len().to_usize() - index; ``` which when added to the original `update_ranges` call: ```rust update_ranges(index + offset, tab_width); ``` simplifies to just `result.text_len()` (`index` is added and then subtracted). The other slight nuance here is that in the original `update_ranges` call locations, we would also need to subtract the length of the new character from the `index` argument, so I instead opted just to move the calls to before we added the new character. This makes `result.text_len()` alone exactly the value we need.
Diagnostic diff on typing conformance testsNo changes detected when running ty on typing conformance tests ✅ |
* main: (24 commits) Add `Checker::context` method, deduplicate Unicode checks (#19609) [`flake8-pyi`] Preserve inline comment in ellipsis removal (`PYI013`) (#19399) [ty] Add flow diagram for import resolution [ty] Add comments to some core resolver functions [ty] Add missing ticks and use consistent quoting [ty] Reflow some long lines [ty] Unexport helper function [ty] Remove offset from `CompletionTargetTokens::Unknown` [`pyupgrade`] Fix `UP030` to avoid modifying double curly braces in format strings (#19378) [ty] fix a typo (#19621) [ty] synthesize `__replace__` for dataclasses (>=3.13) (#19545) [ty] Discard `Definition`s when normalizing `Signature`s (#19615) [ty] Fix empty spans following a line terminator and unprintable character spans in diagnostics (#19535) Add `LinterContext::settings` to avoid passing separate settings (#19608) Support `.pyi` files in ruff analyze graph (#19611) [ty] Sync vendored typeshed stubs (#19607) [ty] Bump docstring-adder pin (#19606) [`perflint`] Ignore rule if target is `global` or `nonlocal` (`PERF401`) (#19539) Add license classifier back to pyproject.toml (#19599) [ty] Add stub mapping support to signature help (#19570) ...
Summary
This was previously the last commit in #19415, split out to make it easier to review. This applies the fixes from c9b99e4, 5021f32, and 2922490 to the new rendering code in
ruff_db. I initially intended only to fix the empty span after a line terminator (as you can see in the branch name), but the two fixes were tied pretty closely together, and my initial fix for the empty spans needed a big change after trying to handle unprintable characters too. I can still split this up if it would help with review. I would just start with the unprintable characters first.The implementation here is essentially copy-pasted from
ruff_linter::message::text.rs, with theSourceCodestruct renamed toEscapedSourceCodesince there's already aSourceCodein scope inrender.rs. It's also updated slightly to account for the multiple annotations for a single snippet. The original implementation used some types from theline_widthmodule fromruff_linter. I copied over heavily stripped-down versions of these instead of trying to import them. We could inline the remaining code entirely, if we want, but I thought it was nice enough to keep.I also moved over
ceil_char_boundary, which is unchanged except to make it a free function taking a&strinstead of aLocatormethod. All of this code could be deleted fromruff_linterif we also move over thegroupedoutput format, which will be the last user after #19415.Test Plan
I added new tests in
ruff_linterthat call into the new rendering code to snapshot the diagnostics for the affected cases. These are copies of existing snapshots in Ruff, so it's helpful to compare them. These are a bit noisy because of the other rendering differences in the header, but all of the^^^indicators should be the same.`empty_span_after_line_terminator` diff
`unprintable_characters` diff