Skip to content

Conversation

@UnboundVariable
Copy link
Collaborator

This PR adds support for "document symbols" and "workspace symbols" language server features. Most of the logic to implement these features is shared.

The "document symbols" feature returns a list of all symbols within a specified source file. Clients can specify whether they want a flat or hierarchical list. Document symbols are typically presented by a client in an "outline" form. Here's what this looks like in VS Code, for example.

image

The "workspace symbols" feature returns a list of all symbols across the entire workspace that match some user-supplied query string. This allows the user to quickly find and navigate to any symbol within their code.

image

@github-actions
Copy link
Contributor

github-actions bot commented Jul 24, 2025

mypy_primer results

No ecosystem changes detected ✅
No memory usage changes detected ✅

@AlexWaygood AlexWaygood removed their request for review July 24, 2025 07:31
@MichaReiser MichaReiser added server Related to the LSP server ty Multi-file analysis & type inference labels Jul 24, 2025
Comment on lines +87 to +89
let mut visitor = SymbolVisitor::new(options);
visitor.visit_body(&module.syntax().body);
visitor.symbols
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you tell me more why this use its own visitor over e.g. using the semantic_index, iterating over each scope and retrieving all names from the place table?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually implemented it both ways. I started with the approach you're suggesting here. There were a couple of issues I ran into.

  1. It required me to expose a bunch of APIs publicly from the ty_python_semantic crate. Alternatively (and this is what I did in my initial implementation) push all of the symbol extraction logic down into the ide_support.rs module.
  2. It produced subtly different results than what you get with a visitor implementation. For example, symbols that are synthesized but don't appear within the sources are "seen" in the place table. Also, the ordering of symbols ends up being different from source order when you use the place table.

I'm sure we could overcome these problems if we were to use the semantic index, but I think the visitor approach will also be much better for performance. We really don't want to do unnecessary semantic indexing just to enumerate the externally-visible symbols in a file when it's possible to do this with a simple AST walk — especially one that is able to skip most of the leaf nodes in the tree. Performance is important for multi-file symbol searches in large projects. It will become even more important once we implement "auto import" for the completion provider because that feature will require us to do symbol enumeration for not only project sources but also libraries and stubs. That potentially translates to tens of thousands of source files. Semantically indexing that many files will consume significant time and memory.

I'll also note that pyright implements this feature using a visitor approach. To the extent that we want to roughly match the behavior of pyright and pylance, a visitor approach here will help us do that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sure we could overcome these problems if we were to use the semantic index, but I think the visitor approach will also be much better for performance.

I think that heavily depends. ty has all semantic index hot in cache when a user uses workspace diagnostics. Therefore, traversing the semantic index should be much cheaper as it is a constant time lookup. Now, we'll probably enable LRU for semantic indices in the future, in which ase this would no longer be true.

We really don't want to do unnecessary semantic indexing just to enumerate the externally-visible symbols in a file when it's possible to do this with a simple AST walk

If this is what we do, have you considered using exported_names? It's a lightweight salsa query that only returns the exported names of a file.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

exported_names relies on the semantic index. Running the semantic index on every file in a large project will take a long time and consume significant memory. Initial workspace queries will be quite slow unless (as you mentioned) the user has opted for workspace diagnostics. (It's not clear to me which percentage of users will choose workspace diagnostics versus openFileOnly diagnostics.) Subsequent queries will be faster, assuming that the salsa working set remains in memory.

Running the semantic indexer on every file in the project plus all vendored stubs plus all files in the configured Python environment will almost definitely lead to performance and memory problems, so that approach won't work for the auto-import feature.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think exported_names depends on the semantic index.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

exported_names doesn't depend on semantic indexing.

It does, however, eagerly follow * imports (with fixpoint iteration to resolve cycles) so as to include * imported names in the symbols list.

In fact, I think that's a more general difference -- this new visitor finds only locally-defined symbols, it doesn't include imported symbols. (It also seems to potentially miss some, though less-common ones -- e.g. names defined by assignment expressions, with statements, for loops, etc -- arguably some of these, while technically available, maybe wouldn't be desirable to include in a document-symbols list anyway?) But this is another difference in behavior that would have to be a flag if they were merged.

I do think that in some cases there could potentially be a lot of third-party files that are never imported by the first-party code, so we would never semantic index them otherwise, but I guess they should also be included in workspace symbols?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

exported_names doesn't depend on semantic indexing.

Ah, good to know. I misunderstood that in the code. Then I agree that the visitor I implemented in this PR could potentially be unified with exported_names at some point.

Unless you have a strong opinion to the contrary, I think I'll merge this PR and log an issue to track the work involved in investigating that union. As you point out, there will need to be some reconciliation of the functionality between the two, and some "mode flags" might be required to get the desired behaviors in all cases.

I guess they should also be included in workspace symbols?

The "workspace symbols" feature should never include symbols from third-party files. It's limited to just workspace symbols. However, the auto-import feature (part of the "completion suggestions" feature) should include symbols from third-party files. The API that I wrote as part of this PR is intended to be used for both scenarios.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. Thanks for the discussion

* main:
  [ty] Added support for "document highlights" language server feature. (astral-sh#19515)
  Add support for specifying minimum dots in detected string imports (astral-sh#19538)
  [ty] Minor: fix incomplete docstring (astral-sh#19534)
  [ty] Move server tests as integration tests (astral-sh#19522)
  [`ruff`] Offer fixes for `RUF039` in more cases (astral-sh#19065)
  [ty] Support `dataclasses.InitVar` (astral-sh#19527)
  [`ruff`] Fix `RUF033` breaking with named default expressions (astral-sh#19115)
  Update pre-commit hook name (astral-sh#19530)
  Bump 0.12.5 (astral-sh#19528)
  [ty] Rename type_api => ty_extensions (astral-sh#19523)
  [ty] Added support for "go to references" in ty playground. (astral-sh#19516)

# Conflicts:
#	crates/ty_server/src/server/api/requests.rs
#	crates/ty_server/tests/e2e/snapshots/e2e__initialize__initialization.snap
#	crates/ty_server/tests/e2e/snapshots/e2e__initialize__initialization_with_workspace.snap
Copy link
Member

@MichaReiser MichaReiser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An extra visitor seems fine in an initial version, but I'm a bit concerned that we now have exported_modules (only traverses top level), semantic_index, and the visitor here that all roughly do the same (but with slightly differences).

Not using semantic_index also has the downside that we don't leverage any caching because we clear the parsed_module after an initial check. That means, we end up reparsing every single file in the project and then traverse all of them. A second symbol lookup will then retraverse all files again without any form of caching whatsoever.

Did you test this on a large project (e.g. homeassistant)? How responsive is ty for whole project symbols?

@UnboundVariable
Copy link
Collaborator Author

Did you test this on a large project (e.g. homeassistant)? How responsive is ty for whole project symbols?

Yes, for large projects, it's way more responsive than I expected it would be! homeassistant is a good stress test here because it contains over 20K files. On my machine, it takes about 4 seconds to do a workspace symbols query from a cold start. Subsequent workspace queries take slightly less than 1 second. This assumes the query string has several characters. If the query string is a single character like a, it still takes a little over two seconds because there are so many matches returned.

By comparison, pyright takes a little more than 10 seconds on a cold start for this same test. Subsequent searches take 6 seconds. So ty is already significantly faster than pyright here.

Pylance adds a symbol indexer on top of pyright. Once its indexer runs on homeassistant, it's able to perform workspace queries in about 1 second — similar to ty's performance without an index.

@MichaReiser
Copy link
Member

MichaReiser commented Jul 25, 2025

I'd be interested in @carljm's take on the visitor.

Regarding performance. This seems good enough so that it doesn't need to block this PR.

I think the way I would implement this with salsa caching is that the visitor collects all symbols for a single file and that result would be cached by salsa. The matching of the symbol name to the name provided in the query then needs to happen outside the visitor (but that should be extremelly fast). Can you open a follow up issue that we should look into caching workspace symbols (or document symbols)

@carljm
Copy link
Contributor

carljm commented Jul 25, 2025

The question of reusing semantic index seems more complex, I don't have a strong feeling there. I do think that ideally we would unify the existing exported_symbols query and this new visitor -- I don't really see any daylight between them in terms of what they are supposed to do (other than of course whether nested scopes are visited, but that seems like a simple flag).

@UnboundVariable UnboundVariable merged commit a0d8ff5 into astral-sh:main Jul 25, 2025
37 checks passed
@UnboundVariable UnboundVariable deleted the symbols branch July 25, 2025 20:07
dcreager added a commit that referenced this pull request Jul 25, 2025
* main:
  [ty] Added support for "document symbols" and "workspace symbols" (#19521)
  Add `TextEmitter::with_color` and disable colors in `unreadable_files` test (#19562)
  [ty] Added support for document highlights in playground. (#19540)
  [`refurb`] Ignore decorated functions for `FURB118` (#19339)
  [ty] Add workflow to comment diagnostic diff for conformance tests (#19556)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

server Related to the LSP server ty Multi-file analysis & type inference

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants