Add documentation to dataframe query workflow and fix api rendering for bindings (#11650)

ntjohnson1 · ntjohnson1 · commit e78381a9a98b · 2025-10-30T13:45:59.000-04:00
* Closes RR-2251 * Primary task: first pass through query-data.md to give an overview of how to start and connect to OSS server and get to a dataframe with pointers to datafusion for follow on work * Will fill an issue to make this into a snippet/tutorial to make testable after we get a python object for the server * Secondary task: Render docs for DataframeQueryView and lots of other bindings exposed through catalog * This scope creeped a lot so I just did the manual class list will file a follow on ticket for proposal to fix this more globally 1. Add lint codes so I can separate out my ignores for pyclass_eq and pyclass_module requirements 1. The lint.py changes (which aren't great since I'm not an experience parser writer but don't seem awful) 3. Add a pyclass module requirement to specify `rerun_bindings.rerun_bindings` see #11268 for more context 1. All the rust file changes 4. Fight mkdocs to render bindings properly 1. The mkdocs related files 2. Adds the ability to generate our docs just off of the stubs without the shared object being present at all. NOTE: The docs actually look better this way, but I added some accommodation to make them look less awful if the shared object is present. Filed an issue to fix all the other missing classes from our rendered api docs: https://linear.app/rerun/issue/RR-2766/cover-remaining-classes-in-python-api-rendered-docs and an issue with a proposal to avoid this problem moving forward: https://linear.app/rerun/issue/RR-2765/change-python-public-api-surface-pattern-to-support-easier-docs
diff --git a/docs/content/getting-started/data-out/query-data.md b/docs/content/getting-started/data-out/query-data.md
@@ -0,0 +1,126 @@
+---
+title: Analyze data via Open Source Server
+order: 4
+---
+
+The Rerun Cloud offering builds on the open source core.
+Towards that end the Open Source Server provides the capability for small scale local analysis using a similar API surface.
+This supports a workflow to develop or debug locally on a single recording then scale up that same workflow on the cloud for production use.
+
+<!-- TODO(RR-2818) add link to doc -->
+
+# Open source server
+
+## Launching the server
+
+The server needs to be opened in a separate window.
+Launch the server using the rerun cli.
+
+```console
+rerun server
+```
+
+For full details run
+
+```console
+rerun server --help
+```
+
+with the most common utility opening a directory of rrds as a dataset in the server
+
+```console
+rerun server -d <directory_containing_rrds>
+```
+
+## Connecting to the server
+
+When launching the server the cli will print out the host and port it is listening on
+(defaulting to: `localhost:51234`).
+
+### From the viewer
+
+Either specify the network location with the cli at launch
+
+```console
+rerun connect localhost:51234
+```
+
+or after the viewer opens open the command palette select `open redap server`
+set the scheme to `http` and enter the hostname and port.
+
+### From the SDK
+
+```python
+import rerun as rr
+CATALOG_URL = "rerun+http://localhost:51234"
+client = rr.catalog.CatalogClient(CATALOG_URL)
+```
+
+## Querying the server
+
+Everything below assumes that the server has been launched and a client has been constructed based on instructions above.
+
+### Datasets overview
+
+A dataset is a collection of recordings that can be queried against.
+If we have already created a dataset we can retrieve it,
+
+```python
+dataset = client.get_dataset_entry(name="oss_demo")
+```
+
+otherwise we can create it.
+
+```python
+dataset = client.create_dataset(
+    name="oss_demo",
+)
+```
+
+In order to add additional recordings to a dataset we use the `register` api.
+
+```python
+# For OSS server you must register files local to your machine
+# To synchronously register a single recording
+dataset.register(f"file://{os.path.abspath('oss_demo.rrd')}")
+# To asynchronously register many recordings
+timeout_seconds = 100
+tasks = dataset.register_batch([f"file://{os.path.abspath('oss_demo.rrd')}"])
+tasks.wait(100)
+```
+
+### Inspecting datasets
+
+Ultimately, we will end up rendering the data as a [DataFusion DataFrame](https://datafusion.apache.org/python/user-guide/dataframe/index.html)
+However, there is an intermediate step that allows for some optimization.
+This generates a `DataFrameQueryView`. <!-- TODO(nick) add link to doc -->
+The `DataFrameQueryView` allows selection of the subset of interest for the dataset (index column, and content columns), filtering to specific time ranges, and managing the sparsity of the data (`fill_latest_at`).
+All of these operations occur on the server prior to evaluating future queries so avoid unnecessary computation.
+
+```python
+view = (
+    dataset
+        .dataframe_query_view(index="log_time", contents="/**")
+        # Select only a single or subset of recordings
+        .filter_partition_id(record_of_interest)
+        # Select subset of time range
+        .filter_range_nanos(start=start_of_interest, end=end_of_interest)
+        # Forward fill for time alignment
+        .fill_latest_at()
+)
+```
+
+After we have identified what data we want we can get a DataFrame.
+
+```python
+df = view.df()
+```
+
+[DataFusion](https://datafusion.apache.org/python/) provides a pythonic dataframe interface to your data as well as [SQL](https://datafusion.apache.org/python/user-guide/sql.html).
+After performing a series of operations this dataframe can be materialized and returned in common data formats.
+
+```python
+pandas_df = df.to_pandas()
+polars_df = df.to_polars()
+arrow_table = df.to_arrow_table()
+```
diff --git a/rerun_py/docs/gen_common_index.py b/rerun_py/docs/gen_common_index.py
@@ -399,9 +399,22 @@ class Section:
     ),
     Section(
         title="Catalog",
-        show_tables=False,
+        show_tables=True,
         mod_path="rerun.catalog",
         show_submodules=True,
+        class_list=[
+            "AlreadyExistsError",
+            "DataframeQueryView",
+            "DatasetEntry",
+            "CatalogClient",
+            "Entry",
+            "EntryId",
+            "EntryKind",
+            "NotFoundError",
+            "TableEntry",
+            "Task",
+            "VectorDistanceMetric",
+        ],
     ),
     Section(
         title="Utilities",
@@ -564,13 +577,25 @@ def make_slug(s: str) -> str:
                         mod_tail = section.mod_path.split(".")[1:]
                         class_name = ".".join([*mod_tail, class_name])
                     cls = rerun_pkg[class_name]
+                    bindings_class = False
+                    if "rerun_bindings" in cls.canonical_path:
+                        bindings_class = True
+                        cls = bindings_pkg[cls.canonical_path[len("rerun_bindings.") :]]
+                        class_name = cls.canonical_path
                     show_class = class_name
                     for maybe_strip in ["archetypes.", "components.", "datatypes."]:
                         if class_name.startswith(maybe_strip):
                             stripped = class_name.replace(maybe_strip, "")
                             if stripped in rerun_pkg.classes:
                                 show_class = stripped
-                    index_file.write(f"[`rerun.{show_class}`][rerun.{class_name}] | {cls.docstring.lines[0]}\n")
+                    if bindings_class:
+                        show_class = class_name  # don't strip anything for bindings
+                    else:
+                        show_class = "rerun." + show_class
+                        class_name = "rerun." + class_name
+                    if cls.docstring is None:
+                        raise ValueError(f"No docstring for class {class_name}")
+                    index_file.write(f"[`{show_class}`][{class_name}] | {cls.docstring.lines[0]}\n")
 
         index_file.write("\n")
 
diff --git a/rerun_py/mkdocs.yml b/rerun_py/mkdocs.yml
@@ -37,6 +37,9 @@ plugins:
                 "!^_[^_]", # Hide things starting with a single underscore
                 "!as_component_batches", # Inherited from AsComponents
                 "!num_instances", # Inherited from AsComponents
+                "!__doc__", # griffe merges extension and stubs :(
+                "!__module__", # griffe merges extension and stubs :(
+                "!__weakref__", # griffe merges extension and stubs :(
               ]
             inherited_members: true
             members_order: source # The order of class members
@@ -48,6 +51,7 @@ plugins:
               - rerun_bindings
             annotations_path: brief
             signature_crossrefs: true
+            find_stubs_package: true
             extensions:
               - griffe_warnings_deprecated
 
diff --git a/rerun_py/rerun_bindings/rerun_bindings.pyi b/rerun_py/rerun_bindings/rerun_bindings.pyi
@@ -1313,6 +1313,8 @@ class Entry:
         """
 
 class DatasetEntry(Entry):
+    """A dataset entry in the catalog."""
+
     @property
     def manifest_url(self) -> str:
         """Return the dataset manifest URL."""
@@ -1556,6 +1558,8 @@ class TableEntry(Entry):
         """Convert this table to a [`pyarrow.RecordBatchReader`][]."""
 
 class DataframeQueryView:
+    """View into a remote dataset acting as DataFusion table provider."""
+
     def filter_partition_id(self, partition_id: str, *args: Iterable[str]) -> Self:
         """Filter by one or more partition ids. All partition ids are included if not specified."""
 
diff --git a/rerun_py/src/catalog/catalog_client.rs b/rerun_py/src/catalog/catalog_client.rs
@@ -15,7 +15,11 @@ use crate::catalog::{
 use crate::utils::{get_tokio_runtime, wait_for_future};
 
 /// Client for a remote Rerun catalog server.
-#[pyclass(name = "CatalogClientInternal")] // NOLINT: skip pyclass_eq, non-trivial implementation
+#[pyclass(  // NOLINT: ignore[py-cls-eq] non-trivial implementation
+    name = "CatalogClientInternal",
+    module = "rerun_bindings.rerun_bindings"
+)]
+
 pub struct PyCatalogClientInternal {
     origin: re_uri::Origin,
 
diff --git a/rerun_py/src/catalog/dataframe_query.rs b/rerun_py/src/catalog/dataframe_query.rs
@@ -25,7 +25,7 @@ use crate::catalog::{PyDatasetEntry, to_py_err};
 use crate::utils::{get_tokio_runtime, wait_for_future};
 
 /// View into a remote dataset acting as DataFusion table provider.
-#[pyclass(name = "DataframeQueryView")] // NOLINT: skip pyclass_eq, non-trivial implementation
+#[pyclass(name = "DataframeQueryView", module = "rerun_bindings.rerun_bindings")] // NOLINT: ignore[py-cls-eq] non-trivial implementation
 pub struct PyDataframeQueryView {
     dataset: Py<PyDatasetEntry>,
 
diff --git a/rerun_py/src/catalog/dataframe_rendering.rs b/rerun_py/src/catalog/dataframe_rendering.rs
@@ -4,7 +4,7 @@ use pyo3::{Bound, PyAny, PyResult, pyclass, pymethods};
 
 use re_format_arrow::{RecordBatchFormatOpts, format_record_batch_opts};
 
-#[pyclass(eq, name = "RerunHtmlTable")]
+#[pyclass(eq, name = "RerunHtmlTable", module = "rerun_bindings.rerun_bindings")]
 #[derive(Clone, PartialEq, Eq)]
 pub struct PyRerunHtmlTable {
     max_width: Option<usize>,
diff --git a/rerun_py/src/catalog/datafusion_catalog.rs b/rerun_py/src/catalog/datafusion_catalog.rs
@@ -10,7 +10,12 @@ use re_redap_client::ConnectionClient;
 
 use crate::utils::get_tokio_runtime;
 
-#[pyclass(frozen, eq, name = "DataFusionCatalog")]
+#[pyclass(
+    frozen,
+    eq,
+    name = "DataFusionCatalog",
+    module = "rerun_bindings.rerun_bindings"
+)]
 pub(crate) struct PyDataFusionCatalogProvider {
     pub provider: Arc<RedapCatalogProvider>,
 }
diff --git a/rerun_py/src/catalog/datafusion_table.rs b/rerun_py/src/catalog/datafusion_table.rs
@@ -10,7 +10,12 @@ use tracing::instrument;
 use crate::catalog::PyCatalogClientInternal;
 use crate::utils::get_tokio_runtime;
 
-#[pyclass(frozen, name = "DataFusionTable")] // NOLINT: skip pyclass_eq, non-trivial implementation
+#[pyclass( // NOLINT: ignore[py-cls-eq] non-trivial implementation
+    frozen,
+    name = "DataFusionTable",
+    module = "rerun_bindings.rerun_bindings"
+)]
+
 pub struct PyDataFusionTable {
     pub provider: Arc<dyn TableProvider + Send>,
     pub name: String,
diff --git a/rerun_py/src/catalog/dataset_entry.rs b/rerun_py/src/catalog/dataset_entry.rs
@@ -38,7 +38,7 @@ use super::{
 };
 
 /// A dataset entry in the catalog.
-#[pyclass(name = "DatasetEntry", extends=PyEntry)] // NOLINT: skip pyclass_eq, non-trivial implementation
+#[pyclass(name = "DatasetEntry", extends=PyEntry, module = "rerun_bindings.rerun_bindings")] // NOLINT: ignore[py-cls-eq] non-trivial implementation
 pub struct PyDatasetEntry {
     pub dataset_details: DatasetDetails,
     pub dataset_handle: DatasetHandle,
diff --git a/rerun_py/src/catalog/entry.rs b/rerun_py/src/catalog/entry.rs
@@ -8,7 +8,7 @@ use re_protos::cloud::v1alpha1::{EntryKind, ext::EntryDetails};
 use crate::catalog::PyCatalogClientInternal;
 
 /// A unique identifier for an entry in the catalog.
-#[pyclass(eq, name = "EntryId")]
+#[pyclass(eq, name = "EntryId", module = "rerun_bindings.rerun_bindings")]
 #[derive(Clone, PartialEq, Eq)]
 pub struct PyEntryId {
     pub id: EntryId,
@@ -42,7 +42,12 @@ impl From<EntryId> for PyEntryId {
 // ---
 
 /// The kinds of entries that can be stored in the catalog.
-#[pyclass(name = "EntryKind", eq, eq_int)]
+#[pyclass(
+    name = "EntryKind",
+    eq,
+    eq_int,
+    module = "rerun_bindings.rerun_bindings"
+)]
 #[derive(Clone, Copy, Debug, PartialEq, Eq, strum_macros::EnumIter)]
 pub enum PyEntryKind {
     #[pyo3(name = "DATASET")]
@@ -104,7 +109,7 @@ impl From<PyEntryKind> for EntryKind {
 // ---
 
 /// An entry in the catalog.
-#[pyclass(name = "Entry", subclass)] // NOLINT: skip pyclass_eq, non-trivial implementation
+#[pyclass(name = "Entry", subclass, module = "rerun_bindings.rerun_bindings")] // NOLINT: ignore[py-cls-eq] non-trivial implementation
 pub struct PyEntry {
     pub client: Py<PyCatalogClientInternal>,
 
diff --git a/rerun_py/src/catalog/errors.rs b/rerun_py/src/catalog/errors.rs
@@ -26,14 +26,14 @@ use pyo3::exceptions::{
 use re_redap_client::ApiErrorKind;
 
 pyo3::create_exception!(
-    rerun_bindings,
+    rerun_bindings.rerun_bindings,
     NotFoundError,
     PyException,
     "Raised when the requested resource is not found."
 );
 
 pyo3::create_exception!(
-    rerun_bindings,
+    rerun_bindings.rerun_bindings,
     AlreadyExistsError,
     PyException,
     "Raised when trying to create a resource that already exists."
diff --git a/rerun_py/src/catalog/mod.rs b/rerun_py/src/catalog/mod.rs
@@ -65,7 +65,12 @@ pub(crate) fn register(_py: Python<'_>, m: &Bound<'_, PyModule>) -> PyResult<()>
 // from the legacy server API)
 
 /// The type of distance metric to use for vector index and search.
-#[pyclass(name = "VectorDistanceMetric", eq, eq_int)]
+#[pyclass(
+    name = "VectorDistanceMetric",
+    eq,
+    eq_int,
+    module = "rerun_bindings.rerun_bindings"
+)]
 #[derive(Clone, Debug, PartialEq)]
 enum PyVectorDistanceMetric {
     L2,
diff --git a/rerun_py/src/catalog/table_entry.rs b/rerun_py/src/catalog/table_entry.rs
@@ -22,7 +22,7 @@ use crate::{
 ///
 /// Note: this object acts as a table provider for DataFusion.
 //TODO(ab): expose metadata about the table (e.g. stuff found in `provider_details`).
-#[pyclass(name = "TableEntry", extends=PyEntry)] // NOLINT: skip pyclass_eq, non-trivial implementation
+#[pyclass(name = "TableEntry", extends=PyEntry, module = "rerun_bindings.rerun_bindings")] // NOLINT: ignore[py-cls-eq] non-trivial implementation
 #[derive(Default)]
 pub struct PyTableEntry {
     lazy_provider: Option<Arc<dyn TableProvider + Send>>,
diff --git a/rerun_py/src/catalog/task.rs b/rerun_py/src/catalog/task.rs
@@ -11,7 +11,7 @@ use re_protos::common::v1alpha1::TaskId;
 use super::{PyCatalogClientInternal, PyDataFusionTable, to_py_err};
 
 /// A handle on a remote task.
-#[pyclass(name = "Task")] // NOLINT: skip pyclass_eq, non-trivial implementation
+#[pyclass(name = "Task", module = "rerun_bindings.rerun_bindings")] // NOLINT: ignore[py-cls-eq] non-trivial implementation
 pub struct PyTask {
     pub client: Py<PyCatalogClientInternal>,
 
@@ -48,7 +48,7 @@ impl PyTask {
 
 #[allow(clippy::allow_attributes, rustdoc::broken_intra_doc_links)]
 /// A collection of [`Task`].
-#[pyclass(name = "Tasks")] // NOLINT: skip pyclass_eq, non-trivial implementation
+#[pyclass(name = "Tasks", module = "rerun_bindings.rerun_bindings")] // NOLINT: ignore[py-cls-eq] non-trivial implementation
 pub struct PyTasks {
     client: Py<PyCatalogClientInternal>,
 
diff --git a/rerun_py/src/dataframe/recording.rs b/rerun_py/src/dataframe/recording.rs
@@ -25,7 +25,7 @@ use super::{PyRecordingView, PySchema};
 /// You can examine the [`.schema()`][rerun.dataframe.Recording.schema] of the recording to see
 /// what data is available, or create a [`RecordingView`][rerun.dataframe.RecordingView] to
 /// to retrieve the data.
-#[pyclass(name = "Recording", module = "rerun_bindings.rerun_bindings")] // NOLINT: skip pyclass_eq, non-trivial implementation
+#[pyclass(name = "Recording", module = "rerun_bindings.rerun_bindings")] // NOLINT: ignore[py-cls-eq] non-trivial implementation
 pub struct PyRecording {
     pub(crate) store: ChunkStoreHandle,
     pub(crate) cache: re_dataframe::QueryCacheHandle,
diff --git a/rerun_py/src/dataframe/recording_view.rs b/rerun_py/src/dataframe/recording_view.rs
@@ -31,7 +31,7 @@ use crate::utils::py_rerun_warn_cstr;
 /// included in the view, as determined by the `row_id` column. This will
 /// generally be the last value logged, as row_ids are guaranteed to be
 /// monotonically increasing when data is sent from a single process.
-#[pyclass(name = "RecordingView", module = "rerun_bindings.rerun_bindings")] // NOLINT: skip pyclass_eq, non-trivial implementation
+#[pyclass(name = "RecordingView", module = "rerun_bindings.rerun_bindings")] // NOLINT: ignore[py-cls-eq] non-trivial implementation
 #[derive(Clone)]
 pub struct PyRecordingView {
     pub(crate) recording: PyRecordingHandle,
diff --git a/rerun_py/src/dataframe/rrd.rs b/rerun_py/src/dataframe/rrd.rs
@@ -11,7 +11,7 @@ use super::PyRecording;
 /// An archive loaded from an RRD.
 ///
 /// RRD archives may include 1 or more recordings or blueprints.
-#[pyclass(frozen, name = "RRDArchive", module = "rerun_bindings.rerun_bindings")] // NOLINT: skip pyclass_eq, non-trivial implementation
+#[pyclass(frozen, name = "RRDArchive", module = "rerun_bindings.rerun_bindings")] // NOLINT: ignore[py-cls-eq] non-trivial implementation
 #[derive(Clone)]
 pub struct PyRRDArchive {
     pub datasets: BTreeMap<StoreId, ChunkStoreHandle>,
diff --git a/rerun_py/src/dataframe/schema.rs b/rerun_py/src/dataframe/schema.rs
diff --git a/rerun_py/src/python_bridge.rs b/rerun_py/src/python_bridge.rs
diff --git a/rerun_py/src/viewer.rs b/rerun_py/src/viewer.rs
diff --git a/scripts/lint.py b/scripts/lint.py