Skip to content

Commit e78381a

Browse files
committed
Add documentation to dataframe query workflow and fix api rendering for bindings (#11650)
* Closes RR-2251 * Primary task: first pass through query-data.md to give an overview of how to start and connect to OSS server and get to a dataframe with pointers to datafusion for follow on work * Will fill an issue to make this into a snippet/tutorial to make testable after we get a python object for the server * Secondary task: Render docs for DataframeQueryView and lots of other bindings exposed through catalog * This scope creeped a lot so I just did the manual class list will file a follow on ticket for proposal to fix this more globally 1. Add lint codes so I can separate out my ignores for pyclass_eq and pyclass_module requirements 1. The lint.py changes (which aren't great since I'm not an experience parser writer but don't seem awful) 3. Add a pyclass module requirement to specify `rerun_bindings.rerun_bindings` see #11268 for more context 1. All the rust file changes 4. Fight mkdocs to render bindings properly 1. The mkdocs related files 2. Adds the ability to generate our docs just off of the stubs without the shared object being present at all. NOTE: The docs actually look better this way, but I added some accommodation to make them look less awful if the shared object is present. Filed an issue to fix all the other missing classes from our rendered api docs: https://linear.app/rerun/issue/RR-2766/cover-remaining-classes-in-python-api-rendered-docs and an issue with a proposal to avoid this problem moving forward: https://linear.app/rerun/issue/RR-2765/change-python-public-api-surface-pattern-to-support-easier-docs
1 parent e3c279d commit e78381a

22 files changed

Lines changed: 334 additions & 60 deletions
Lines changed: 126 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,126 @@
1+
---
2+
title: Analyze data via Open Source Server
3+
order: 4
4+
---
5+
6+
The Rerun Cloud offering builds on the open source core.
7+
Towards that end the Open Source Server provides the capability for small scale local analysis using a similar API surface.
8+
This supports a workflow to develop or debug locally on a single recording then scale up that same workflow on the cloud for production use.
9+
10+
<!-- TODO(RR-2818) add link to doc -->
11+
12+
# Open source server
13+
14+
## Launching the server
15+
16+
The server needs to be opened in a separate window.
17+
Launch the server using the rerun cli.
18+
19+
```console
20+
rerun server
21+
```
22+
23+
For full details run
24+
25+
```console
26+
rerun server --help
27+
```
28+
29+
with the most common utility opening a directory of rrds as a dataset in the server
30+
31+
```console
32+
rerun server -d <directory_containing_rrds>
33+
```
34+
35+
## Connecting to the server
36+
37+
When launching the server the cli will print out the host and port it is listening on
38+
(defaulting to: `localhost:51234`).
39+
40+
### From the viewer
41+
42+
Either specify the network location with the cli at launch
43+
44+
```console
45+
rerun connect localhost:51234
46+
```
47+
48+
or after the viewer opens open the command palette select `open redap server`
49+
set the scheme to `http` and enter the hostname and port.
50+
51+
### From the SDK
52+
53+
```python
54+
import rerun as rr
55+
CATALOG_URL = "rerun+http://localhost:51234"
56+
client = rr.catalog.CatalogClient(CATALOG_URL)
57+
```
58+
59+
## Querying the server
60+
61+
Everything below assumes that the server has been launched and a client has been constructed based on instructions above.
62+
63+
### Datasets overview
64+
65+
A dataset is a collection of recordings that can be queried against.
66+
If we have already created a dataset we can retrieve it,
67+
68+
```python
69+
dataset = client.get_dataset_entry(name="oss_demo")
70+
```
71+
72+
otherwise we can create it.
73+
74+
```python
75+
dataset = client.create_dataset(
76+
name="oss_demo",
77+
)
78+
```
79+
80+
In order to add additional recordings to a dataset we use the `register` api.
81+
82+
```python
83+
# For OSS server you must register files local to your machine
84+
# To synchronously register a single recording
85+
dataset.register(f"file://{os.path.abspath('oss_demo.rrd')}")
86+
# To asynchronously register many recordings
87+
timeout_seconds = 100
88+
tasks = dataset.register_batch([f"file://{os.path.abspath('oss_demo.rrd')}"])
89+
tasks.wait(100)
90+
```
91+
92+
### Inspecting datasets
93+
94+
Ultimately, we will end up rendering the data as a [DataFusion DataFrame](https://datafusion.apache.org/python/user-guide/dataframe/index.html)
95+
However, there is an intermediate step that allows for some optimization.
96+
This generates a `DataFrameQueryView`. <!-- TODO(nick) add link to doc -->
97+
The `DataFrameQueryView` allows selection of the subset of interest for the dataset (index column, and content columns), filtering to specific time ranges, and managing the sparsity of the data (`fill_latest_at`).
98+
All of these operations occur on the server prior to evaluating future queries so avoid unnecessary computation.
99+
100+
```python
101+
view = (
102+
dataset
103+
.dataframe_query_view(index="log_time", contents="/**")
104+
# Select only a single or subset of recordings
105+
.filter_partition_id(record_of_interest)
106+
# Select subset of time range
107+
.filter_range_nanos(start=start_of_interest, end=end_of_interest)
108+
# Forward fill for time alignment
109+
.fill_latest_at()
110+
)
111+
```
112+
113+
After we have identified what data we want we can get a DataFrame.
114+
115+
```python
116+
df = view.df()
117+
```
118+
119+
[DataFusion](https://datafusion.apache.org/python/) provides a pythonic dataframe interface to your data as well as [SQL](https://datafusion.apache.org/python/user-guide/sql.html).
120+
After performing a series of operations this dataframe can be materialized and returned in common data formats.
121+
122+
```python
123+
pandas_df = df.to_pandas()
124+
polars_df = df.to_polars()
125+
arrow_table = df.to_arrow_table()
126+
```

rerun_py/docs/gen_common_index.py

Lines changed: 27 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -399,9 +399,22 @@ class Section:
399399
),
400400
Section(
401401
title="Catalog",
402-
show_tables=False,
402+
show_tables=True,
403403
mod_path="rerun.catalog",
404404
show_submodules=True,
405+
class_list=[
406+
"AlreadyExistsError",
407+
"DataframeQueryView",
408+
"DatasetEntry",
409+
"CatalogClient",
410+
"Entry",
411+
"EntryId",
412+
"EntryKind",
413+
"NotFoundError",
414+
"TableEntry",
415+
"Task",
416+
"VectorDistanceMetric",
417+
],
405418
),
406419
Section(
407420
title="Utilities",
@@ -564,13 +577,25 @@ def make_slug(s: str) -> str:
564577
mod_tail = section.mod_path.split(".")[1:]
565578
class_name = ".".join([*mod_tail, class_name])
566579
cls = rerun_pkg[class_name]
580+
bindings_class = False
581+
if "rerun_bindings" in cls.canonical_path:
582+
bindings_class = True
583+
cls = bindings_pkg[cls.canonical_path[len("rerun_bindings.") :]]
584+
class_name = cls.canonical_path
567585
show_class = class_name
568586
for maybe_strip in ["archetypes.", "components.", "datatypes."]:
569587
if class_name.startswith(maybe_strip):
570588
stripped = class_name.replace(maybe_strip, "")
571589
if stripped in rerun_pkg.classes:
572590
show_class = stripped
573-
index_file.write(f"[`rerun.{show_class}`][rerun.{class_name}] | {cls.docstring.lines[0]}\n")
591+
if bindings_class:
592+
show_class = class_name # don't strip anything for bindings
593+
else:
594+
show_class = "rerun." + show_class
595+
class_name = "rerun." + class_name
596+
if cls.docstring is None:
597+
raise ValueError(f"No docstring for class {class_name}")
598+
index_file.write(f"[`{show_class}`][{class_name}] | {cls.docstring.lines[0]}\n")
574599

575600
index_file.write("\n")
576601

rerun_py/mkdocs.yml

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -37,6 +37,9 @@ plugins:
3737
"!^_[^_]", # Hide things starting with a single underscore
3838
"!as_component_batches", # Inherited from AsComponents
3939
"!num_instances", # Inherited from AsComponents
40+
"!__doc__", # griffe merges extension and stubs :(
41+
"!__module__", # griffe merges extension and stubs :(
42+
"!__weakref__", # griffe merges extension and stubs :(
4043
]
4144
inherited_members: true
4245
members_order: source # The order of class members
@@ -48,6 +51,7 @@ plugins:
4851
- rerun_bindings
4952
annotations_path: brief
5053
signature_crossrefs: true
54+
find_stubs_package: true
5155
extensions:
5256
- griffe_warnings_deprecated
5357

rerun_py/rerun_bindings/rerun_bindings.pyi

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1313,6 +1313,8 @@ class Entry:
13131313
"""
13141314

13151315
class DatasetEntry(Entry):
1316+
"""A dataset entry in the catalog."""
1317+
13161318
@property
13171319
def manifest_url(self) -> str:
13181320
"""Return the dataset manifest URL."""
@@ -1556,6 +1558,8 @@ class TableEntry(Entry):
15561558
"""Convert this table to a [`pyarrow.RecordBatchReader`][]."""
15571559

15581560
class DataframeQueryView:
1561+
"""View into a remote dataset acting as DataFusion table provider."""
1562+
15591563
def filter_partition_id(self, partition_id: str, *args: Iterable[str]) -> Self:
15601564
"""Filter by one or more partition ids. All partition ids are included if not specified."""
15611565

rerun_py/src/catalog/catalog_client.rs

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,11 @@ use crate::catalog::{
1515
use crate::utils::{get_tokio_runtime, wait_for_future};
1616

1717
/// Client for a remote Rerun catalog server.
18-
#[pyclass(name = "CatalogClientInternal")] // NOLINT: skip pyclass_eq, non-trivial implementation
18+
#[pyclass( // NOLINT: ignore[py-cls-eq] non-trivial implementation
19+
name = "CatalogClientInternal",
20+
module = "rerun_bindings.rerun_bindings"
21+
)]
22+
1923
pub struct PyCatalogClientInternal {
2024
origin: re_uri::Origin,
2125

rerun_py/src/catalog/dataframe_query.rs

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -25,7 +25,7 @@ use crate::catalog::{PyDatasetEntry, to_py_err};
2525
use crate::utils::{get_tokio_runtime, wait_for_future};
2626

2727
/// View into a remote dataset acting as DataFusion table provider.
28-
#[pyclass(name = "DataframeQueryView")] // NOLINT: skip pyclass_eq, non-trivial implementation
28+
#[pyclass(name = "DataframeQueryView", module = "rerun_bindings.rerun_bindings")] // NOLINT: ignore[py-cls-eq] non-trivial implementation
2929
pub struct PyDataframeQueryView {
3030
dataset: Py<PyDatasetEntry>,
3131

rerun_py/src/catalog/dataframe_rendering.rs

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@ use pyo3::{Bound, PyAny, PyResult, pyclass, pymethods};
44

55
use re_format_arrow::{RecordBatchFormatOpts, format_record_batch_opts};
66

7-
#[pyclass(eq, name = "RerunHtmlTable")]
7+
#[pyclass(eq, name = "RerunHtmlTable", module = "rerun_bindings.rerun_bindings")]
88
#[derive(Clone, PartialEq, Eq)]
99
pub struct PyRerunHtmlTable {
1010
max_width: Option<usize>,

rerun_py/src/catalog/datafusion_catalog.rs

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,12 @@ use re_redap_client::ConnectionClient;
1010

1111
use crate::utils::get_tokio_runtime;
1212

13-
#[pyclass(frozen, eq, name = "DataFusionCatalog")]
13+
#[pyclass(
14+
frozen,
15+
eq,
16+
name = "DataFusionCatalog",
17+
module = "rerun_bindings.rerun_bindings"
18+
)]
1419
pub(crate) struct PyDataFusionCatalogProvider {
1520
pub provider: Arc<RedapCatalogProvider>,
1621
}

rerun_py/src/catalog/datafusion_table.rs

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,12 @@ use tracing::instrument;
1010
use crate::catalog::PyCatalogClientInternal;
1111
use crate::utils::get_tokio_runtime;
1212

13-
#[pyclass(frozen, name = "DataFusionTable")] // NOLINT: skip pyclass_eq, non-trivial implementation
13+
#[pyclass( // NOLINT: ignore[py-cls-eq] non-trivial implementation
14+
frozen,
15+
name = "DataFusionTable",
16+
module = "rerun_bindings.rerun_bindings"
17+
)]
18+
1419
pub struct PyDataFusionTable {
1520
pub provider: Arc<dyn TableProvider + Send>,
1621
pub name: String,

rerun_py/src/catalog/dataset_entry.rs

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -38,7 +38,7 @@ use super::{
3838
};
3939

4040
/// A dataset entry in the catalog.
41-
#[pyclass(name = "DatasetEntry", extends=PyEntry)] // NOLINT: skip pyclass_eq, non-trivial implementation
41+
#[pyclass(name = "DatasetEntry", extends=PyEntry, module = "rerun_bindings.rerun_bindings")] // NOLINT: ignore[py-cls-eq] non-trivial implementation
4242
pub struct PyDatasetEntry {
4343
pub dataset_details: DatasetDetails,
4444
pub dataset_handle: DatasetHandle,

0 commit comments

Comments
 (0)