[RFC] Datasets in OpenSearch UI

### Introduction

This RFC proposes introducing a new concept called "Datasets" to OpenSearch UI. Datasets would serve as a unified abstraction layer for data management, enabling users to work with data from various sources—including OpenSearch indices and external data sources—through a consistent interface. This RFC outlines the concept, benefits, and proposed implementation of Datasets, and seeks community feedback on the approach.

### Motivation and Problem Statement

Tpday, OpenSearch UI relies on index patterns for data access—a concept specific to OpenSearch that doesn't support non-OpenSearch data sources. This approach presents several challenges:

1. **Conceptual complexity**: Users unfamiliar with OpenSearch struggle to understand index patterns and how they relate to data selection.
2. **Limited abstraction:** There's no unified way to combine data from multiple sources (both OpenSearch and external) for comprehensive exploration, visualization, and analytics.
3. **Technical barrier:** The current approach exposes underlying data source complexities to users who may want to focus on data analysis rather than infrastructure details.
4. **Data distribution reality:** In real-world scenarios, data is often distributed across various clusters or storage systems. For example, logs might be spread across different clusters or time periods, with no straightforward way to append or join this data.

We have an opportunity to address these challenges by creating an abstraction layer that simplifies data access and combination, while preserving the power and flexibility of OpenSearch.

### Proposed Solution: Datasets

Datasets would provide a logical representation of data stored in and across various sources. A Dataset is fundamentally a user-friendly way to define, organize, and access data within OpenSearch UI.


**What is a Dataset?**

A Dataset is a new artifact in OpenSearch that defines and manages data from various sources, both OpenSearch and non-OpenSearch. It serves as a LOGICAL REPRESENTATION of data stored in one or more data sources, including:

* OpenSearch indices
* SQL databases
* Object stores (like S3)
* APIs and other data sources

Datasets would facilitate:

* **Data Exploration:** Enabling users to explore and analyze data through a unified interface
* **Data Combination**: Supporting both union operations (combining data with identical schemas) and join operations (relating data through common fields)
* **Visualization**: Allowing users to create visualizations and dashboards based on single or combined data sources
* **Collaboration**: Enabling users to share common data definitions across individuals and teams

**What Datasets are NOT**

To be clear, Datasets would NOT be:

* A directly queryable storage entity. Datasets don't store data; they serve as a logical reference that OpenSearch uses to execute queries against the actual underlying data sources.
* A replacement for direct query access to data sources. Advanced users can still query data sources directly when needed.
* An organizational construct like tags or folders. Datasets focus on data access and representation, not artifact organization.

### Core Concepts

Dataset Types

1. **Single-Source Datasets:** Reference data from a single source, such as an OpenSearch index, SQL database, or object store.
2. **Composite Datasets**: Combine data from multiple sources using:

    * **Union Operations**: Appending data with identical schemas (e.g., logs spread across different indices)
    * **Join Operations**: Relating data through common fields (e.g., joining application logs with user data)

Key Components

1. **Data Source Selection**: Connecting to OpenSearch indices or external data sources
2. **Schema Management**: Defining and customizing fields, including calculated fields
3. **Field Configuration**: Setting field types, formats, and descriptions
4. **Time-Based Configuration**: Configuring date/time fields and time ranges
5. **Permissions**: Inheriting permissions from data sources

User Workflows and Examples

**Creating a Dataset**

1. Single-Source Dataset:

* Select a data source (OpenSearch index, SQL database, etc.)
* Configure schema and field properties
* Define name and description
* Save the Dataset

2. Composite Dataset with Union Operation:

* Select multiple data sources with compatible schemas
* Configure the union operation
* Preview and validate the combined dataset
* Save the Union Dataset

3. Composite Dataset with Join Operation:

* Select multiple data sources
* Define join conditions between sources
* Configure schema for the joined dataset
* Preview and save the Join Dataset

**Using Datasets**

* In Discover: Select a Dataset for data exploration, searching, and filtering
* In Visualizations: Build charts and dashboards based on Datasets
* In Other Plugins: Use Datasets in any OpenSearch plugin that currently uses index patterns

**Example: Log Analysis with Datasets**

Scenario: You need to analyze application logs along with error details stored in a separate database.

Traditional approach: You would need to query each source separately, then manually combine and correlate results.

With Datasets:

1. Create a Join Dataset that combines application logs and error details
2. Define join condition on error_code
3. Use in Discover to seamlessly explore correlated data
4. Create visualizations showing error trends with detailed descriptions

### Implementation Considerations

**Workspaces Integration**
[[Read more about workspace on OpenSearch]](https://docs.opensearch.org/docs/latest/dashboards/workspace/workspace/)
Datasets would be introduced within the context of workspaces in OpenSearch. This approach provides a clear boundary for Dataset management and avoids disrupting existing index pattern workflows.


**Permissions Model**

Datasets would inherit permissions from the associated data sources, similar to how index patterns work today in OpenSearch. This approach leverages the existing permissions structure and requires no additional permission management at the Dataset level.

**Relationship to Index Patterns**

- Within Workspaces: Index patterns will be fully replaced by Datasets, introducing both the new terminology and expanded capabilities. Users working in workspaces will exclusively use the Datasets interface.
- Outside Workspaces: The underlying implementation will gradually transition to Datasets, but the user-facing terminology will remain "index patterns" for continuity and to minimize disruption for existing users.
- Long-Term: In the fullness of time, only Datasets will exist throughout OpenSearch. The term "index patterns" will eventually be phased out completely as more users transition to workspace environments.


### Questions for the Community

1. Do you feel Datasets would solve important pain points in your current use of OpenSearch?
2. What types of data sources would you want to combine using Datasets?
3. Should Datasets be able to cross workspace boundaries in the future? What challenges do you see with this approach?
4. Are there specific performance concerns we should consider when implementing Composite Datasets, particularly for Join operations?
5. What additional features would make Datasets more valuable to your specific use cases?
6. Would you prefer Datasets to completely replace index patterns, or to coexist with them?

### Example Use Cases

**Log-Analytics JOIN within a Single OpenSearch Source**

Scenario: Analyzing application logs with user session data and error details

Data Sources:

* app_logs (Fields: timestamp, user_id, error_code, response_time_ms)
* user_sessions (Fields: user_id, session_start, session_end)
* error_catalog (Fields: error_code, severity, description)

With Datasets:

1. Create a Dataset "AppLogInsights" with joins:

* Join 1: app_logs.user_id = user_sessions.user_id
* Join 2: app_logs.error_code = error_catalog.error_code

2. Query in Discover using PPL:

```
SOURCE=AppLogInsights
| LET session_duration = date_diff('second', session_start, session_end)
| STATS
  count() AS total_errors,
  avg(response_time_ms) AS avg_resp_ms,
  avg(session_duration) AS avg_session_s
  BY severity
| SORT total_errors DESC
```


Benefits:

* ONE-TIME schema definition—no need to re-express JOINs in every query
* Cleaner PPL in Discover: focus on metrics, not join logic
* Reusable for any downstream visualizations


We welcome feedback, suggestions, and discussions on this proposal. Your input will help shape the direction and implementation of Datasets in OpenSearch.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[RFC] Datasets in OpenSearch UI #9791

Introduction

Motivation and Problem Statement

Proposed Solution: Datasets

Core Concepts

Implementation Considerations

Questions for the Community

Example Use Cases

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[RFC] Datasets in OpenSearch UI #9791

Description

Introduction

Motivation and Problem Statement

Proposed Solution: Datasets

Core Concepts

Implementation Considerations

Questions for the Community

Example Use Cases

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions