From ce824a1ef368e656304dc63f5b536558ece3d7c1 Mon Sep 17 00:00:00 2001 From: Phil Varner Date: Mon, 17 Aug 2020 22:51:29 -0400 Subject: [PATCH 1/2] initial commit for aggregation extension --- extensions/aggregation/README.md | 232 +++++++++++++++++++++++++++++++ 1 file changed, 232 insertions(+) create mode 100644 extensions/aggregation/README.md diff --git a/extensions/aggregation/README.md b/extensions/aggregation/README.md new file mode 100644 index 00000000..546776d0 --- /dev/null +++ b/extensions/aggregation/README.md @@ -0,0 +1,232 @@ +# Aggregation Extension + +The purpose of the Aggregation Extension is to provide an endpoint similar to the Search endpoint (`/search`), but which will provide aggregated information on matching Items rather than the Items themselves. This is highly influenced by the Elasticsearch aggregation endpoint, but with a more regular structure for responses. + +## STAC Endpoints + +| Endpoint | Returns | Description | +| ------------ | -------------------------------------------------------------- | ----------- | +| `/aggregate` | AggregationCollection | Retrieves an aggregation of the group of Items matching the provided predicates | + +The `/aggregate` endpoint behaves similarly to the `/search` endpoint, but instead of returning an ItemCollection of Items, it instead returns aggregated information over the same matching Items in the form of an **AggregationCollection** of **Aggregation** entities. + +If the `/aggregate` endpoint is implemented, it is **required** to add a link with the `rel` type set to `aggregate` to the `links` array in root entity (`/`) that refers to the aggregate endpoint in the `href` property. This link **should** also have a field `aggregations` having a string array value that advertises the available values for the `aggregations` query parameter. + +## Filter Parameters and Fields + +The filters for `/aggregate` are the same as those for `/search` that are semantically meaningful (e.g., limit has no meaning when doing aggregations). These filters are passed as query string parameters or JSON +entity fields. For filters that represent a set of values, query parameters should use comma-separated +string values and JSON entity attributes should use JSON Arrays. + +| Parameter | Type | Description | +| ----------- | ---------------- | ----------- | +| datetime | string | Single date+time, or a range ('/' seperator), formatted to [RFC 3339, section 5.6](https://tools.ietf.org/html/rfc3339#section-5.6). Use double dots `..` for open date ranges. | +| bbox | \[number] | Requested bounding box. Represented using either 2D or 3D geometries. The length of the array must be 2*n where n is the number of dimensions. The array contains all axes of the southwesterly most extent followed by all axes of the northeasterly most extent specified in Longitude/Latitude or Longitude/Latitude/Elevation based on [WGS 84](http://www.opengis.net/def/crs/OGC/1.3/CRS84). When using 3D geometries, the elevation of the southwesterly most extent is the minimum elevation in meters and the elevation of the northeasterly most extent is the maximum. | +| intersects | GeoJSON Geometry | Searches items by performing intersection between their geometry and provided GeoJSON geometry. All GeoJSON geometry types must be supported. | +| ids | \[string] | Array of Item ids to return. All other filter parameters that further restrict the number of search results (except `next` and `limit`) are ignored | +| collections | \[string] | Array of Collection IDs to include in the search for items. Only Items in one of the provided Collections will be searched | +| aggregations | \[string] | A list of aggregations to compute and return | + +Only one of either **intersects** or **bbox** should be specified. If both are specified, a 400 Bad Request response should be returned. + +**aggregations**: There are no named aggregations that must be implemented. All aggregations which are available should be advertised in the root `rel="aggregate"` link. + +This is a list of recommended aggregations to implement: +* count (Single Value of integer) +* datetime_min (Single Value of datetime) +* datetime_max (Single Value of datetime) +* collection (Term Count) +* cloud_cover (Discrete Range) +* datetime_auto (Datetime Range, automatic interval detection) -- detect a reasonable interval based on the datetime range and distribution of data. Implementation specific. +* datetime_yearly (Datetime Range, interval=year) +* datetime_quarterly (Datetime Range, interval=quarter) +* datetime_monthly (Datetime Range, interval=month) +* datetime_weekly (Datetime Range, interval=week) +* datetime_daily (Datetime Range, interval=day) +* datetime_hourly (Datetime Range, interval=hour) +* datetime_minutes (Datetime Range, interval=minute) +* datetime_seconds (Datetime Range, interval=second) + +## AggregationCollection fields + +This object describes a STAC AggregationCollection, which is the analog of an ItemCollection for the `/aggregate` operation. + +| Field Name | Type | Description | +| --------------- | -------------- | ----------- | +| stac_version | string | **REQUIRED** The STAC version the AggregationCollection implements. | +| type | string | **REQUIRED** Always "AggregationCollection". | +| aggregations | \[Aggregation] | **REQUIRED** A possibly-empty array of Aggregations. | + +**stac_version**: In general, STAC versions can be mixed, but please keep the [recommended best practices](../best-practices.md#mixing-stac-versions) in mind. + +## Aggregation fields + +| Field Name | Type | Description | +| --------------- | -------------- | ----------- | +| key | string | **REQUIRED** The unique indentifier of the aggregation. | +| buckets | \[Bucket] | If the aggregation bucketizes Items, they are defined here. | +| overflow | integer | The count of Items that were not categorized into any of the buckets defined by the `buckets` field | +| interval | string | \["year", "quarter", "month", "week", "day", "hour", "minute", "second"] | +| value | string | For a Single Value aggregation, the string representation of the result value. | +| value_as_type | string or number or datetime | For a Single Value aggregation, a JSON-type represenation of the result value. | + +One of either **buckets** or **value** is required. + +**key** An identifier for the aggregation result. Should be identical to the value passed to the `aggregations` query parameter. + +**buckets** If the aggregation is a Term Count, Datetime Range, or Discrete Range, these are the "buckets" into which each matching Item is categorized. + +**overflow** Some implemenation data stores may have limitations on the aggregation queries that can be performed on them. For example, Elasticsearch limits the number of buckets for a query to 10,000 for performance reasons. Overflow indicates that there were Items matched by the query that are not accounted for in the count of any of the response buckets. + +**interval** Aggregations over datetime typed values that return a Datetime Range have a slightly different format than Discrete Range. For these, only the start datetime for the bucket is set to the `key` field. The interval determines how much time from that starting datetime the bucket represents. + +**value** For Single Value aggregations, this is the string value of the result. If the type of the value being aggregated over is a datetime, this is an RFC 3339 datetime, e.g., "2020-08-12T19:06:09Z". + +**value_as_type** For Single Value aggregations, this is a representation of the result value as the equivalent JSON type. TBD: what about datetimes? + +## Bucket fields + +| Field Name | Type | Aggregation Types | Description | +| --------------- | -------------- | ----------------- | ----------- | +| key | string | all | | +| key_as_type | ? | all | | +| value | string | all | | +| value_as_type | ? | all | | +| from | ? | all | | +| to | ? | all | | + +## Aggregation Types + +### Single Value Aggregation + +effectively a single Term Count Bucket lifted up one level + +**todo** (diff for String, Numeric, and Datetime) + +Example: + { + "stac_version": "1.0.0", + "type": "AggregationCollection", + "aggregations": [ + { + "key": "datetime_min", + "value": "2000-02-16T00:00:00.000Z", + "value_as_type": 1.506592E+11 + } + ] + } + +### Term Count Aggregation + +- enumeration count multi bucket one per unique value + +Example: + { + "stac_version": "1.0.0", + "type": "AggregationCollection", + "aggregations": [ + { + "key": "collections", + "buckets": [ + { + "key": "sentinel2_l1c", + "value": "12649072", + "value_as_type": 12649072 + }, + { + "key": "landsat8_l1tp", + "value" : "1071997", + "value_as_type": 1071997 + } + ], + "overflow": 23414 + } + ] + } + +### Discrete Range Aggregation + +Fields: +* key (string) +* key_as_type () +* from (optional, missing indicates an open interval) inclusive +* to (optional, missing indicates an open interval) exclusive +* value (integer) + +Example: + { + "stac_version": "1.0.0", + "type": "AggregationCollection", + "aggregations": [ + { + "key": "cloud_cover", + "buckets": [ + { + "key": "*-5.0", + "to": 5, + "value" : "8644819", + "value_as_type" : 8644819 + }, + { + "key": "5.0-10.0", + "from": 5, + "to": 10, + "value" : "5644819", + "value_as_type" : 5644819 + }, + { + "key": "10.0-*", + "from": 10, + "value" : "7644819", + "value_as_type" : 7644819 + } + ] + } + ] + } + +### Datetime Range Aggregation + +Fields: +datetimes are RFC 3339 string values + +* key (string) +* key_as_type (datetime in milliseconds?) +* value (integer) (ES: doc_count) + +Example: + { + "stac_version": "1.0.0", + "type": "AggregationCollection", + "aggregations": [ + { + "key": "datetime_yearly", + "buckets": [ + { + "key": "2000-01-01T00:00:00.000Z", + "key_as_type": 946684800000, + "to": 5, + "value" : "8644819", + "value_as_type" : 8644819 + }, + { + "key": "2001-01-01T00:00:00.000Z, + "key_as_type": 978307200000, + "from": 5, + "to": 10, + "value" : "5644819", + "value_as_type" : 5644819 + }, + { + "key": "2002-01-01T00:00:00.000Z, + "key_as_type": 1009843200000, + "from": 10, + "value" : "7644819", + "value_as_type" : 7644819 + } + ], + "interval": "year", + "overflow": 98373 + } + ] + } \ No newline at end of file From ebabaa74f3fa796631e84700ea8c16caecf75275 Mon Sep 17 00:00:00 2001 From: Phil Varner Date: Mon, 17 Aug 2020 22:53:23 -0400 Subject: [PATCH 2/2] add motivation --- extensions/aggregation/README.md | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/extensions/aggregation/README.md b/extensions/aggregation/README.md index 546776d0..ea7ad265 100644 --- a/extensions/aggregation/README.md +++ b/extensions/aggregation/README.md @@ -1,6 +1,8 @@ # Aggregation Extension -The purpose of the Aggregation Extension is to provide an endpoint similar to the Search endpoint (`/search`), but which will provide aggregated information on matching Items rather than the Items themselves. This is highly influenced by the Elasticsearch aggregation endpoint, but with a more regular structure for responses. +The purpose of the Aggregation Extension is to provide an endpoint similar to the Search endpoint (`/search`), but which will provide aggregated information on matching Items rather than the Items themselves. This is useful when a dataset is very large, and it is infeasible to example all results. This is particularly useful in data exploration, whereby a user of the data can change queries to see the "shape" of the results for a given query. + +This is highly influenced by the Elasticsearch aggregation endpoint, but with a more regular structure for responses. ## STAC Endpoints