diff --git a/docs/sql-ref-functions-builtin-aggregate.md b/docs/sql-ref-functions-builtin-aggregate.md index d59543647e02..a2be7577a617 100644 --- a/docs/sql-ref-functions-builtin-aggregate.md +++ b/docs/sql-ref-functions-builtin-aggregate.md @@ -19,4 +19,631 @@ license: | limitations under the License. --- -Aggregate functions \ No newline at end of file +### Description + +Spark SQL provides build-in aggregate functions defined in the dataset API and SQL interface. Aggregate functions +operate on a group of rows and return a single aggregated value. + +
| Function | Argument Type(s) | Description |
|---|---|---|
| {any | some | bool_or}(expression) | +boolean | +Returns true if at least one value is true. | +
| approx_count_distinct(expression[, relativeSD]) | +(bigint[, double]) | +`relativeSD` is the maximum estimation error allowed. Returns the estimated cardinality by HyperLogLog++. | +
| {avg | mean}(expression) | +tinyint|smallint|int|bigint|float|double|decimal|string | +Returns the average of values in the input expression. | +
| {bool_and | every}(expression) | +boolean | +Returns true if all values are true. | +
| collect_list(expression) | +any | +Collects and returns a list of non-unique elements. The function is non-deterministic because the order of collected results depends on the order of the rows which may be non-deterministic after a shuffle. | +
| collect_set(expression) | +any | +Collects and returns a set of unique elements. The function is non-deterministic because the order of collected results depends on the order of the rows which may be non-deterministic after a shuffle. | +
| corr(expression1, expression2) | +(double, double) | +Returns Pearson coefficient of correlation between a set of number pairs. | +
| count([DISTINCT] *) | +none | +If specified DISTINCT, returns the total number of retrieved rows are unique and not null; otherwise, returns the total number of retrieved rows, including rows containing null. |
+
| count([DISTINCT] expression1[, expression2]) | +(any[, any]) | +If specified DISTINCT, returns the number of rows for which the supplied expression(s) are unique and not null; otherwise, returns the number of rows for which the supplied expression(s) are all not null. |
+
| count_if(predicate) | +expression that returns a boolean value | +Returns the count number from the predicate evaluate to `TRUE` values. | +
| count_min_sketch(expression, eps, confidence, seed) | +(tinyint|int|bigint|smallint|string|binary, double, double, int) | +`eps` and `confidence` are the double values between 0.0 and 1.0, `seed` is a positive integer. Returns a count-min sketch of a expression with the given `esp`, `confidence` and `seed`. The result is an array of bytes, which can be deserialized to a `CountMinSketch` before usage. Count-min sketch is a probabilistic data structure used for cardinality estimation using sub-linear space. | +
| covar_pop(expression1, expression2) | +(double, double) | +Returns the population covariance of a set of number pairs. | +
| covar_samp(expression1, expression2) | +(double, double) | +Returns the sample covariance of a set of number pairs. | +
| {first | first_value}(expression[, `isIgnoreNull`]) | +(any[, boolean]) | +Returns the first value of expression for a group of rows. If isIgnoreNull is true, returns only non-null values, default is false. This function is non-deterministic. | +
| kurtosis(expression) | +double | +Returns the kurtosis value calculated from values of a group. | +
| {last | last_value}(expression[, `isIgnoreNull`]) | +(any[, boolean]) | +Returns the last value of expression for a group of rows. If isIgnoreNull is true, returns only non-null values, default is false. This function is non-deterministic. | +
| max(expression) | +tinyint|short|int|bigint|float|double|date|timestamp|string, or arrays of these types | +Returns the maximum value of the expression. | +
| max_by(expression1, expression2) | +tinyint|short|int|bigint|float|double|date|timestamp|string, or arrays of these types | +Returns the value of expression1 associated with the maximum value of expression2. | +
| min(expression) | +tinyint|short|int|bigint|float|double|date|timestamp|string, or arrays of these types | +Returns the minimum value of the expression. | +
| min_by(expression1, expression2) | +tinyint|short|int|bigint|float|double|date|timestamp|string, or arrays of these types | +Returns the value of expression1 associated with the minimum value of expression2. | +
| percentile(expression, percentage [, frequency]) | +(short|float|byte|decimal|double|int|bigint, double[, int]) | +`percentage` is a number between 0 and 1; `frequency` is a positive integer. Returns the exact percentile value of numeric expression at the given percentage. | +
| percentile(expression, array(percentage1 [, percentage2]...) [, frequency]) | +(short|float|byte|decimal|double|int|bigint, array of double[, int]) | +Percentage array is an array of number between 0 and 1; `frequency` is a positive integer. Returns the exact percentile value array of numeric expression at the given percentage(s). | +
| {percentile_approx | percentile_approx}(expression, percentage [, frequency]) | +(short|float|byte|decimal|double|int|bigint, double[, int]) | +`percentage` is a number between 0 and 1; `frequency` is a positive integer. Returns the approximate percentile value of numeric expression at the given percentage. | +
| {percentile_approx | percentile_approx}(expression, percentage [, frequency]) | +(date|timestamp, double[, int]) | +`percentage` is a number between 0 and 1; `frequency` is a positive integer. Returns the approximate percentile value of numeric expression at the given percentage. | +
| {percentile_approx | percentile_approx}(expression, array(percentage1 [, percentage2]...) [, frequency]) | +(short|float|byte|decimal|double|int|bigint, array of double[, int]) | +`percentage` is a number between 0 and 1; `frequency` is a positive integer. Returns the approximate percentile value of numeric expression at the given percentage. | +
| {percentile_approx | percentile_approx}(expression, array(percentage1 [, percentage2]...) [, frequency]) | +(date|timestamp, array of double[, int]) | +`percentage` is a number between 0 and 1; `frequency` is a positive integer. Returns the approximate percentile value of numeric expression at the given percentage. | +
| skewness(expression) | +double | +Returns the skewness value calculated from values of a group. | +
| {stddev_samp | stddev | std}(expression) | +double | +Returns the sample standard deviation calculated from values of a group. | +
| stddev_pop(expression) | +double | +Returns the population standard deviation calculated from values of a group. | +
| sum(expression) | +tinyint|smallint|int|bigint|float|double|decimal | +Returns the sum calculated from values of a group. | +
| {variance | var_samp}(expression) | +double | +Returns the sample variance calculated from values of a group. | +
| var_pop(expression) | +double | +Returns the population variance calculated from values of a group. | +