Skip to content

Conversation

@yaooqinn
Copy link
Member

What changes were proposed in this pull request?

All DataSketches-related expression functions should have their own sketch_funcs group instead of being grouped under misc_funcs.

Move all sketch-related expression functions from misc_funcs to sketch_funcs:

  • HLL sketch functions: hll_sketch_estimate, hll_union
  • Theta sketch functions: theta_sketch_estimate, theta_union, theta_difference, theta_intersection
  • KLL sketch functions: kll_sketch_to_string_*, kll_sketch_get_n_*, kll_sketch_get_rank_*, kll_sketch_get_quantile_*, kll_sketch_get_pmf_*, kll_sketch_get_cdf_*, kll_sketch_merge_*
  • Tuple sketch functions: tuple_sketch_* expression functions
  • ApproxTopK: approx_top_k_estimate

Add sketch_funcs to the groups set in gen-sql-functions-docs.py.

Note: Aggregate functions (like hll_sketch_agg, theta_sketch_agg, kll_sketch_agg_*, etc.) remain in agg_funcs.

Why are the changes needed?

This PR moves 34 DataSketches-related expression functions from misc_funcs to a dedicated sketch_funcs group. These 34 functions represent over 60% of all misc_funcs, making misc_funcs a catch-all bucket that reduces documentation clarity. By creating sketch_funcs, we achieve consistency with other specialized function groups (avro_funcs, json_funcs, csv_funcs, xml_funcs, etc.) and make it easier for users to discover and understand DataSketches functionality in Spark SQL.

Does this PR introduce any user-facing change?

No functional changes. The only difference is in how functions are grouped in documentation.

How was this patch tested?

Existing tests.

Was this patch authored or co-authored using generative AI tooling?

Yes, GitHub Copilot was used to assist with this change.

@github-actions
Copy link

JIRA Issue Information

=== Improvement SPARK-55279 ===
Summary: [SQL] Add sketch_funcs group for DataSketches SQL functions
Assignee: None
Status: Open
Affected: ["4.2.0"]


This comment was automatically generated by GitHub Actions

@github-actions github-actions bot added the SQL label Jan 29, 2026
Copy link
Contributor

@allisonwang-db allisonwang-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yaooqinn yaooqinn force-pushed the SPARK-55279-sketch-funcs-group branch from 5a52f4b to 3349dc5 Compare January 30, 2026 03:37
@dongjoon-hyun dongjoon-hyun changed the title [SPARK-55279][SQL] Add sketch_funcs group for DataSketches SQL functions [SPARK-55279][SQL] Add sketch_funcs group for DataSketches SQL functions Jan 30, 2026
Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM. Thank you, @yaooqinn and @allisonwang-db .

cc @peter-toth

@yaooqinn yaooqinn force-pushed the SPARK-55279-sketch-funcs-group branch from 3349dc5 to 4c8b0c9 Compare January 30, 2026 07:06
@yaooqinn yaooqinn force-pushed the SPARK-55279-sketch-funcs-group branch from 4c8b0c9 to 8cc53f4 Compare January 30, 2026 14:01
All DataSketches-related expression functions should have their own 'sketch_funcs' group instead of being grouped under 'misc_funcs'.

This improves consistency with how other specialized function categories are organized and makes the documentation clearer for users.

Move all sketch-related expression functions from misc_funcs to sketch_funcs:
- HLL sketch functions: hll_sketch_estimate, hll_union
- Theta sketch functions: theta_sketch_estimate, theta_union, theta_difference, theta_intersection
- KLL sketch functions: kll_sketch_to_string_*, kll_sketch_get_n_*, kll_sketch_get_rank_*, kll_sketch_get_quantile_*, kll_sketch_get_pmf_*, kll_sketch_get_cdf_*, kll_sketch_merge_*
- Tuple sketch functions: tuple_sketch_* expression functions
- ApproxTopK: approx_top_k_estimate

Add sketch_funcs to the groups set in gen-sql-functions-docs.py.

Note: Aggregate functions (like hll_sketch_agg, theta_sketch_agg, kll_sketch_agg_*, etc.) remain in 'agg_funcs'.

This PR moves 34 DataSketches-related expression functions from misc_funcs to a dedicated sketch_funcs group. These 34 functions represent over 60% of all misc_funcs, making misc_funcs a catch-all bucket that reduces documentation clarity. By creating sketch_funcs, we achieve consistency with other specialized function groups (avro_funcs, json_funcs, csv_funcs, xml_funcs, etc.) and make it easier for users to discover and understand DataSketches functionality in Spark SQL.

No functional changes. The only difference is in how functions are grouped in documentation.

Existing tests.

Yes, GitHub Copilot was used to assist with this change.
@yaooqinn yaooqinn force-pushed the SPARK-55279-sketch-funcs-group branch from 8cc53f4 to bd8912a Compare January 30, 2026 14:14
@yaooqinn yaooqinn closed this in fbb4019 Jan 30, 2026
@yaooqinn yaooqinn deleted the SPARK-55279-sketch-funcs-group branch January 30, 2026 18:23
@yaooqinn
Copy link
Member Author

Merged to master, thank you @allisonwang-db @dongjoon-hyun @peter-toth

BTW, the test failure is irrelevant and I'm trying to fix it in #54072

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants