[SPARK-28510][SQL] Implement Spark's own GetFunctionsOperation #25252

wangyum · 2019-07-25T06:45:54Z

What changes were proposed in this pull request?

This PR implements Spark's own GetFunctionsOperation which mitigates the differences between Spark SQL and Hive UDFs. But our implementation is different from Hive's implementation:

Our implementation always returns results. Hive only returns results when (null == catalogName || "".equals(catalogName)) && (null == schemaName || "".equals(schemaName)).
Our implementation pads the REMARKS field with the function usage - Hive returns an empty string.
Our implementation does not support FUNCTION_TYPE, but Hive does.

How was this patch tested?

unit tests

wangyum · 2019-07-25T06:55:11Z

...erver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkGetFunctionsOperation.scala

+      val functionPattern = CLIServiceUtils.patternToRegex(functionName)
+      if ((null == catalogName || "".equals(catalogName))
+        && (null == schemaName || "".equals(schemaName))) {
+        catalog.listFunctions(catalog.getCurrentDatabase, functionPattern).foreach {


Actually, I'm confused about this code. Maybe it should be:

val functionPattern = CLIServiceUtils.patternToRegex(functionName) matchingDbs.foreach { schema => catalog.listFunctions(catalog.getCurrentDatabase, functionPattern).foreach { case (functionIdentifier, _) => val rowData = Array[AnyRef]( DEFAULT_HIVE_CATALOG, // FUNCTION_CAT schema, // FUNCTION_SCHEM functionIdentifier.funcName, // FUNCTION_NAME "", // REMARKS DatabaseMetaData.functionResultUnknown.asInstanceOf[AnyRef], // FUNCTION_TYPE "") // SPECIFIC_NAME rowSet.addRow(rowData); }

But it's the logic of Hive: https://github.com/apache/hive/blob/rel/release-3.1.1/service/src/java/org/apache/hive/service/cli/operation/GetFunctionsOperation.java#L101-L119

I think we should cover the case of functions that don't have a schema (null) which is basically what Hive's implementation seems to do, as well as the functions associated with a given schema which is what your code snippet above seems to do. Could you combine both?

https://docs.microsoft.com/en-us/sql/connect/jdbc/reference/getfunctions-method-sqlserverdatabasemetadata?view=sql-server-2017

May be we do not need to care about catalog:
https://github.com/pgjdbc/pgjdbc/blob/17c4bcfb59e846c593093752f2e30dd97bb4b338/pgjdbc/src/main/java/org/postgresql/jdbc/PgDatabaseMetaData.java#L2612-L2649

wangyum · 2019-07-25T06:57:42Z

...erver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkGetFunctionsOperation.scala

+              null, // FUNCTION_SCHEM
+              functionIdentifier.funcName, // FUNCTION_NAME
+              "", // REMARKS
+              DatabaseMetaData.functionResultUnknown.asInstanceOf[AnyRef], // FUNCTION_TYPE


We do not support FUNCTION_TYPE now. Set it to Unknown:

// java.sql.DatabaseMetaData /** * Indicates that it is not known whether the function returns * a result or a table. * <P> * A possible value for column <code>FUNCTION_TYPE</code> in the * <code>ResultSet</code> object returned by the method * <code>getFunctions</code>. * @since 1.6 */ int functionResultUnknown = 0; /** * Indicates that the function does not return a table. * <P> * A possible value for column <code>FUNCTION_TYPE</code> in the * <code>ResultSet</code> object returned by the method * <code>getFunctions</code>. * @since 1.6 */ int functionNoTable = 1; /** * Indicates that the function returns a table. * <P> * A possible value for column <code>FUNCTION_TYPE</code> in the * <code>ResultSet</code> object returned by the method * <code>getFunctions</code>. * @since 1.6 */ int functionReturnsTable = 2;

SparkQA · 2019-07-25T07:05:01Z

Test build #108152 has finished for PR 25252 at commit fd49906.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

wangyum · 2019-07-25T07:13:04Z

retest this please

SparkQA · 2019-07-25T07:34:16Z

Test build #108157 has finished for PR 25252 at commit fd49906.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2019-07-25T21:39:02Z

cc @juliuszsompolski

juliuszsompolski · 2019-07-26T08:42:28Z

cc @bogdanghit

bogdanghit

Thanks for working on this, it looks pretty good. A few comments below.

...erver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkGetFunctionsOperation.scala

bogdanghit · 2019-07-26T12:32:48Z

...erver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkGetFunctionsOperation.scala

+        HiveThriftServer2.listener.onStatementError(
+          statementId, e.getMessage, SparkUtils.exceptionString(e))
+        throw e
+    }


Shouldn't we handle other exceptions too?

Maybe it should be the same as SparkExecuteStatementOperation:

spark/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkExecuteStatementOperation.scala

Lines 261 to 269 in 687dd4e

// Actually do need to catch Throwable as some failures don't inherit from Exception and

// HiveServer will silently swallow them.

case e: Throwable =>

val currentState = getStatus().getState()

logError(s"Error executing query, currentState $currentState, ", e)

setState(OperationState.ERROR)

HiveThriftServer2.listener.onStatementError(

statementId, e.getMessage, SparkUtils.exceptionString(e))

throw new HiveSQLException(e.toString)

Or the same as GetTablesOperation and GetSchemasOperation for the sake of consistency?

+1 for same as GetTablesOperation and GetSchemasOperation.

bogdanghit · 2019-07-29T07:46:36Z

...erver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkGetFunctionsOperation.scala

+      val functionPattern = CLIServiceUtils.patternToRegex(functionName)
+      if ((null == catalogName || "".equals(catalogName))
+        && (null == schemaName || "".equals(schemaName))) {
+        catalog.listFunctions(catalog.getCurrentDatabase, functionPattern).foreach {


I think we should cover the case of functions that don't have a schema (null) which is basically what Hive's implementation seems to do, as well as the functions associated with a given schema which is what your code snippet above seems to do. Could you combine both?

https://docs.microsoft.com/en-us/sql/connect/jdbc/reference/getfunctions-method-sqlserverdatabasemetadata?view=sql-server-2017

SparkQA · 2019-07-29T12:49:54Z

Test build #108326 has finished for PR 25252 at commit ee85a38.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

bogdanghit · 2019-07-29T15:10:36Z

...erver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkGetFunctionsOperation.scala

+      statementId,
+      parentSession.getUsername)
+
+    try {


This looks correct now: retrieve all functions available for all matching schemas.

bogdanghit · 2019-07-30T08:51:34Z

...erver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkGetFunctionsOperation.scala

+              functionIdentifier.funcName, // FUNCTION_NAME
+              "", // REMARKS
+              DatabaseMetaData.functionResultUnknown.asInstanceOf[AnyRef], // FUNCTION_TYPE
+              "")


This is the function class name which I think we can get through a catalog.getFunction(funcIdentifier).className call.

Done.

catalog.lookupFunctionInfo(funcIdentifier).getClassName

bogdanghit · 2019-07-30T08:52:13Z

...erver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkGetFunctionsOperation.scala

+              null, // FUNCTION_CAT
+              db, // FUNCTION_SCHEM
+              functionIdentifier.funcName, // FUNCTION_NAME
+              "", // REMARKS


I wonder if we can get the function usage somehow from catalog ...

Done.

catalog.lookupFunctionInfo(funcIdentifier).getUsage

...erver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkGetFunctionsOperation.scala

bogdanghit · 2019-07-30T09:48:08Z

Can you also polish a bit the description of the PR?

This PR implements Spark's own GetFunctionsOperation which mitigates the differences between Spark SQL and Hive UDFs.
A set of unit tests that check the behavior of the operations for exact and partial matches on the schema name.

tgravescs · 2019-07-30T13:28:24Z

Can you add to the description and code on what GetFunctionsOperation actually does as well

wangyum · 2019-07-31T07:52:10Z

Thank you @bogdanghit @tgravescs I have Updated the description.

bogdanghit · 2019-07-31T09:49:15Z

...rver/src/test/scala/org/apache/spark/sql/hive/thriftserver/SparkMetadataOperationSuite.scala

      checkResult(metaData.getTableTypes, Seq("TABLE", "VIEW"))
    }
  }
+


Can you add some tests for the usage and the function class name? @wangyum

I have added some tests:

ssert(rs.getString("REMARKS").startsWith(s"${functionName(i)}(")) assert(rs.getString("SPECIFIC_NAME").startsWith("org.apache.spark.sql.catalyst"))

Do you think we need to assert more details?

Would be nice to run a DESCRIBE function statement and then compare the results.

Done:
https://github.com/apache/spark/pull/25252/files#diff-1cde3e024f639585984b678383450789R212-R223

bogdanghit · 2019-07-31T09:55:18Z

Our implementation pads the REMARKS field with the function usage - Hive returns an empty string.
Our implementation does not support FUNCTION_TYPE, but Hive does.

Edit: Also, please explain each unit test in the description. @wangyum

bogdanghit · 2019-07-31T10:04:05Z

sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/ExpressionInfo.java

        // simply pass the `extended` as `arguments` and an empty string for `examples`.
        this(className, db, name, usage, extended, "", "", "", "");
    }
+


Do we really need to move this function and change the getters in ExpressionInfo.java?

I guess this is OK. @juliuszsompolski, what do you think about calling the replaceFunctionName directly inside the getters?

I think it's a move in a good direction, as it will make it work better (not return nulls, return actual name instead of placeholder) for anyone using ExpressionInfos directly through SessionCatalog.lookupFunctionInfo API directly
cc @gatorsmile, what do you think?

Create new PR for this change: #25314

It's merged. Please rebase this PR. Thanks!

SparkQA · 2019-07-31T11:16:55Z

Test build #108451 has finished for PR 25252 at commit f37d653.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

juliuszsompolski · 2019-07-31T14:36:11Z

...erver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkGetFunctionsOperation.scala

+          statementId, e.getMessage, SparkUtils.exceptionString(e))
+        throw e
+    }
+    HiveThriftServer2.listener.onStatementFinish(statementId)


We also need the onStatementClosed handler like in the other ops.

bogdanghit · 2019-08-01T09:32:42Z

...rver/src/test/scala/org/apache/spark/sql/hive/thriftserver/SparkMetadataOperationSuite.scala

+
+    withJdbcStatement() { statement =>
+      val metaData = statement.getConnection.getMetaData
+      val rs = metaData.getFunctions(null, "default", "upPer")


Thanks for writing all these tests. Was the capital P here intentional?

SparkQA · 2019-08-01T09:46:16Z

Test build #108514 has finished for PR 25252 at commit c95db44.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

bogdanghit · 2019-08-01T09:51:31Z

LGTM

bogdanghit · 2019-08-01T10:36:56Z

@gatorsmile

gatorsmile · 2019-08-02T06:43:49Z

LGTM

Thanks! Merged to master.

juliuszsompolski · 2021-10-28T15:06:28Z

...erver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkGetFunctionsOperation.scala

+      matchingDbs.foreach { db =>
+        catalog.listFunctions(db, functionPattern).foreach {
+          case (funcIdentifier, _) =>
+            val info = catalog.lookupFunctionInfo(funcIdentifier)


@dongjoon-hyun @wangyum hmm... it looks that if it's run for a wildcard schema pattern, all Spark builtin functions from FunctionRegistry are returned for every schema... This makes it return hundreds of thousands of rows for a big catalog with hundreds of schemas.
Should it return builtin function only once, and in-schema functions only for UDFs registered in the catalog?

Make sense.

@juliuszsompolski and @wangyum . Since this is Apache Spark 3.0 feature, the suggestion sounds like a breaking API change. Is it safe?

bogdanghit · 2021-12-14T11:17:04Z

Hi @dongjoon-hyun @wangyum. I checked this out with DbVisualizer and it is indeed loading the available function set for every schema in the database. I'm attaching two screenshots:
Top of the list:

After scrolling down:

If the there are many schemas, it can take a while for a GetFunctions operation to return and it also creates a poor experience because of the exhaustive and repetitive list of functions returned.

This results in hundreds of thousands of rows that slow down the UI. That being said, it doesn't look like a breaking change to me, more like a bug and fixing it would significantly improve UX.

WDYT?

dongjoon-hyun · 2021-12-14T20:33:58Z

It sounds like you have a different meaning of a breaking change. When a function suddenly returns different values, it is considered a breaking change to me.

This results in hundreds of thousands of rows that slow down the UI. That being said, it doesn't look like a breaking change to me,

BTW, I agree with your requirements. You might introduce a new internal configuration to add the behavior you want. The default should be the legacy behavior at least for one release, e.g., Apache Spark 3.3, and we need to add it to the SQL migration guide.

WDYT, @bogdanghit ?

juliuszsompolski · 2021-12-14T20:40:20Z

It sounds good to me to have it under a legacy flag.
cc @AngersZhuuuu who raised https://github.com/apache/spark/pull/34453/files for it.

wangyum added 3 commits July 25, 2019 06:14

Add SparkGetFunctionsOperation

19c97f9

fix

47b05fd

revert

fd49906

wangyum commented Jul 25, 2019

View reviewed changes

dongjoon-hyun added the SQL label Jul 25, 2019

bogdanghit suggested changes Jul 29, 2019

View reviewed changes

fix

ee85a38

bogdanghit reviewed Jul 29, 2019

View reviewed changes

bogdanghit reviewed Jul 30, 2019

View reviewed changes

...erver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkGetFunctionsOperation.scala Outdated Show resolved Hide resolved

Support Usage and ClassName

f37d653

bogdanghit reviewed Jul 31, 2019

View reviewed changes

juliuszsompolski reviewed Jul 31, 2019

View reviewed changes

dongjoon-hyun mentioned this pull request Jul 31, 2019

[SPARK-28581][SQL] Replace _FUNC_ in UDF ExpressionInfo #25314

Closed

wangyum added 2 commits August 1, 2019 11:28

Merge remote-tracking branch 'upstream/master' into SPARK-28510

530758f

onOperationClosed

c95db44

bogdanghit reviewed Aug 1, 2019

View reviewed changes

bogdanghit approved these changes Aug 1, 2019

View reviewed changes

gatorsmile closed this in efd9299 Aug 2, 2019

wangyum deleted the SPARK-28510 branch August 2, 2019 15:58

juliuszsompolski reviewed Oct 28, 2021

View reviewed changes

juliuszsompolski mentioned this pull request Dec 14, 2021

[SPARK-37173][SQL] SparkGetFunctionOperation return builtin function only once #34453

Closed

	// Actually do need to catch Throwable as some failures don't inherit from Exception and
	// HiveServer will silently swallow them.
	case e: Throwable =>
	val currentState = getStatus().getState()
	logError(s"Error executing query, currentState $currentState, ", e)
	setState(OperationState.ERROR)
	HiveThriftServer2.listener.onStatementError(
	statementId, e.getMessage, SparkUtils.exceptionString(e))
	throw new HiveSQLException(e.toString)

[SPARK-28510][SQL] Implement Spark's own GetFunctionsOperation #25252

[SPARK-28510][SQL] Implement Spark's own GetFunctionsOperation #25252

Uh oh!

Conversation

wangyum commented Jul 25, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jul 25, 2019

Uh oh!

wangyum commented Jul 25, 2019

Uh oh!

SparkQA commented Jul 25, 2019

Uh oh!

gatorsmile commented Jul 25, 2019

Uh oh!

juliuszsompolski commented Jul 26, 2019

Uh oh!

bogdanghit left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jul 29, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bogdanghit Jul 30, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

bogdanghit commented Jul 30, 2019

Uh oh!

tgravescs commented Jul 30, 2019

Uh oh!

wangyum commented Jul 31, 2019

Uh oh!

bogdanghit Jul 31, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bogdanghit commented Jul 31, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

wangyum commented Jul 25, 2019 •

edited

Loading

bogdanghit Jul 30, 2019 •

edited

Loading

bogdanghit Jul 31, 2019 •

edited

Loading

bogdanghit commented Jul 31, 2019 •

edited

Loading

dongjoon-hyun Oct 29, 2021 •

edited

Loading

bogdanghit commented Dec 14, 2021 •

edited

Loading