Skip to content

Conversation

@wangyum
Copy link
Member

@wangyum wangyum commented Jul 25, 2019

What changes were proposed in this pull request?

This PR implements Spark's own GetFunctionsOperation which mitigates the differences between Spark SQL and Hive UDFs. But our implementation is different from Hive's implementation:

How was this patch tested?

unit tests

val functionPattern = CLIServiceUtils.patternToRegex(functionName)
if ((null == catalogName || "".equals(catalogName))
&& (null == schemaName || "".equals(schemaName))) {
catalog.listFunctions(catalog.getCurrentDatabase, functionPattern).foreach {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I'm confused about this code. Maybe it should be:

val functionPattern = CLIServiceUtils.patternToRegex(functionName)
matchingDbs.foreach { schema =>
  catalog.listFunctions(catalog.getCurrentDatabase, functionPattern).foreach {
    case (functionIdentifier, _) =>
      val rowData = Array[AnyRef](
        DEFAULT_HIVE_CATALOG, // FUNCTION_CAT
        schema, // FUNCTION_SCHEM
        functionIdentifier.funcName, // FUNCTION_NAME
        "", // REMARKS
        DatabaseMetaData.functionResultUnknown.asInstanceOf[AnyRef], // FUNCTION_TYPE
        "") // SPECIFIC_NAME
      rowSet.addRow(rowData);
}

But it's the logic of Hive: https://github.com/apache/hive/blob/rel/release-3.1.1/service/src/java/org/apache/hive/service/cli/operation/GetFunctionsOperation.java#L101-L119

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should cover the case of functions that don't have a schema (null) which is basically what Hive's implementation seems to do, as well as the functions associated with a given schema which is what your code snippet above seems to do. Could you combine both?

https://docs.microsoft.com/en-us/sql/connect/jdbc/reference/getfunctions-method-sqlserverdatabasemetadata?view=sql-server-2017

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

null, // FUNCTION_SCHEM
functionIdentifier.funcName, // FUNCTION_NAME
"", // REMARKS
DatabaseMetaData.functionResultUnknown.asInstanceOf[AnyRef], // FUNCTION_TYPE
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do not support FUNCTION_TYPE now. Set it to Unknown:

    // java.sql.DatabaseMetaData

    /**
     * Indicates that it is not known whether the function returns
     * a result or a table.
     * <P>
     * A possible value for column <code>FUNCTION_TYPE</code> in the
     * <code>ResultSet</code> object returned by the method
     * <code>getFunctions</code>.
     * @since 1.6
     */
    int functionResultUnknown   = 0;

    /**
     * Indicates that the function  does not return a table.
     * <P>
     * A possible value for column <code>FUNCTION_TYPE</code> in the
     * <code>ResultSet</code> object returned by the method
     * <code>getFunctions</code>.
     * @since 1.6
     */
    int functionNoTable         = 1;

    /**
     * Indicates that the function  returns a table.
     * <P>
     * A possible value for column <code>FUNCTION_TYPE</code> in the
     * <code>ResultSet</code> object returned by the method
     * <code>getFunctions</code>.
     * @since 1.6
     */
    int functionReturnsTable    = 2;

@SparkQA
Copy link

SparkQA commented Jul 25, 2019

Test build #108152 has finished for PR 25252 at commit fd49906.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@wangyum
Copy link
Member Author

wangyum commented Jul 25, 2019

retest this please

@SparkQA
Copy link

SparkQA commented Jul 25, 2019

Test build #108157 has finished for PR 25252 at commit fd49906.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

cc @juliuszsompolski

@juliuszsompolski
Copy link
Contributor

cc @bogdanghit

Copy link
Contributor

@bogdanghit bogdanghit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this, it looks pretty good. A few comments below.

HiveThriftServer2.listener.onStatementError(
statementId, e.getMessage, SparkUtils.exceptionString(e))
throw e
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we handle other exceptions too?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it should be the same as SparkExecuteStatementOperation:

// Actually do need to catch Throwable as some failures don't inherit from Exception and
// HiveServer will silently swallow them.
case e: Throwable =>
val currentState = getStatus().getState()
logError(s"Error executing query, currentState $currentState, ", e)
setState(OperationState.ERROR)
HiveThriftServer2.listener.onStatementError(
statementId, e.getMessage, SparkUtils.exceptionString(e))
throw new HiveSQLException(e.toString)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or the same as GetTablesOperation and GetSchemasOperation for the sake of consistency?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for same as GetTablesOperation and GetSchemasOperation.

val functionPattern = CLIServiceUtils.patternToRegex(functionName)
if ((null == catalogName || "".equals(catalogName))
&& (null == schemaName || "".equals(schemaName))) {
catalog.listFunctions(catalog.getCurrentDatabase, functionPattern).foreach {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should cover the case of functions that don't have a schema (null) which is basically what Hive's implementation seems to do, as well as the functions associated with a given schema which is what your code snippet above seems to do. Could you combine both?

https://docs.microsoft.com/en-us/sql/connect/jdbc/reference/getfunctions-method-sqlserverdatabasemetadata?view=sql-server-2017

@SparkQA
Copy link

SparkQA commented Jul 29, 2019

Test build #108326 has finished for PR 25252 at commit ee85a38.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

statementId,
parentSession.getUsername)

try {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks correct now: retrieve all functions available for all matching schemas.

functionIdentifier.funcName, // FUNCTION_NAME
"", // REMARKS
DatabaseMetaData.functionResultUnknown.asInstanceOf[AnyRef], // FUNCTION_TYPE
"")
Copy link
Contributor

@bogdanghit bogdanghit Jul 30, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the function class name which I think we can get through a catalog.getFunction(funcIdentifier).className call.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

catalog.lookupFunctionInfo(funcIdentifier).getClassName

null, // FUNCTION_CAT
db, // FUNCTION_SCHEM
functionIdentifier.funcName, // FUNCTION_NAME
"", // REMARKS
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we can get the function usage somehow from catalog ...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

catalog.lookupFunctionInfo(funcIdentifier).getUsage

@bogdanghit
Copy link
Contributor

Can you also polish a bit the description of the PR?

  1. This PR implements Spark's own GetFunctionsOperation which mitigates the differences between Spark SQL and Hive UDFs.
  2. A set of unit tests that check the behavior of the operations for exact and partial matches on the schema name.

@tgravescs
Copy link
Contributor

Can you add to the description and code on what GetFunctionsOperation actually does as well

@wangyum
Copy link
Member Author

wangyum commented Jul 31, 2019

Thank you @bogdanghit @tgravescs I have Updated the description.

checkResult(metaData.getTableTypes, Seq("TABLE", "VIEW"))
}
}

Copy link
Contributor

@bogdanghit bogdanghit Jul 31, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add some tests for the usage and the function class name? @wangyum

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added some tests:

ssert(rs.getString("REMARKS").startsWith(s"${functionName(i)}("))
assert(rs.getString("SPECIFIC_NAME").startsWith("org.apache.spark.sql.catalyst"))

Do you think we need to assert more details?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be nice to run a DESCRIBE function statement and then compare the results.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bogdanghit
Copy link
Contributor

bogdanghit commented Jul 31, 2019

Our implementation pads the REMARKS field with the function usage - Hive returns an empty string.
Our implementation does not support FUNCTION_TYPE, but Hive does.

Edit: Also, please explain each unit test in the description. @wangyum

// simply pass the `extended` as `arguments` and an empty string for `examples`.
this(className, db, name, usage, extended, "", "", "", "");
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need to move this function and change the getters in ExpressionInfo.java?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this is OK. @juliuszsompolski, what do you think about calling the replaceFunctionName directly inside the getters?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's a move in a good direction, as it will make it work better (not return nulls, return actual name instead of placeholder) for anyone using ExpressionInfos directly through SessionCatalog.lookupFunctionInfo API directly
cc @gatorsmile, what do you think?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Create new PR for this change: #25314

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's merged. Please rebase this PR. Thanks!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@SparkQA
Copy link

SparkQA commented Jul 31, 2019

Test build #108451 has finished for PR 25252 at commit f37d653.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

statementId, e.getMessage, SparkUtils.exceptionString(e))
throw e
}
HiveThriftServer2.listener.onStatementFinish(statementId)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also need the onStatementClosed handler like in the other ops.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done


withJdbcStatement() { statement =>
val metaData = statement.getConnection.getMetaData
val rs = metaData.getFunctions(null, "default", "upPer")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for writing all these tests. Was the capital P here intentional?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes.

@SparkQA
Copy link

SparkQA commented Aug 1, 2019

Test build #108514 has finished for PR 25252 at commit c95db44.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@bogdanghit
Copy link
Contributor

LGTM

@bogdanghit
Copy link
Contributor

@gatorsmile

@gatorsmile
Copy link
Member

LGTM

Thanks! Merged to master.

@gatorsmile gatorsmile closed this in efd9299 Aug 2, 2019
@wangyum wangyum deleted the SPARK-28510 branch August 2, 2019 15:58
Comment on lines +91 to +94
matchingDbs.foreach { db =>
catalog.listFunctions(db, functionPattern).foreach {
case (funcIdentifier, _) =>
val info = catalog.lookupFunctionInfo(funcIdentifier)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dongjoon-hyun @wangyum hmm... it looks that if it's run for a wildcard schema pattern, all Spark builtin functions from FunctionRegistry are returned for every schema... This makes it return hundreds of thousands of rows for a big catalog with hundreds of schemas.
Should it return builtin function only once, and in-schema functions only for UDFs registered in the catalog?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sense.

Copy link
Member

@dongjoon-hyun dongjoon-hyun Oct 29, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@juliuszsompolski and @wangyum . Since this is Apache Spark 3.0 feature, the suggestion sounds like a breaking API change. Is it safe?

@bogdanghit
Copy link
Contributor

bogdanghit commented Dec 14, 2021

Hi @dongjoon-hyun @wangyum. I checked this out with DbVisualizer and it is indeed loading the available function set for every schema in the database. I'm attaching two screenshots:
Top of the list:
Screenshot 2021-12-14 at 12 12 55

After scrolling down:
Screenshot 2021-12-14 at 12 13 14

If the there are many schemas, it can take a while for a GetFunctions operation to return and it also creates a poor experience because of the exhaustive and repetitive list of functions returned.

This results in hundreds of thousands of rows that slow down the UI. That being said, it doesn't look like a breaking change to me, more like a bug and fixing it would significantly improve UX.

WDYT?

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Dec 14, 2021

It sounds like you have a different meaning of a breaking change. When a function suddenly returns different values, it is considered a breaking change to me.

This results in hundreds of thousands of rows that slow down the UI. That being said, it doesn't look like a breaking change to me,

BTW, I agree with your requirements. You might introduce a new internal configuration to add the behavior you want. The default should be the legacy behavior at least for one release, e.g., Apache Spark 3.3, and we need to add it to the SQL migration guide.

WDYT, @bogdanghit ?

@juliuszsompolski
Copy link
Contributor

It sounds good to me to have it under a legacy flag.
cc @AngersZhuuuu who raised https://github.com/apache/spark/pull/34453/files for it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants