Skip to content

Conversation

@bogdanrdc
Copy link
Contributor

What changes were proposed in this pull request?

SQL hint syntax:

  • support expressions such as strings, numbers, etc. instead of only identifiers as it is currently.
  • support multiple hints, which was missing compared to the DataFrame syntax.

DataFrame API:

  • support any parameters in DataFrame.hint instead of just strings

How was this patch tested?

Existing tests. New tests in PlanParserSuite. New suite DataFrameHintSuite.

@SparkQA
Copy link

SparkQA commented May 24, 2017

Test build #77300 has finished for PR 18086 at commit 5439468.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class UnresolvedHint(name: String, parameters: Seq[Any], child: LogicalPlan)

@rxin
Copy link
Contributor

rxin commented May 25, 2017


/**
* Add a [[UnresolvedHint]] to a logical plan.
* Add a [[UnresolvedHint]]s to a logical plan.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove a

query: LogicalPlan): LogicalPlan = withOrigin(ctx) {
val stmt = ctx.hintStatement
UnresolvedHint(stmt.hintName.getText, stmt.parameters.asScala.map(_.getText), query)
var plan = query
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

using foldLeft instead of having a var?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Honestly I think foldLeft is almost always a bad idea ...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I used foldRight somewhere too. Why is it a bad idea?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i always find a loop simpler to reason about ...

* A pair of (name, parameters).
*/
case class UnresolvedHint(name: String, parameters: Seq[String], child: LogicalPlan)
case class UnresolvedHint(name: String, parameters: Seq[Any], child: LogicalPlan)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we use Expression as type?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we use Expression then either:

  • Dataset.hint parameters should be Expression too, in which case you can't do df.hint("hint", 1, 2, "c") you'd have to do df.hint("hint", Literal(1), Literal(2), Literal("c")) or a shortcut if there is
  • Dataset.hint accepts Any but then has to convert Any to Expressions. One problem here is that Seq(1,2,3) can't be converted to Literal. So you have to use df.hint("hint", Array(1,2,3))

The disadvantage of have Any in UnresolvedHint is that to resolve the hint you have to check both for String and Literal(String) but the API is easier to use.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can keep Any in the API(df.hint(xxx)), but use Expression in UnresolvedHint, what do you think?

Copy link
Contributor Author

@bogdanrdc bogdanrdc May 29, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One useful hint parameter is a list of columns.
Something like df.hint("hint", $"table", Seq($"col1", $"col2", $"col3"))

In this case UnresolvedHint could be called like this:
UnresolvedHint(name: String, parameters: Seq(Expression, Seq[Expression]), child)

But if UnresolvedHint.parameters is Seq[Expression] then it's not possible to have this kind of hint.

@SparkQA
Copy link

SparkQA commented May 26, 2017

Test build #77421 has finished for PR 18086 at commit d386cdf.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class HasMinSupport(Params):
  • class HasNumPartitions(Params):
  • class HasMinConfidence(Params):
  • case class AnalysisBarrier(child: LogicalPlan) extends LeafNode
  • case class ResolvedHint(child: LogicalPlan, hints: HintInfo = HintInfo())
  • case class HintInfo(

@SparkQA
Copy link

SparkQA commented May 26, 2017

Test build #77424 has finished for PR 18086 at commit 6e40301.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

fromClause?
(WHERE where=booleanExpression)?)
| ((kind=SELECT hint? setQuantifier? namedExpressionSeq fromClause?
| ((kind=SELECT (hints+=hint)* setQuantifier? namedExpressionSeq fromClause?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In Hive and Oracle, multiple hints are put in the same /*+ */.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This patch supports both.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gatorsmile does hive support multiple /*+ */?

Copy link
Member

@gatorsmile gatorsmile May 30, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope. Hive does not support multiple /*+ */

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does not hurt anything if we support more hint styles, as long as they are user-friendly.


hint
: '/*+' hintStatement '*/'
: '/*+' hintStatements+=hintStatement (hintStatements+=hintStatement)* '*/'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the same block /*+ */, multiple hints are separated by commas in Hive. However, in Oracle, it is separated by spaces.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added support for optional comma

/**
* A general hint for the child that is not yet resolved. This node is generated by the parser and
* should be removed This node will be eliminated post analysis.
* A pair of (name, parameters).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs an update.

case tableName: String => tableName
case tableId: UnresolvedAttribute => tableId.name
case unsupported => throw new AnalysisException("Broadcast hint parameter should be " +
s" identifier or string but was $unsupported (${unsupported.getClass}")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: s" identifier or string -> s"an identifier or string

* limitations under the License.
*/

package org.apache.spark.sql
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Normally, we move such a test suite to org.apache.spark.sql.catalyst. We just need to add hint into org.apache.spark.sql.catalyst.dsl.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a new test for dsl. I also want a test that calls df.hint

parsePlan("SELECT /*+ HINT1(a, 1) hint2(b, 2) */ * from t"),
UnresolvedHint("hint2", Seq($"b", Literal(2)),
UnresolvedHint("HINT1", Seq($"a", Literal(1)),
table("t").select(star())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Indent

* A pair of (name, parameters).
*/
case class UnresolvedHint(name: String, parameters: Seq[String], child: LogicalPlan)
case class UnresolvedHint(name: String, parameters: Seq[Any], child: LogicalPlan)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To support multiple parameters in hint, does it make sense to do it like df.hint("hint", "1, 2, c")? We can use our Parser to parse this parameter string.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that could be something extra. The DF API should accept scala expressions too: function calls (df.hint("hint", getInterestingValues()))

@SparkQA
Copy link

SparkQA commented May 30, 2017

Test build #77531 has finished for PR 18086 at commit 8daa05e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

why rename DataFrameSuite?

@gatorsmile
Copy link
Member

LGTM pending Jenkins

@SparkQA
Copy link

SparkQA commented May 30, 2017

Test build #77536 has finished for PR 18086 at commit 09635a9.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class DataFrameSuite extends QueryTest with SharedSQLContext

r1.hint("hint1"),
UnresolvedHint("hint1", Seq(),
r1
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can we collapse it to the previous line?

r1.hint("hint1", 1, $"a"),
UnresolvedHint("hint1", Seq(1, $"a"),
r1
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

r1.hint("hint1", Seq(1, 2, 3), Seq($"a", $"b", $"c")),
UnresolvedHint("hint1", Seq(Seq(1, 2, 3), Seq($"a", $"b", $"c")),
r1
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto


// Hint
withWindow.optionalMap(hint)(withHints)
hints.asScala.foldRight(withWindow)(withHints)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why we construct the hint from right to left?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so that select /*+ hint1() /* /*+ hint2() */produces Hint1(Hint2(plan)) and not Hint2(Hint1(plan)). withHints adds a Hint on top so the last one folded is the top most.


private def check(df: Dataset[_], expected: LogicalPlan) = {
comparePlans(
EliminateBarriers(df.queryExecution.logical),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that PR has been reverted, can you rebase?

@SparkQA
Copy link

SparkQA commented Jun 1, 2017

Test build #77636 has finished for PR 18086 at commit 7776ae6.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class Uuid() extends LeafExpression

@asfgit asfgit closed this in 2134196 Jun 1, 2017
@cloud-fan
Copy link
Contributor

thanks, merging to master/2.2!

asfgit pushed a commit that referenced this pull request Jun 1, 2017
SQL hint syntax:
* support expressions such as strings, numbers, etc. instead of only identifiers as it is currently.
* support multiple hints, which was missing compared to the DataFrame syntax.

DataFrame API:
* support any parameters in DataFrame.hint instead of just strings

Existing tests. New tests in PlanParserSuite. New suite DataFrameHintSuite.

Author: Bogdan Raducanu <[email protected]>

Closes #18086 from bogdanrdc/SPARK-20854.

(cherry picked from commit 2134196)
Signed-off-by: Wenchen Fan <[email protected]>
)

comparePlans(
parsePlan("SELECT /*+ HINT1(a, array(1, 2, 3)) */ * from t"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this test case redundant?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea, @bogdanrdc can you send a follow-up PR to clean it up?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants