Skip to content

Conversation

@jzhuge
Copy link
Member

@jzhuge jzhuge commented Feb 21, 2019

What changes were proposed in this pull request?

  • Support N-part identifier in SQL
  • N-part identifier extractor in Analyzer

How was this patch tested?

  • A new unit test suite ResolveMultipartRelationSuite
  • CatalogLoadingSuite

@rBlue @cloud-fan @mccheah

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Feb 22, 2019

ok to test. Thank you, @jzhuge !

@SparkQA

This comment has been minimized.

@jzhuge jzhuge changed the title [SPARK-26946][SQL][WIP] Identifiers for multi-catalog [SPARK-26946][SQL] Identifiers for multi-catalog Mar 8, 2019
@SparkQA

This comment has been minimized.

@jzhuge
Copy link
Member Author

jzhuge commented Mar 8, 2019

Looking at the build failure

@SparkQA

This comment has been minimized.

@SparkQA

This comment has been minimized.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a top-level parser entry used in ParserInterface. I don't think we need it now for catalog identifier.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, only my test case uses it to parse a table name into a sequence. I will remove it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Won't we need this eventually for parsing names passed to saveAsTable? Why not add it now?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I start to to convert SELECT, INSERT, and DROP code path to support multi-catalog, this parse function is needed, e.g,

  override def visitTable(ctx: TableContext): LogicalPlan = withOrigin(ctx) {
    UnresolvedIdentifier(visitMultiPartIdentifier(ctx.multiPartIdentifier))
  }

 override def visitTableName(ctx: TableNameContext): LogicalPlan = withOrigin(ctx) {
    val tableId = visitMultiPartIdentifier(ctx.multiPartIdentifier())
    val table = mayApplyAliasPlan(ctx.tableAlias, UnresolvedIdentifier(tableId))
    table.optionalMap(ctx.sample)(withSample)
  }

@mccheah
Copy link
Contributor

mccheah commented Mar 16, 2019

Sorry for breaking up my review into individual comments. I think this looks ok short of some style changes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This uses a lot of temporary classes to simulate future rules that match multi-part identifiers. I think I would rather include an update that adds new UnresolvedRelation nodes and uses them instead of test plan nodes, but I'd be interested to hear whether @cloud-fan agrees.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK either way. I have already converted SELECT/INSERT/DROP code paths to support multi-catalog in my private 2.3 branch. Pretty straightforward. Converting CREATE would be a lot easier with Ryan's PR 24029.

@SparkQA

This comment has been minimized.

@rdblue
Copy link
Contributor

rdblue commented Mar 18, 2019

@jzhuge, this looks really close to being ready to me!

@SparkQA

This comment has been minimized.

@SparkQA

This comment has been minimized.

@dilipbiswal
Copy link
Contributor

retest this please

@SparkQA

This comment has been minimized.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: Looks like Scaladoc conventions used in Javadoc. This should be {@link Identifier}.

@rdblue
Copy link
Contributor

rdblue commented Mar 19, 2019

+1

This looks good to me. @cloud-fan, do you have any more review comments?

@SparkQA

This comment has been minimized.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we use a class directly? I don't see much value of using an interface here, as it has only one implementation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This allows us more flexibility than a single concrete class. Changing a class to an interface is not a binary compatible change, so using an interface is the right thing to do.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then I suggest we move the impl class to a private package like org.apache.spark.sql.catalyst. Also the static method should be moved to the impl class as well, as we only create it inside Spark.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The implementation class is package-private. If we were to move it to a different package, we would need to make it public for the of factory method, which would increase its visibility, not decrease it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why it's a trait?

My understanding is this PR adds the class of the catalog object identifier and the related parser support. I don't think we have a detailed design of how analyzer looks up catalog yet.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This trait provides extractors, similar to a trait like PredicateHelper. These implement the resolution rules from the SPIP using a generic catalog lookup provided by the implementation.

This decouples the resolution rules from how the analyzer looks up catalogs and provides convenient extractors that implement those rules.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

then this should be an internal trait under a private package like org.apache.spark.sql.catalyst

jzhuge added 10 commits March 20, 2019 18:29
Create org.apache.spark.sql.catalog.v2.Identifier and IdentifierImpl.
Inherit CatalogIdentifier from v2.Identifier.
Encapsulate lookupCatalog and extractor into trait LookupCatalog.
SqlBase.g4: Replace MultiPart with Multipart.
Rename and simplify the unit test ResolveMultipartIdentifierSuite.
Add extractor LookupCatalog.AsTableIdentifier and a unit test.
Remove CatalogIdentifier.
Add comment for AsTableIdentifier to emphasize legacy support only.
@SparkQA
Copy link

SparkQA commented Mar 21, 2019

Test build #103750 has finished for PR 23848 at commit 3bb4485.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

@mccheah mccheah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cloud-fan have we addressed all your comments, or did you have any other feedback you wanted to give? Would like to merge this soon to unblock other V2 work, particularly table catalogs.

Possibly try to merge before EOD Pacific time today, at the very latest before end of week?

For everyone else following, please feel free to leave any feedback we would like to address before this goes in.

this(catalog, conf, conf.optimizerMaxIterations)
}

def this(lookupCatalog: Option[(String) => CatalogPlugin], catalog: SessionCatalog,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

who will call this constructor? I feel we are adding too much code for future use only. Can we add them when they are needed? It will be good if this PR only add the identifier interface and impl class, and the related parser rules, which is pretty easy to justify.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cloud-fan, I think this commit is reasonably self-contained. Nit-picking about whether a constructor is added in this commit or the next isn't adding much value.

Keep in mind that we make commits self-contained to decrease conflicts and increase the rate at which we can review and accept patches. Is putting this in the next commit really worth the time it takes to change and test that change, if it means that this work is delayed another day?

@cloud-fan
Copy link
Contributor

The parser part and identifier interface/impl class LGTM. The catalog lookup part looks reasonable but I'm not very confident without seeing the actual use case. To move things forward, I'm merging this. I may refactor this part after the table catalog gets it.

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@jzhuge
Copy link
Member Author

jzhuge commented Mar 22, 2019

Thanks @cloud-fan !

@cloud-fan cloud-fan closed this in 80565ce Mar 22, 2019
@rdblue
Copy link
Contributor

rdblue commented Mar 22, 2019

Thanks for merging, @cloud-fan, and thanks for working on this, @jzhuge!

mccheah pushed a commit to palantir/spark that referenced this pull request May 15, 2019
## What changes were proposed in this pull request?

- Support N-part identifier in SQL
- N-part identifier extractor in Analyzer

## How was this patch tested?

- A new unit test suite ResolveMultipartRelationSuite
- CatalogLoadingSuite

rblue cloud-fan mccheah

Closes apache#23848 from jzhuge/SPARK-26946.

Authored-by: John Zhuge <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
rdblue pushed a commit to rdblue/spark that referenced this pull request May 19, 2019
- Support N-part identifier in SQL
- N-part identifier extractor in Analyzer

- A new unit test suite ResolveMultipartRelationSuite
- CatalogLoadingSuite

rblue cloud-fan mccheah

Closes apache#23848 from jzhuge/SPARK-26946.

Authored-by: John Zhuge <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
jzhuge added a commit to jzhuge/spark that referenced this pull request Oct 15, 2019
- Support N-part identifier in SQL
- N-part identifier extractor in Analyzer

- A new unit test suite ResolveMultipartRelationSuite
- CatalogLoadingSuite

rblue cloud-fan mccheah

Closes apache#23848 from jzhuge/SPARK-26946.

Authored-by: John Zhuge <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants