-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-31255][SQL] Add SupportsMetadataColumns to DSv2 #28027
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
0aa86df
d7072ee
af3703f
6e72f4d
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,58 @@ | ||
| package org.apache.spark.sql.connector.catalog; | ||
|
|
||
| import org.apache.spark.annotation.Evolving; | ||
| import org.apache.spark.sql.connector.expressions.Transform; | ||
| import org.apache.spark.sql.types.DataType; | ||
|
|
||
| /** | ||
| * Interface for a metadata column. | ||
| * <p> | ||
| * A metadata column can expose additional metadata about a row. For example, rows from Kafka can | ||
| * use metadata columns to expose a message's topic, partition number, and offset. | ||
| * <p> | ||
| * A metadata column could also be the result of a transform applied to a value in the row. For | ||
| * example, a partition value produced by bucket(id, 16) could be exposed by a metadata column. In | ||
| * this case, {@link #transform()} should return a non-null {@link Transform} that produced the | ||
| * metadata column's values. | ||
| */ | ||
| @Evolving | ||
| public interface MetadataColumn { | ||
| /** | ||
| * The name of this metadata column. | ||
| * | ||
| * @return a String name | ||
| */ | ||
| String name(); | ||
|
|
||
| /** | ||
| * The data type of values in this metadata column. | ||
| * | ||
| * @return a {@link DataType} | ||
| */ | ||
| DataType dataType(); | ||
|
|
||
| /** | ||
| * @return whether values produced by this metadata column may be null | ||
| */ | ||
| default boolean isNullable() { | ||
| return true; | ||
| } | ||
|
|
||
| /** | ||
| * Documentation for this metadata column, or null. | ||
| * | ||
| * @return a documentation String | ||
| */ | ||
| default String comment() { | ||
| return null; | ||
| } | ||
|
|
||
| /** | ||
| * The {@link Transform} used to produce this metadata column from data rows, or null. | ||
| * | ||
| * @return a {@link Transform} used to produce the column's values, or null if there isn't one | ||
| */ | ||
| default Transform transform() { | ||
| return null; | ||
| } | ||
| } | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,37 @@ | ||
| package org.apache.spark.sql.connector.catalog; | ||
|
|
||
| import org.apache.spark.annotation.Evolving; | ||
| import org.apache.spark.sql.connector.read.SupportsPushDownRequiredColumns; | ||
| import org.apache.spark.sql.types.StructField; | ||
| import org.apache.spark.sql.types.StructType; | ||
|
|
||
| /** | ||
| * An interface for exposing data columns for a table that are not in the table schema. For example, | ||
| * a file source could expose a "file" column that contains the path of the file that contained each | ||
| * row. | ||
| * <p> | ||
|
Comment on lines
+11
to
+12
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Should we talk about the behavior of reserving names for metadata columns or the behavior that will happen during name collisions here (data columns will be selected over metadata)?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Good idea. I'll add that and rebase to fix the conflicts. Thanks!
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I added this:
|
||
| * The columns returned by {@link #metadataColumns()} may be passed as {@link StructField} in | ||
| * requested projections. Sources that implement this interface and column projection using | ||
| * {@link SupportsPushDownRequiredColumns} must accept metadata fields passed to | ||
| * {@link SupportsPushDownRequiredColumns#pruneColumns(StructType)}. | ||
| * <p> | ||
| * If a table column and a metadata column have the same name, the metadata column will never be | ||
| * requested. It is recommended that Table implementations reject data column name that conflict | ||
| * with metadata column names. | ||
| */ | ||
| @Evolving | ||
| public interface SupportsMetadataColumns extends Table { | ||
| /** | ||
| * Metadata columns that are supported by this {@link Table}. | ||
| * <p> | ||
| * The columns returned by this method may be passed as {@link StructField} in requested | ||
| * projections using {@link SupportsPushDownRequiredColumns#pruneColumns(StructType)}. | ||
| * <p> | ||
| * If a table column and a metadata column have the same name, the metadata column will never be | ||
| * requested and is ignored. It is recommended that Table implementations reject data column names | ||
| * that conflict with metadata column names. | ||
| * | ||
| * @return an array of {@link MetadataColumn} | ||
| */ | ||
| MetadataColumn[] metadataColumns(); | ||
| } | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -886,6 +886,12 @@ case class SubqueryAlias( | |
| val qualifierList = identifier.qualifier :+ alias | ||
| child.output.map(_.withQualifier(qualifierList)) | ||
| } | ||
|
|
||
| override def metadataOutput: Seq[Attribute] = { | ||
| val qualifierList = identifier.qualifier :+ alias | ||
| child.metadataOutput.map(_.withQualifier(qualifierList)) | ||
| } | ||
|
Comment on lines
+890
to
+893
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. why is this differentiation needed? Won't the metadata columns be part of
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. They are eventually part of the output, but they can't be at first because Instead, we add the metadata columns to this and then update column resolution to look up columns here. The result is that we can resolve everything just like normal, including |
||
|
|
||
| override def doCanonicalize(): LogicalPlan = child.canonicalized | ||
| } | ||
|
|
||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm @brkyvz @rdblue don't we require license header for new files?
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm that's weird RAT missed this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll open a new PR to add the license headers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in #30415
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was due to incorrect exclusion rule. I made a PR.