Skip to content

Conversation

@gatorsmile
Copy link
Member

What changes were proposed in this pull request?

The statistics-related table properties should be skipped by SHOW CREATE TABLE, since it could be incorrect in the newly created table. See the Hive JIRA: https://issues.apache.org/jira/browse/HIVE-13792

CREATE TABLE t1 (
  c1 INT COMMENT 'bla',
  c2 STRING
)
LOCATION '$dir'
TBLPROPERTIES (
  'prop1' = 'value1',
  'prop2' = 'value2'
)

The output of SHOW CREATE TABLE t1 is

CREATE EXTERNAL TABLE `t1`(`c1` int COMMENT 'bla', `c2` string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES (
  'serialization.format' = '1'
)
STORED AS
  INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat'
  OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION 'file:/private/var/folders/4b/sgmfldk15js406vk7lw5llzw0000gn/T/spark-ee317538-0f8c-42d0-b08c-cf077d94fe75'
TBLPROPERTIES (
  'rawDataSize' = '-1',
  'numFiles' = '0',
  'transient_lastDdlTime' = '1472424052',
  'totalSize' = '0',
  'prop1' = 'value1',
  'prop2' = 'value2',
  'COLUMN_STATS_ACCURATE' = 'false',
  'numRows' = '-1'
)

After the fix, the output becomes

CREATE EXTERNAL TABLE `t1`(`c1` int COMMENT 'bla', `c2` string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES (
  'serialization.format' = '1'
)
STORED AS
  INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat'
  OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION 'file:/private/var/folders/4b/sgmfldk15js406vk7lw5llzw0000gn/T/spark-74058a6d-db8b-41c1-9bda-bd449f1a78ed'
TBLPROPERTIES (
  'transient_lastDdlTime' = '1472423603',
  'prop1' = 'value1',
  'prop2' = 'value2'
)

How was this patch tested?

Updated the existing test cases.

case (key, _) => key == "EXTERNAL" && metadata.tableType == EXTERNAL
// Skips all the stats info (See the JIRA: HIVE-13792)
case (key, _) =>
key == "numFiles" || key == "numRows" || key == "totalSize" || key == "numPartitions" ||
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible that the list of hidden properties will grow in future? If so, can we not add them with || here? A separated list like the excludedTableProperties below seems good. And we can check if the key is in the list.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, let me change it. Thanks!

@SparkQA
Copy link

SparkQA commented Aug 29, 2016

Test build #64560 has finished for PR 14855 at commit ce8e8b8.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member Author

cc @hvanhovell @cloud-fan @liancheng @yhuai Thanks!

val filteredProps = metadata.properties.filterNot {
// Skips "EXTERNAL" property for external tables
case (key, _) => key == "EXTERNAL" && metadata.tableType == EXTERNAL
// Skips all the stats info (See the JIRA: HIVE-13792)
Copy link
Contributor

@hvanhovell hvanhovell Aug 29, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we fix this in the HiveExternalCatalog and just drop those properties there?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure whether I got your point. Are you saying getTableOption in HiveExternalCatalog?

Initially, I had the same design like you. Later, I realized we still use/display them in the other DDL statements, for example, DESCRIBE EXTENDED TABLE.

Let me know if my understanding is wrong. Thanks!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean in general: we should not leak Hive specific implementation (i.e. hacks) into sql/core.

At first I thought that we might be able to hide/filter them but you have a point. The good news is that this becomes a non-issue when we have statistics as a part of the CatalogTable. Then these properties can become the problem of a yet-to-be written translation.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, agree on this general rule.

When we support translations, we need to be very careful about including these statistics info in the SHOW CREATE TABLE DDL. Hive does not include them in SHOW CREATE TABLE, as shown in their JIRA: https://issues.apache.org/jira/browse/HIVE-13792. If we allow users to provide the statistics info when creating the tables, we might need to mark them as inaccurate, like what Hive does now?

BTW, should we merge this in 2.1 before we support the translation? So far, Spark 2.0 has the bug. Let me know what I should do next.

Thanks!

@cloud-fan
Copy link
Contributor

cloud-fan commented Aug 30, 2016

Where do we need the hive table statistics properties(does DESCRIBE EXTENDED TABLE really need to show them)? If nowhere, I think it's better to filter them out at the very beginning, i.e. HiveExternalCatalog

@gatorsmile
Copy link
Member Author

Another example is ALTER TABLE. Basically, we retrieve the original table properties and then modify the existing CatalogTable and then put it back.

def alterTable(tableDefinition: CatalogTable): Unit = {
val db = formatDatabaseName(tableDefinition.identifier.database.getOrElse(getCurrentDatabase))
val table = formatTableName(tableDefinition.identifier.table)
val tableIdentifier = TableIdentifier(table, Some(db))
val newTableDefinition = tableDefinition.copy(identifier = tableIdentifier)
requireDbExists(db)
requireTableExists(tableIdentifier)
externalCatalog.alterTable(newTableDefinition)
}

How about hiding them in HiveClientImp? Like what we did for EXTERNAL and comment in the table properties? The advantage is to leave the others unaware of the existence of these Hive-specific attributes.

@cloud-fan
Copy link
Contributor

sgtm, do we need to retrieve the statistics properties in HiveClientImp before alter a table? Can hive metastore automatically fill them?

@gatorsmile
Copy link
Member Author

Will try it.

My current solution is to first retrieve the table properties and fill these Hive-specific properties (if not set by the callers) before issuing the command to Hive. Will do it tonight. Thanks!

@gatorsmile
Copy link
Member Author

This PR is part of another PR #14971. Close it now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants