Skip to content

Conversation

@gatorsmile
Copy link
Member

What changes were proposed in this pull request?

In the recent statistics-related work, our focus is on how to generate and store the statistics. After Analyze Table commands, the statistics will not be changed unless users run the command again. However, Hive behaves differently. For example, ALTER TABLE SET LOCATION will invalidate the statistics, including numRows and rawDataSize.

hive> describe formatted t2;
...
Location:               hdfs://6b68a24121f4:9000/user/hive/warehouse/t2  
Table Type:             MANAGED_TABLE            
Table Parameters:        
    COLUMN_STATS_ACCURATE   true                
    numFiles                4                   
    numRows                 2                   
    rawDataSize             2                   
    totalSize               4                   
    transient_lastDdlTime   1464590855          
...
hive> alter table t2 set location 'hdfs://6b68a24121f4:9000/user/hive/warehouse/t1';
OK
Time taken: 0.113 seconds
hive> describe formatted t2;
...                  
Location:               hdfs://6b68a24121f4:9000/user/hive/warehouse/t1  
Table Type:             MANAGED_TABLE            
Table Parameters:        
    COLUMN_STATS_ACCURATE   false               
    last_modified_by        root                
    last_modified_time      1474178025          
    numFiles                4                   
    numRows                 -1                  
    rawDataSize             -1                  
    totalSize               4                   
...

This PR tries to fix the related issues.

How was this patch tested?

Added test cases.

checkStats(
textTable,
isDataSourceTable = false,
hasSizeInBytes = false,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After the PR: #14971, we can reuse the Hive-generated statistics. Thus, this value will be true.

@gatorsmile gatorsmile changed the title [SPARK-17581] [SQL] Invalidate Statistics After Some ALTER TABLE Commands [SPARK-17581] [SQL] Invalidate Statistics After Some ALTER TABLE Commands [WIP] Sep 18, 2016
@gatorsmile
Copy link
Member Author

@hvanhovell @cloud-fan Do you think automatic invalidation makes sense after some ALTER TABLE commands?

@cloud-fan
Copy link
Contributor

is SET LOCATION the only command that can invalidate the statistics?

@SparkQA
Copy link

SparkQA commented Sep 18, 2016

Test build #65554 has finished for PR 15136 at commit 54df392.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member Author

Will do more investigation on this.

@gatorsmile
Copy link
Member Author

The behavior of Hive 2.1 is different from Hive 1.6. Hive 2.1 does not invalidate the stats, but only mark it inaccurate. Thus, this PR can be closed. Thanks!

@gatorsmile gatorsmile closed this Nov 27, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants