Skip to content

Conversation

@dvryaboy
Copy link
Contributor

@dvryaboy dvryaboy commented Jul 3, 2014

This will intern the following strings columns:

SchemaElement.name;
KeyValue.key
KeyValue.value
ColumnMetadata.(list) path_in_schema
ColumnChunk.file_path
FileMetaData.created_by

All of these are likely to be highly repeated.

Heap analysis before applying this patch (some silly obfuscation applied):

Object header size: 12 bytes
Total size of all object headers: 214,501K (33.6%)

 #instances    (Average) object   Total header     Class name
                    size              size                   
-------------------------------------------------------------
   4,405,394         32         51,625K (8.1%)     String
   4,403,224         44         51,600K (8.1%)     char[]
   1,894,774         24         22,204K (3.5%)     long[]
   1,893,181         33         22,185K (3.5%)     Object[]
   1,892,932         24         22,182K (3.5%)     j.u.BitSet
   1,892,650         24         22,179K (3.5%)     j.u.ArrayList
     944,858         32         11,072K (1.7%)     parquet.format.ColumnChunk
     944,858         88         11,072K (1.7%)     parquet.format.ColumnMetaData
DUPLICATE STRING STATS

Total strings: 4,405,394  Unique strings: 6,837  Duplicate values: 1,139  Overhead: 328,790K (51.5%)

Top duplicate strings:
    Ovhd         Num char[]s   Num objs   Value

 42,831K (6.7%)   783213      783213      "user"
 12,628K (2.0%)   161647      161647      "user_extra"
  7,446K (1.2%)   105907      105907      "another_column"
  6,751K (1.1%)    86424       86424      "column_a"
  6,270K (1.0%)   100333      100333      "column_b"
  6,098K (1.0%)    97581       97581      "column_c"
  5,748K (0.9%)    91972       91972      "column_d"
  5,486K (0.9%)    58528       58528      "column_e"
  5,029K (0.8%)    58528       58528      "column_f"
..... and so on.

After applying this patch, the strings disappear from the profile (next we need to deal with duplicate arrays...):

Object header size: 12 bytes
Total size of all object headers: 230,459K (36.1%)

 #instances    (Average) object   Total header     Class name
                    size              size                   
-------------------------------------------------------------
   3,926,516         24         46,013K (7.2%)     long[]
   3,924,923         33         45,995K (7.2%)     Object[]
   3,924,674         24         45,992K (7.2%)     j.u.BitSet
   3,924,392         24         45,988K (7.2%)     j.u.ArrayList
   1,959,233         88         22,959K (3.6%)     parquet.format.ColumnMetaData
   1,959,233         32         22,959K (3.6%)     parquet.format.ColumnChunk

 DUPLICATE STRING STATS

Total strings: 7,004  Unique strings: 5,808  Duplicate values: 621  Overhead: 103K (0.0%)

Top duplicate strings:
    Ovhd         Num char[]s   Num objs   Value


===================================================

(that's right, they no longer show up!)

@dvryaboy
Copy link
Contributor Author

dvryaboy commented Jul 7, 2014

@julienledem just bumping this for you. Other reviewers welcome to look too, of course :)

@isnotinvain
Copy link
Contributor

So this will avoid creation of short-lived strings, which should decrease GC pressure.
But, is this the root problem? Or is the root problem that we are just holding on to some references that we shouldn't, which prevents a bunch of things from getting GCed? EG, are we pulling all the footers into memory at once instead of streaming through them? Even thought it's better to intern these things, assuming we don't hold references we shouldn't, shouldn't these Strings be so short lived that they get GCed cheaply because they never leave the new gen?

@dvryaboy
Copy link
Contributor Author

This stuff doesn't get to where we can have a gc -- the strings are part of the giant thrift footer object, and can't be partially GCed.

Changing the mode of working with the metadata from having it in memory to streaming through it is a significant change -- perhaps one we could contemplate, but a pretty deep change that we shouldn't block on.

The unfortunate thing here is that consolidated footers are a single thrift object, not a collection of objects, which is why we OOM inside the thrift protocol, not afterwards, where the metadata thrift is handed off to parquet code and we can do our own memory avoidance. So we have to avoid unnecessary allocations inside the thrift protocol; streaming or not streaming won't help in this case.

@isnotinvain
Copy link
Contributor

Got it thanks.

@dvryaboy
Copy link
Contributor Author

So.. can I haz a +1? :)

@isnotinvain
Copy link
Contributor

FYI I've filed a ticket for Footers causing OOMs here: https://issues.apache.org/jira/browse/PARQUET-28

@isnotinvain
Copy link
Contributor

What do you think about changing the class or package name so that it can't collide with the real one if that ever lands on the classpath? You could call it NotFinalTCompactProtocol or something like that

*/
private static class TCompactInterningProtocol extends TCompactProtocol {
public TCompactInterningProtocol(TTransport transport) {
super(transport);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can wrap the TCompactProtocol instead of extending, that will avoid duplicating it.

TCompactInterningProtocol extends TProtocol {
  private TCompactProtocol delegate;
  public TCompactInterningProtocol(TTransport transport) {
     super(transport);
     this.delegate = new TCompactProtocol(transport);
  }
...
@Override
 public byte readByte() throws TException {
     return delegate.readByte();
   }
// all the public methods delegating to delegate
...
// the method you want to change
@Override
 public String readString() throws TException {
     return delegate.readString().intern();
 }
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 good call

@julienledem
Copy link
Member

I'd rather have the Interning protocol delegate to the compact protocol and not duplicate it there.

asfgit referenced this pull request Aug 22, 2014
based on https://github.com/apache/incubator-parquet-format/pull/2

Author: julien <[email protected]>
Author: Dmitriy Ryaboy <[email protected]>

Closes #7 from julienledem/reduce_metadata_memory and squashes the following commits:

96ff408 [julien] Merge branch 'master' into reduce_metadata_memory
1c382cc [julien] implement delegate instead
7664919 [Dmitriy Ryaboy] intern parquet metadata strings when reading them
@julienledem
Copy link
Member

this was merged in #7
@dvryaboy could you close this one?

@dvryaboy dvryaboy closed this Aug 23, 2014
@dvryaboy dvryaboy deleted the 11_reduce_metadata_memory branch August 23, 2014 01:11
dvryaboy pushed a commit to dvryaboy/incubator-parquet-format that referenced this pull request Aug 24, 2014
Reopening Parquet/parquet-mr#403 against the new Apache repository.

Author: Matthieu Martin <[email protected]>

Closes apache#2 from matt-martin/master and squashes the following commits:

99bb5a3 [Matthieu Martin] Minor javadoc and whitespace changes. Also added the FileStatusWrapper class to ParquetInputFormat to make sure that the debugging log statements print out meaningful paths.
250a398 [Matthieu Martin] Be less aggressive about checking whether the underlying file has been appended to/overwritten/deleted in order to minimize the number of namenode interactions.
d946445 [Matthieu Martin] Add javadocs to parquet.hadoop.LruCache.  Rename cache "entries" as cache "values" to avoid confusion with java.util.Map.Entry (which contains key value pairs whereas our old "entries" really only refer to the values).
a363622 [Matthieu Martin] Use LRU caching for footers in ParquetInputFormat.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants