PARQUET-11: Reduce memory pressure when reading footers #2

dvryaboy · 2014-07-03T06:54:07Z

This will intern the following strings columns:

SchemaElement.name;
KeyValue.key
KeyValue.value
ColumnMetadata.(list) path_in_schema
ColumnChunk.file_path
FileMetaData.created_by

All of these are likely to be highly repeated.

Heap analysis before applying this patch (some silly obfuscation applied):

Object header size: 12 bytes
Total size of all object headers: 214,501K (33.6%)

 #instances    (Average) object   Total header     Class name
                    size              size                   
-------------------------------------------------------------
   4,405,394         32         51,625K (8.1%)     String
   4,403,224         44         51,600K (8.1%)     char[]
   1,894,774         24         22,204K (3.5%)     long[]
   1,893,181         33         22,185K (3.5%)     Object[]
   1,892,932         24         22,182K (3.5%)     j.u.BitSet
   1,892,650         24         22,179K (3.5%)     j.u.ArrayList
     944,858         32         11,072K (1.7%)     parquet.format.ColumnChunk
     944,858         88         11,072K (1.7%)     parquet.format.ColumnMetaData

DUPLICATE STRING STATS

Total strings: 4,405,394  Unique strings: 6,837  Duplicate values: 1,139  Overhead: 328,790K (51.5%)

Top duplicate strings:
    Ovhd         Num char[]s   Num objs   Value

 42,831K (6.7%)   783213      783213      "user"
 12,628K (2.0%)   161647      161647      "user_extra"
  7,446K (1.2%)   105907      105907      "another_column"
  6,751K (1.1%)    86424       86424      "column_a"
  6,270K (1.0%)   100333      100333      "column_b"
  6,098K (1.0%)    97581       97581      "column_c"
  5,748K (0.9%)    91972       91972      "column_d"
  5,486K (0.9%)    58528       58528      "column_e"
  5,029K (0.8%)    58528       58528      "column_f"
..... and so on.

After applying this patch, the strings disappear from the profile (next we need to deal with duplicate arrays...):

Object header size: 12 bytes
Total size of all object headers: 230,459K (36.1%)

 #instances    (Average) object   Total header     Class name
                    size              size                   
-------------------------------------------------------------
   3,926,516         24         46,013K (7.2%)     long[]
   3,924,923         33         45,995K (7.2%)     Object[]
   3,924,674         24         45,992K (7.2%)     j.u.BitSet
   3,924,392         24         45,988K (7.2%)     j.u.ArrayList
   1,959,233         88         22,959K (3.6%)     parquet.format.ColumnMetaData
   1,959,233         32         22,959K (3.6%)     parquet.format.ColumnChunk

 DUPLICATE STRING STATS

Total strings: 7,004  Unique strings: 5,808  Duplicate values: 621  Overhead: 103K (0.0%)

Top duplicate strings:
    Ovhd         Num char[]s   Num objs   Value


===================================================

(that's right, they no longer show up!)

dvryaboy · 2014-07-07T18:49:45Z

@julienledem just bumping this for you. Other reviewers welcome to look too, of course :)

isnotinvain · 2014-07-23T00:36:51Z

So this will avoid creation of short-lived strings, which should decrease GC pressure.
But, is this the root problem? Or is the root problem that we are just holding on to some references that we shouldn't, which prevents a bunch of things from getting GCed? EG, are we pulling all the footers into memory at once instead of streaming through them? Even thought it's better to intern these things, assuming we don't hold references we shouldn't, shouldn't these Strings be so short lived that they get GCed cheaply because they never leave the new gen?

dvryaboy · 2014-07-23T00:48:14Z

This stuff doesn't get to where we can have a gc -- the strings are part of the giant thrift footer object, and can't be partially GCed.

Changing the mode of working with the metadata from having it in memory to streaming through it is a significant change -- perhaps one we could contemplate, but a pretty deep change that we shouldn't block on.

The unfortunate thing here is that consolidated footers are a single thrift object, not a collection of objects, which is why we OOM inside the thrift protocol, not afterwards, where the metadata thrift is handed off to parquet code and we can do our own memory avoidance. So we have to avoid unnecessary allocations inside the thrift protocol; streaming or not streaming won't help in this case.

isnotinvain · 2014-07-23T03:21:10Z

Got it thanks.

dvryaboy · 2014-07-23T16:52:31Z

So.. can I haz a +1? :)

isnotinvain · 2014-07-24T06:35:00Z

FYI I've filed a ticket for Footers causing OOMs here: https://issues.apache.org/jira/browse/PARQUET-28

isnotinvain · 2014-07-24T06:37:11Z

What do you think about changing the class or package name so that it can't collide with the real one if that ever lands on the classpath? You could call it NotFinalTCompactProtocol or something like that

julienledem · 2014-07-24T16:55:40Z

src/main/java/parquet/format/Util.java

+   */
+  private static class TCompactInterningProtocol extends TCompactProtocol {
+    public TCompactInterningProtocol(TTransport transport) {
+      super(transport);


We can wrap the TCompactProtocol instead of extending, that will avoid duplicating it.

TCompactInterningProtocol extends TProtocol { private TCompactProtocol delegate; public TCompactInterningProtocol(TTransport transport) { super(transport); this.delegate = new TCompactProtocol(transport); } ... @Override public byte readByte() throws TException { return delegate.readByte(); } // all the public methods delegating to delegate ... // the method you want to change @Override public String readString() throws TException { return delegate.readString().intern(); } }

+1 good call

julienledem · 2014-07-24T16:57:04Z

I'd rather have the Interning protocol delegate to the compact protocol and not duplicate it there.

based on https://github.com/apache/incubator-parquet-format/pull/2 Author: julien <[email protected]> Author: Dmitriy Ryaboy <[email protected]> Closes #7 from julienledem/reduce_metadata_memory and squashes the following commits: 96ff408 [julien] Merge branch 'master' into reduce_metadata_memory 1c382cc [julien] implement delegate instead 7664919 [Dmitriy Ryaboy] intern parquet metadata strings when reading them

julienledem · 2014-08-23T00:14:56Z

this was merged in #7
@dvryaboy could you close this one?

Reopening Parquet/parquet-mr#403 against the new Apache repository. Author: Matthieu Martin <[email protected]> Closes apache#2 from matt-martin/master and squashes the following commits: 99bb5a3 [Matthieu Martin] Minor javadoc and whitespace changes. Also added the FileStatusWrapper class to ParquetInputFormat to make sure that the debugging log statements print out meaningful paths. 250a398 [Matthieu Martin] Be less aggressive about checking whether the underlying file has been appended to/overwritten/deleted in order to minimize the number of namenode interactions. d946445 [Matthieu Martin] Add javadocs to parquet.hadoop.LruCache. Rename cache "entries" as cache "values" to avoid confusion with java.util.Map.Entry (which contains key value pairs whereas our old "entries" really only refer to the values). a363622 [Matthieu Martin] Use LRU caching for footers in ParquetInputFormat.

intern parquet metadata strings when reading them

7664919

julienledem reviewed Jul 24, 2014
View reviewed changes

julienledem mentioned this pull request Aug 20, 2014

PARQUET-11: Reduce memory pressure when reading footers #7

Closed

dvryaboy closed this Aug 23, 2014

dvryaboy deleted the 11_reduce_metadata_memory branch August 23, 2014 01:11

emkornfield mentioned this pull request May 30, 2024

DRAFT: Alternative Strawman proposal for a new V3 footer format in Parquet #250

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

PARQUET-11: Reduce memory pressure when reading footers #2

PARQUET-11: Reduce memory pressure when reading footers #2

Uh oh!

dvryaboy commented Jul 3, 2014

Uh oh!

dvryaboy commented Jul 7, 2014

Uh oh!

isnotinvain commented Jul 23, 2014

Uh oh!

dvryaboy commented Jul 23, 2014

Uh oh!

isnotinvain commented Jul 23, 2014

Uh oh!

dvryaboy commented Jul 23, 2014

Uh oh!

isnotinvain commented Jul 24, 2014

Uh oh!

isnotinvain commented Jul 24, 2014

Uh oh!

julienledem Jul 24, 2014

Uh oh!

isnotinvain Jul 24, 2014

Uh oh!

julienledem commented Jul 24, 2014

Uh oh!

julienledem commented Aug 23, 2014

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

PARQUET-11: Reduce memory pressure when reading footers #2

PARQUET-11: Reduce memory pressure when reading footers #2

Uh oh!

Conversation

dvryaboy commented Jul 3, 2014

Uh oh!

dvryaboy commented Jul 7, 2014

Uh oh!

isnotinvain commented Jul 23, 2014

Uh oh!

dvryaboy commented Jul 23, 2014

Uh oh!

isnotinvain commented Jul 23, 2014

Uh oh!

dvryaboy commented Jul 23, 2014

Uh oh!

isnotinvain commented Jul 24, 2014

Uh oh!

isnotinvain commented Jul 24, 2014

Uh oh!

julienledem Jul 24, 2014

Choose a reason for hiding this comment

Uh oh!

isnotinvain Jul 24, 2014

Choose a reason for hiding this comment

Uh oh!

julienledem commented Jul 24, 2014

Uh oh!

julienledem commented Aug 23, 2014

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants