-
Notifications
You must be signed in to change notification settings - Fork 465
PARQUET-11: Reduce memory pressure when reading footers #2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@julienledem just bumping this for you. Other reviewers welcome to look too, of course :) |
|
So this will avoid creation of short-lived strings, which should decrease GC pressure. |
|
This stuff doesn't get to where we can have a gc -- the strings are part of the giant thrift footer object, and can't be partially GCed. Changing the mode of working with the metadata from having it in memory to streaming through it is a significant change -- perhaps one we could contemplate, but a pretty deep change that we shouldn't block on. The unfortunate thing here is that consolidated footers are a single thrift object, not a collection of objects, which is why we OOM inside the thrift protocol, not afterwards, where the metadata thrift is handed off to parquet code and we can do our own memory avoidance. So we have to avoid unnecessary allocations inside the thrift protocol; streaming or not streaming won't help in this case. |
|
Got it thanks. |
|
So.. can I haz a +1? :) |
|
FYI I've filed a ticket for Footers causing OOMs here: https://issues.apache.org/jira/browse/PARQUET-28 |
|
What do you think about changing the class or package name so that it can't collide with the real one if that ever lands on the classpath? You could call it NotFinalTCompactProtocol or something like that |
| */ | ||
| private static class TCompactInterningProtocol extends TCompactProtocol { | ||
| public TCompactInterningProtocol(TTransport transport) { | ||
| super(transport); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can wrap the TCompactProtocol instead of extending, that will avoid duplicating it.
TCompactInterningProtocol extends TProtocol {
private TCompactProtocol delegate;
public TCompactInterningProtocol(TTransport transport) {
super(transport);
this.delegate = new TCompactProtocol(transport);
}
...
@Override
public byte readByte() throws TException {
return delegate.readByte();
}
// all the public methods delegating to delegate
...
// the method you want to change
@Override
public String readString() throws TException {
return delegate.readString().intern();
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 good call
|
I'd rather have the Interning protocol delegate to the compact protocol and not duplicate it there. |
based on https://github.com/apache/incubator-parquet-format/pull/2 Author: julien <[email protected]> Author: Dmitriy Ryaboy <[email protected]> Closes #7 from julienledem/reduce_metadata_memory and squashes the following commits: 96ff408 [julien] Merge branch 'master' into reduce_metadata_memory 1c382cc [julien] implement delegate instead 7664919 [Dmitriy Ryaboy] intern parquet metadata strings when reading them
Reopening Parquet/parquet-mr#403 against the new Apache repository. Author: Matthieu Martin <[email protected]> Closes apache#2 from matt-martin/master and squashes the following commits: 99bb5a3 [Matthieu Martin] Minor javadoc and whitespace changes. Also added the FileStatusWrapper class to ParquetInputFormat to make sure that the debugging log statements print out meaningful paths. 250a398 [Matthieu Martin] Be less aggressive about checking whether the underlying file has been appended to/overwritten/deleted in order to minimize the number of namenode interactions. d946445 [Matthieu Martin] Add javadocs to parquet.hadoop.LruCache. Rename cache "entries" as cache "values" to avoid confusion with java.util.Map.Entry (which contains key value pairs whereas our old "entries" really only refer to the values). a363622 [Matthieu Martin] Use LRU caching for footers in ParquetInputFormat.
This will intern the following strings columns:
SchemaElement.name;
KeyValue.key
KeyValue.value
ColumnMetadata.(list) path_in_schema
ColumnChunk.file_path
FileMetaData.created_by
All of these are likely to be highly repeated.
Heap analysis before applying this patch (some silly obfuscation applied):
After applying this patch, the strings disappear from the profile (next we need to deal with duplicate arrays...):
(that's right, they no longer show up!)