-
Notifications
You must be signed in to change notification settings - Fork 1.4k
feat: Add encodedVectorCopy #12588
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Add encodedVectorCopy #12588
Conversation
✅ Deploy Preview for meta-velox canceled.
|
|
This pull request was exported from Phabricator. Differential Revision: D70867237 |
|
This pull request was exported from Phabricator. Differential Revision: D70867237 |
1 similar comment
|
This pull request was exported from Phabricator. Differential Revision: D70867237 |
Summary: Pull Request resolved: facebookincubator#12588 Implement `encodedVectorCopy`, a generic vector copy utility that preserves encodings for memory saving purpose. ## Encoding Preservation There are mainly 2 use cases for this new function. One is to merge multiple encoded vectors (`source`s) into one large encoded vector (`target`); the other is to update specific rows (`source`) in a large vector (`target`), while keeping the encodings. Both use cases requires us to keep the encoding on `target`, so it is decided as the behavior of this function. There are some exceptions to this rule: - We merge multiple adjacent layers of dictionary and constant wrappers into one. - When `target` is constant, we convert it to dictionary to allow different values in `source`. - When `target` is flat ROW, MAP, or ARRAY, and `source` is constant or dictionary encoded, the result will be dictionary encoded, to avoid flattening the child vectors. Once the target becomes dictionary, it can stay that way and we can keep adding new content to it while keeping the encoding, this is a typical use case for encoding preserved merging. ## Inner Vector Compaction Other than encoding, we also pay special attention to avoid holding on memory that is no longer needed. This is especially important for the merging use case, as the `target` gets updated, majority rows of its inner vectors will be dereferenced and no longer used. There are 2 cases where we need to take care of this. The first is for dictionary encoding, some rows in the alphabet (base/value) vector become no longer referenced by the indices, so we should recycle them. This is done properly that when we translate the copy ranges on dictionary indices to the copy ranges on alphabet, we overwrite the unused rows in alphabet using the new alphabet rows from source. This way we efficiently reuse the memory in alphabet without both reallocation and memory leaking. The second case is for `ARRAY`/`MAP`, the elements/keys/values vector can have rows that are no longer referenced from the parent. This is a little harder to solve than in the dictionary case, since the nested rows need to be contiguous for one parent row (offset/size pair), which means we cannot move them around easily. Our approach is to allow some unused nested rows, but keep track of the percentage of them, and once they exceed certain threshold (50% in the implementation), we make a new copy of the nested vector and copy only the used rows over. This allows us to reuse the nested rows to a certain degree while keep some bounds on the memory usage. Differential Revision: D70867237
42f9a6a to
ad7d712
Compare
Summary: Implement `encodedVectorCopy`, a generic vector copy utility that preserves encodings for memory saving purpose. ## Encoding Preservation There are mainly 2 use cases for this new function. One is to merge multiple encoded vectors (`source`s) into one large encoded vector (`target`); the other is to update specific rows (`source`) in a large vector (`target`), while keeping the encodings. Both use cases requires us to keep the encoding on `target`, so it is decided as the behavior of this function. There are some exceptions to this rule: - We merge multiple adjacent layers of dictionary and constant wrappers into one. - When `target` is constant, we convert it to dictionary to allow different values in `source`. - When `target` is flat ROW, MAP, or ARRAY, and `source` is constant or dictionary encoded, the result will be dictionary encoded, to avoid flattening the child vectors. Once the target becomes dictionary, it can stay that way and we can keep adding new content to it while keeping the encoding, this is a typical use case for encoding preserved merging. ## Inner Vector Compaction Other than encoding, we also pay special attention to avoid holding on memory that is no longer needed. This is especially important for the merging use case, as the `target` gets updated, majority rows of its inner vectors will be dereferenced and no longer used. There are 2 cases where we need to take care of this. The first is for dictionary encoding, some rows in the alphabet (base/value) vector become no longer referenced by the indices, so we should recycle them. This is done properly that when we translate the copy ranges on dictionary indices to the copy ranges on alphabet, we overwrite the unused rows in alphabet using the new alphabet rows from source. This way we efficiently reuse the memory in alphabet without both reallocation and memory leaking. The second case is for `ARRAY`/`MAP`, the elements/keys/values vector can have rows that are no longer referenced from the parent. This is a little harder to solve than in the dictionary case, since the nested rows need to be contiguous for one parent row (offset/size pair), which means we cannot move them around easily. Our approach is to allow some unused nested rows, but keep track of the percentage of them, and once they exceed certain threshold (50% in the implementation), we make a new copy of the nested vector and copy only the used rows over. This allows us to reuse the nested rows to a certain degree while keep some bounds on the memory usage. Reviewed By: mbasmanova Differential Revision: D70867237
|
This pull request was exported from Phabricator. Differential Revision: D70867237 |
Summary: Implement `encodedVectorCopy`, a generic vector copy utility that preserves encodings for memory saving purpose. ## Encoding Preservation There are mainly 2 use cases for this new function. One is to merge multiple encoded vectors (`source`s) into one large encoded vector (`target`); the other is to update specific rows (`source`) in a large vector (`target`), while keeping the encodings. Both use cases requires us to keep the encoding on `target`, so it is decided as the behavior of this function. There are some exceptions to this rule: - We merge multiple adjacent layers of dictionary and constant wrappers into one. - When `target` is constant, we convert it to dictionary to allow different values in `source`. - When `target` is flat ROW, MAP, or ARRAY, and `source` is constant or dictionary encoded, the result will be dictionary encoded, to avoid flattening the child vectors. Once the target becomes dictionary, it can stay that way and we can keep adding new content to it while keeping the encoding, this is a typical use case for encoding preserved merging. ## Inner Vector Compaction Other than encoding, we also pay special attention to avoid holding on memory that is no longer needed. This is especially important for the merging use case, as the `target` gets updated, majority rows of its inner vectors will be dereferenced and no longer used. There are 2 cases where we need to take care of this. The first is for dictionary encoding, some rows in the alphabet (base/value) vector become no longer referenced by the indices, so we should recycle them. This is done properly that when we translate the copy ranges on dictionary indices to the copy ranges on alphabet, we overwrite the unused rows in alphabet using the new alphabet rows from source. This way we efficiently reuse the memory in alphabet without both reallocation and memory leaking. The second case is for `ARRAY`/`MAP`, the elements/keys/values vector can have rows that are no longer referenced from the parent. This is a little harder to solve than in the dictionary case, since the nested rows need to be contiguous for one parent row (offset/size pair), which means we cannot move them around easily. Our approach is to allow some unused nested rows, but keep track of the percentage of them, and once they exceed certain threshold (50% in the implementation), we make a new copy of the nested vector and copy only the used rows over. This allows us to reuse the nested rows to a certain degree while keep some bounds on the memory usage. Reviewed By: mbasmanova Differential Revision: D70867237
|
This pull request was exported from Phabricator. Differential Revision: D70867237 |
|
This pull request has been merged in 7d05ec7. |
|
Conbench analyzed the 1 benchmark run on commit There were no benchmark performance regressions. 🎉 The full Conbench report has more details. |
Summary: Pull Request resolved: facebookincubator#12588 Implement `encodedVectorCopy`, a generic vector copy utility that preserves encodings for memory saving purpose. ## Encoding Preservation There are mainly 2 use cases for this new function. One is to merge multiple encoded vectors (`source`s) into one large encoded vector (`target`); the other is to update specific rows (`source`) in a large vector (`target`), while keeping the encodings. Both use cases requires us to keep the encoding on `target`, so it is decided as the behavior of this function. There are some exceptions to this rule: - We merge multiple adjacent layers of dictionary and constant wrappers into one. - When the values type size in dictionary is no larger than the index type, we flatten the vector to save memory. - When `target` is constant, we convert it to dictionary to allow different values in `source`. - When `target` is flat ROW, MAP, or ARRAY, and `source` is constant or dictionary encoded, the result will be dictionary encoded, to avoid flattening the child vectors. Once the target becomes dictionary, it can stay that way and we can keep adding new content to it while keeping the encoding, this is a typical use case for encoding preserved merging. ## Inner Vector Compaction Other than encoding, we also pay special attention to avoid holding on memory that is no longer needed. This is especially important for the merging use case, as the `target` gets updated, majority rows of its inner vectors will be dereferenced and no longer used. There are 2 cases where we need to take care of this. The first is for dictionary encoding, some rows in the alphabet (base/value) vector become no longer referenced by the indices, so we should recycle them. This is done properly that when we translate the copy ranges on dictionary indices to the copy ranges on alphabet, we overwrite the unused rows in alphabet using the new alphabet rows from source. This way we efficiently reuse the memory in alphabet without both reallocation and memory leaking. The second case is for `ARRAY`/`MAP`, the elements/keys/values vector can have rows that are no longer referenced from the parent. This is a little harder to solve than in the dictionary case, since the nested rows need to be contiguous for one parent row (offset/size pair), which means we cannot move them around easily. Our approach is to allow some unused nested rows, but keep track of the percentage of them, and once they exceed certain threshold (50% by default and configurable), we make a new copy of the nested vector and copy only the used rows over. This allows us to reuse the nested rows to a certain degree while keep some bounds on the memory usage. Reviewed By: mbasmanova Differential Revision: D70867237 fbshipit-source-id: 0cddd37fd7188d89ea541fb89324aa9a10745415
Summary:
Implement
encodedVectorCopy, a generic vector copy utility that preservesencodings for memory saving purpose.
Encoding Preservation
There are mainly 2 use cases for this new function. One is to merge multiple
encoded vectors (
sources) into one large encoded vector (target); the otheris to update specific rows (
source) in a large vector (target), whilekeeping the encodings. Both use cases requires us to keep the encoding on
target, so it is decided as the behavior of this function.There are some exceptions to this rule:
one.
targetis constant, we convert it to dictionary to allow differentvalues in
source.targetis flat ROW, MAP, or ARRAY, andsourceis constant ordictionary encoded, the result will be dictionary encoded, to avoid flattening
the child vectors. Once the target becomes dictionary, it can stay that way
and we can keep adding new content to it while keeping the encoding, this is a
typical use case for encoding preserved merging.
Inner Vector Compaction
Other than encoding, we also pay special attention to avoid holding on memory
that is no longer needed. This is especially important for the merging use
case, as the
targetgets updated, majority rows of its inner vectors will bedereferenced and no longer used. There are 2 cases where we need to take care
of this.
The first is for dictionary encoding, some rows in the alphabet (base/value)
vector become no longer referenced by the indices, so we should recycle them.
This is done properly that when we translate the copy ranges on dictionary
indices to the copy ranges on alphabet, we overwrite the unused rows in alphabet
using the new alphabet rows from source. This way we efficiently reuse the
memory in alphabet without both reallocation and memory leaking.
The second case is for
ARRAY/MAP, the elements/keys/values vector can haverows that are no longer referenced from the parent. This is a little harder to
solve than in the dictionary case, since the nested rows need to be contiguous
for one parent row (offset/size pair), which means we cannot move them around
easily. Our approach is to allow some unused nested rows, but keep track of the
percentage of them, and once they exceed certain threshold (50% in the
implementation), we make a new copy of the nested vector and copy only the used
rows over. This allows us to reuse the nested rows to a certain degree while
keep some bounds on the memory usage.
Differential Revision: D70867237