feat: Add encodedVectorCopy #12588

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Closed

Yuhta wants to merge 1 commit into facebookincubator:main from Yuhta:export-D70867237

Contributor

Yuhta commented Mar 10, 2025

Summary:
Implement encodedVectorCopy, a generic vector copy utility that preserves
encodings for memory saving purpose.

Encoding Preservation

There are mainly 2 use cases for this new function. One is to merge multiple
encoded vectors (sources) into one large encoded vector (target); the other
is to update specific rows (source) in a large vector (target), while
keeping the encodings. Both use cases requires us to keep the encoding on
target, so it is decided as the behavior of this function.

There are some exceptions to this rule:

We merge multiple adjacent layers of dictionary and constant wrappers into
one.
When target is constant, we convert it to dictionary to allow different
values in source.
When target is flat ROW, MAP, or ARRAY, and source is constant or
dictionary encoded, the result will be dictionary encoded, to avoid flattening
the child vectors. Once the target becomes dictionary, it can stay that way
and we can keep adding new content to it while keeping the encoding, this is a
typical use case for encoding preserved merging.

Inner Vector Compaction

Other than encoding, we also pay special attention to avoid holding on memory
that is no longer needed. This is especially important for the merging use
case, as the target gets updated, majority rows of its inner vectors will be
dereferenced and no longer used. There are 2 cases where we need to take care
of this.

The first is for dictionary encoding, some rows in the alphabet (base/value)
vector become no longer referenced by the indices, so we should recycle them.
This is done properly that when we translate the copy ranges on dictionary
indices to the copy ranges on alphabet, we overwrite the unused rows in alphabet
using the new alphabet rows from source. This way we efficiently reuse the
memory in alphabet without both reallocation and memory leaking.

The second case is for ARRAY/MAP, the elements/keys/values vector can have
rows that are no longer referenced from the parent. This is a little harder to
solve than in the dictionary case, since the nested rows need to be contiguous
for one parent row (offset/size pair), which means we cannot move them around
easily. Our approach is to allow some unused nested rows, but keep track of the
percentage of them, and once they exceed certain threshold (50% in the
implementation), we make a new copy of the nested vector and copy only the used
rows over. This allows us to reuse the nested rows to a certain degree while
keep some bounds on the memory usage.

Differential Revision: D70867237

Yuhta requested review from assignUser and majetideepak as code owners

March 10, 2025 14:36

facebook-github-bot added the CLA Signed label

netlify bot commented Mar 10, 2025 •

edited

Loading

✅ Deploy Preview for meta-velox canceled.

Name	Link
🔨 Latest commit	`81ca05c`
🔍 Latest deploy log	https://app.netlify.com/sites/meta-velox/deploys/67dc32b9bef236000817fc20

Yuhta force-pushed the export-D70867237 branch from 51c56ee to fce1c5d Compare

March 10, 2025 14:38

Contributor

facebook-github-bot commented Mar 10, 2025

This pull request was exported from Phabricator. Differential Revision: D70867237

facebook-github-bot added the fb-exported label

Yuhta force-pushed the export-D70867237 branch from fce1c5d to a175645 Compare

March 10, 2025 14:40

Contributor

facebook-github-bot commented Mar 10, 2025

This pull request was exported from Phabricator. Differential Revision: D70867237

1 similar comment

Contributor

facebook-github-bot commented Mar 10, 2025

This pull request was exported from Phabricator. Differential Revision: D70867237

Yuhta added a commit to Yuhta/velox that referenced this pull request


          feat: Add encodedVectorCopy (facebookincubator#12588)

42f9a6a

Summary:
Pull Request resolved: facebookincubator#12588

Implement `encodedVectorCopy`, a generic vector copy utility that preserves
encodings for memory saving purpose.

## Encoding Preservation

There are mainly 2 use cases for this new function.  One is to merge multiple
encoded vectors (`source`s) into one large encoded vector (`target`); the other
is to update specific rows (`source`) in a large vector (`target`), while
keeping the encodings.  Both use cases requires us to keep the encoding on
`target`, so it is decided as the behavior of this function.

There are some exceptions to this rule:

- We merge multiple adjacent layers of dictionary and constant wrappers into
  one.
- When `target` is constant, we convert it to dictionary to allow different
  values in `source`.
- When `target` is flat ROW, MAP, or ARRAY, and `source` is constant or
  dictionary encoded, the result will be dictionary encoded, to avoid flattening
  the child vectors.  Once the target becomes dictionary, it can stay that way
  and we can keep adding new content to it while keeping the encoding, this is a
  typical use case for encoding preserved merging.

## Inner Vector Compaction

Other than encoding, we also pay special attention to avoid holding on memory
that is no longer needed.  This is especially important for the merging use
case, as the `target` gets updated, majority rows of its inner vectors will be
dereferenced and no longer used.  There are 2 cases where we need to take care
of this.

The first is for dictionary encoding, some rows in the alphabet (base/value)
vector become no longer referenced by the indices, so we should recycle them.
This is done properly that when we translate the copy ranges on dictionary
indices to the copy ranges on alphabet, we overwrite the unused rows in alphabet
using the new alphabet rows from source.  This way we efficiently reuse the
memory in alphabet without both reallocation and memory leaking.

The second case is for `ARRAY`/`MAP`, the elements/keys/values vector can have
rows that are no longer referenced from the parent.  This is a little harder to
solve than in the dictionary case, since the nested rows need to be contiguous
for one parent row (offset/size pair), which means we cannot move them around
easily.  Our approach is to allow some unused nested rows, but keep track of the
percentage of them, and once they exceed certain threshold (50% in the
implementation), we make a new copy of the nested vector and copy only the used
rows over.  This allows us to reuse the nested rows to a certain degree while
keep some bounds on the memory usage.

Differential Revision: D70867237

Yuhta force-pushed the export-D70867237 branch 2 times, most recently from 42f9a6a to ad7d712 Compare

March 11, 2025 23:32

Yuhta added a commit to Yuhta/velox that referenced this pull request


          feat: Add encodedVectorCopy (facebookincubator#12588)

ad7d712

Summary:

Implement `encodedVectorCopy`, a generic vector copy utility that preserves
encodings for memory saving purpose.

## Encoding Preservation

There are mainly 2 use cases for this new function.  One is to merge multiple
encoded vectors (`source`s) into one large encoded vector (`target`); the other
is to update specific rows (`source`) in a large vector (`target`), while
keeping the encodings.  Both use cases requires us to keep the encoding on
`target`, so it is decided as the behavior of this function.

There are some exceptions to this rule:

- We merge multiple adjacent layers of dictionary and constant wrappers into
  one.
- When `target` is constant, we convert it to dictionary to allow different
  values in `source`.
- When `target` is flat ROW, MAP, or ARRAY, and `source` is constant or
  dictionary encoded, the result will be dictionary encoded, to avoid flattening
  the child vectors.  Once the target becomes dictionary, it can stay that way
  and we can keep adding new content to it while keeping the encoding, this is a
  typical use case for encoding preserved merging.

## Inner Vector Compaction

Other than encoding, we also pay special attention to avoid holding on memory
that is no longer needed.  This is especially important for the merging use
case, as the `target` gets updated, majority rows of its inner vectors will be
dereferenced and no longer used.  There are 2 cases where we need to take care
of this.

The first is for dictionary encoding, some rows in the alphabet (base/value)
vector become no longer referenced by the indices, so we should recycle them.
This is done properly that when we translate the copy ranges on dictionary
indices to the copy ranges on alphabet, we overwrite the unused rows in alphabet
using the new alphabet rows from source.  This way we efficiently reuse the
memory in alphabet without both reallocation and memory leaking.

The second case is for `ARRAY`/`MAP`, the elements/keys/values vector can have
rows that are no longer referenced from the parent.  This is a little harder to
solve than in the dictionary case, since the nested rows need to be contiguous
for one parent row (offset/size pair), which means we cannot move them around
easily.  Our approach is to allow some unused nested rows, but keep track of the
percentage of them, and once they exceed certain threshold (50% in the
implementation), we make a new copy of the nested vector and copy only the used
rows over.  This allows us to reuse the nested rows to a certain degree while
keep some bounds on the memory usage.

Reviewed By: mbasmanova

Differential Revision: D70867237

Contributor

facebook-github-bot commented Mar 11, 2025

This pull request was exported from Phabricator. Differential Revision: D70867237


          feat: Add encodedVectorCopy (facebookincubator#12588)

81ca05c

Summary:

Implement `encodedVectorCopy`, a generic vector copy utility that preserves
encodings for memory saving purpose.

## Encoding Preservation

There are mainly 2 use cases for this new function.  One is to merge multiple
encoded vectors (`source`s) into one large encoded vector (`target`); the other
is to update specific rows (`source`) in a large vector (`target`), while
keeping the encodings.  Both use cases requires us to keep the encoding on
`target`, so it is decided as the behavior of this function.

There are some exceptions to this rule:

- We merge multiple adjacent layers of dictionary and constant wrappers into
  one.
- When `target` is constant, we convert it to dictionary to allow different
  values in `source`.
- When `target` is flat ROW, MAP, or ARRAY, and `source` is constant or
  dictionary encoded, the result will be dictionary encoded, to avoid flattening
  the child vectors.  Once the target becomes dictionary, it can stay that way
  and we can keep adding new content to it while keeping the encoding, this is a
  typical use case for encoding preserved merging.

## Inner Vector Compaction

Other than encoding, we also pay special attention to avoid holding on memory
that is no longer needed.  This is especially important for the merging use
case, as the `target` gets updated, majority rows of its inner vectors will be
dereferenced and no longer used.  There are 2 cases where we need to take care
of this.

The first is for dictionary encoding, some rows in the alphabet (base/value)
vector become no longer referenced by the indices, so we should recycle them.
This is done properly that when we translate the copy ranges on dictionary
indices to the copy ranges on alphabet, we overwrite the unused rows in alphabet
using the new alphabet rows from source.  This way we efficiently reuse the
memory in alphabet without both reallocation and memory leaking.

The second case is for `ARRAY`/`MAP`, the elements/keys/values vector can have
rows that are no longer referenced from the parent.  This is a little harder to
solve than in the dictionary case, since the nested rows need to be contiguous
for one parent row (offset/size pair), which means we cannot move them around
easily.  Our approach is to allow some unused nested rows, but keep track of the
percentage of them, and once they exceed certain threshold (50% in the
implementation), we make a new copy of the nested vector and copy only the used
rows over.  This allows us to reuse the nested rows to a certain degree while
keep some bounds on the memory usage.

Reviewed By: mbasmanova

Differential Revision: D70867237

Yuhta force-pushed the export-D70867237 branch from ad7d712 to 81ca05c Compare

March 20, 2025 15:22

Contributor

facebook-github-bot commented Mar 20, 2025

This pull request was exported from Phabricator. Differential Revision: D70867237

facebook-github-bot closed this in

7d05ec7

facebook-github-bot added the Merged label

Contributor

facebook-github-bot commented Mar 20, 2025

This pull request has been merged in 7d05ec7.

conbench-facebook bot commented Mar 20, 2025

Conbench analyzed the 1 benchmark run on commit 7d05ec79.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details.

jinchengchenghh pushed a commit to jinchengchenghh/velox that referenced this pull request


          feat: Add encodedVectorCopy (facebookincubator#12588)

ae0f2fe

Summary:
Pull Request resolved: facebookincubator#12588

Implement `encodedVectorCopy`, a generic vector copy utility that preserves
encodings for memory saving purpose.

## Encoding Preservation

There are mainly 2 use cases for this new function.  One is to merge multiple
encoded vectors (`source`s) into one large encoded vector (`target`); the other
is to update specific rows (`source`) in a large vector (`target`), while
keeping the encodings.  Both use cases requires us to keep the encoding on
`target`, so it is decided as the behavior of this function.

There are some exceptions to this rule:

- We merge multiple adjacent layers of dictionary and constant wrappers into
  one.
- When the values type size in dictionary is no larger than the index type,
   we flatten the vector to save memory.
- When `target` is constant, we convert it to dictionary to allow different
  values in `source`.
- When `target` is flat ROW, MAP, or ARRAY, and `source` is constant or
  dictionary encoded, the result will be dictionary encoded, to avoid flattening
  the child vectors.  Once the target becomes dictionary, it can stay that way
  and we can keep adding new content to it while keeping the encoding, this is a
  typical use case for encoding preserved merging.

## Inner Vector Compaction

Other than encoding, we also pay special attention to avoid holding on memory
that is no longer needed.  This is especially important for the merging use
case, as the `target` gets updated, majority rows of its inner vectors will be
dereferenced and no longer used.  There are 2 cases where we need to take care
of this.

The first is for dictionary encoding, some rows in the alphabet (base/value)
vector become no longer referenced by the indices, so we should recycle them.
This is done properly that when we translate the copy ranges on dictionary
indices to the copy ranges on alphabet, we overwrite the unused rows in alphabet
using the new alphabet rows from source.  This way we efficiently reuse the
memory in alphabet without both reallocation and memory leaking.

The second case is for `ARRAY`/`MAP`, the elements/keys/values vector can have
rows that are no longer referenced from the parent.  This is a little harder to
solve than in the dictionary case, since the nested rows need to be contiguous
for one parent row (offset/size pair), which means we cannot move them around
easily.  Our approach is to allow some unused nested rows, but keep track of the
percentage of them, and once they exceed certain threshold (50% by default and configurable), we make a new copy of the nested vector and copy only the used
rows over.  This allows us to reuse the nested rows to a certain degree while
keep some bounds on the memory usage.

Reviewed By: mbasmanova

Differential Revision: D70867237

fbshipit-source-id: 0cddd37fd7188d89ea541fb89324aa9a10745415

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed fb-exported Merged