Skip to content

Conversation

@Yuhta
Copy link
Contributor

@Yuhta Yuhta commented Mar 10, 2025

Summary:
Implement encodedVectorCopy, a generic vector copy utility that preserves
encodings for memory saving purpose.

Encoding Preservation

There are mainly 2 use cases for this new function. One is to merge multiple
encoded vectors (sources) into one large encoded vector (target); the other
is to update specific rows (source) in a large vector (target), while
keeping the encodings. Both use cases requires us to keep the encoding on
target, so it is decided as the behavior of this function.

There are some exceptions to this rule:

  • We merge multiple adjacent layers of dictionary and constant wrappers into
    one.
  • When target is constant, we convert it to dictionary to allow different
    values in source.
  • When target is flat ROW, MAP, or ARRAY, and source is constant or
    dictionary encoded, the result will be dictionary encoded, to avoid flattening
    the child vectors. Once the target becomes dictionary, it can stay that way
    and we can keep adding new content to it while keeping the encoding, this is a
    typical use case for encoding preserved merging.

Inner Vector Compaction

Other than encoding, we also pay special attention to avoid holding on memory
that is no longer needed. This is especially important for the merging use
case, as the target gets updated, majority rows of its inner vectors will be
dereferenced and no longer used. There are 2 cases where we need to take care
of this.

The first is for dictionary encoding, some rows in the alphabet (base/value)
vector become no longer referenced by the indices, so we should recycle them.
This is done properly that when we translate the copy ranges on dictionary
indices to the copy ranges on alphabet, we overwrite the unused rows in alphabet
using the new alphabet rows from source. This way we efficiently reuse the
memory in alphabet without both reallocation and memory leaking.

The second case is for ARRAY/MAP, the elements/keys/values vector can have
rows that are no longer referenced from the parent. This is a little harder to
solve than in the dictionary case, since the nested rows need to be contiguous
for one parent row (offset/size pair), which means we cannot move them around
easily. Our approach is to allow some unused nested rows, but keep track of the
percentage of them, and once they exceed certain threshold (50% in the
implementation), we make a new copy of the nested vector and copy only the used
rows over. This allows us to reuse the nested rows to a certain degree while
keep some bounds on the memory usage.

Differential Revision: D70867237

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Mar 10, 2025
@netlify
Copy link

netlify bot commented Mar 10, 2025

Deploy Preview for meta-velox canceled.

Name Link
🔨 Latest commit 81ca05c
🔍 Latest deploy log https://app.netlify.com/sites/meta-velox/deploys/67dc32b9bef236000817fc20

@Yuhta Yuhta force-pushed the export-D70867237 branch from 51c56ee to fce1c5d Compare March 10, 2025 14:38
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D70867237

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D70867237

1 similar comment
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D70867237

Yuhta added a commit to Yuhta/velox that referenced this pull request Mar 10, 2025
Summary:
Pull Request resolved: facebookincubator#12588

Implement `encodedVectorCopy`, a generic vector copy utility that preserves
encodings for memory saving purpose.

## Encoding Preservation

There are mainly 2 use cases for this new function.  One is to merge multiple
encoded vectors (`source`s) into one large encoded vector (`target`); the other
is to update specific rows (`source`) in a large vector (`target`), while
keeping the encodings.  Both use cases requires us to keep the encoding on
`target`, so it is decided as the behavior of this function.

There are some exceptions to this rule:

- We merge multiple adjacent layers of dictionary and constant wrappers into
  one.
- When `target` is constant, we convert it to dictionary to allow different
  values in `source`.
- When `target` is flat ROW, MAP, or ARRAY, and `source` is constant or
  dictionary encoded, the result will be dictionary encoded, to avoid flattening
  the child vectors.  Once the target becomes dictionary, it can stay that way
  and we can keep adding new content to it while keeping the encoding, this is a
  typical use case for encoding preserved merging.

## Inner Vector Compaction

Other than encoding, we also pay special attention to avoid holding on memory
that is no longer needed.  This is especially important for the merging use
case, as the `target` gets updated, majority rows of its inner vectors will be
dereferenced and no longer used.  There are 2 cases where we need to take care
of this.

The first is for dictionary encoding, some rows in the alphabet (base/value)
vector become no longer referenced by the indices, so we should recycle them.
This is done properly that when we translate the copy ranges on dictionary
indices to the copy ranges on alphabet, we overwrite the unused rows in alphabet
using the new alphabet rows from source.  This way we efficiently reuse the
memory in alphabet without both reallocation and memory leaking.

The second case is for `ARRAY`/`MAP`, the elements/keys/values vector can have
rows that are no longer referenced from the parent.  This is a little harder to
solve than in the dictionary case, since the nested rows need to be contiguous
for one parent row (offset/size pair), which means we cannot move them around
easily.  Our approach is to allow some unused nested rows, but keep track of the
percentage of them, and once they exceed certain threshold (50% in the
implementation), we make a new copy of the nested vector and copy only the used
rows over.  This allows us to reuse the nested rows to a certain degree while
keep some bounds on the memory usage.

Differential Revision: D70867237
@Yuhta Yuhta force-pushed the export-D70867237 branch 2 times, most recently from 42f9a6a to ad7d712 Compare March 11, 2025 23:32
Yuhta added a commit to Yuhta/velox that referenced this pull request Mar 11, 2025
Summary:

Implement `encodedVectorCopy`, a generic vector copy utility that preserves
encodings for memory saving purpose.

## Encoding Preservation

There are mainly 2 use cases for this new function.  One is to merge multiple
encoded vectors (`source`s) into one large encoded vector (`target`); the other
is to update specific rows (`source`) in a large vector (`target`), while
keeping the encodings.  Both use cases requires us to keep the encoding on
`target`, so it is decided as the behavior of this function.

There are some exceptions to this rule:

- We merge multiple adjacent layers of dictionary and constant wrappers into
  one.
- When `target` is constant, we convert it to dictionary to allow different
  values in `source`.
- When `target` is flat ROW, MAP, or ARRAY, and `source` is constant or
  dictionary encoded, the result will be dictionary encoded, to avoid flattening
  the child vectors.  Once the target becomes dictionary, it can stay that way
  and we can keep adding new content to it while keeping the encoding, this is a
  typical use case for encoding preserved merging.

## Inner Vector Compaction

Other than encoding, we also pay special attention to avoid holding on memory
that is no longer needed.  This is especially important for the merging use
case, as the `target` gets updated, majority rows of its inner vectors will be
dereferenced and no longer used.  There are 2 cases where we need to take care
of this.

The first is for dictionary encoding, some rows in the alphabet (base/value)
vector become no longer referenced by the indices, so we should recycle them.
This is done properly that when we translate the copy ranges on dictionary
indices to the copy ranges on alphabet, we overwrite the unused rows in alphabet
using the new alphabet rows from source.  This way we efficiently reuse the
memory in alphabet without both reallocation and memory leaking.

The second case is for `ARRAY`/`MAP`, the elements/keys/values vector can have
rows that are no longer referenced from the parent.  This is a little harder to
solve than in the dictionary case, since the nested rows need to be contiguous
for one parent row (offset/size pair), which means we cannot move them around
easily.  Our approach is to allow some unused nested rows, but keep track of the
percentage of them, and once they exceed certain threshold (50% in the
implementation), we make a new copy of the nested vector and copy only the used
rows over.  This allows us to reuse the nested rows to a certain degree while
keep some bounds on the memory usage.

Reviewed By: mbasmanova

Differential Revision: D70867237
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D70867237

Summary:

Implement `encodedVectorCopy`, a generic vector copy utility that preserves
encodings for memory saving purpose.

## Encoding Preservation

There are mainly 2 use cases for this new function.  One is to merge multiple
encoded vectors (`source`s) into one large encoded vector (`target`); the other
is to update specific rows (`source`) in a large vector (`target`), while
keeping the encodings.  Both use cases requires us to keep the encoding on
`target`, so it is decided as the behavior of this function.

There are some exceptions to this rule:

- We merge multiple adjacent layers of dictionary and constant wrappers into
  one.
- When `target` is constant, we convert it to dictionary to allow different
  values in `source`.
- When `target` is flat ROW, MAP, or ARRAY, and `source` is constant or
  dictionary encoded, the result will be dictionary encoded, to avoid flattening
  the child vectors.  Once the target becomes dictionary, it can stay that way
  and we can keep adding new content to it while keeping the encoding, this is a
  typical use case for encoding preserved merging.

## Inner Vector Compaction

Other than encoding, we also pay special attention to avoid holding on memory
that is no longer needed.  This is especially important for the merging use
case, as the `target` gets updated, majority rows of its inner vectors will be
dereferenced and no longer used.  There are 2 cases where we need to take care
of this.

The first is for dictionary encoding, some rows in the alphabet (base/value)
vector become no longer referenced by the indices, so we should recycle them.
This is done properly that when we translate the copy ranges on dictionary
indices to the copy ranges on alphabet, we overwrite the unused rows in alphabet
using the new alphabet rows from source.  This way we efficiently reuse the
memory in alphabet without both reallocation and memory leaking.

The second case is for `ARRAY`/`MAP`, the elements/keys/values vector can have
rows that are no longer referenced from the parent.  This is a little harder to
solve than in the dictionary case, since the nested rows need to be contiguous
for one parent row (offset/size pair), which means we cannot move them around
easily.  Our approach is to allow some unused nested rows, but keep track of the
percentage of them, and once they exceed certain threshold (50% in the
implementation), we make a new copy of the nested vector and copy only the used
rows over.  This allows us to reuse the nested rows to a certain degree while
keep some bounds on the memory usage.

Reviewed By: mbasmanova

Differential Revision: D70867237
@Yuhta Yuhta force-pushed the export-D70867237 branch from ad7d712 to 81ca05c Compare March 20, 2025 15:22
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D70867237

@facebook-github-bot
Copy link
Contributor

This pull request has been merged in 7d05ec7.

@conbench-facebook
Copy link

Conbench analyzed the 1 benchmark run on commit 7d05ec79.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details.

jinchengchenghh pushed a commit to jinchengchenghh/velox that referenced this pull request Apr 4, 2025
Summary:
Pull Request resolved: facebookincubator#12588

Implement `encodedVectorCopy`, a generic vector copy utility that preserves
encodings for memory saving purpose.

## Encoding Preservation

There are mainly 2 use cases for this new function.  One is to merge multiple
encoded vectors (`source`s) into one large encoded vector (`target`); the other
is to update specific rows (`source`) in a large vector (`target`), while
keeping the encodings.  Both use cases requires us to keep the encoding on
`target`, so it is decided as the behavior of this function.

There are some exceptions to this rule:

- We merge multiple adjacent layers of dictionary and constant wrappers into
  one.
- When the values type size in dictionary is no larger than the index type,
   we flatten the vector to save memory.
- When `target` is constant, we convert it to dictionary to allow different
  values in `source`.
- When `target` is flat ROW, MAP, or ARRAY, and `source` is constant or
  dictionary encoded, the result will be dictionary encoded, to avoid flattening
  the child vectors.  Once the target becomes dictionary, it can stay that way
  and we can keep adding new content to it while keeping the encoding, this is a
  typical use case for encoding preserved merging.

## Inner Vector Compaction

Other than encoding, we also pay special attention to avoid holding on memory
that is no longer needed.  This is especially important for the merging use
case, as the `target` gets updated, majority rows of its inner vectors will be
dereferenced and no longer used.  There are 2 cases where we need to take care
of this.

The first is for dictionary encoding, some rows in the alphabet (base/value)
vector become no longer referenced by the indices, so we should recycle them.
This is done properly that when we translate the copy ranges on dictionary
indices to the copy ranges on alphabet, we overwrite the unused rows in alphabet
using the new alphabet rows from source.  This way we efficiently reuse the
memory in alphabet without both reallocation and memory leaking.

The second case is for `ARRAY`/`MAP`, the elements/keys/values vector can have
rows that are no longer referenced from the parent.  This is a little harder to
solve than in the dictionary case, since the nested rows need to be contiguous
for one parent row (offset/size pair), which means we cannot move them around
easily.  Our approach is to allow some unused nested rows, but keep track of the
percentage of them, and once they exceed certain threshold (50% by default and configurable), we make a new copy of the nested vector and copy only the used
rows over.  This allows us to reuse the nested rows to a certain degree while
keep some bounds on the memory usage.

Reviewed By: mbasmanova

Differential Revision: D70867237

fbshipit-source-id: 0cddd37fd7188d89ea541fb89324aa9a10745415
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. fb-exported Merged

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants