Skip to content

Improve ability of FlightDataEncoder to respect max_flight_data_size for certain data types (strings, dictionaries, etc) #3478

@alamb

Description

@alamb

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

Some implementations of gRPC, such as golang have a default max message size that is "relatively small" (4MB) and the clients will generate errors if they receive larger messages.

The FlightDataEncoder has a mechanism (link) to try and avoid this problem by heuristically slicing RecordBatchs into smaller parts to limit their size. This works well for primitive arrays but does not work well for other cases as we have found upstream in IOx:

  1. DataType::Utf8 (only the offsets are sliced, the underlying string data is not sliced)
  2. Dictionaries (the dictionary itself is not changed, so if the dictionary is large it will be repeated sent) -- will be an issue after Support DictionaryArrays in Arrow Flight #3389

Lists, structs, and other nested types probably suffer from similar issues with maximum message sizes.

Of course, the smallest message possible is a single row, which can always be be significantly larger than whatever the max_flight_data_size limit is for variable length columns (e.g. several large string columns)

Describe the solution you'd like
I would like to improve the situation and handle nested types and more effectively reduce the FlightDataSize

Describe alternatives you've considered

  1. One approach would be to copy the data into a new record batch, "packing" it into a brand new memory space (this is the approach I plan as a workaround in IOx) - this would result in the minimum sized flight data batches
  2. Another way might be to implement a slice / take that resulted in smaller data sizes (e.g. rewrite data offsets for strings)

Additional context
See #3347

Metadata

Metadata

Assignees

Labels

arrow-flightChanges to the arrow-flight crateenhancementAny new improvement worthy of a entry in the changelog

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions