-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Description
Note: migrated from original JIRA: https://issues.apache.org/jira/browse/ARROW-12411
Use case:
While writing tests (both in IOx and in DataFusion) where I need a single RecordBatch, I often find myself doing something like this:
let schema = Arc::new(Schema::new(vec![
ArrowField::new("float_field", ArrowDataType::Float64, true),
ArrowField::new("time", ArrowDataType::Int64, true),
]));
let float_array: ArrayRef = Arc::new(Float64Array::from(vec![10.1, 20.1, 30.1, 40.1]));
let timestamp_array: ArrayRef = Arc::new(Int64Array::from(vec![1000, 2000, 3000, 4000]));
let batch = RecordBatch::try_new(schema, vec![float_array, timestamp_array])
.expect("created new record batch");
This is annoying because the information that float_field is a float is encoded both in the Schema and the Float64Array
I would much rather rather be able to construct RecordBatches a a builder style to avoid the the redundancy and reduce the amount of typing / redundancy:
let float_array: ArrayRef = Arc::new(Float64Array::from(vec![10.1, 20.1, 30.1, 40.1]));
let timestamp_array: ArrayRef = Arc::new(Int64Array::from(vec![1000, 2000, 3000, 4000]));
let batch = RecordBatch::empty()
.append("float_field", timestamp_array).unwrap()
.append("time", float_array).unwrap;
The proposal is to add a method to RecordBatch like
impl RecordBatch {
...
fn append(self, field_name: &str, field_values: ArrayRef) -> Result<Self>
}
That would append the a field name to the current schema, returning an error if field_name was already present.
The nullability of the field would be set based on the actual null count of the field_values