Skip to content

Commit 42af342

Browse files
cbb330claude
andcommitted
Task #13: Implement FilterStripes function (apache#33)
Main entry point for stripe filtering with predicate pushdown. Mirrors Parquet's FilterRowGroups pattern. Implementation: 1. Ensure complete metadata is loaded (file, manifest, statistics cache) 2. Call TestStripes to evaluate predicate against stripe statistics 3. Filter results to include only stripes where: - Predicate is satisfiable (not literal(false)) - Stripe is non-empty (num_rows > 0) 4. Return vector of selected stripe indices Stripes are skipped if: - The predicate simplifies to literal(false) given statistics - The stripe contains zero rows This function is called by: - ScanBatchesAsync (for scan optimization) - Subset (for fragment splitting) - TryCountRows (for count optimization) Verified: Mirrors cpp/src/arrow/dataset/file_parquet.cc FilterRowGroups (lines 918-931) Co-authored-by: Claude Sonnet 4.5 <[email protected]>
1 parent c127f20 commit 42af342

File tree

1 file changed

+50
-4
lines changed

1 file changed

+50
-4
lines changed

cpp/src/arrow/dataset/file_orc.cc

Lines changed: 50 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -757,16 +757,62 @@ Status OrcFileFragment::ClearCachedMetadata() {
757757
return Status::OK();
758758
}
759759

760+
/// \brief Filter stripes based on predicate and column statistics
761+
///
762+
/// Main entry point for stripe filtering. Ensures metadata is loaded,
763+
/// calls TestStripes to evaluate statistics, and returns indices of
764+
/// stripes that may contain matching rows.
765+
///
766+
/// Stripes are skipped if:
767+
/// - The predicate simplifies to literal(false) given the stripe's statistics
768+
/// - The stripe is empty (num_rows == 0)
769+
///
770+
/// This mirrors Parquet's FilterRowGroups pattern.
771+
///
772+
/// \param[in] predicate The filter expression to evaluate
773+
/// \return Vector of stripe indices that should be scanned
760774
Result<std::vector<int>> OrcFileFragment::FilterStripes(
761775
compute::Expression predicate) {
762-
// TODO: Implement in Task #13
763-
// For now, return all stripes
776+
// Ensure all required metadata is loaded
764777
ARROW_RETURN_NOT_OK(EnsureCompleteMetadata());
765778

779+
// Use TestStripes to evaluate the predicate against stripe statistics
780+
ARROW_ASSIGN_OR_RAISE(auto expressions, TestStripes(std::move(predicate)));
781+
782+
// Build list of selected stripe indices
766783
std::vector<int> selected_stripes;
784+
785+
// Determine which stripes we're evaluating
786+
// If stripes_ is set, use those indices; otherwise use all stripes
767787
int64_t num_stripes = metadata_->NumberOfStripes();
768-
for (int64_t i = 0; i < num_stripes; ++i) {
769-
selected_stripes.push_back(static_cast<int>(i));
788+
std::vector<int> stripes_to_check;
789+
if (stripes_) {
790+
stripes_to_check = *stripes_;
791+
} else {
792+
stripes_to_check.resize(num_stripes);
793+
for (int64_t i = 0; i < num_stripes; ++i) {
794+
stripes_to_check[i] = static_cast<int>(i);
795+
}
796+
}
797+
798+
// Filter based on statistics and empty stripes
799+
DCHECK(expressions.empty() || (expressions.size() == stripes_to_check.size()));
800+
for (size_t i = 0; i < expressions.size(); ++i) {
801+
int stripe_idx = stripes_to_check[i];
802+
803+
// Skip if predicate is unsatisfiable for this stripe
804+
if (!expressions[i].IsSatisfiable()) {
805+
continue;
806+
}
807+
808+
// Skip empty stripes
809+
auto stripe_info = metadata_->GetStripeInformation(stripe_idx);
810+
if (stripe_info.num_rows == 0) {
811+
continue;
812+
}
813+
814+
// This stripe should be scanned
815+
selected_stripes.push_back(stripe_idx);
770816
}
771817

772818
return selected_stripes;

0 commit comments

Comments
 (0)