-
Notifications
You must be signed in to change notification settings - Fork 90
Description
Feature scope
Taps (catalog, state, stream maps, etc.)
Description
There are many use cases which could benefit from a generic and SDK-handled end_date config option, to match the start_date config option that is specified in the Singer Spec, and which most taps already support.
A few use cases:
- Rerunning a backfill (aka a "restatement") covering just a specific time period. (E.g. restating all of "Q3 2021" due to data corruption or to account for an initial
start_datevalue such asJan 1 2022, which was not inclusive of all time periods.) - Running date partition syncs in parallel, for instance: extracting all of 2020 in one batch, all of 2021 in another batch, and YTD 2022 in a final batch.
- Especially for new EL pipelines: prioritizing recent dates' data over prior dates. For instance: We may start with extracting "current year YTD" as highest priority. Then, later once the data source is running and operational, we may want to keep backfilling one year at a time: 2021, 2020, 2019, etc., with the recent periods having higher priority than prior periods.
- Intentionally skipping over records which have not reached a minimal cool-off period.
Spec
- The SDK would add support for an optional
end_dateconfig input. When provided, records received from the source would be ignored and/or filtered out if they were further than the providedend_date. - The SDK would never advance the bookmark beyond the
end_date. - The SDK would likely treat this identically to the Signposts feature, which already performs the record filtering behavior as well as the bookmark limiting feature. (In the case of the signpost, the goal is to not mark any bookmarks newer than the
utcnow()calculation at the start of execution.) - Different APIs would have differing levels of ability for performance optimization:
- An API that supports both
startandendfilters can get exactly the records needed. - Likewise, SQL-based taps can filter for exactly the records needed.
- APIs that do not support an
endfilter can cancel early their paginated calls if the output is known to be sorted and if we've already passed theend_datelimit. - APIS that do not support an
endfilter and also do not generate a sorted output may be forced to continue paginating through all records until the end of stream. This means that extraction will be significantly slower; even though the tap will only send along matching records, the API still has to paginate through all records to find all records that match.
- An API that supports both
Spec Questions (TBD)
Still TBD is whether the end_date should be inclusive or exclusive. Probably exclusive is the correct behavior, so that Jan 1 2022 0:00:00.000 (for instance) could be used as the end_date on one pipeline and the start_date on another. If we ask users to provide an inclusive date, users are likely to provide something like Dec 31 2021 11:59:59, which (depending on precision of the source system) is subject to gaps - and therefor subject to unintentional data loss.
If we go with an exclusive logic, and given that start_date is inclusive, then the logic would be:
start_date<=record_date<end_date
Caveats/warnings
For merge-insert operations, especially when operations are run in parallel, it it is important to note that the latest time block should always be loaded last. This is because over the course of a parallel execution, the same record may appear in a historic period and also in the latest time window. In order to not lose the most recent changes, the final/current time period should be loaded last or (safer yet), the final/current time block should be rerun after the prior periods have been extracted and loaded.
A theoretical example: a load for "2021" and "2022 YTD" are running in parallel A customers table record ABC Customer has not been updated since Dec 2021. It is updated at the same time as the load is running, and "moves" into the 2022 YTD bucket after already being picked up in the 2021 bucket. If 2021 loads to the warehouse after the 2022 YTD bucket is loaded, the older change will overwrite the newer one, causing data loss.
The way to resolve this is to either wait until backfills have run before running the most recent period, or to rerun the latest period so that the newer version of the record once again becomes the primary/correct version of the record in the warehouse.
Since these challenges are reasonably addressed with some best practice documentation and/or additional intelligence added to the orchestrator that wants to run these in parallel (such as Meltano), there doesn't seem to be a strong reason not to build this feature.
See also
- Proposal for parallel execution capability in Meltano: Improve ELT performance by running multiple tap stream processes in parallel ("Melturbo") meltano#2677