Skip to content

Feature: Add end_date support in generic tap config #922

@aaronsteers

Description

@aaronsteers

Feature scope

Taps (catalog, state, stream maps, etc.)

Description

There are many use cases which could benefit from a generic and SDK-handled end_date config option, to match the start_date config option that is specified in the Singer Spec, and which most taps already support.

A few use cases:

  1. Rerunning a backfill (aka a "restatement") covering just a specific time period. (E.g. restating all of "Q3 2021" due to data corruption or to account for an initial start_date value such as Jan 1 2022, which was not inclusive of all time periods.)
  2. Running date partition syncs in parallel, for instance: extracting all of 2020 in one batch, all of 2021 in another batch, and YTD 2022 in a final batch.
  3. Especially for new EL pipelines: prioritizing recent dates' data over prior dates. For instance: We may start with extracting "current year YTD" as highest priority. Then, later once the data source is running and operational, we may want to keep backfilling one year at a time: 2021, 2020, 2019, etc., with the recent periods having higher priority than prior periods.
  4. Intentionally skipping over records which have not reached a minimal cool-off period.

Spec

  1. The SDK would add support for an optional end_date config input. When provided, records received from the source would be ignored and/or filtered out if they were further than the provided end_date.
  2. The SDK would never advance the bookmark beyond the end_date.
  3. The SDK would likely treat this identically to the Signposts feature, which already performs the record filtering behavior as well as the bookmark limiting feature. (In the case of the signpost, the goal is to not mark any bookmarks newer than the utcnow() calculation at the start of execution.)
  4. Different APIs would have differing levels of ability for performance optimization:
    1. An API that supports both start and end filters can get exactly the records needed.
    2. Likewise, SQL-based taps can filter for exactly the records needed.
    3. APIs that do not support an end filter can cancel early their paginated calls if the output is known to be sorted and if we've already passed the end_date limit.
    4. APIS that do not support an end filter and also do not generate a sorted output may be forced to continue paginating through all records until the end of stream. This means that extraction will be significantly slower; even though the tap will only send along matching records, the API still has to paginate through all records to find all records that match.

Spec Questions (TBD)

Still TBD is whether the end_date should be inclusive or exclusive. Probably exclusive is the correct behavior, so that Jan 1 2022 0:00:00.000 (for instance) could be used as the end_date on one pipeline and the start_date on another. If we ask users to provide an inclusive date, users are likely to provide something like Dec 31 2021 11:59:59, which (depending on precision of the source system) is subject to gaps - and therefor subject to unintentional data loss.

If we go with an exclusive logic, and given that start_date is inclusive, then the logic would be:

  • start_date <= record_date < end_date

Caveats/warnings

For merge-insert operations, especially when operations are run in parallel, it it is important to note that the latest time block should always be loaded last. This is because over the course of a parallel execution, the same record may appear in a historic period and also in the latest time window. In order to not lose the most recent changes, the final/current time period should be loaded last or (safer yet), the final/current time block should be rerun after the prior periods have been extracted and loaded.

A theoretical example: a load for "2021" and "2022 YTD" are running in parallel A customers table record ABC Customer has not been updated since Dec 2021. It is updated at the same time as the load is running, and "moves" into the 2022 YTD bucket after already being picked up in the 2021 bucket. If 2021 loads to the warehouse after the 2022 YTD bucket is loaded, the older change will overwrite the newer one, causing data loss.

The way to resolve this is to either wait until backfills have run before running the most recent period, or to rerun the latest period so that the newer version of the record once again becomes the primary/correct version of the record in the warehouse.

Since these challenges are reasonably addressed with some best practice documentation and/or additional intelligence added to the orchestrator that wants to run these in parallel (such as Meltano), there doesn't seem to be a strong reason not to build this feature.

See also

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions