Feature: Add `end_date` support in generic tap config

### Feature scope

Taps (catalog, state, stream maps, etc.)

### Description

There are many use cases which could benefit from a generic and SDK-handled `end_date` config option, to match the `start_date` config option that is specified in the Singer Spec, and which most taps already support.

A few use cases:

1. Rerunning a backfill (aka a "restatement") covering just a specific time period. (E.g. restating all of "Q3 2021" due to data corruption or to account for an initial `start_date` value such as `Jan 1 2022`, which was not inclusive of all time periods.)
1. Running date partition syncs in parallel, for instance: extracting all of 2020 in one batch, all of 2021 in another batch, and YTD 2022 in a final batch.
1. Especially for new EL pipelines: prioritizing recent dates' data over prior dates. For instance: We may start with extracting "current year YTD" as highest priority. Then, later once the data source is running and operational, we may want to keep backfilling one year at a time: 2021, 2020, 2019, etc., with the recent periods having higher priority than prior periods.
1. Intentionally skipping over records which have not reached a minimal cool-off period.

### Spec

1. The SDK would add support for an optional `end_date` config input. When provided, records received from the source would be ignored and/or filtered out if they were further than the provided `end_date`.
1. The SDK would never advance the bookmark beyond the `end_date`.
1. The SDK would likely treat this identically to the [Signposts feature](https://sdk.meltano.com/en/latest/implementation/state.html#replication-key-signposts), which already performs the record filtering behavior as well as the bookmark limiting feature. (In the case of the signpost, the goal is to not mark any bookmarks newer than the `utcnow()` calculation at the start of execution.)
1. Different APIs would have differing levels of ability for performance optimization:
   1. An API that supports both `start` and `end` filters can get exactly the records needed.
   2. Likewise, SQL-based taps can filter for exactly the records needed.
   3. APIs that do not support an `end` filter can cancel early their paginated calls if the output is known to be sorted and if we've already passed the `end_date` limit.
   4. APIS that do not support an `end` filter and also do not generate a sorted output may be forced to continue paginating through all records until the end of stream. This means that extraction will be significantly slower; even though the tap will only send along matching records, the API still has to paginate through _all_ records to find all records that match.

### Spec Questions (TBD)

Still TBD is whether the `end_date` should be inclusive or exclusive. Probably _exclusive_ is the correct behavior, so that `Jan 1 2022 0:00:00.000` (for instance) could be used as the `end_date` on one pipeline and the `start_date` on another. If we ask users to provide an _inclusive_ date, users are likely to provide something like `Dec 31 2021 11:59:59`, which (depending on precision of the source system) is subject to gaps - and therefor subject to unintentional data loss.

If we go with an _exclusive_ logic, and given that `start_date` is inclusive, then the logic would be:

- `start_date` <= `record_date` < `end_date`

### Caveats/warnings

For merge-insert operations, especially when operations are run in parallel, it it is important to note that the _latest_ time block should always be loaded _last_. This is because over the course of a parallel execution, the same record may appear in a historic period and _also_ in the latest time window. In order to not lose the most recent changes, the final/current time period should be loaded last or (safer yet), the final/current time block should be rerun after the prior periods have been extracted and loaded.

A theoretical example: a load for "2021" and "2022 YTD" are running in parallel A `customers` table record `ABC Customer` has not been updated since Dec 2021. It is updated at the same time as the load is running, and "moves" into the 2022 YTD bucket after already being picked up in the 2021 bucket. If 2021 loads to the warehouse _after_ the 2022 YTD bucket is loaded, the older change will overwrite the newer one, causing data loss.

The way to resolve this is to _either_ wait until backfills have run before running the most recent period, _or_ to rerun the latest period so that the newer version of the record once again becomes the primary/correct version of the record in the warehouse.

Since these challenges are reasonably addressed with some best practice documentation and/or additional intelligence added to the orchestrator that wants to run these in parallel (such as Meltano), there doesn't seem to be a strong reason _not_ to build this feature.

### See also

- Proposal for parallel execution capability in Meltano: https://github.com/meltano/meltano/issues/2677

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature: Add `end_date` support in generic tap config #922

Feature scope

Description

Spec

Spec Questions (TBD)

Caveats/warnings

See also

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Feature: Add end_date support in generic tap config #922

Description

Feature scope

Description

Spec

Spec Questions (TBD)

Caveats/warnings

See also

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Feature: Add `end_date` support in generic tap config #922