Modernize legacy insert by period materialization#410
Modernize legacy insert by period materialization#410HorvathDanielMarton wants to merge 29 commits intodbt-labs:mainfrom
Conversation
| select | ||
| coalesce(max({{timestamp_field}}), '{{start_date}}')::timestamp as start_timestamp, | ||
| coalesce( | ||
| {{dbt_utils.dateadd('millisecond', |
There was a problem hiding this comment.
nanosecond is more accurate, i.e. timestamp(9), millisecond only covers timestamp(3).
|
|
||
| select | ||
| start_timestamp, | ||
| stop_timestamp, |
There was a problem hiding this comment.
Downstream, load_result truncates these timestamp(9) to timestamp(6), which breaks timestamp comparisons downstream, thus inserting duplicates if incremental is rerun after a completed load.
The fix is to cast both start_timestamp and stop_timestamp here to varchar, so that load_result gets a string and not a timestamp.
|
Whew! Thanks for all your work on this @HorvathDanielMarton. I just spent some time talking about this very macro in the dbt Community Slack as well. In short, we (the dbt Labs Developer Experience team) are very interested in the future of this materialization because it does some important work. What we're not certain about yet is exactly how it fits into the wider dbt ecosystem:
I promise this PR won't languish forever, but please bear with us while we work out where we're headed 🧭 |
|
Worth noting: anyone else who comes along in the meantime is of course welcome to make a copy of the code in this PR and apply it to their own project! |
@joellabes Thank you! To make it easier for the curious ones, just update your packages:
- git: "https://github.com/sspinc/dbt-utils.git"
revision: fix-legacy-insert-by-period-macroand execute |
|
|
||
| {%- set period_filter -%} | ||
| ("{{timestamp_field}}" > '{{start_timestamp}}'::timestamp + interval '{{offset}} {{period}}' and | ||
| "{{timestamp_field}}" <= '{{start_timestamp}}'::timestamp + interval '{{offset}} {{period}}' + interval '1 {{period}}' and |
There was a problem hiding this comment.
This line will not work properly if period = "month"
example:
select '2018-12-31 00:00:00'::timestamp + interval '4 month' + interval '1 month',
'2018-12-31 00:00:00'::timestamp + interval '5 month'
2019-05-30 00:00:00 & 2019-05-31 00:00:00 => one day will not be processed.
Redshift can deal with the brackets, like '{{start_timestamp}}'::timestamp + (interval '{{offset}} {{period}}' + interval '1 {{period}}'), Snowflake will not like it.
Suggestion (that I cannot test right away): calculate offset_plus_one as offset + 1 and use it in setting the filter, like
{%- set period_filter -%}
("{{timestamp_field}}" > '{{start_timestamp}}'::timestamp + interval '{{offset}} {{period}}' and
"{{timestamp_field}}" <= '{{start_timestamp}}'::timestamp + interval '{{offset_plus_one}} {{period}}' and
"{{timestamp_field}}" < '{{stop_timestamp}}'::timestamp)
{%- endset -%}
There was a problem hiding this comment.
I tested this block and it works ok. Feel free to use it.
{% macro get_period_sql(target_cols_csv, sql, timestamp_field, period, start_timestamp, stop_timestamp, offset) -%}
{%- set offset_plus_one = offset + 1 -%}
{%- set period_filter -%}
("{{timestamp_field}}" > '{{start_timestamp}}'::timestamp + interval '{{offset}} {{period}}' and
"{{timestamp_field}}" <= '{{start_timestamp}}'::timestamp + interval '{{offset_plus_one}} {{period}}' and
"{{timestamp_field}}" < '{{stop_timestamp}}'::timestamp)
{%- endset -%}
{%- set filtered_sql = sql | replace("__PERIOD_FILTER__", period_filter) -%}
select
{{target_cols_csv}}
from (
{{filtered_sql}}
)
{%- endmacro %}
|
Folks on this thread will probably be interested in this Discussion: #487 I'd appreciate your input! |
|
looks like this can be moved to the experimental features repo :-) |
|
Unsurprisingly, I can't transfer a PR from one repo to another, but if you want to reopen the PR over there I'll be happy to merge it 👍 |
This is a:
masterdev/branchdev/branchDescription & motivation
The goal of this change is to get the
insert_by_periodmaterialization compatible with Snowflake while using the currently latest dbt (0.21.0b2) and keeping the same functionality.The
insert_by_periodmaterialization was introduced in this discourse post and the intention behind it is to help with “problematic” initial builds of data models. Here, problematic means that some tables are just so large and/or complex that a simple--full-refreshis not adequate because the build will be incredibly slow, inefficient, and can even get stuck.Unfortunately, the currently available version does not support Snowflake and it’s advised to refactor it by using the refreshed code for the incremental materialization as a good starting point. As a result, this change resolves dbt-labs/dbt-labs-experimental-features#32.
The materialization itself was completely built from the current incremental materialization and the modifications done to it was inspired by the original
insert_by_period.The macros that were originally in the
insert_by_period.sqlwas moved to ahelpers.sqlfile where the Snowflake version ofget_period_boundaries(), andget_period_sql()are introduced. Also, a newcheck_for_period_filter()macro has been added.The materialization seems fully functional and so far works as expected, nevertheless, I'm still testing it against production models.
Ideas
day,week,month, etc... as a unit. However, we might want to build our models fortnightly (ie. 2-week chunks). We could introduce an optional configuration that lets us use a configurable number ofperiods as a chunk. The fortnightly example could look like this:{{ config( materialized = "insert_by_period", timestamp_field = "created_at", period = 'week', interval_size = 2, start_date = "2021-01-01", stop_date = "2021-08-31" ) }}__PERIOD_FILTER__by some time (for example a month lookback time, but it should be configurable). Currently, this materialization doesn't support this, but that is surely a shortcoming for some of our models whereinsert_by_periodwould come in handy because of their massive size & complexity. Also, it seems like our team is not the only ones who bumped into this: Feature: Add optionallookback_intervalparam toinsert_by_periodmaterialization #394A weird behavior
The materialization can produce a weird behavior if it’s executed on an already existing model because the “last incremental run” may have a very short overlap with the specified date range. Let’s take the following config as an example:
{{ config( materialized = "insert_by_period", timestamp_field = "created_at", period = 'day', start_date = "2021-08-28", stop_date = "2021-08-31" ) }}This is going to build a model with events
2021-08-28 < created_at < 2021-08-31, so it will have 3 days, thus in 3 steps it will be built, given it’s a fresh start and the run is uninterrupted. After building the 1st day, if the run is terminated, the next run will have 3 steps, again. It's not unlikely, that the last step will only insert a handful of events (many magnitudes smaller, than the previous ones). Like on this CLI output:This may raise some eyebrows, but it makes perfect sense if we look at the query’s where clause - which is basically this:
So the last iteration is only looking for events that happened between
2021-08-30 23:59:59.852and2021-08-30 23:59:59.999, which is far from even being a second. It would be nice to deal with this, so it won't cause suspicion that something is buggy.This is most conspicuous for large models with a lot of events.
Checklist
star()source)