Fix legacy insert by period materialization#1
Fix legacy insert by period materialization#1HorvathDanielMarton wants to merge 31 commits intomasterfrom
Conversation
* PR - lowercase `except` values in star() resolves dbt-labs#402 * Update test_star.sql * Update CHANGELOG.md * Update test_star.sql Co-authored-by: jasnonaz <[email protected]>
|
One more question. I encountered another issue with Postgres as described in the same comment dbt-labs/dbt-labs-experimental-features#32. But I couldn't track down in your new code what happened with it. Perhaps you can comment if this part follows the same logic as before. And if it this error will reappear. The error I got (after fixing the subquery alias above) was: "cannot create temporary relation in non-temporary schema". I "solved" it by changing the {%- set tmp_relation = api.Relation.create(identifier=tmp_identifier,
schema=None, type='table') -%} -- this is line 130Potentially breaking other features or supported DB engines, but my setup was fine that. |
I updated the code as a direct result of your question, hopefully for the better! As I started from the incremental materialization, I took a look and I realized, that Unfortunately, I can’t test this on Postgres, but I tested on Snowflake, and the updated code works as expected. I think using a dbt core function is the way to go forward. Thanks for your comment! |
19dd99f to
f82f2ee
Compare
|
All the checks passed for Postgres, so I think we are good there. The rest of the integration checks won’t pass here, in this repository. Thank you again @etoulas for your comments. I opened a PR with this change in dbt-utils too, so if you’d like to follow the conversation, here it is: dbt-labs#410 @rgabo thanks again for your continuous support on this. What do you think, should we 1) wait for some feedback on the other PR or 2) merge this back locally and modify it later if it’s necessary based on the review we receive? Also, do you have any idea why their CI didn't run for our PR? |
|
@HorvathDanielMarton with regards to this PR, we can simply use the functionality from the With that in mind, we can close this PR and keep all relevant discussion in dbt-labs#410. As far as CI goes, it is likely that PRs are not automatically run against CI, we'll likely need an author sprinkle his magic on the PR to trigger CI 😄 |
This is a:
masterdev/branchdev/branchDescription & motivation
The goal of this change is to get the
insert_by_periodmaterialization compatible with Snowflake while using the currently latest dbt (0.21.0b2) and keeping the same functionality.The
insert_by_periodmaterialization was introduced in this discourse post and the intention behind it is to help with “problematic” initial builds of data models. Here, problematic means that some tables are just so large and/or complex that a simple--full-refreshis not adequate because the build will be incredibly slow, inefficient, and can even get stuck.Unfortunately, the currently available version does not support Snowflake and it’s advised to refactor it by using the refreshed code for the incremental materialization as a good starting point. As a result, this change addresses dbt-labs/dbt-labs-experimental-features#32.
The materialization itself was completely built from the current incremental materialization and the modifications done to it was inspired by the original
insert_by_period.The macros that were originally in the
insert_by_period.sqlwas moved to ahelpers.sqlfile where the Snowflake version ofget_period_boundaries(), andget_period_sql()are introduced. Also, a newcheck_for_period_filter()macro has been added.The materialization seems fully functional and so far works as expected, nevertheless, I'm still testing it against production models.
Ideas
day,week,month, etc... as a unit. However, we might want to build our models fortnightly (ie. 2-week chunks). We could introduce an optional configuration that lets us use a configurable number ofperiods as a chunk. The fortnightly example could look like this:{{ config( materialized = "insert_by_period", timestamp_field = "created_at", period = 'week', interval_size = 2, start_date = "2021-01-01", stop_date = "2021-08-31" ) }}__PERIOD_FILTER__by some time (for example a month lookback time, but it should be configurable). Currently, this materialization doesn't support this, but that is surely a shortcoming for some of our models whereinsert_by_periodwould come in handy because of their massive size & complexity. Also, it seems like our team is not the only ones who bumped into this: Feature: Add optionallookback_intervalparam toinsert_by_periodmaterialization dbt-labs/dbt-utils#394A weird behavior
The materialization can produce a weird behavior if it’s executed on an already existing model because the “last incremental run” may have a very short overlap with the specified date range. Let’s take the following config as an example:
{{ config( materialized = "insert_by_period", timestamp_field = "created_at", period = 'day', start_date = "2021-08-28", stop_date = "2021-08-31" ) }}This is going to build a model with events
2021-08-28 < created_at < 2021-08-31, so it will have 3 days, thus in 3 steps it will be built, given it’s a fresh start and the run is uninterrupted. After building the 1st day, if the run is terminated, the next run will have 3 steps, again. It's not unlikely, that the last step will only insert a handful of events (many magnitudes smaller, than the previous ones). Like on this CLI output:This may raise some eyebrows, but it makes perfect sense if we look at the query’s where clause - which is basically this:
So the last iteration is only looking for events that happened between
2021-08-30 23:59:59.852and2021-08-30 23:59:59.999, which is far from even being a second. It would be nice to deal with this, so it won't cause suspicion that something is buggy.This is most conspicuous for large models with a lot of events.
Checklist
star()source)