Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
162 changes: 162 additions & 0 deletions PROJECT_QUESTIONS.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,162 @@
### Week 1 Queries

1. How many users do we have?

-- All Users
select
count(distinct u.user_id) as user_counts

from "DEV_DB"."DBT_KSHYAM91YAHOOCOM"."POSTGRES__USERS" u
;
-- Users w/ orders
select
count(distinct o.user_id) as user_counts

from "DEV_DB"."DBT_KSHYAM91YAHOOCOM"."POSTGRES__ORDERS" o
;

2. On average, how many orders do we receive per hour?

-- first, we count the number of orders, rolled up at the HOUR level
with tmp as (
select
extract(hour from date_trunc(hour,o.created_at)) as created_at_hour
, count(o.order_id) as order_counts

from "DEV_DB"."DBT_KSHYAM91YAHOOCOM"."POSTGRES__ORDERS" o

group by 1
)

-- next, we compute the hourly average
select
round(avg(t.order_counts)) as avg_orders_per_hour

from tmp t
;

3. On average, how long does an order take from being placed to being delivered?

-- first, we calculate 'time_to_deliver' in days for every order
with tmp as (
select
o.*
, datediff(day, o.created_at, o.delivered_at) as time_to_deliver


from "DEV_DB"."DBT_KSHYAM91YAHOOCOM"."POSTGRES__ORDERS" o

-- this is ONLY null for non-delivered orders
where o.delivered_at is not null
)

select
round(avg(time_to_deliver),2) as avg_time_to_deliver

from tmp

4. How many users have only made one purchase? Two purchases? Three+ purchases?
Note: you should consider a purchase to be a single order. In other words, if a user places one order for 3 products, they are considered to have made 1 purchase.

-- first, we calculate order counts per user
with tmp as (
select
o.user_id
, count(o.order_id) as order_counts

from "DEV_DB"."DBT_KSHYAM91YAHOOCOM"."POSTGRES__ORDERS" o

group by 1
)

-- next, we find count of users with 1,2,3+ orders
select
count(case when t.order_counts = 1 then t.user_id end) as users_with_only_one_order
, count(case when t.order_counts = 2 then t.user_id end) as users_with_two_orders
, count(case when t.order_counts >= 3 then t.user_id end) as users_with_three_plus_orders

from tmp t
;

5. On average, how many unique sessions do we have per hour?
-- first, we count the number of unique sessions, rolled up at the HOUR level
with tmp as (
select
extract(hour from date_trunc(hour,e.created_at)) as created_at_hour
, count(distinct e.session_id) as session_counts

from "DEV_DB"."DBT_KSHYAM91YAHOOCOM"."POSTGRES__EVENTS" e

group by 1
)

-- next, we compute the hourly average
select
round(avg(t.session_counts)) as avg_session_per_hour

from tmp t
;

### Week2
PART 1 : MODELS
1. What is our user repeat rate?

Repeat Rate = Users who purchased 2 or more times / users who purchased

![alt text](image-4.png)

2. What are good indicators of a user who will likely purchase again? What about indicators of users who are likely NOT to purchase again? If you had more data, what features would you want to look into to answer this question?
- I'd look at the events (page views) data for users who ordered vs not to understand how far in the funnel they progressed.
- I'd assume/hypothesise that users who progressed till checkout are highly likely to finish the order

3. Explain the product mart models you added. Why did you organize the models in the way you did?
- First, i created a marts folder with sub folders (core, marketing, product), though ONLY product has models within.
- The product folder has 2 sub folders - intermediate & fact. All preliminary transformation models can be found within the intermediate folder & the final (fact) can be found in the fact folder
- fact_page_views model combines page views + users & products datasets. This will help us understand page views for every product across times & users
- fact_daily_product_orders is a simple fact table that helps us report on daily product orders
- We can now use these 2 facts on the reporing layer & filter by a specific product to understand page views or orders over time.

4. Use the dbt docs to visualize your model DAGs to ensure the model layers make sense.
Please see the DAG here !
![alt text](image-1.png)

PART 2 : TESTS
1. I have created 2 .yml files for the 2 facts & added a unique/not null test on the primary key. Please see test results below :
![alt text](image-2.png)
2. I have added primary key <> foregin key tests to the staging yml/model so we should be good from an upstream data quality perspective
3. Ran 'dbt test' on the entire project & all tests are passing !!
![alt text](image-3.png)
4. Real time alerts : We gotta link our dbt notifications to slack so that we can get real time notifications on test failures. I believe if we use dbt core, we would get notifications via email.

PART 2 : SNAPSHOTS
- Which products had their inventory change from week 1 to week 2?
Please see query/output below !
![alt text](image-5.png)

### Week3

Part 1
1. What is our overall conversion rate?
![alt text](image-6.png)
2. What is our conversion rate by product?
Not sure what's the expected calculation / outcome here as i see in the events data, for those rows where product_id is populated there is no corresponding order_id so not sure how we know if a product related event ended up in an order.
![alt text](image-7.png)

Part 2 : Macros
Create a macro to simplify part of a model(s).
Please see 'distinct_event_counts_per_event_type' macro that's being used in 'fact_session'

Part 3 : Post hook / grants
Please check greenery/dbt_project.yml for grants

Part 4 : Packages
Please 'packages.yml' where i have 2 packages installed, dbt_utils & dbt_expectations. See below corresponding usage :
- dbt_utils.generate_surrogate_key from dbt_utils to generate an unique key in postgres__order_items
- Column additions to events source data : dbt_expectations.expect_table_column_count_to_be_between

Part 5 : Show (using dbt docs and the model DAGs) how you have simplified or improved a DAG using macros and/or dbt packages.
See above comments

Part 6 : Dbt Snapshots
Products that had inventory changes from week 2 > week 3
![alt text](image-8.png)
4 changes: 4 additions & 0 deletions greenery/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@

target/
dbt_packages/
logs/
15 changes: 15 additions & 0 deletions greenery/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
Welcome to your new dbt project!

### Using the starter project

Try running the following commands:
- dbt run
- dbt test


### Resources:
- Learn more about dbt [in the docs](https://docs.getdbt.com/docs/introduction)
- Check out [Discourse](https://discourse.getdbt.com/) for commonly asked questions and answers
- Join the [chat](https://community.getdbt.com/) on Slack for live discussions and support
- Find [dbt events](https://events.getdbt.com) near you
- Check out [the blog](https://blog.getdbt.com/) for the latest news on dbt's development and best practices
Empty file added greenery/analyses/.gitkeep
Empty file.
39 changes: 39 additions & 0 deletions greenery/dbt_project.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@

# Name your project! Project names should contain only lowercase characters
# and underscores. A good package name should reflect your organization's
# name or the intended use of these models
name: 'greenery'
version: '1.0.0'
config-version: 2

# This setting configures which "profile" dbt uses for this project.
profile: 'greenery'

# These configurations specify where dbt should look for different types of files.
# The `model-paths` config, for example, states that models in this project can be
# found in the "models/" directory. You probably won't need to change these!
model-paths: ["models"]
analysis-paths: ["analyses"]
test-paths: ["tests"]
seed-paths: ["seeds"]
macro-paths: ["macros"]
snapshot-paths: ["snapshots"]

clean-targets: # directories to be removed by `dbt clean`
- "target"
- "dbt_packages"


# Configuring models
# Full documentation: https://docs.getdbt.com/docs/configuring-models

# In this example config, we tell dbt to build all models in the example/
# directory as views. These settings can be overridden in the individual model
# files using the `{{ config(...) }}` macro.
models:
greenery:
# Config indicated by + and applies to all files under models/example/
example:
+materialized: view
post-hook:
- "GRANT SELECT ON {{ this }} TO reporting"
Empty file added greenery/macros/.gitkeep
Empty file.
12 changes: 12 additions & 0 deletions greenery/macros/distinct_event_counts_per_event_type.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
{%- macro distinct_event_counts_per_event_type() -%}

{% set event_types = ["checkout", "package_shipped", "add_to_cart","page_view"] %}

{% for event_type in event_types %}
count(distinct case when event_type = '{{event_type}}' then event_id end) as distinct_counts_of_{{event_type}}_events
{%- if not loop.last -%}
,
{%- endif -%}
{% endfor %}

{%- endmacro -%}
29 changes: 29 additions & 0 deletions greenery/models/marts/product/fact/fact_daily_product_orders.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
{{
config(
materialized = 'table'
, unique_key = 'order_product_date_key'
)
}}

/*
Grain/primary key : One row per product per (order) date
Stakeholders : Product Team (Product Manager X)
Purpose : Report on daily product orders tthat were delivered
*/

select
o.product_id
, o.product_name
, o.product_price
, date(o.order_created_at) as order_created_date
, concat_ws('-', o.product_id, order_created_date) as order_product_date_key
-- aggregates
, count(distinct o.order_id) as count_of_daily_product_orders
, sum(o.order_product_quantity) as count_of_daily_product_quantity
, sum(o.product_price * o.order_product_quantity) as daily_product_order_value

from {{ ref('int_orders') }} o

where o.order_status = 'delivered'

group by all
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
models:
- name: fact_daily_product_orders
description: Contains daily product (delivered) orders stats
columns:
- name: order_product_date_key
tests:
- unique
- not_null
35 changes: 35 additions & 0 deletions greenery/models/marts/product/fact/fact_page_views.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
{{
config(
materialized = 'table'
, unique_key = 'pageview_id'
)
}}

/*
Grain/primary key : One row per event_id
Stakeholders : Product Team (Product Manager X)
Purpose : Understand page view data by collating all related dimensions/facts
ToDo: replace all select * in the model with a jinja list
*/

select
pageviews.event_id as pageview_id
, pageviews.session_id
, pageviews.user_id
, pageviews.event_created_at as pageview_created_at
, pageviews.product_id
, pageviews.event_is_from_weekend
-- users
, users.* exclude user_id
-- orders
, products.* exclude product_id

from {{ ref('int_events') }} pageviews

left join {{ ref('int_users') }} users
on pageviews.user_id = users.user_id

left join {{ ref('int_products') }} products
on pageviews.product_id = products.product_id

where pageviews.event_type = 'page_view'
8 changes: 8 additions & 0 deletions greenery/models/marts/product/fact/fact_page_views.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
models:
- name: fact_page_views
description: contacts page views (from event data) with enriched user & product info
columns:
- name: pageview_id
tests:
- unique
- not_null
63 changes: 63 additions & 0 deletions greenery/models/marts/product/fact/fact_sessions.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,63 @@
{{
config(
materialized = 'incremental'
, unique_key = 'session_id'
)
}}

/*
Grain/primary key : One row per session_id
Stakeholders : Product Team (Product Manager X)
Purpose : Understand session data
ToDo: replace all select * in the model with a jinja list
*/

-- roll up event data at the session level
with sessions as (
select
e.session_id
, e.user_id
-- min/max
, min(e.event_created_at) as session_start_at
, max(e.event_created_at) as session_end_at
, max(case when e.order_id is not null then true else false end) as session_has_order
, max(e.event_is_from_weekend) as session_has_weekend_event
, listagg(distinct e.event_type,',') within group (order by e.event_type) as session_event_types
-- counts
, {{ distinct_event_counts_per_event_type () }}
, count(distinct e.event_id) as count_of_events_in_session
-- calcs
, datediff(minute,session_start_at,session_end_at) as session_duration_in_minutes

from {{ ref('int_events') }} e

group by 1,2
)
, product_sessions as (
select
e.session_id
, listagg(distinct p.product_name,',') within group (order by p.product_name) as session_product_types

from {{ ref('int_events') }} e

join {{ ref('int_products') }} p
on e.product_id = p.product_id

group by 1
)

select
-- sessions
s.*
-- users
, users.* exclude user_id
-- product sessions
, ps.session_product_types

from sessions s

left join {{ ref('int_users') }} users
on s.user_id = users.user_id

left join product_sessions ps
on s.session_id = ps.session_id
Loading