From 45b8a8b47febcbbc941d86c509c5c7996b102dd0 Mon Sep 17 00:00:00 2001 From: Cody Peterson Date: Fri, 7 Oct 2022 18:42:32 +0000 Subject: [PATCH 1/2] minor updates for python models --- .../building-models/python-models.md | 23 ++++++++----------- 1 file changed, 10 insertions(+), 13 deletions(-) diff --git a/website/docs/docs/building-a-dbt-project/building-models/python-models.md b/website/docs/docs/building-a-dbt-project/building-models/python-models.md index 1499af9a9b3..bc8d8800618 100644 --- a/website/docs/docs/building-a-dbt-project/building-models/python-models.md +++ b/website/docs/docs/building-a-dbt-project/building-models/python-models.md @@ -183,7 +183,7 @@ Out of the box, the `dbt` class supports: - Accessing the database location of the current model: `dbt.this()` (also: `dbt.this.database`, `.schema`, `.identifier`) - Determining if the current model's run is incremental: `dbt.is_incremental` -It is possible to extend this context by setting custom configurations on your Python model and then "getting" them via `dbt.config.get()`. This includes inputs such as `var`, `env_var`, and `target`. If you want to use those values to power conditional logic in your model, we recommend setting them through a dedicated `.yml` file config instead: +It is possible to extend this context by "getting" them via `dbt.config.get()` after they are configured in the [model's config](/reference/model-configs). This includes inputs such as `var`, `env_var`, and `target`. If you want to use those values to power conditional logic in your model, we require setting them through a dedicated `.yml` file config: @@ -568,7 +568,7 @@ As a general rule, if there's a transformation you could write equally well in S ## Specific data platforms -In their initial launch, Python models are supported on three of the most popular data platforms: Snowflake, Databricks, and GCP (BigQuery + Dataproc). Both Databricks and GCP Dataproc use PySpark as the processing framework. Snowflake uses its own framework, Snowpark, which has many similarities to PySpark. +In their initial launch, Python models are supported on three of the most popular data platforms: Snowflake, Databricks, and BigQuery/GCP (via Dataproc). Both Databricks and GCP's Dataproc use PySpark as the processing framework. Snowflake uses its own framework, Snowpark, which has many similarities to PySpark. @@ -597,13 +597,11 @@ models:
-**Additional setup:** The `user` field in your [Spark connection profile](spark-profile) (which is usually optional) is required for running Python models. In the current implementation, this should be your email login to your Databricks workspace (`yourname@company.com`). - **Submission methods:** Databricks supports a few different mechanisms to submit PySpark code, each with relative advantages. Some are better for supporting iterative development, while others are better for supporting lower-cost production deployments. The options are: - `all_purpose_cluster` (default): dbt will run your Python model using the cluster ID configured as `cluster` in your connection profile or for this specific model. These clusters are more expensive but also much more responsive. We recommend using an interactive all-purpose cluster for quicker iteration in development. - - `create_notebook: True`: dbt will upload your model's compiled PySpark code to a notebook in the namespace `/dbt_python_model/{schema}`, where `{schema}` is the configured schema for the model, and execute that notebook to run using the all-purpose cluster. The appeal of this approach is that you can easily open the notebook in the Databricks UI for debugging or fine-tuning right after running your model. Remember to copy any changes into your dbt `.py` model code before re-running. + - `create_notebook: True`: dbt will upload your model's compiled PySpark code to a notebook in the namespace `/Shared/dbt_python_model/{schema}`, where `{schema}` is the configured schema for the model, and execute that notebook to run using the all-purpose cluster. The appeal of this approach is that you can easily open the notebook in the Databricks UI for debugging or fine-tuning right after running your model. Remember to copy any changes into your dbt `.py` model code before re-running. - `create_notebook: False` (default): dbt will use the [Command API](https://docs.databricks.com/dev-tools/api/1.2/index.html#run-a-command), which is slightly faster. -- `job_cluster`: dbt will upload your model's compiled PySpark code to a notebook in the namespace `/dbt_python_model/{schema}`, where `{schema}` is the configured schema for the model, and execute that notebook to run using a short-lived jobs cluster. For each Python model, Databricks will need to spin up the cluster, execute the model's PySpark transformation, and then spin down the cluster. As such, job clusters take longer before and after model execution, but they're also less expensive, so we recommend these for longer-running Python models in production. To use the `job_cluster` submission method, your model must be configured with `job_cluster_config`, which defines key-value properties for `new_cluster`, as defined in the [JobRunsSubmit API](https://docs.databricks.com/dev-tools/api/latest/jobs.html#operation/JobsRunsSubmit). +- `job_cluster`: dbt will upload your model's compiled PySpark code to a notebook in the namespace `/Shared/dbt_python_model/{schema}`, where `{schema}` is the configured schema for the model, and execute that notebook to run using a short-lived jobs cluster. For each Python model, Databricks will need to spin up the cluster, execute the model's PySpark transformation, and then spin down the cluster. As such, job clusters take longer before and after model execution, but they're also less expensive, so we recommend these for longer-running Python models in production. To use the `job_cluster` submission method, your model must be configured with `job_cluster_config`, which defines key-value properties for `new_cluster`, as defined in the [JobRunsSubmit API](https://docs.databricks.com/dev-tools/api/latest/jobs.html#operation/JobsRunsSubmit). You can configure each model's `submission_method` in all the standard ways you supply configuration: @@ -612,7 +610,7 @@ def model(dbt, session): dbt.config( submission_method="all_purpose_cluster", create_notebook=True, - cluster="abcd-1234-wxyz" + cluster_id="abcd-1234-wxyz" ) ... ``` @@ -634,9 +632,10 @@ models: # set defaults for all .py models defined in this subfolder +submission_method: all_purpose_cluster +create_notebook: False + +cluster_id: abcd-1234-wxyz ``` -If not configured, dbt will use the built-in defaults: the all-purpose cluster (based on `cluster_id` in your connection profile), without creating a notebook. +If not configured, `dbt-spark` will use the built-in defaults: the all-purpose cluster (based on `cluster` in your connection profile), without creating a notebook. The `dbt-databricks` adapter will default to the cluster configured in `http_path`. We encourage explicitly configuring the clusters for Python models in Databricks projects. **Installing packages:** When using all-purpose clusters, we recommend installing packages which you will be using to run your Python models. @@ -652,9 +651,7 @@ The `dbt-bigquery` adapter uses a service called Dataproc to submit your Python **Submission methods.** Dataproc supports two submission methods: `serverless` and `cluster`. Dataproc Serverless does not require a ready cluster, which saves on hassle and cost—but it is slower to start up, and much more limited in terms of available configuration. For example, Dataproc Serverless supports only a small set of Python packages, though it does include `pandas`, `numpy`, and `scikit-learn`. (See the full list [here](https://cloud.google.com/dataproc-serverless/docs/guides/custom-containers#example_custom_container_image_build), under "The following packages are installed in the default image"). Whereas, by creating a Dataproc Cluster in advance, you can fine-tune the cluster's configuration, install any PyPI packages you want, and benefit from faster, more responsive runtimes. -We recommend: -- Using the `serverless` submission method for simpler Python models running in production. -- Using the `cluster` submission method in development, and for models that require custom configuration, such as third-party PyPI packages. +Use the `cluster` submission method with dedicated Dataproc clusters you or your organization manage. Use the `serverless` submission method to avoid managing a Spark cluster. The latter may be quicker for getting started, but both are valid for production. **Additional setup:** - Create or use an existing [Cloud Storage bucket](https://cloud.google.com/storage/docs/creating-buckets) @@ -672,7 +669,7 @@ The following configurations are needed to run Python models on Dataproc. You ca def model(dbt, session): dbt.config( submission_method="cluster", - cluster_name="my-favorite-cluster" + dataproc_cluster_name="my-favorite-cluster" ) ... ``` @@ -698,7 +695,7 @@ storage.objects.delete **Installing packages:** If you are using a Dataproc Cluster (as opposed to Dataproc Serverless), you can add third-party packages while creating the cluster. Google recommends installing Python packages on Dataproc clusters via initialization actions: -- ["How initialization actions are used"](https://github.com/GoogleCloudDataproc/initialization-actions/blob/master/README.md#how-initialization-actions-are-used) +- [How initialization actions are used](https://github.com/GoogleCloudDataproc/initialization-actions/blob/master/README.md#how-initialization-actions-are-used) - [Actions for installing via `pip` or `conda`](https://github.com/GoogleCloudDataproc/initialization-actions/tree/master/python) You can also install packages at cluster creation time by [defining cluster properties](https://cloud.google.com/dataproc/docs/tutorials/python-configuration#image_version_20): `dataproc:pip.packages` or `dataproc:conda.packages`. From 5faf9fd808830b9dc98d245a99e8c2e5eb2b8828 Mon Sep 17 00:00:00 2001 From: Matt Shaver <60105315+matthewshaver@users.noreply.github.com> Date: Fri, 7 Oct 2022 14:58:15 -0400 Subject: [PATCH 2/2] Update website/docs/docs/building-a-dbt-project/building-models/python-models.md --- .../building-a-dbt-project/building-models/python-models.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/website/docs/docs/building-a-dbt-project/building-models/python-models.md b/website/docs/docs/building-a-dbt-project/building-models/python-models.md index bc8d8800618..3289a1fbf69 100644 --- a/website/docs/docs/building-a-dbt-project/building-models/python-models.md +++ b/website/docs/docs/building-a-dbt-project/building-models/python-models.md @@ -635,7 +635,7 @@ models: +cluster_id: abcd-1234-wxyz ``` -If not configured, `dbt-spark` will use the built-in defaults: the all-purpose cluster (based on `cluster` in your connection profile), without creating a notebook. The `dbt-databricks` adapter will default to the cluster configured in `http_path`. We encourage explicitly configuring the clusters for Python models in Databricks projects. +If not configured, `dbt-spark` will use the built-in defaults: the all-purpose cluster (based on `cluster` in your connection profile) without creating a notebook. The `dbt-databricks` adapter will default to the cluster configured in `http_path`. We encourage explicitly configuring the clusters for Python models in Databricks projects. **Installing packages:** When using all-purpose clusters, we recommend installing packages which you will be using to run your Python models.