Skip to content
This repository was archived by the owner on Sep 2, 2025. It is now read-only.

Add spark session connection#279

Merged
jtcohen6 merged 60 commits intodbt-labs:mainfrom
JCZuurmond:add-spark-session-connection
Mar 26, 2022
Merged

Add spark session connection#279
jtcohen6 merged 60 commits intodbt-labs:mainfrom
JCZuurmond:add-spark-session-connection

Conversation

@JCZuurmond
Copy link
Collaborator

@JCZuurmond JCZuurmond commented Jan 28, 2022

resolves #272

Description

Adds a connection method to connect to Spark session. See #272 for full discussion.

Checklist

  • I have signed the CLA
  • I have run this code in development and it appears to resolve the stated issue
  • This PR includes tests, or tests are not required/relevant for this PR
  • I have updated the CHANGELOG.md and added information about my change to the "dbt-spark next" section.
  • Add new connection method to docs.

@cla-bot cla-bot bot added the cla:yes label Jan 28, 2022
Copy link
Contributor

@jtcohen6 jtcohen6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a fascinating PR. Thanks for such clean and well-annotated code. I feel like I'm connecting two wayward dots in my head, between what pyspark does and what database cursors do. To that end, I feel far from qualified to review the specifics; I'd love to spend some time experimenting with one of the engineers on the Core team.

The current "dbtspec" suite/framework for integration testing is quite unfriendly. Please don't spend too much time chiseling away at that particular wall, since we're aiming/planning to replace it over the next few months.

@JCZuurmond JCZuurmond force-pushed the add-spark-session-connection branch 2 times, most recently from 6ec5374 to 5be6715 Compare February 9, 2022 11:31
Copy link
Collaborator Author

@JCZuurmond JCZuurmond left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jtcohen6 : Functionality wise everything is there. I am struggling with the test set-up in cirecle CI

schema: "analytics_{{ var('_dbt_random_suffix') }}"
sequences:
test_dbt_empty: empty
# Disabled tests requires hive support
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know yet how to enable hive support, this is particular hard due to the sub-process calls to dbt inside the pytest-dbt-adapter package

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given how difficult the .dbtspec tests are to debug, and the fact that we're moving away from this framework, I'd be happy with an alternative integration test, to validate some truly basic functionality. The setup could be similar to the ones we have in tests/integration currently. I think it would first require adding a new profile marker to the list in conftest.py.

Copy link
Collaborator Author

@JCZuurmond JCZuurmond Mar 10, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

most of the test are disabled due to missing hive support, which is not easy to add due to the subprocesses in .dbtspec. @jtcohen6 : do you consider the two active tests to be sufficient? (see thread in tox.ini too)

Copy link
Contributor

@jtcohen6 jtcohen6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like this is generally working. I'd hate to have the particularities of our current testing setup (.dbtspec + CircleCI) stand in the way of making this capability more broadly available. To that end, I'm open to being more creative about other ways to test this, knowing too that we're soon to revamp all adapter plugin testing.

schema: "analytics_{{ var('_dbt_random_suffix') }}"
sequences:
test_dbt_empty: empty
# Disabled tests requires hive support
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given how difficult the .dbtspec tests are to debug, and the fact that we're moving away from this framework, I'd be happy with an alternative integration test, to validate some truly basic functionality. The setup could be similar to the ones we have in tests/integration currently. I think it would first require adding a new profile marker to the list in conftest.py.

@JCZuurmond
Copy link
Collaborator Author

@jtcohen6 : Could provide some help with this PR? The PR is stuck on the testing image, it needs an image that has pyspark + the standard dbt-spark dependencies. If I move away from .dbtspec this still is a problem.

@JCZuurmond JCZuurmond force-pushed the add-spark-session-connection branch 4 times, most recently from 18f83ef to 25da4c9 Compare March 15, 2022 20:51
@JCZuurmond JCZuurmond force-pushed the add-spark-session-connection branch from 1c70d74 to 85d23b8 Compare March 24, 2022 08:51
@JCZuurmond
Copy link
Collaborator Author

@jtcohen6 : This PR is ready to merge! Will you review it once more? Also the accompanying docs PR please.

The PR contains test for the session connection method. Not all dbtspec tests are enabled. I think it is technically possible to run at least the tests that are run for Thrift. I was not able to set-up the connection to a metastore like for Thrift. I think it is technically possible. My expectation is that if that connection is set-up, all the tests that run for Thrift should also run for the session connection method. Given that dbtspec is going to be replaced, I would like to wait for that before investing more time on the test suite, I expect it would become easier to enable more tests.

JCZuurmond added a commit to JCZuurmond/docs.getdbt.com that referenced this pull request Mar 24, 2022
Documentation updates that go along with dbt-labs/dbt-spark#279
@JCZuurmond
Copy link
Collaborator Author

The failing CI steps look unrelated to this PR

@jtcohen6
Copy link
Contributor

Thanks @JCZuurmond! Triggering a rerun of the failed CI steps now. Looks like it may have been a connection/timeout issue. We did just migrate to a new dedicated Databricks workspace for sandbox/integration testing.

@JCZuurmond JCZuurmond force-pushed the add-spark-session-connection branch from d765f7a to 37ebe30 Compare March 25, 2022 11:02
Copy link
Contributor

@jtcohen6 jtcohen6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JCZuurmond Thank you for this very, very cool contribution!

Alongside the dbt-sqlite and dbt-duckdb adapters, this feels like a step in the direction of mocking/testing/validating dbt functionality, without first requiring a connection to a remote database.

I'm totally happy with punting on the test cases that require a metastore to persist objects between sessions. We are finally moving away from .dbtspec, and toward a new testing framework for adapters, which you can check out in #299.

In the meantime, my only change here will be to move your contributor note up in the changelog, since this will be included in v1.1.0rc1.

@jtcohen6 jtcohen6 merged commit 086becb into dbt-labs:main Mar 26, 2022
jtcohen6 pushed a commit to dbt-labs/docs.getdbt.com that referenced this pull request Mar 30, 2022
Documentation updates that go along with dbt-labs/dbt-spark#279
jtcohen6 added a commit to dbt-labs/docs.getdbt.com that referenced this pull request Apr 1, 2022
* Add session method

Documentation updates that go along with dbt-labs/dbt-spark#279

* Add version blocks, tweak language

* Add to v1.1 migration guide

Co-authored-by: Cor <[email protected]>
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add connection method for running dbt against a Spark session

2 participants