Skip to content

Conversation

@krisnaru
Copy link
Contributor

Summary

This PR introduces a custom Airflow operator (ChrononSparkSubmitOperator) and a Dagster asset (chronon_spark_job) to model batch workflow execution for Apache Chronon. These implementations allow validation, serialization, and submission of Chronon Thrift objects as Spark jobs.

Airflow: Extends SparkSubmitOperator to take a Thrift object as input, validate it, serialize it to JSON, and pass it to Spark.
Dagster: Implements a Dagster asset that performs the same workflow using SparkSubmitTaskDefinition

Why / Goal

This PR unlocks seamless batch processing for Chronon transformations (e.g., GroupBy, Join, Staging). The goal is to:
✅ Standardize job submission across orchestration platforms (Airflow & Dagster).
✅ Ensure validation of Chronon configs before execution.
✅ Automate serialization of Thrift objects into JSON.
Impact

Enables scalable & reproducible data workflows for batch feature generation.
Reduces manual handling of Thrift objects when integrating Chronon with Spark.
Provides a reusable operator for teams working with Apache Chronon.

Test Plan

Unit Tests: Added tests for validation and serialization of Thrift objects.
CI Coverage: Ensured that existing tests pass with the new operators.
Integration Testing:

  • Airflow DAG tested with GroupBy and Join Thrift objects.
    
  • Dagster pipeline executed successfully with SparkSubmitTaskDefinition.
    

Checklist

  • Documentation update

Reviewers

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants