Skip to content

Add Network Broadcast Join #223

@gene-bordegaray

Description

@gene-bordegaray

Add a new network broadcast join operator.

Quick definition and benefit:
"A broadcast join is a type of join operation in which one of the tables is small enough to fit in memory, and is broadcast to all the worker nodes in the cluster. This allows the join operation to be performed locally on each worker node, rather than requiring a shuffle operation to redistribute the data."

Impact:

  • Enable Parallelism: CollectLeft Join is force to run on 1 worker, with broadcast join it will run in parallel across all workers
  • Reduces network traffic: won't have to shuffle large amounts of data across network, rather just replicate the small table in memory of each worker

Some useful links:
How Broadcast joins in Spark work
LinkedIn Post with Visuals

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions