-
Notifications
You must be signed in to change notification settings - Fork 18
Open
Description
Add a new network broadcast join operator.
Quick definition and benefit:
"A broadcast join is a type of join operation in which one of the tables is small enough to fit in memory, and is broadcast to all the worker nodes in the cluster. This allows the join operation to be performed locally on each worker node, rather than requiring a shuffle operation to redistribute the data."
Impact:
- Enable Parallelism: CollectLeft Join is force to run on 1 worker, with broadcast join it will run in parallel across all workers
- Reduces network traffic: won't have to shuffle large amounts of data across network, rather just replicate the small table in memory of each worker
Some useful links:
How Broadcast joins in Spark work
LinkedIn Post with Visuals
gabotechs
Metadata
Metadata
Assignees
Labels
No labels