In this post you will be learning the various optimizing joins techniques that can be used in Apache Spark.
Three types of joins in Spark are
Shuffle hash join (default):
- It is a map-reduce type join.
- Based on output key it shuffles the datasets.
- During reduce phase, it joins the datasets for same output key.
Broadcast hash join:
- Use this when one dataset is small enough to fit in memory.
- Use this when every row of one table is joined with every row of the other table.
The below code snippet uses the .join() with orderDF and orderItemDF to get a joinedOrderDataDF.
scala> var joinedOrderDataDF = orderDF.join(orderItemDF,orderDF(“order_id”)===orderItemDF(“order_item_order_id”))
joinedOrderDataDF: org.apache.spark.sql.DataFrame = [order_id: int, order_date: bigint … 8 more fields]
check which execution strategy of the join has been used in joinedOrderDataDF using
scala>joinedOrderDataDF.explain —> to get which join type has been used.
scala>joinedOrderDataDF.queryExecution.executedPlan —> gives information on how the dataframe has been executed.
You can also bookmark this page for future reference.
You can share this page with your friends.
Follow mefor future notifications.