Performing INNER JOIN operations using Two Keyvalue RDD
Joining data is an important part of many of our pipelines, and both Spark core and SQL support the same fundamental types of joins. While joins are very common and powerful, they warrant special performance consideration as they may require large network transfers or even create data sets beyond our capability to handle
The “inner” join is both the default and likely what you think of when you think of joining tables. It requires that the key be present in both tables
Creating a new RDD
Using keyBy function
Constructs two component tuples (keyvalue pairs) by applying a function on each data item.
The result of the function becomes the key and the original data item becomes the value of
the newly created tuples.
def keyBy[K](f: T => K): RDD[(K, T)]
Create a another RDD
Use keyBy Transformation
JOIN Using Two Key value RDD