Performing INNER JOIN operations using Two Keyvalue RDD

Joining data is an important part of many of our pipelines, and both Spark core and SQL support the same fundamental types of joins. While joins are very common and powerful, they warrant special performance consideration as they may require large network transfers or even create data sets beyond our capability to handle

The “inner” join is both the default and likely what you think of when you think of joining tables. It requires that the key be present in both tables

Creating a new RDD

ij1

Using keyBy function

keyBy

Constructs two ­component tuples (key­value pairs) by applying a function on each data item.

The result of the function becomes the key and the original data item becomes the value of

the newly created tuples.

Listing Variants

def keyBy[K](f: T => K): RDD[(K, T)]

Create a another RDD

ij3

Use keyBy Transformation

ij4

JOIN Using Two Key value RDD

Performing INNER JOIN operations using Two Keyvalue RDD