sortByKey

This function sorts the input RDD’s data and stores it in a new RDD. The output RDD is a shuffled RDD because it stores data that is output by a reducer which has been shuffled. The implementation of this function is actually very clever. First, it uses a range partitioner to partition the data in ranges within the shuffled RDD. Then it sorts these ranges individually with mapPartitions using standard sort mechanisms.

Create a New RDD

sortByKey

parallelize an existing RDD

sortByKey

Use ZIP function

sortByKey

zip

Joins two RDDs by combining the i-th of either partition with each other. The resulting RDD will consist of two-component tuples which are interpreted as key-value pairs by the methods provided by the PairRDDFunctions extension.

Listing Variants

def zip[U: ClassTag](other: RDD[U]): RDD[(T, U)]

Using sortByKey function

sortByKey

sortByKey

Leave a Reply

Your email address will not be published. Required fields are marked *