Spark RDD

Resilient Distributed Dataset

Resilient– If data in memory is lost,it can be recreated

Distributed– Stored in memory across a cluster

Dataset-Initial data can come from a file or created programmatically

An RDD is the basic unit of data in Spark in which all Operations are performed.Simply an immutable distributed collection of objectsRDDs are intermediate results stored in Memory and are Partitioned to be operated on multiple nodes in the ClusterRDDs can be created in two ways: by loading an external dataset, or by distributing a collection of objects

Creating RDDs

The simplest way to create RDDs is to take an existing collection in your program and pass it to SparkContext’s parallelize() method

parallelize() method in Scala

RDD1

RDD Operations
RDD performs two types of operations:

Transformations
Actions.

Transformations

Set of operations on a RDD that define how they should be transformed

As in relational Algebra ,the application of atransformation to an RDD yields a new RDD

(because RDD are immutable)

Transformations are Lazily evaluated,Which allow for optimizations to take place before execution

It returns a new RDD.

Example:

map(func),flatMap(func),filter(func)

groupByKey()

reduceByKey(func),mapvalues(func),distinct(),sortByKey(func)

join(other),union(other)

sample()

Actions

Actions are the operations that return a final value to the driver program or write data to an external storage system.

Example:

reduce(func)

collect(),first(),take(),foreach(func)

count(),countByKey()

saveAsTextFile()