Spark RDD
Spark RDD
Resilient Distributed Dataset
Resilient– If data in memory is lost,it can be recreated
Distributed– Stored in memory across a cluster
Dataset-Initial data can come from a file or created programmatically
An RDD is the basic unit of data in Spark in which all Operations are performed.Simply an immutable distributed collection of objectsRDDs are intermediate results stored in Memory and are Partitioned to be operated on multiple nodes in the ClusterRDDs can be created in two ways: by loading an external dataset, or by distributing a collection of objects
Creating RDDs
The simplest way to create RDDs is to take an existing collection in your program and pass it to SparkContext’s parallelize() method
parallelize() method in Scala
RDD Operations
RDD performs two types of operations:
Transformations
Actions.
Transformations
Set of operations on a RDD that define how they should be transformed
As in relational Algebra ,the application of atransformation to an RDD yields a new RDD
(because RDD are immutable)
Transformations are Lazily evaluated,Which allow for optimizations to take place before execution
It returns a new RDD.
Example:
map(func),flatMap(func),filter(func)
groupByKey()
reduceByKey(func),mapvalues(func),distinct(),sortByKey(func)
join(other),union(other)
sample()
Actions
Actions are the operations that return a final value to the driver program or write data to an external storage system.
Example:
reduce(func)
collect(),first(),take(),foreach(func)
count(),countByKey()
saveAsTextFile()