Apache Spark:

Apache Spark is a general-purpose & lightning fast cluster computing system. It provides a high-level API like Java, Scala, Python and R. It is a tool for running spark applications and it is 100 times faster than Hadoop and 10 times faster than accessing data from disk.

Necessity of Apache Spark:

In the industry world , every one needed a general purpose cluster computing tools , such as
MapReduce(It is limited to batch processing).
Storm(It is limited to stream processing).
Impala(It is limited to interactive processing).
Neo4j(It is limited to graph processing).
So, here every one is handling single process only. But in Apache Apark , it provides real-time stream processing,interactive processing,graph processing,in-memeory processing as well as batch procesing with very fast speed, ease of use and standard interface.

Components of Apache Spark;

Spark Core
Spaerk Sql
Spark streaming
Mlib
Graphx

RDDs – Resilient Distributed Datasets:

Iit is the fundamental unit of data in spark, which is didtributed collection of elements across cluster nodes and can perform parallel operations.
RDDs are immutable but can generate new RDD by transforming existing RDD.
There are two ways to create RDDs:
Parallelized Collections:
It is created by invoking parallelize method in the driver program.
External Datasets:
It can be created by calling textfile method. This method takes an URI of the file and reda it as a collections of lines.

Two types of Operations:

Transformation:
Creates a new RDD from the existing one.It passes the dataset to the function and returns new dataset.
Some of the functionalities are listed below:
Map
FlatMap
Filter
MapPartitions
MappartionwithIndex
Union
Distinct
Intersection
GroupBy
ReduceByKey
AggregateByKey
SortByKey
Join
Coalesce
Repartition
Action:
It returns final result to drives program or write it to the external data-store.
Some of the functionalities are listed below:
Count ()
Collect()
Reduce()
Take(n)
First()
TakeSample()
TakeOrdered (count&ordering)
CountByKey()
Foreach()
SaveAsTextfile()

Properties of RDDs:

Immutable
Parttioned
Coarse gained operations
Action / Transformations
Cacheable

Example for RDDs:

Creation of new RDDs:

scala> case class Customer(name :String,product_name:String,price:Int,country:String,state:String)
defined class Customer

scala> val Customerrdd = sc.textFile(“/home/geouser/Documents/Bigdata/Spark/sales.csv”)
Customerrdd: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[17] at textFile at <console>:27

scala> Customerrdd.collect
res5: Array[String] = Array(carolina,Product1,1200,Basildon,England, Betina,Product1,1200,Parkville ,MO, Federica e Andrea,Product1,1200,Astoria ,OR, Gouya,Product1,1200,Echuca,Victoria, Gerd W ,Product2,3600,Cahaba Heights ,AL, carolina,Product1,1200,Mickleton ,NJ, Fleur,Product1,1200,Peoria ,IL, adam,Product1,1200,Martin ,TN, Renee Elisabeth,Product1,1200,Tel Aviv,Tel Aviv, Aidan,Product1,1200,Chatou,Ile-de-France)

scala> Customerrdd.count()
res7: Long = 10

DataFrames in Apache Spark:

A DataFrame is a distributed collection of data organized into named columns. It is conceptually equivalent to a table in a relational database or a R/Python Dataframe. Along with Dataframe, Spark also introduced catalyst optimizer, which leverages advanced programming features to build an extensible query optimizer.

DataFrame Features:

Distributed collection of Row Object
Data Processing
Optimization using catalyst optimizer
Hive Compatibility
Tungsten
Programming Languages supported

Coding with DataFrames:

Creating a DataFrame from existing RDDs:

scala> val df = Customerrdd.toDF
df: org.apache.spark.sql.DataFrame = [_1: string]

scala> df.collect
res8: Array[org.apache.spark.sql.Row] = Array ([carolina,Product1,1200,Basildon,England], [Betina,Product1,1200,Parkville                   ,MO], [Federica e Andrea,Product1,1200,Astoria                     ,OR], [Gouya,Product1,1200,Echuca,Victoria], [Gerd W ,Product2,3600,Cahaba Heights              ,AL], [carolina,Product1,1200,Mickleton                   ,NJ], [Fleur,Product1,1200,Peoria                      ,IL], [adam,Product1,1200,Martin                      ,TN], [Renee Elisabeth,Product1,1200,Tel Aviv,Tel Aviv], [Aidan,Product1,1200,Chatou,Ile-de-France])

Applying Transformations / Actions operations in DataFrame:

scala> df.count
res12: Long = 10

scala> val df = Customerrdd.toDF(“line”)
df: org.apache.spark.sql.DataFrame = [line: string]

scala> df.take(3)
res29: Array[org.apache.spark.sql.Row] = Array([carolina,Product1,1200,Basildon,England], [Betina,Product1,1200,Parkville ,MO], [Federica e Andrea,Product1,1200,Astoria ,OR])

scala> df.collect
res15: Array[org.apache.spark.sql.Row] = Array([carolina,Product1,1200,Basildon,England], [Betina,Product1,1200,Parkville ,MO], [Federica e Andrea,Product1,1200,Astoria ,OR], [Gouya,Product1,1200,Echuca,Victoria], [Gerd W ,Product2,3600,Cahaba Heights ,AL], [carolina,Product1,1200,Mickleton ,NJ], [Fleur,Product1,1200,Peoria ,IL], [adam,Product1,1200,Martin ,TN], [Renee Elisabeth,Product1,1200,Tel Aviv,Tel Aviv], [Aidan,Product1,1200,Chatou,Ile-de-France])

scala> df.filter(col(“line”).like(“%adam%”)).count()
res27: Long = 1

scala> df.filter(col(“line”).like(“%adam%”)).collect
res28: Array[org.apache.spark.sql.Row] = Array([adam,Product1,1200,Martin ,TN])