Apache Spark:

Apache Spark is a general-purpose & lightning fast cluster computing system. It provides a high-level API like Java, Scala, Python and R. It is a tool for running spark applications and it is 100 times faster than Hadoop and 10 times faster than accessing data from disk.

Necessity of Apache Spark:

In the industry world , every one needed a general purpose cluster computing tools , such as

MapReduce(It is limited to batch processing).

Storm(It is limited to stream processing).

Impala(It is limited to interactive processing).

Neo4j(It is limited to graph processing).

So, here every one is handling single process only. But in Apache Apark , it provides real-time stream processing,interactive processing,graph processing,in-memeory processing as well as batch procesing with very fast speed, ease of use and standard interface.

Components of Apache Spark;

Spark Core

Spaerk Sql

Spark streaming

Mlib

Graphx

RDDs – Resilient Distributed Datasets:

Iit is the fundamental unit of data in spark, which is didtributed collection of elements across cluster nodes and can perform parallel operations.

RDDs are immutable but can generate new RDD by transforming existing RDD.

There are two ways to create RDDs:

Parallelized Collections:

It is created by invoking parallelize method in the driver program.

External Datasets:

It can be created by calling textfile method. This method takes an URI of the file and reda it as a collections of lines.

Two types of Operations:

Transformation:

Creates a new RDD from the existing one.It passes the dataset to the function and returns new dataset.

Some of the functionalities are listed below:

Map

FlatMap

Filter

MapPartitions

MappartionwithIndex

Union

Distinct

Intersection

GroupBy

ReduceByKey

AggregateByKey

SortByKey

Join

Coalesce

Repartition

Action:

It returns final result to drives program or write it to the external data-store.

Some of the functionalities are listed below:

Count ()

Collect()

Reduce()

Take(n)

First()

TakeSample()

TakeOrdered (count&ordering)

CountByKey()

Foreach()

SaveAsTextfile()

Properties of RDDs:

Immutable

Parttioned

Coarse gained operations

Action / Transformations

Cacheable

Example for RDDs:

Creation of new RDDs:

scala> case class Customer(name :String,product_name:String,price:Int,country:String,state:String)
defined class Customer

scala> val Customerrdd = sc.textFile(“/home/geouser/Documents/Bigdata/Spark/sales.csv”)
Customerrdd: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[17] at textFile at <console>:27

scala> Customerrdd.collect
res5: Array[String] = Array(carolina,Product1,1200,Basildon,England, Betina,Product1,1200,Parkville ,MO, Federica e Andrea,Product1,1200,Astoria ,OR, Gouya,Product1,1200,Echuca,Victoria, Gerd W ,Product2,3600,Cahaba Heights ,AL, carolina,Product1,1200,Mickleton ,NJ, Fleur,Product1,1200,Peoria ,IL, adam,Product1,1200,Martin ,TN, Renee Elisabeth,Product1,1200,Tel Aviv,Tel Aviv, Aidan,Product1,1200,Chatou,Ile-de-France)

scala> Customerrdd.count()
res7: Long = 10

DataFrames in Apache Spark:

A DataFrame is a distributed collection of data organized into named columns. It is conceptually equivalent to a table in a relational database or a R/Python Dataframe. Along with Dataframe, Spark also introduced catalyst optimizer, which leverages advanced programming features to build an extensible query optimizer.

DataFrame Features:

Distributed collection of Row Object

Data Processing

Optimization using catalyst optimizer

Hive Compatibility

Tungsten

Programming Languages supported

Coding with DataFrames:

Creating a DataFrame from existing RDDs:

scala> val df = Customerrdd.toDF
df: org.apache.spark.sql.DataFrame = [_1: string]

scala> df.collect
res8: Array[org.apache.spark.sql.Row] = Array ([carolina,Product1,1200,Basildon,England], [Betina,Product1,1200,Parkville                   ,MO], [Federica e Andrea,Product1,1200,Astoria                     ,OR], [Gouya,Product1,1200,Echuca,Victoria], [Gerd W ,Product2,3600,Cahaba Heights              ,AL], [carolina,Product1,1200,Mickleton                   ,NJ], [Fleur,Product1,1200,Peoria                      ,IL], [adam,Product1,1200,Martin                      ,TN], [Renee Elisabeth,Product1,1200,Tel Aviv,Tel Aviv], [Aidan,Product1,1200,Chatou,Ile-de-France])

Applying Transformations / Actions operations in DataFrame:

scala> df.count
res12: Long = 10

scala> val df = Customerrdd.toDF(“line”)
df: org.apache.spark.sql.DataFrame = [line: string]

scala> df.take(3)
res29: Array[org.apache.spark.sql.Row] = Array([carolina,Product1,1200,Basildon,England], [Betina,Product1,1200,Parkville ,MO], [Federica e Andrea,Product1,1200,Astoria ,OR])

scala> df.collect
res15: Array[org.apache.spark.sql.Row] = Array([carolina,Product1,1200,Basildon,England], [Betina,Product1,1200,Parkville ,MO], [Federica e Andrea,Product1,1200,Astoria ,OR], [Gouya,Product1,1200,Echuca,Victoria], [Gerd W ,Product2,3600,Cahaba Heights ,AL], [carolina,Product1,1200,Mickleton ,NJ], [Fleur,Product1,1200,Peoria ,IL], [adam,Product1,1200,Martin ,TN], [Renee Elisabeth,Product1,1200,Tel Aviv,Tel Aviv], [Aidan,Product1,1200,Chatou,Ile-de-France])

scala> df.filter(col(“line”).like(“%adam%”)).count()
res27: Long = 1

scala> df.filter(col(“line”).like(“%adam%”)).collect
res28: Array[org.apache.spark.sql.Row] = Array([adam,Product1,1200,Martin ,TN])

Leave a Reply

Your email address will not be published. Required fields are marked *