Apache Spark Training in chennai

Apache Spark Introduction

Apache Spark is a fast,general engine for large-scale data processing on a cluster,Orginally developed at AMPLab at UC Berkley,Open source Apache project,Fast and general purpose cluster computing system,In Memory Data Processing,Spark enables applications in Hadoop clusters to run up to 100 times faster in memory and 10 times faster even when running on disk.

Spark has several advantages compared to other big data and MapReduce technologies like Hadoop and Storm.  Spark supports SQL queries, streaming data, machine learning and graph data processing. Developers can use      these capabilities stand-alone or combine them to run in a single data pipeline use case.

Features of Apache Spark

Spark takes mapreduce to next level with less expensive shuffles in data processing,Optimizes arbitary operator graphs,Supports Lazy evalution of big data queries which helps with the optimization of overall data processing workflow,Provides concise and consistent APIs in scala,java and python,Offers interactive shell for scala and python.this is not available in Java yet.

Why to use Apache Spark

High level programming framework

Cluster computing

Distributed storage

Data in memory

RDD

Resilient Distributed Dataset

Resilient– If data in memory is lost,it can be recreated

Distributed– Stored in memory across a cluster

Dataset– Initial data can come from a file or created programmatically

An RDD is the basic unit of data in Spark in which all Operations are performed.Simply an immutable distributed collection of objects,RDDs are intermediate results stored in Memory and are Partitioned to be operated on multiple nodes in the Cluster RDDs can be created in two ways: by loading an external dataset, or by distributing a collection of objects

Spark Shell Introduction

 We will run spark’s interactive shell.Within the spark directory :   ./bin/Spark-shell

Then from the “scala>” REPL prompt,Let’s create some Data:      Scala>val data=1 to 50

Creation of RDD

Create RDD based on that data.. :  Scala>val distData=sc.parallelize(data)

FILTER command

Then use a filter to select values less than 15… :  Scala>distData.filter(_<10).collect()