Apache Spark Inroduction
Apache Spark Introduction
Apache Spark is a fast,general engine for large-scale data processing on a cluster,Orginally developed at AMPLab at UC Berkley,Open source Apache project,Fast and general purpose cluster computing system,In Memory Data Processing,Spark enables applications in Hadoop clusters to run up to 100 times faster in memory and 10 times faster even when running on disk.
Spark has several advantages compared to other big data and MapReduce technologies like Hadoop and Storm. Spark supports SQL queries, streaming data, machine learning and graph data processing. Developers can use these capabilities stand-alone or combine them to run in a single data pipeline use case.
Features of Apache Spark
Spark takes mapreduce to next level with less expensive shuffles in data processing,Optimizes arbitary operator graphs,Supports Lazy evalution of big data queries which helps with the optimization of overall data processing workflow,Provides concise and consistent APIs in scala,java and python,Offers interactive shell for scala and python.this is not available in Java yet.
Why to use Apache Spark
High level programming framework
Cluster computing
Distributed storage
Data in memory
RDD
Resilient Distributed Dataset
Resilient– If data in memory is lost,it can be recreated
Distributed– Stored in memory across a cluster
Dataset– Initial data can come from a file or created programmatically
An RDD is the basic unit of data in Spark in which all Operations are performed.Simply an immutable distributed collection of objects,RDDs are intermediate results stored in Memory and are Partitioned to be operated on multiple nodes in the Cluster RDDs can be created in two ways: by loading an external dataset, or by distributing a collection of objects
Spark Shell Introduction
We will run spark’s interactive shell.Within the spark directory : ./bin/Spark-shell
Then from the “scala>” REPL prompt,Let’s create some Data: Scala>val data=1 to 50
Creation of RDD
Create RDD based on that data.. : Scala>val distData=sc.parallelize(data)
FILTER command
Then use a filter to select values less than 15… : Scala>distData.filter(_<10).collect()