Spark Sql Aggregate Function in RDD:
Spark sql Aggregate Function in RDD:
Spark sql:
Spark SQL is a Spark module for structured data processing. Unlike the basic Spark RDD API, the interfaces provided by Spark SQL provide Spark with more information about the structure of both the data and the computation being performed. Internally, Spark SQL uses this extra information to perform extra optimizations.
RDD:
Resilient Distributed Datasets. Resilient Distributed Datasets (RDD) is a fundamental data structure of Spark. It is an immutable distributed collection of objects. Each dataset in RDD is divided into logical partitions, which may be computed on different nodes of the cluster.
Aggregate Function:
collect, toArray
collectAsMap[Pair]
count
first
fold
foreach
max
mean [Double]
meanApprox [Double]
min
sum
top
collect, toArray
Converts the RDD into a Scala array and returns it. If you provide a standard map-function (i.e. f = T -> U) it will be applied before inserting the values into the result array.
Listing Variants
def collect(): Array[T]
def collect[U: ClassTag](f: PartialFunction[T, U]): RDD[U]
def toArray(): Array[T]
Example
scala> val c = sc.parallelize(List(“Fleur”, “adam”, “Aidan”, “adam”, “fleur”, “Janet”), 2)
c: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[10] at parallelize at <console>:27
scala> c.collect
res4: Array[String] = Array(Fleur, adam, Aidan, adam, fleur, Janet)
collectAsMap [Pair]
Similar to collect, but works on key-value RDDs and converts them into Scala maps to preserve their key-value structure.
Listing Variants
def collectAsMap(): Map[K, V]
Example
scala> val a = sc.parallelize(List(“Fleur”,”adam”,”Aidan”,”adam”,”Fleur”),1)
scala> val b = a.zip(a)
scala> b.collectAsMap
res5: scala.collection.Map[String,String] = Map(Fleur -> Fleur, Aidan -> Aidan, adam -> adam)
count
Returns the number of items stored within a RDD.
Listing Variants
def count(): Long
Example
scala> val a = sc.parallelize(List(“Fleur”,”adam”,”Aidan”,”adam”,”Fleur”),2)
scala> a.count
res2: Long = 5
First
Looks for the very first data item of the RDD and returns it.
Listing Variants
def first(): T
Example
scala> val a = sc.parallelize(List(“Fleur”,”adam”,”Aidan”,”adam”,”Fleur”),2)
scala> a.first
res3: String = Fleur
Fold
Aggregates the values of each partition. The aggregation variable within each partition is initialized with zeroValue.
Listing Variants
def fold(zeroValue: T)(op: (T, T) => T): T
Example
val a = sc.parallelize(List(1,2,3), 3)
a.fold(0)(_ + _)
res59: Int = 6
Foreach
Executes an parameterless function for each data item.
Listing Variants
def foreach(f: T => Unit)
Example
scala> val a = sc.parallelize(List(“Fleur”,”adam”,”Aidan”,”adam”,”Fleur”),2)
scala> a.foreach(x => println(x + ” is my friend”))
Fleur is my friend
adam is my friend
Aidan is my friend
adam is my friend
Fleur is my friend
Max
Returns the largest element in the RDD
Listing Variants
def max()(implicit ord: Ordering[T]): T
Example
scala> val a = sc.parallelize(List((25,”Fleur”),(28,”adam”),(24,”Aidan”),(28,”adam”),(25,”Fleur”)))
scala> a.max
res9: (Int, String) = (28,adam)
Mean [Double], MeanApprox [Double]
Calls stats and extracts the mean component. The approximate version of the function can finish somewhat faster in some scenarios. However, it trades accuracy for speed.
Listing Variants
def mean(): Double
def meanApprox(timeout: Long, confidence: Double = 0.95): PartialResult[BoundedDouble]
Example
val a = sc.parallelize(List(9.1, 1.0, 1.2, 2.1, 1.3, 5.0, 2.0, 2.1, 7.4, 7.5, 7.6, 8.8, 10.0, 8.9, 5.5), 3)
a.mean
res0: Double = 5.3
Min
Returns the smallest element in the RDD
Listing Variants
def min()(implicit ord: Ordering[T]): T
Example
scala> val a = sc.parallelize(List((25,”Fleur”),(28,”adam”),(24,”Aidan”),(28,”adam”),(25,”Fleur”)))
scala> a.min
res10: (Int, String) = (24,Aidan)
Top
Utilizes the implicit ordering of $T$ to determine the top $k$ values and returns them as an array.
Listing Variants
ddef top(num: Int)(implicit ord: Ordering[T]): Array[T]
Example
val c = sc.parallelize(Array(6, 9, 4, 7, 5, 8), 2)
c.top(2)
res28: Array[Int] = Array(9, 8)
Sum [Double], SumApprox [Double]
Computes the sum of all values contained in the RDD. The approximate version of the function can finish somewhat faster in some scenarios. However, it trades accuracy for speed.
Listing Variants
def sum(): Double
def sumApprox(timeout: Long, confidence: Double = 0.95): PartialResult[BoundedDouble]
Example
val x = sc.parallelize(List(1.0, 2.0, 3.0, 5.0, 20.0, 19.02, 19.29, 11.09, 21.0), 2)
x.sum
res17: Double = 101.39999999999999