Spark sql Aggregate Function in RDD:

Spark sql:

Spark SQL is a Spark module for structured data processing. Unlike the basic Spark RDD API, the interfaces provided by Spark SQL provide Spark with more information about the structure of both the data and the computation being performed. Internally, Spark SQL uses this extra information to perform extra optimizations.

RDD:

Resilient Distributed Datasets. Resilient Distributed Datasets (RDD) is a fundamental data structure of Spark. It is an immutable distributed collection of objects. Each dataset in RDD is divided into logical partitions, which may be computed on different nodes of the cluster.

Aggregate Function:

collect, toArray

collectAsMap[Pair]

count

first

fold

foreach

max

mean [Double]

meanApprox [Double]

min

sum

top

collect, toArray

Converts the RDD into a Scala array and returns it. If you provide a standard map-function (i.e. f = T -> U) it will be applied before inserting the values into the result array.
Listing Variants
def collect(): Array[T]
def collect[U: ClassTag](f: PartialFunction[T, U]): RDD[U]
def toArray(): Array[T]
Example

scala> val c = sc.parallelize(List(“Fleur”, “adam”, “Aidan”, “adam”, “fleur”, “Janet”), 2)
c: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[10] at parallelize at <console>:27
scala> c.collect
res4: Array[String] = Array(Fleur, adam, Aidan, adam, fleur, Janet)

collectAsMap [Pair]

Similar to collect, but works on key-value RDDs and converts them into Scala maps to preserve their key-value structure.
Listing Variants
def collectAsMap(): Map[K, V]
Example

scala> val a = sc.parallelize(List(“Fleur”,”adam”,”Aidan”,”adam”,”Fleur”),1)
scala> val b = a.zip(a)
scala> b.collectAsMap
res5: scala.collection.Map[String,String] = Map(Fleur -> Fleur, Aidan -> Aidan, adam -> adam)

count

Returns the number of items stored within a RDD.
Listing Variants
def count(): Long
Example

scala> val a = sc.parallelize(List(“Fleur”,”adam”,”Aidan”,”adam”,”Fleur”),2)
scala> a.count
res2: Long = 5

First

Looks for the very first data item of the RDD and returns it.
Listing Variants
def first(): T
Example

scala> val a = sc.parallelize(List(“Fleur”,”adam”,”Aidan”,”adam”,”Fleur”),2)
scala> a.first
res3: String = Fleur

Fold

Aggregates the values of each partition. The aggregation variable within each partition is initialized with zeroValue.
Listing Variants
def fold(zeroValue: T)(op: (T, T) => T): T
Example

val a = sc.parallelize(List(1,2,3), 3)
a.fold(0)(_ + _)
res59: Int = 6

Foreach

Executes an parameterless function for each data item.
Listing Variants
def foreach(f: T => Unit)
Example

scala> val a = sc.parallelize(List(“Fleur”,”adam”,”Aidan”,”adam”,”Fleur”),2)
scala> a.foreach(x => println(x + ” is my friend”))
Fleur is my friend
adam is my friend
Aidan is my friend
adam is my friend
Fleur is my friend

Max

Returns the largest element in the RDD
Listing Variants
def max()(implicit ord: Ordering[T]): T
Example

scala> val a = sc.parallelize(List((25,”Fleur”),(28,”adam”),(24,”Aidan”),(28,”adam”),(25,”Fleur”)))
scala> a.max
res9: (Int, String) = (28,adam)

Mean [Double], MeanApprox [Double]

Calls stats and extracts the mean component. The approximate version of the function can finish somewhat faster in some scenarios. However, it trades accuracy for speed.
Listing Variants

def mean(): Double
def meanApprox(timeout: Long, confidence: Double = 0.95): PartialResult[BoundedDouble]
Example

val a = sc.parallelize(List(9.1, 1.0, 1.2, 2.1, 1.3, 5.0, 2.0, 2.1, 7.4, 7.5, 7.6, 8.8, 10.0, 8.9, 5.5), 3)
a.mean
res0: Double = 5.3

Min

Returns the smallest element in the RDD
Listing Variants
def min()(implicit ord: Ordering[T]): T
Example

scala> val a = sc.parallelize(List((25,”Fleur”),(28,”adam”),(24,”Aidan”),(28,”adam”),(25,”Fleur”)))
scala> a.min
res10: (Int, String) = (24,Aidan)

Top

Utilizes the implicit ordering of $T$ to determine the top $k$ values and returns them as an array.
Listing Variants
ddef top(num: Int)(implicit ord: Ordering[T]): Array[T]
Example

val c = sc.parallelize(Array(6, 9, 4, 7, 5, 8), 2)
c.top(2)
res28: Array[Int] = Array(9, 8)

Sum [Double], SumApprox [Double]

Computes the sum of all values contained in the RDD. The approximate version of the function can finish somewhat faster in some scenarios. However, it trades accuracy for speed.
Listing Variants
def sum(): Double
def sumApprox(timeout: Long, confidence: Double = 0.95): PartialResult[BoundedDouble]
Example

val x = sc.parallelize(List(1.0, 2.0, 3.0, 5.0, 20.0, 19.02, 19.29, 11.09, 21.0), 2)
x.sum
res17: Double = 101.39999999999999