Spark DataFrame using Hive table
A DataFrame is a distributed collection of data, which is organized into named columns. Conceptually, it is equivalent to relational tables with good optimization techniques. DataFrame can be constructed from an array of different sources such as Hive tables, Structured Data files, External databases, or existing RDDs
Introduced in Spark1.3
DataFrame = RDD+schema
DataFrame provides a domain-specific language for structured data manipulation.
Spark SQL also supports reading and writing data stored in Apache Hive. However, since Hive has a large number of dependencies
Hive comes bundled with the Spark library as HiveContext, which inherits from SQLContext. Using HiveContext, you can create and find tables in the HiveMetaStore and write queries on it using HiveQL. Users who do not have an existing Hive deployment can still create a HiveContext.
Spark DataFrames using Hive Table
Insert records into the table
Retriving records from table:
Spark DataFrame using Hive table
Start the spark-shell:
$ spark-shell
Create SQLContext.
SparkSQL is a class and is used for initializing the functionalities of Spark SQL. SparkContext class object (sc) is required for initializing HiveContext class object
Create a HiveContext
Spark DataFrame using Hive table
write a query to create a dataframe in spark to read a data stored in a Hive table
Spark DataFrame using HiveTable
To see the result of DataFrame hiveDf
Spark dataframe
Spark DataFrame