Spark DataFrame using Hive table

A DataFrame is a distributed collection of data, which is organized into named columns. Conceptually, it is equivalent to relational tables with good optimization techniques. DataFrame can be constructed from an array of different sources such as Hive tables, Structured Data files, External databases, or existing RDDs
Introduced in Spark1.3
DataFrame = RDD+schema

DataFrame provides a domain-specific language for structured data manipulation.

Spark SQL also supports reading and writing data stored in Apache Hive. However, since Hive has a large number of dependencies

Hive comes bundled with the Spark library as HiveContext, which inherits from SQLContext. Using HiveContext, you can create and find tables in the HiveMetaStore and write queries on it using HiveQL. Users who do not have an existing Hive deployment can still create a HiveContext.


Spark DataFrames using Hive Table

Insert records into the table

Retriving records from table:

Spark DataFrame using Hive table

Start the spark-shell:

$ spark-shell

Create SQLContext.

SparkSQL is a class and is used for initializing the functionalities of Spark SQL. SparkContext class object (sc) is required for initializing HiveContext class object

Create a HiveContext

Spark DataFrame using Hive table

write a query to create a dataframe in spark to read a data stored in a Hive table

Spark DataFrame using HiveTable

To see the result of DataFrame hiveDf

Spark dataframe

Spark DataFrame




Leave a Reply

Your email address will not be published. Required fields are marked *