Frequently asked Hive Interview questions
(click to view answers)
Hive interview questions and answers:
Hive is a data warehouse infrastructure tool to process structured data in Hadoop. It resides on top of Hadoop to summarize Big Data, and makes querying and analyzing easy. It’s an open source project under the Apache Software Foundation, it’s a data warehouse software ecosystem in Hadoop. Which manage vast amount of structured data sets, by using HQl language; it’s similar to SQL.
Hive is a tool in Hadoop ecosystem which provides an interface to organize and query data in a database like fashion and write SQL like queries. It is suitable for accessing and analyzing data in Hadoop using SQL syntax.
Initially Hive was developed by Facebook, later the Apache Software Foundation took it up and developed it further as an open source under the name Apache Hive.
Hive is commonly used at Facebook for analytical purposes,
Amazon uses it in Amazon Elastic MapReduce and etc.
Hive doesn’t support Updates in record level. To update, It integrate with Hbase. The versions above Hive 0.13 will support update and delete in but it has some limitations.
1.It can only be performed on tables that support ACID.
2.If a table is to be used in ACID writes (insert, update, delete) then the table property “transactional” must be set on that table.
3.Only ORC file format is supported in this.
4.Tables must be bucketed to make use of these features.
When you are doing data warehouse applications, where you are getting static data instead of dynamic data, when the application on high latency (response time high), where a large data set is maintained and mined for insights, reports, when we are using queries instead of scripting we have to use Hive.
It doesn’t provide OLTP transactions supports only OLAP transactions. If application required OLTP, switch to NoSQL databases. HQL queries have higher latency, due to the mapreduce.
There are Partitions to simplify the data process, Bucketing for sampling the data, sort the data quickly, and simplify the mapreduce process. Partitions and Buckets can segmenting large data sets to improve Query performance in Hive. So It is highly recommendable for structure data.
Hive store and process the data on the top of Hadoop, but it’s possible to run in Other data storage systems like Amazon S3, GPFS (IBM) and MapR file systems.
By default it doesn’t support record-level update, insert and delete, but recent Hive 1.4 later versions supporting insert, update and delete operations. So hive support ACID transactions.
Hive provides no additional capabilities to MapReduce. The programs are executed as MapReduce jobs via the interpreter. The Interpreter runs on a client machine which rurns HiveQL queries into MapReduce jobs. Framework submits those jobs onto the cluster.
It is a relational database storing the metadata of hive tables, partitions, Hive databases etc. MetaStore is a central repository of Hive, that allows to store meta data in external database. By default Hive store meta data in Derby database, but you can store in MySql, Oracle depends on project.
The Hive Query Language (HiveQL) is a query language for Hive to process and analyze structured data in a Metastore.
When we run the hive query, first it creates a local metastore, before creates the metastore first Hive checks whether metastore is already exists or not? If presents shows error, else the process goes on. This configuration is set in hive-site.xml like this.
<description>JDBC connect string for a JDBC metastore</description>
There are three types of metastores configuration called as,
1) Embedded metastore
If Hive run any query first it enter into embedded mode, It’s default mode. In Command line all operations done in embedded mode only, it can access Hive libraries locally. In the embedded metastore configuration, hive driver, metastore interface and databases use same JVM. It’s good for development and testing.
2) Local metastore
In local metastore the metastore store data in external databases like MYSQL. Here Hive driver and metastore run in the same JVM, but remotely communicate with external Database. For better protection required credentials in Local metastore.
3) Remote metastore.
Where as in Remote server, use remote mode to run the queries over Thift server. In Remote metastore, Hive driver and metastore interface would be running in a different JVM. So for better protection, required credentials such are isolated from Hive users.
Yes, Hive uses the SerDe interface for IO operations. Different SerDe interfaces can read and write any type of data. If normal directly process the data where as different type of data is in the Hadoop, Hive use different SerDe interface to process such data.
MetadataTypedColumnsetSerDe : used to read/write CSV format data.
JsonSerDe : process Json data.
RejexSerDe : process weblog data.
AvroSerde : Avro format data.
A SerDe is a short name for a Serializer Deserializer. Hive uses SerDe (and FileFormat) to read and write data from tables. An important concept behind Hive is that it DOES NOT own the Hadoop File System (HDFS) format that data is stored in.
Users are able to write files to HDFS with whatever tools/mechanism takes their fancy(“CREATE EXTERNAL TABLE” or “LOAD DATA INPATH,” ) and use Hive to correctly “parse” that file format in a way that can be used by Hive.
A SerDe is a powerful (and customizable) mechanism that Hive uses to “parse” data stored in HDFS to be used by Hive.
By default Hive used Lazy Serde also allows Jeson Serde and most often used RegexSerde to be Serialized and DeSerialized Data.
In Hive language, SerDe also called Serialization and DeSerialization. Usually when read/write the data, user first communicate with inputformat, then it connect with Record reader to read/write record. The data is stored in Serialized (binary) format in Record.
To serialize the data dat goes to row, here deserialized custem serde use object inspector to deserialize the data in fields. now user see the data in the fields, that deliver to the end user.
To insert data into table, Hive create an object by using Java. To transfer java objects over network, the data should be serialized. Each field serialized by using Object inspector and finally serialized data stored in Hive table.
In most cases, users want to write a Deserializer instead of a SerDe, because users just want to read their own data format instead of writing to it.
The RegexDeserializer will deserialize the data using the configuration parameter ‘regex’, and possibly a list of column names If your SerDe supports DDL (basically, SerDe with parameterized columns and column types), you probably want to implement a Protocol based on DynamicSerDe, instead of writing a SerDe from scratch.
The reason is that the framework passes DDL to SerDe through “thrift DDL” format, and it’s non – trivial to write a “thrift DDL” parser.
Depending on the nature of data the user has, the inbuilt SerDe may not satisfy the format of the data. SO users need to write their own java code to satisfy their data format requirements.
The Hive Web Interface is an alternative to the command line interface. HWI is a simple graphical interface, It’s hive web interface. The HWI allows start at database level directly. you can get all SerDe, column names and types and simplifies the hive steps. It’s seccession based interface, so you can run multiple hive queries simultaneously. There is no local metastore mode in HWI.
Following classes are used by Hive to read and write HDFS files
-TextInputFormat/HiveIgnoreKeyTextOutputFormat : These 2 classes read/write data in plain text file format.
-SequenceFileInputFormat/SequenceFileOutputFormat : These 2 classes read/write data in hadoop SequenceFile format
This component implements the processing framework for converting SQL to a graph of map/reduce jobs and the execution time framework to run those jobs in the order of dependencies.
1. In CLI mode (By using command line inerface).
2. By using JDBC or ODBC.
3. By Called Hive Thift client. It allows java, PHP, Python, Ruby and C++ to write commands to run in Hive.
Typically database is designed for OLTP transactional operations. Where as Data-warehouse is implemented for OLAP (analysis) operations. OLTP can constrained to a single application. OLAP resists as a layer on the top of several databases. OLTP process current, streaming and dynamic data where as OLAP process Retired, historic and static data only. Database completely has normalization concept. Data WareHouse is De-normalization concept.
There are two types. Managed table and external table. In managed table both the data an schema in under control of hive but in external table only the schema is under control of Hive.
External table: Schema is stored in Database. Actual data stored in Hive tables. If data lost in External table, it lost only metastore, but not actual data.
Internal table: MetaStore and actual data both stored in local system. If any situation, data lost, both actual data and meta store will be lost.
Generally, after creating a table in SQL, we can insert data using the Insert statement. But in Hive, we can insert data using the LOAD DATA statement. While inserting data into Hive, it is better to use LOAD DATA to store bulk records.
There are two ways to load data: one is from local file system and second is from Hadoop file system.
To see table definition in Hive, use the command describe <table name>;
To see more detailed information about the table, We use the command describe extended <tablename>;
Another important command it’s also describe all details in a clean manner describe formatted <tablename>;
Alter the attributes of a table such as changing its table name, changing column names, adding columns, and deleting or replacing columns.
Using drop table command.
DROP TABLE <TABLE_NAME>;
When you drop a table from Hive Metastore, it removes the table/column data and their metadata. It can be a normal table (stored in Metastore) or an external table (stored in local file system); Hive treats both in the same manner, irrespective of their types.
The ORDER BY clause is used to retrieve the details based on one column and sort the result set by ascending or descending order.
ORDER BY – use single reducer. If data is too large, it’s take a long time to sort the data.
SORT BY – use number of reducers, so it can process quickly.
In the first case (order by) maps sends each value to the single reducer and count them all. In the second case (sort by) maps splits up the values to many reducers and each reduce generates its list and finds the count. So it can sort quickly.
SELECT name, id, cell FROM user_table ORDER BY id, name;
SELECT name, id, cell FROM ur_table DISTRIBUTE BY id SORT BY name;
The GROUP BY clause is used to group all the records in a result set using a particular collection column. It is used to query a group of records.
Like : Used to find the substrings within a main string with regular expression %.
Rlike : It’s a special function which also finds the sub strings within a main string, but return true or false without using regular expression.
Tablename is table, column is name.
name=VenuKatragadda, venkatesh, venkateswarlu
Select * from table where name like “venu%. //VenuKatragadda.
select * from table where name rlike “venk%”. // false, true, true.
Hive organize in three ways such as Tables, Partitions and Buckets.
-Tables organize based on Arrays, Maps, primitive column types.
-Partitions has one or more partition keys based on project requirements.
-Buckets used for analyze the data for sampling purpose. It’s good approach to process a pinch of data in the form of buckets instead of process all data.
Hive organizes tables into partitions. It is a way of dividing a table into related parts based on the values of partitioned columns such as date, city, and department. Using partition, it is easy to query a portion of the data.
To analyze a particular set of data, not required to load entire data, desired data partition is a good approach. To achieve this goal, Hive allows to partition the data based on particular column.
Static partition and Dynamic partition, both can optimize the Hive performance.
For Instant, required a particular year information, partition based on year.
To prune data during query, partition can minimize the query time. The partition is created when the data is inserted into table.
Static partition can insert individual rows where as Dynamic partition can process entire table based on a particular column. At least one static partition is must to create any (static, dynamic) partition.
If you are partitioning a large datasets, doing sort of a ETL flow Dynamic partition partition recommendable.
Tables or partitions are sub-divided into buckets, to provide extra structure to the data that may be used for more efficient querying. Bucketing works based on the value of hash function of some column of a table
To process many chunks of files, to analyze vast amount of data, sometime burst the process and time. Bucketing is a sampling concept to analyze the data, by using hashing algorithm.
set hive.enforce.bucketing=true; can enable the process
The main aim of both Partitioning and Bucketing is execute the query more efficiently. When you are creating a table the slices are fixed in the partitioning the table.
Bucketing follows Hash algorithm. Based on number of buckets, randomly the data inserted into the bucket to sampling of the data.
1. Relational Operators – These operators are used to compare two operands.
=, !=, <=, <, >=, >, IS NULL, IS NOT NULL, LIKE, RLIKE, REGEXP
2. Arithmetic Operators – These operators support various common arithmetic operations on the operands. All of them return number types
+, -, *, /, %, &, |, ^, ~
3. Logical Operators – Operators are logical expressions.All of them return either TRUE or FALSE.
4. Complex Operators – These operators provide an expression to access the elements of Complex Types.
Views are generated based on user requirements. You can save any result set data as a view. The usage of view in Hive is same as that of the view in SQL. It is a standard RDBMS concept. We can execute all DML operations on a view.
An Index is nothing but a pointer on a particular column of a table. Creating an index means creating a pointer on a particular column of a table.
Hive can store the data in highly efficient manner in the Optimized Row Columnar (ORC) file format. It can ease many Hive file format limitations. Using ORC files can improves the performance when reading, writing, and processing data.
Enable this format by run this command and create table like this.
CREATE TABLE orc_table (
name string )
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ‘\;’
LINES TERMINATED BY ‘\n’
STORED AS ORC;
It’s a query optimization technique. Instead of processing multiple rows, Vectorization allows to process process a batch of rows as a unit. Consequently it can optimize query performance.
The file must be stored in ORC format to enable this Vectorization. It’s disabled by default, but enable this property by run this command.
UI: This interface is interpreter between users and Driver, which accept queries from User and execute on the Driver.
Now two types of interfaces available in Hive such as command line interface and GUI interface. Hadoop provides Thrift interface and JDBC/ODBC for integrating other applications.
Driver: It manages life cycle of HiveQL queries. Driver receives the queries from User Interface and fetch on the ODBC/JDBC interfaces to process the query.
Driver create separate independent section to handle each query.
Compiler: Compiler accept plans from Drivers and gets the required metadata from MetaStore, to execute Plan.
MetaStore: Hive Store meta data in the table. It means information about data is stored in MetaStore in the form of table, it may be internal or external table. Hive compiler get the meta data information from metastore table.
Execute Engine: Hive Driver execute the output in the execution Engine.
Here, execute engine executes the queries in the MapReduce
JobTracker: Based on Required information, Hive queries run in the MapReduce to process the data.
Sometime Hadoop developer takes array as input and convert into a separate table row. To achieve this goal, Hive use explode, it acts as interpreter to convert complex data-types into desired table formats.
SELECT explode (arrayName) AS newCol FROM TableName;
SELECT explode(map) AS newCol1, NewCol2 From TableName;
Hive uses ObjectInspector to analyze the internal structure of the rows, columns and complex objects. Additionally gives us ways to access the internal fields inside the object. It not only process common data-types like int, bigint, STRING, but also process complex data-types like arrays, maps, structs and union.
No, Hadoop framework can split the data-file, but not Job. This chunks of data stored in blocks. Each split need a map to process. Where as Job is a configurable unit to control execution of the plan/logic. Job is not a physical data-set to split, it’s a logical configuration API to process those split.
There are two ways to know the current database. One temporary in cli and second one is persistently.
1) in CLI just enter this command: set hive.cli.print.current.db=true;
2) In hive-site.xml paste this code:
In second scenario, you can automatically display the Hive database name when you open terminal.
Yes, You can overwrite Hive map, reduce steps in hive conf settings. Hive allows to overwrite Hadoop configuration files.
Thrift is a cross language RPC framework which generate code and combines a software stack finally execute the Thrift code in remote server. Thrift compiler acts as interpreter between server and client. Thrift server allows a remove client to submit request to Hive, using different programming languages like Python, Ruby and scala.
JDBC driver: A JDBC driver is a software component enabling a Java application to interact with a database.
ODBC driver: ODBC accomplishes DBMS independence by using an ODBC driver as a translation layer between the application and the DBMS.
To write single line commands we use –followed by commands.
Eg: –It is too important step.
Hive doesn’t supports multiple comments now.
-Hive allows most of the SQL queries, but Hbase not allows SQL queries directly.
-Hive doesn’t support record level update, insert, and deletion operations on table, but Hbase can do it.
-Hive is a Data warehouse framework where as Hbase is a NoSQL database.
-Hive run on the top of Mapreduce, Hbase run on the top of HDFS.
|1. Hadoop should start to run Hive.||1. Not required to start Hadoop, you can run standalone mode or cluster mode, but you should install Hadoop.|
|2. Joins If you have limited joins and filters go ahead with HIVE.||2. Pig is highly recommendable when you have huge number of joins and filters.|
|3. Data Hive Support only Structured data, so most often used in the data warehouse.||3. Pig can process both structured & unstructured data, so it’s the best suitable for Streaming Data.|
|4. UDF support User Defined Functions, but much hard to debug.||4. Very easy to write a UDF to calculate Matrics.|
|5. Manually create table to store intermediate data.||5. Not required to create table table.|
|6. Hive Stores the meta data in database like darby, (by default), mysql, oracle.||6. Pig has no metadata support.|
|7. Hive use separate query language called HQL goes beyond standard SQL.||7. Pig use own language called Pig Latin is the relational data-flow language.|
|8. Best suitable for analysts especially big data analysts and who familiar to SQL, most often used to generate reports and statistics functions.||8. Best suitable for programmers and software developers and who familiar Scripting languages like Python, Java.|
|9. Hive can operate an optional thrift based server and operates on the server side of any cluster.||9. Pig can operates on the client side of any cluster, there is no any server side concept.|
|10. It execute quickly, but not load quickly.||10. It loads the data effectively and quickly.|
|11. Carefully configure the Hive in Cluster, Pseudo mode.||11. Pig Installed based on shell interaction , so not required any other configuration, Just extract the tar file.|