HIVE INTERVIEW QUESTIONS ANSWERS

Frequently asked Hive Interview questions

(click to view answers)

Hive interview questions and answers:

1. What is hive?

Hive is a data warehouse infrastructure tool to process structured data in Hadoop. It resides on top of Hadoop to summarize Big Data, and makes querying and analyzing easy. It’s an open source project under the Apache Software Foundation, it’s a data warehouse software ecosystem in Hadoop. Which manage vast amount of structured data sets, by using HQl language; it’s similar to SQL.

2. Why do we need Hive?

Hive is a tool in Hadoop ecosystem which provides an interface to organize and query data in a database like fashion and write SQL like queries. It is suitable for accessing and analyzing data in Hadoop using SQL syntax.

3. Who developed hive?

Initially Hive was developed by Facebook, later the Apache Software Foundation took it up and developed it further as an open source under the name Apache Hive.

4.Which companies use hive?

Hive is commonly used at Facebook for analytical purposes,

Amazon uses it in Amazon Elastic MapReduce and etc.

5. Does Hive support 100% SQL Quries like Insert, Delete and Updates?

Hive doesn’t support Updates in record level. To update, It integrate with Hbase. The versions above Hive 0.13 will support update and delete in but it has some limitations.

They are,

1.It can only be performed on tables that support ACID.
2.If a table is to be used in ACID writes (insert, update, delete) then the table property “transactional” must be set on that table.
3.Only ORC file format is supported in this.
4.Tables must be bucketed to make use of these features.

6. Where hive is the best suitable?

When you are doing data warehouse applications, where you are getting static data instead of dynamic data, when the application on high latency (response time high), where a large data set is maintained and mined for insights, reports, when we are using queries instead of scripting we have to use Hive.

7. When hive is not suitable?

It doesn’t provide OLTP transactions supports only OLAP transactions. If application required OLTP, switch to NoSQL databases. HQL queries have higher latency, due to the mapreduce.

8. Why I choose Hive instead of MapReduce?

There are Partitions to simplify the data process, Bucketing for sampling the data, sort the data quickly, and simplify the mapreduce process. Partitions and Buckets can segmenting large data sets to improve Query performance in Hive. So It is highly recommendable for structure data.

9. Can I access Hive without Hadoop?

Hive store and process the data on the top of Hadoop, but it’s possible to run in Other data storage systems like Amazon S3, GPFS (IBM) and MapR file systems.

10. Hive Support Acid Transactions?

By default it doesn’t support record-level update, insert and delete, but recent Hive 1.4 later versions supporting insert, update and delete operations. So hive support ACID transactions.

11. What is the relationship between MapReduce and Hive? or How Mapreduce jobs submits on the cluster?

Hive provides no additional capabilities to MapReduce. The programs are executed as MapReduce jobs via the interpreter. The Interpreter runs on a client machine which rurns HiveQL queries into MapReduce jobs. Framework submits those jobs onto the cluster.

12. What are the three different modes in which hive can be run?

1.Local mode,

2.Distributed mode,

3.Pseudodistributed mode.

13. What is Hive MetaStore?

It is a relational database storing the metadata of hive tables, partitions, Hive databases etc. MetaStore is a central repository of Hive, that allows to store meta data in external database. By default Hive store meta data in Derby database, but you can store in MySql, Oracle depends on project.

14. What is HiveQL or HQL?

The Hive Query Language (HiveQL) is a query language for Hive to process and analyze structured data in a Metastore.

15. Wherever you run hive query, first it creates new metastore_db, why? What is the importance of Metastore_db?

When we run the hive query, first it creates a local metastore, before creates the metastore first Hive checks whether metastore is already exists or not? If presents shows error, else the process goes on. This configuration is set in hive-site.xml like this.

<name>javax.jdo.option.ConnectionURL</name>

<value>jdbc:derby:;databaseName=metastore_db;create=true</value>

<description>JDBC connect string for a JDBC metastore</description>

</property>

16. Different Hive metastore configuration.

There are three types of metastores configuration called as,

1) Embedded metastore

If Hive run any query first it enter into embedded mode, It’s default mode. In Command line all operations done in embedded mode only, it can access Hive libraries locally. In the embedded metastore configuration, hive driver, metastore interface and databases use same JVM. It’s good for development and testing.

2) Local metastore

In local metastore the metastore store data in external databases like MYSQL. Here Hive driver and metastore run in the same JVM, but remotely communicate with external Database. For better protection required credentials in Local metastore.

3) Remote metastore.

Where as in Remote server, use remote mode to run the queries over Thift server. In Remote metastore, Hive driver and metastore interface would be running in a different JVM. So for better protection, required credentials such are isolated from Hive users.

17. Hive can process any type of data formats?

Yes, Hive uses the SerDe interface for IO operations. Different SerDe interfaces can read and write any type of data. If normal directly process the data where as different type of data is in the Hadoop, Hive use different SerDe interface to process such data.

Example:

MetadataTypedColumnsetSerDe : used to read/write CSV format data.

JsonSerDe : process Json data.

RejexSerDe : process weblog data.

AvroSerde : Avro format data.

18. What is SerDe in Apache Hive ?

A SerDe is a short name for a Serializer Deserializer. Hive uses SerDe (and FileFormat) to read and write data from tables. An important concept behind Hive is that it DOES NOT own the Hadoop File System (HDFS) format that data is stored in.

Users are able to write files to HDFS with whatever tools/mechanism takes their fancy(“CREATE EXTERNAL TABLE” or “LOAD DATA INPATH,” ) and use Hive to correctly “parse” that file format in a way that can be used by Hive.

A SerDe is a powerful (and customizable) mechanism that Hive uses to “parse” data stored in HDFS to be used by Hive.

19. Can you explain different type of SerDe?

By default Hive used Lazy Serde also allows Jeson Serde and most often used RegexSerde to be Serialized and DeSerialized Data.

20. How Hive Serialize and DeSerialize the data?

In Hive language, SerDe also called Serialization and DeSerialization. Usually when read/write the data, user first communicate with inputformat, then it connect with Record reader to read/write record. The data is stored in Serialized (binary) format in Record.

To serialize the data dat goes to row, here deserialized custem serde use object inspector to deserialize the data in fields. now user see the data in the fields, that deliver to the end user.

21. How Hive use Java in SerDe?

To insert data into table, Hive create an object by using Java. To transfer java objects over network, the data should be serialized. Each field serialized by using Object inspector and finally serialized data stored in Hive table.

22. How do you write your own custom SerDe ?

In most cases, users want to write a Deserializer instead of a SerDe, because users just want to read their own data format instead of writing to it.

For example,

The RegexDeserializer will deserialize the data using the configuration parameter ‘regex’, and possibly a list of column names If your SerDe supports DDL (basically, SerDe with parameterized columns and column types), you probably want to implement a Protocol based on DynamicSerDe, instead of writing a SerDe from scratch.

The reason is that the framework passes DDL to SerDe through “thrift DDL” format, and it’s non – trivial to write a “thrift DDL” parser.

23. What is the need for custom Serde?

Depending on the nature of data the user has, the inbuilt SerDe may not satisfy the format of the data. SO users need to write their own java code to satisfy their data format requirements.

24.What Is the HWI?

The Hive Web Interface is an alternative to the command line interface. HWI is a simple graphical interface, It’s hive web interface. The HWI allows start at database level directly. you can get all SerDe, column names and types and simplifies the hive steps. It’s seccession based interface, so you can run multiple hive queries simultaneously. There is no local metastore mode in HWI.

25. Which classes are used by the Hive to Read and Write HDFS Files?

Following classes are used by Hive to read and write HDFS files

-TextInputFormat/HiveIgnoreKeyTextOutputFormat : These 2 classes read/write data in plain text file format.

-SequenceFileInputFormat/SequenceFileOutputFormat : These 2 classes read/write data in hadoop SequenceFile format

26. What is the functionality of Query Processor in Apached Hive ?

This component implements the processing framework for converting SQL to a graph of map/reduce jobs and the execution time framework to run those jobs in the order of dependencies.

27. How many ways you can run Hive?

1. In CLI mode (By using command line inerface).

2. By using JDBC or ODBC.

3. By Called Hive Thift client. It allows java, PHP, Python, Ruby and C++ to write commands to run in Hive.

28. What is different between database and data-warehouse?

Typically database is designed for OLTP transactional operations. Where as Data-warehouse is implemented for OLAP (analysis) operations. OLTP can constrained to a single application. OLAP resists as a layer on the top of several databases. OLTP process current, streaming and dynamic data where as OLAP process Retired, historic and static data only. Database completely has normalization concept. Data WareHouse is De-normalization concept.

29. What are the different types of tables available in HIve?

There are two types. Managed table and external table. In managed table both the data an schema in under control of hive but in external table only the schema is under control of Hive.

30. Difference between Internal and External Table?

External table: Schema is stored in Database. Actual data stored in Hive tables. If data lost in External table, it lost only metastore, but not actual data.

Internal table: MetaStore and actual data both stored in local system. If any situation, data lost, both actual data and meta store will be lost.

31. How to load data in hive table?

Generally, after creating a table in SQL, we can insert data using the Insert statement. But in Hive, we can insert data using the LOAD DATA statement. While inserting data into Hive, it is better to use LOAD DATA to store bulk records.

There are two ways to load data: one is from local file system and second is from Hadoop file system.

32. What is the difference between Describe and describe extended?

To see table definition in Hive, use the command describe <table name>;

To see more detailed information about the table, We use the command describe extended <tablename>;

Another important command it’s also describe all details in a clean manner describe formatted <tablename>;

33. What are the functions for alter table keyword in hive?

Alter the attributes of a table such as changing its table name, changing column names, adding columns, and deleting or replacing columns.

34. How to drop a table in Hive?

Using drop table command.

DROP TABLE <TABLE_NAME>;

When you drop a table from Hive Metastore, it removes the table/column data and their metadata. It can be a normal table (stored in Metastore) or an external table (stored in local file system); Hive treats both in the same manner, irrespective of their types.

35. Define ORDER BY in Hive:

The ORDER BY clause is used to retrieve the details based on one column and sort the result set by ascending or descending order.

36. Difference between order by and sort by in hive?

ORDER BY – use single reducer. If data is too large, it’s take a long time to sort the data.

SORT BY – use number of reducers, so it can process quickly.

In the first case (order by) maps sends each value to the single reducer and count them all. In the second case (sort by) maps splits up the values to many reducers and each reduce generates its list and finds the count. So it can sort quickly.

Example:

SELECT name, id, cell FROM user_table ORDER BY id, name;

SELECT name, id, cell FROM ur_table DISTRIBUTE BY id SORT BY name;

37. Define GROUP BY.

The GROUP BY clause is used to group all the records in a result set using a particular collection column. It is used to query a group of records.

38. What is the difference between Like and Rlike operators in HIVE?

Like : Used to find the substrings within a main string with regular expression %.

Rlike : It’s a special function which also finds the sub strings within a main string, but return true or false without using regular expression.

Example:

Tablename is table, column is name.

name=VenuKatragadda, venkatesh, venkateswarlu

Select * from table where name like “venu%. //VenuKatragadda.

select * from table where name rlike “venk%”. // false, true, true.

39. How Hive Organize the data?

Hive organize in three ways such as Tables, Partitions and Buckets.

-Tables organize based on Arrays, Maps, primitive column types.

-Partitions has one or more partition keys based on project requirements.

-Buckets used for analyze the data for sampling purpose. It’s good approach to process a pinch of data in the form of buckets instead of process all data.

40. What is partitioning in hive?

Hive organizes tables into partitions. It is a way of dividing a table into related parts based on the values of partitioned columns such as date, city, and department. Using partition, it is easy to query a portion of the data.

41. What is the use of partition in hive?

To analyze a particular set of data, not required to load entire data, desired data partition is a good approach. To achieve this goal, Hive allows to partition the data based on particular column.

42. Types of partitions in Hive.

Static partition and Dynamic partition, both can optimize the Hive performance.

For Instant, required a particular year information, partition based on year.

43. What is difference between static and dynamic partition of a table?

To prune data during query, partition can minimize the query time. The partition is created when the data is inserted into table.

Static partition can insert individual rows where as Dynamic partition can process entire table based on a particular column. At least one static partition is must to create any (static, dynamic) partition.

If you are partitioning a large datasets, doing sort of a ETL flow Dynamic partition partition recommendable.

44. What is bucketing?

Tables or partitions are sub-divided into buckets, to provide extra structure to the data that may be used for more efficient querying. Bucketing works based on the value of hash function of some column of a table

45. Why we are using buckets in Hive?

To process many chunks of files, to analyze vast amount of data, sometime burst the process and time. Bucketing is a sampling concept to analyze the data, by using hashing algorithm.

set hive.enforce.bucketing=true; can enable the process

46. What is the difference between partition and bucketing?

The main aim of both Partitioning and Bucketing is execute the query more efficiently. When you are creating a table the slices are fixed in the partitioning the table.

Bucketing follows Hash algorithm. Based on number of buckets, randomly the data inserted into the bucket to sampling of the data.

47. What are the built in operators in hive?

1. Relational Operators – These operators are used to compare two operands.

Operators:

=, !=, <=, <, >=, >, IS NULL, IS NOT NULL, LIKE, RLIKE, REGEXP

2. Arithmetic Operators – These operators support various common arithmetic operations on the operands. All of them return number types

Operators:

+, -, *, /, %, &, |, ^, ~

3. Logical Operators – Operators are logical expressions.All of them return either TRUE or FALSE.

Operators:

AND,&&,OR,||,NOT,!

4. Complex Operators – These operators provide an expression to access the elements of Complex Types.

Operators:

A[n],M[key],S.x

48. What is view in hive?

Views are generated based on user requirements. You can save any result set data as a view. The usage of view in Hive is same as that of the view in SQL. It is a standard RDBMS concept. We can execute all DML operations on a view.

49. What is index?

An Index is nothing but a pointer on a particular column of a table. Creating an index means creating a pointer on a particular column of a table.

50. How Hive can improve performance with ORC format tables?

Hive can store the data in highly efficient manner in the Optimized Row Columnar (ORC) file format. It can ease many Hive file format limitations. Using ORC files can improves the performance when reading, writing, and processing data.

Enable this format by run this command and create table like this.

set hive.compute.query.using.stats=true;

set hive.stats.dbclass=fs;

CREATE TABLE orc_table (

id int,

name string )

ROW FORMAT DELIMITED

FIELDS TERMINATED BY ‘\;’

LINES TERMINATED BY ‘\n’

STORED AS ORC;

51. What is the importance of Vectorization in Hive?

It’s a query optimization technique. Instead of processing multiple rows, Vectorization allows to process process a batch of rows as a unit. Consequently it can optimize query performance.

The file must be stored in ORC format to enable this Vectorization. It’s disabled by default, but enable this property by run this command.

set hive.vectorized.execution.enabled=true;

52. What is User Interface (UI)?

UI: This interface is interpreter between users and Driver, which accept queries from User and execute on the Driver.

Now two types of interfaces available in Hive such as command line interface and GUI interface. Hadoop provides Thrift interface and JDBC/ODBC for integrating other applications.

53. What is importance of Driver in Hive?

Driver: It manages life cycle of HiveQL queries. Driver receives the queries from User Interface and fetch on the ODBC/JDBC interfaces to process the query.

Driver create separate independent section to handle each query.

Compiler: Compiler accept plans from Drivers and gets the required metadata from MetaStore, to execute Plan.

MetaStore: Hive Store meta data in the table. It means information about data is stored in MetaStore in the form of table, it may be internal or external table. Hive compiler get the meta data information from metastore table.

Execute Engine: Hive Driver execute the output in the execution Engine.

Here, execute engine executes the queries in the MapReduce

JobTracker: Based on Required information, Hive queries run in the MapReduce to process the data.

54. When we are use explode in Hive?

Sometime Hadoop developer takes array as input and convert into a separate table row. To achieve this goal, Hive use explode, it acts as interpreter to convert complex data-types into desired table formats.

Syntax:

SELECT explode (arrayName) AS newCol FROM TableName;

SELECT explode(map) AS newCol1, NewCol2 From TableName;

55. What is ObjectInspector functionality in Hive?

Hive uses ObjectInspector to analyze the internal structure of the rows, columns and complex objects. Additionally gives us ways to access the internal fields inside the object. It not only process common data-types like int, bigint, STRING, but also process complex data-types like arrays, maps, structs and union.

56. Is a job split into map?

No, Hadoop framework can split the data-file, but not Job. This chunks of data stored in blocks. Each split need a map to process. Where as Job is a configurable unit to control execution of the plan/logic. Job is not a physical data-set to split, it’s a logical configuration API to process those split.

57. How to display the present database name in the terminal?

There are two ways to know the current database. One temporary in cli and second one is persistently.

1) in CLI just enter this command: set hive.cli.print.current.db=true;

2) In hive-site.xml paste this code:

<name>hive.cli.print.current.db</name>

</property>

In second scenario, you can automatically display the Hive database name when you open terminal.

58. Can you overwrite Hadoop Mapreduce configuration in Hive?

Yes, You can overwrite Hive map, reduce steps in hive conf settings. Hive allows to overwrite Hadoop configuration files.

59. What is Thrift server & client, JDBC and ODBC driver importance in Hive?

Thrift is a cross language RPC framework which generate code and combines a software stack finally execute the Thrift code in remote server. Thrift compiler acts as interpreter between server and client. Thrift server allows a remove client to submit request to Hive, using different programming languages like Python, Ruby and scala.

JDBC driver: A JDBC driver is a software component enabling a Java application to interact with a database.

ODBC driver: ODBC accomplishes DBMS independence by using an ODBC driver as a translation layer between the application and the DBMS.

60. How to write single and multiple line commands in Hive?

To write single line commands we use –followed by commands.

Eg: –It is too important step.

Hive doesn’t supports multiple comments now.

61. What is the difference between Hive and Hbase?

-Hive allows most of the SQL queries, but Hbase not allows SQL queries directly.

-Hive doesn’t support record level update, insert, and deletion operations on table, but Hbase can do it.

-Hive is a Data warehouse framework where as Hbase is a NoSQL database.

-Hive run on the top of Mapreduce, Hbase run on the top of HDFS.

62. Diference between Hive and Pig?

Hive	Pig
1. Hadoop should start to run Hive.	1. Not required to start Hadoop, you can run standalone mode or cluster mode, but you should install Hadoop.
2. Joins If you have limited joins and filters go ahead with HIVE.	2. Pig is highly recommendable when you have huge number of joins and filters.
3. Data Hive Support only Structured data, so most often used in the data warehouse.	3. Pig can process both structured & unstructured data, so it’s the best suitable for Streaming Data.
4. UDF support User Defined Functions, but much hard to debug.	4. Very easy to write a UDF to calculate Matrics.
5. Manually create table to store intermediate data.	5. Not required to create table table.
6. Hive Stores the meta data in database like darby, (by default), mysql, oracle.	6. Pig has no metadata support.
7. Hive use separate query language called HQL goes beyond standard SQL.	7. Pig use own language called Pig Latin is the relational data-flow language.
8. Best suitable for analysts especially big data analysts and who familiar to SQL, most often used to generate reports and statistics functions.	8. Best suitable for programmers and software developers and who familiar Scripting languages like Python, Java.
9. Hive can operate an optional thrift based server and operates on the server side of any cluster.	9. Pig can operates on the client side of any cluster, there is no any server side concept.
10. It execute quickly, but not load quickly.	10. It loads the data effectively and quickly.
11. Carefully configure the Hive in Cluster, Pseudo mode.	11. Pig Installed based on shell interaction , so not required any other configuration, Just extract the tar file.

Latest News

HIVE INTERVIEW QUESTIONS ANSWERS

(click to view answers)

Hive interview questions and answers:

Recent Posts

Categories