Hbase Interview Questions
Hbase Interview Questions
(click to view answers)
1. What is NoSql?
Apache HBase is a type of “NoSQL” database. “NoSQL” is a general term meaning that the database isn’t an RDBMS which supports SQL as its primary access language, but there are many types of NoSQL databases: BerkeleyDB is an example of a local NoSQL database, whereas HBase is very much a distributed database. Technically speaking, HBase is really more a “Data Store” than “Data Base” because it lacks many of the features you find in an RDBMS, such as typed columns, secondary indexes, triggers, and advanced query languages, etc.
2. Explain what is Hbase?
Hbase is a column-oriented database management system which runs on top of HDFS (Hadoop Distribute File System). Hbase is not a relational data store, and it does not support structured query language like SQL.
In Hbase, a master node regulates the cluster and region servers to store portions of the tables and operates the work on the data.
3. Explain why to use Hbase?
High capacity storage system
Distributed design to create large tables
Column-Oriented Stores
Horizontally Scalable
High Performance And Availability
Base goal of Hbase is millions of columns, thousands of versions and billions of rows
Unlike HDFS (Hadoop Distribute File System), it supports random real time CRUD operations
4. What is Apache HBase?
Apache Hbase is one the sub-project of Apache Hadoop,which was designed for NoSql database(Hadoop Database),bigdata store and a distributed, scalable. Use Apache HBase when you need random, realtime read/write access to your Big Data. A table which contain billions of rows X millions of columns -atop clusters of commodity hardware. Apache HBase is an open-source, distributed, versioned, non-relational database modeled after Google’s Bigtable. Apache HBase provides Bigtable- like capabilities run on top of Hadoop and HDFS.
5. What is the history of HBase?
2006: BigTable paper published by Google. 2006 (end of year): HBase development starts. 2008: HBase becomes Hadoop sub-project. 2010: HBase becomes Apache top-level project.
6. What are the main features of Apache HBase?
Apache HBase has many features which supports both linear and modular scaling, HBase tables are distributed on the cluster via regions, and regions are automatically split and re-distributed as your data grows(Automatic sharding).HBase supports a Block Cache and Bloom Filters for high volume query optimization(Block Cache and Bloom Filters).
7. Mention what are the key components of Hbase?
Zookeeper: It does the co-ordination work between client and Hbase Maser
Hbase Master: Hbase Master monitors the Region Server
RegionServer: RegionServer monitors the Region
Region: It contains in memory data store(MemStore) and Hfile.
Catalog Tables:Catalog tables consists of ROOT and META
8. When should we use Hbase?
we should have milions or billions of rows and columns in table at that point only we have use Hbase otherwise better to go RDBMS(we have use thousand of rows and columns).In RDBMS should runs on single database server but in hbase is distributed and scalable and also run on commodity hardware. typed columns, secondary indexes, transactions, advanced query languages, etc these features provided by Hbase,not by RDBMS.
9. Is there any difference between HBase datamodel and RDBMS datamodel?
In Hbase, data is stored as a table(have rows and columns) similar to RDBMS but this is not a helpful analogy. Instead, it can be helpful to think of an HBase table as a multi-dimensional map.
10. What are key terms are used for designing of HBase datamodel?
1) Table (Hbase table consists of rows)
2) Row (Row in hbase which contains row key and one or more columns with value associated with them)
3) Column (A column in HBase consists of a column family and a column qualifier, which are delimited by a: (colon) character)
4) Column family (having set of columns and their values, the column families should be considered carefully during schema design)
5) Column qualifier (A column qualifier is added to a column family to provide the index
for a given piece of data)
6) Cell (A cell is a combination of row, column family, and column qualifier,
and contains a value and a timestamp, which represents the value’s version)
7) Timestamp (represents the time on the RegionServer when the data was written, but
you can specify a different timestamp value when you put data into the cell)
11. What are datamodel operations in HBase?
Get (returns attributes for a specified row,Gets are executed via HTable.get)
Put (Put either adds new rows to a table (if the key is new) or can update existing rows (if the key already exists). Puts are executed via HTable.put (writeBuffer) or HTable.batch (non-writeBuffer))
Scan (Scan allow iteration over multiple rows for specified attributes)
Delete (Delete removes a row from a table. Deletes are executed via HTable.delete)
HBase does not modify data in place, and so deletes are handled by creating new markers called tombstones. These tombstones, along with the dead values, are cleaned up on major compaction.
12. How to connect to Hbase?
A connection to Hbase is established through Hbase Shell which is a Java API.
13. What is the role of Master server in Hbase?
The Master server assigns regions to region servers and handles load balancing in the cluster.
14. what is HMaster?
The Hmaster is the Master server responsible for monitoring all RegionServer instances in the cluster and it is the interface for all metadata changes. In a distributed cluster, it runs on the Namenode.
15. What is HRegionServer in Hbase?
HRegionServer is the RegionServer implementation. It is responsible for serving and managing regions. In a distributed cluster, a RegionServer runs on a DataNode.
16. What is the role of Zookeeper in Hbase?
The zookeeper maintains configuration information, provides distributed synchronization, and also maintains the communication between clients and region servers.
17. What is a Namespace in Hbase?
A Namespace is a logical grouping of tables . It is similar to a database object in a Relational database system.
18. When do we need to disable a table in Hbase?
In Hbase a table is disabled to allow it to be modified or change its settings. .When a table is disabled it cannot be accessed through the scan command.
19. Give a command to check if a table is disabled?
Hbase > is_disabled “table name”
20. What does the following table do?( EX)
hbase > disable_all ‘p.*’
It disable all the tables starting with P.
21. What are catalog tables in Hbase?
The catalog tables in Hbase maintain the metadata information. They are named as −ROOT− and .META. The –R0OT− table stores information about location of .META> table and the .META> table holds information about all regions and their locations.
22. “Is Hbase a scale out or scale up process?
Hbase runs on top of Hadoop which is a distributed system. Hadoop can only scale as and when required by adding more machines on the fly. So Hbase is a scale out ”
23. What is the step in writing something into Hbase by a client?
In Hbase the client does not write directly into the HFile. The client first writes to WAL(Write Access Log), which then is accessed by Memdtore. The Memstore Flushes the data into permanent memory from time to time.
24. What is compaction in Hbase?
As more and more data is written to Hbase, many HFiles get created. Compaction is the process of merging these HFiles to one file and after the merged file is created successfully, discards the old file.
25. What are the different compaction types in Hbase?
There are two types of compaction. Major and Minor compaction. In minor compaction, the adjacent small HFiles are merged to create a single HFile without removing the deleted HFiles. Files to be merged are chosen randomly.
In Major compaction, all the HFiles of a column are emerged and a single HFiles is created. The deleted HFiles are discarded and it is generally triggered manually.
26. What is the difference between the commands delete column and delete family?
The Delete column command deletes all versions of a column but the delete family deletes all columns of a particular family.
27. What is a cell in Hbase?
A cell in Hbase is the smallest unit of a Hbase table which holds a piece of data in the form of a tuple{row, column, version}
28. What is the role of the class HColumnDescriptor in Hbase?
This class is used to store information about a column family such as the number of versions, compression settings, etc. It is used as input when creating a table or adding a column.
29. What is the lower bound of versions in Hbase?
The lower bound of versions indicates the minimum number of versions to be stored in Hbase for a column. For example If the value is set to 3 then three latest version wil be maintained and the older ones will be removed.
30. What is TTL (Time to live) in Hbase?
TTL is a data retention technique using which the version of a cell can be preserved till a specific time period. Once that timestamp is reached the specific version will be removed.
31. Does Hbase support table joins?
Hbase does not support table joins. But using a mapreduce job we can specify join queries to retrieve data from multiple Hbase tables.
32. What is a rowkey in Hbase?
Each row in Hbase is identified by a unique byte of array called row key.
33. What are the two ways in which you can access data from Hbase?
The data in Hbase can be accessed in two ways.
Using the rowkey and table scan for a range of row key values.
Using mapreduce in a batch manner.
34. What are the two types of table design approach in Hbase?
They are − (i) Short and Wide (ii) Tall and Thin
35. In which scenario should we consider creating a short and wide Hbase table?
The short and wide table design is considered when there is
There is a small number of columns
There is a large number of rows
36. In Which scenario should we consider a Tall-thin table design?
The tall and thin table design is considered when there is
There is a large number of columns
There is a small number of rows
37. Give a command to store 4 versions in a table rather than the default 3?
hbase > alter ‘tablename’, {NAME => ‘ColFamily’,
>VERSIONS => 4};
38. What does the following command do?
hbase > alter ‘tablename’, {NAME => ‘colFamily’, METHOD => ‘delete’};
This command deletes the column family form the table.
39. Give the commands to add a new column family “(newcolfamily”) to a table (“tablename”) which has a existing column?
family(“oldcolfamily”).
Hbase > disable ‘tablename’
Hbase > alter ‘tablename’ {NAME =>
‘oldcolfamily’,NAME=>’newcolfamily’}
Habse > enable ‘tablename’
40. What is the Hbase shell command to only 10 records form a table?
can ‘tablename’, {LIMIT=>10, STARTROW=>”start_row”, STOPROW=>”stop_row”}
41. What does the following command do?
major_compact ‘tablename’
42. How does Hbase support Bulk data loading?
There are two main steps to do a data bulk load in Hbase.
Generate Hbase data file(StoreFile) using a custom mapreduce job from the data source. The StoreFile is created in Hbase internal format which can be efficiently loaded.
The prepared file is imported using another tool like comletebulkload to import data into a running cluster. Each file gets loaded to one specific region.
43. How does Hbase provide high availability?
Hbase uses a feature called region replication. In this feature for each region of a table, there will be multiple replicas that are opened in different RegionServers. The Load Balancer ensures that the region replicas are not co-hosted in the same region servers.
44. What are the different Block Caches in Hbase?
HBase provides two different BlockCache implementations: the default on-heap LruBlockCache and the BucketCache, which is (usually) off-heap.
45. How does WAL help when a RegionServer crashes?
The Write Ahead Log (WAL) records all changes to data in HBase, to file-based storage. if a RegionServer crashes or becomes unavailable before the MemStore is flushed, the WAL ensures that the changes to the data can be replayed.
46. Why MultiWAL is needed?
With a single WAL per RegionServer, the RegionServer must write to the WAL serially, because HDFS files must be sequential. This causes the WAL to be a performance bottleneck.
47. In Hbase what is log splitting?
When a region is edited, the edits in the WAL file which belong to that region need to be replayed. Therefore, edits in the WAL file must be grouped by region so that particular sets can be replayed to regenerate the data in a particular region. The process of grouping the WAL edits by region is called log splitting.
48. How can you disable WAL? What is the benefit?
WAL can be disabled to improve performance bottleneck.
This is done by calling the Hbase client field Mutation.writeToWAL(false).
49. When do we do manual Region splitting?
The manual region splitting is done we have an unexpected hotspot in your table because of many clients querying the same table.
50. What is a Hbase Store?
A Habse Store hosts a MemStore and 0 or more StoreFiles (HFiles). A Store corresponds to a column family for a table for a given region.
51. Which file in Hbase is designed after the SSTable file of BigTable?
The HFile in Habse which stores the Actual data(not metadata) is designed after the SSTable file of BigTable.
52. Why do we pre-create empty regions?
Tables in HBase are initially created with one region by default. Then for bulk imports, all clients will write to the same region until it is large enough to split and become distributed across the cluster. So empty regions are created to make this process faster.
53. What is hotspotting in Hbase?
Hotspotting is asituation when a large amount of client traffic is directed at one node, or only a few nodes, of a cluster. This traffic may represent reads, writes, or other operations. This traffic overwhelms the single machine responsible for hosting that region, causing performance degradation and potentially leading to region unavailability.
54. What are the approaches to avoid hotspotting?
Hotspotting can be avoided or minimized by distributing the rowkeys across multiple regions. The different techniques to do this is salting and Hashing.
55. Why should we try to minimize the row name and column name sizes in Hbase?
In Hbase values are always freighted with their coordinates; as a cell value passes through the system, it’ll be accompanied by its row, column name, and timestamp. If the rows and column names are large, especially compared to the size of the cell value, then indices that are kept on HBase storefiles (StoreFile (HFile)) to facilitate random access may end up occupying large chunks of the HBase allotted RAM than the data itself because the cell value coordinates are large.
56. What is the scope of a rowkey in Habse?
Rowkeys are scoped to Column Families. The same rowkey could exist in each ColumnFamily that exists in a table without collision.
57. What is the information stored in hbase:meta table?
The Hbase:meta tables stores details of region in the system in the following format.
info:regioninfo (serialized HRegionInfo instance for this region)
info:server (server:port of the RegionServer containing this region)
info:serverstartcode (start-time of the RegionServer process containing this
region)
58. How do we get the complete list of columns that exist in a column Family?
The complete list of columns in a column family can be obtained only querying all the rows for that column family.
59. When the records are fetched form a Hbase tables, in which order are the sorted?
The records fetched form Hbase are always sorted in the order of rowkey-> column Family-> column qualifier-> timestamp.
60. How should filters are useful in Apache HBase?
Filters In Hbase Shell, Filter Language was introduced in APache HBase 0.92. It allows you to perform server-side filtering when accessing HBase over Thrift or in the HBase shell.
61. How many filters are available in Apache HBase?
Total we have 18 filters are support to hbase.
They are:
ColumnPrefixFilter
TimestampsFilter
PageFilter
MultipleColumn
PrefixFilter
FamilyFilter
62. Explain what is WAL and Hlog in Hbase?
WAL (Write Ahead Log) is similar to MySQL BIN log; it records all the changes occur in data. It is a standard sequence file by Hadoop and it stores HLogkey’s. These keys consist of a sequential number as well as actual data and are used to replay not yet persisted data after a server crash. So, in cash of server failure WAL work as a life-line and retrieves the lost data’s.
63. Explain what is the row key?
Row key is defined by the application. As the combined key is pre-fixed by the rowkey, it enables the application to define the desired sort order. It also allows logical grouping of cells and make sure that all cells with the same rowkey are co-located on the same server.
64. Explain deletion in Hbase? Mention what are the three types of tombstone markers in Hbase?
When you delete the cell in Hbase, the data is not actually deleted but a tombstone marker is set, making the deleted cells invisible. Hbase deleted are actually removed during compactions.
Three types of tombstone markers are there:
Version delete marker: For deletion, it marks a single version of a column
Column delete marker: For deletion, it marks all the versions of a column
Family delete marker: For deletion, it marks of all column for a column family
65. Explain how does Hbase actually delete a row?
In Hbase, whatever you write will be stored from RAM to disk, these disk writes are immutable barring compaction. During deletion process in Hbase, major compaction process delete marker while minor compactions don’t. In normal deletes, it results in a delete tombstone marker- these delete data they represent are removed during compaction. Also, if you delete data and add more data, but with an earlier timestamp than the tombstone timestamp, further Gets may be masked by the delete/tombstone marker and hence you will not receive the inserted value until after the major compaction.
66. Explain what happens if you alter the block size of a column family on an already occupied database?
When you alter the block size of the column family, the new data occupies the new block size while the old data remains within the old block size. During data compaction, old data will take the new block size. New files as they are flushed, have a new block size whereas existing data will continue to be read correctly. All data should be transformed to the new block size, after the next major compaction.
67. Mention the difference between Hbase and Relational Database?
Hbase
Relational Database
It is schema-less
It is a schema based database
It is a column-oriented data store
It is a row-oriented data store
It is used to store de-normalized data
It is used to store normalized data
It contains sparsely populated tables
It contains thin tables
Automated partitioning is done in Hbase
There is no such provision or built-in support for partitioning
click here to know about Big Data Training and to see about course details.