HIVE TABLE USING PARTITION BUCKETING

Hive Partition Bucketing (Use Partition and Bucketing in same table):

HIVE:

Apache Hive is a data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis. Hive is good for performing queries on large datasets.

In Hive Partition and Bucketing are the main concepts.

Why we use Partition:

A simple query in Hive reads the entire dataset even if we have where clause filter. This becomes a bottleneck for running MapReduce jobs over a large table. We can overcome this issue by implementing partitions in Hive.

Hive makes it very easy to implement partitions by using the automatic partition scheme when the table is created.

Partition:

Hive organizes tables into Partitions. It is a way of dividing a table into related parts based on the values of partitioned columns such as date, city, and department. Using partition, it is easy to query a portion of the data.

Why we use Bucketing:

Partitioning gives effective results when,

1.There are limited number of partitions,

2.Comparatively equal sized partitions.

But this may not possible in all scenarios, like when are partitioning our tables based geographic locations like country, some bigger countries will have large partitions where as small countries data will create small partitions . So, In these cases Partitioning will not be ideal.

To overcome the problem of over partitioning, Hive provides Bucketing concept, another technique for decomposing table data sets into more manageable parts.

Bucketing:

Hive Partition can be further subdivided into Clusters or Buckets

Hive Buckets is nothing but another technique of decomposing data or decreasing the data into more manageable parts or equal parts. we can’t create number of Hive Buckets the reason is we should declare the number of buckets for a table in the time of table creation.

Example for Partition Bucketing

Step-1: Create a hive table

create table patient1(patient_id int, patient_name string, gender string, total_amount int, drug string) row format delimited fields terminated by ',' stored as textfile;

Step-2: Load data into the hive table

load data local inpath '/home/geouser/Documents/patient1' into table patient1;

hive> create table patient1(patient_id int, patient_name string, gender string, total_amount int, drug string) row format delimited fields terminated by ‘,’ stored as textfile;
OK
Time taken: 0.972 seconds
hive> load data local inpath ‘/home/geouser/Documents/patient1’ into table patient1;
Loading data to table default.patient1
Table default.patient1 stats: [numFiles=1, totalSize=282]
OK
Time taken: 2.338 seconds

Step-3: Create a table in hive with partition and bucketing

 create table partition_bucket (patient_id int, patient_name string, gender string, total_amount int) partitioned by (drug string) clustered by (gender) into 4 buckets;

hive> create table partition_bucket (patient_id int, patient_name string, gender string, total_amount int) partitioned by (drug string) clustered by (gender) into 4 buckets;
OK
Time taken: 0.585 seconds

Step-4: Set the properties for partition and bucketing

SET hive.exec.dynamic.partition=true;
SET hive.exec.dynamic.partition.mode=non-strict;
SET hive.enforce.bucketing =true;

hive> SET hive.exec.dynamic.partition=true;
hive> SET hive.exec.dynamic.partition.mode=non-strict;
hive> SET hive.enforce.bucketing =true;

Step-5: Insert value into the table

insert overwrite table partition_bucket partition(drug) select * from patient1;

hive> insert overwrite table partition_bucket partition(drug) select * from patient1;
Query ID = geouser_20160901111450_a2e1829d-2139-41ed-8a0a-7d113ff81cc7
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 4
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Starting Job = job_1472708272048_Hive0001, Tracking URL = http://geouser:8088/proxy/application_1472708272048_0001/
Kill Command = /home/geouser/hadoop-2.7.1/bin/hadoop job -kill job_1472708272048_0001

Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 4
2016-09-01 11:15:23,802 Stage-1 map = 0%, reduce = 0%
2016-09-01 11:15:50,072 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 3.63 sec
2016-09-01 11:16:25,977 Stage-1 map = 100%, reduce = 50%, Cumulative CPU 7.13 sec
2016-09-01 11:16:28,685 Stage-1 map = 100%, reduce = 67%, Cumulative CPU 11.37 sec
2016-09-01 11:16:35,141 Stage-1 map = 100%, reduce = 83%, Cumulative CPU 17.98 sec
2016-09-01 11:16:36,230 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 24.07 sec
MapReduce Total cumulative CPU time: 24 seconds 70 msec
Ended Job = job_1472708272048_0001
Loading data to table default.partition_bucket partition (drug=null)
Time taken for load dynamic partitions : 2309

Loading partition {drug=metacin}
Loading partition {drug=crocin}
Loading partition {drug=para}
Time taken for adding to write entity : 18
Partition default.partition_bucket{drug=crocin} stats: [numFiles=4, numRows=3, totalSize=85, rawDataSize=82]
Partition default.partition_bucket{drug=metacin} stats: [numFiles=4, numRows=2, totalSize=58, rawDataSize=56]
Partition default.partition_bucket{drug=para} stats: [numFiles=4, numRows=3, totalSize=87, rawDataSize=84]
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Reduce: 4 Cumulative CPU: 24.07 sec HDFS Read: 19311 HDFS Write: 580 SUCCESS
Total MapReduce CPU Time Spent: 24 seconds 70 msec
OK
Time taken: 113.403 seconds

Step-6: View the value of the Hive table using hadoop command

(i). List the files in the table partition_bucket

hadoop fs -ls /user/hive/warehouse/partition_bucket;

geouser@geouser:~$ hadoop fs -ls /user/hive/warehouse/partition_bucket;
16/09/01 11:27:24 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform… using builtin-java classes where applicable
Found 3 items
drwxr-xr-x   – geouser supergroup          0 2016-09-01 11:16 /user/hive/warehouse/partition_bucket/drug=crocin
drwxr-xr-x   – geouser supergroup          0 2016-09-01 11:16 /user/hive/warehouse/partition_bucket/drug=metacin
drwxr-xr-x   – geouser supergroup          0 2016-09-01 11:16 /user/hive/warehouse/partition_bucket/drug=para

(ii). List the files inside the table partition_bucket

hadoop fs -ls /user/hive/warehouse/partition_bucket/drug=crocin;
hadoop fs -ls /user/hive/warehouse/partition_bucket/drug=metacin;
hadoop fs -ls /user/hive/warehouse/partition_bucket/drug=para;

geouser@geouser:~$ hadoop fs -ls /user/hive/warehouse/partition_bucket/drug=crocin;16/09/01 11:27:55 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform… using builtin-java classes where applicable Found 4 items -rwxr-xr-x   1 geouser supergroup         58 2016-09-01 11:16 /user/hive/warehouse/partition_bucket/drug=crocin/000000_0
-rwxr-xr-x   1 geouser supergroup         27 2016-09-01 11:16 /user/hive/warehouse/partition_bucket/drug=crocin/000001_0
-rwxr-xr-x   1 geouser supergroup          0 2016-09-01 11:16 /user/hive/warehouse/partition_bucket/drug=crocin/000002_0
-rwxr-xr-x   1 geouser supergroup          0 2016-09-01 11:16 /user/hive/warehouse/partition_bucket/drug=crocin/000003_0

geouser@geouser:~$ hadoop fs -ls /user/hive/warehouse/partition_bucket/drug=metacin;16/09/01 11:28:06 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform… using builtin-java classes where applicable
Found 4 items
-rwxr-xr-x   1 geouser supergroup          0 2016-09-01 11:16 /user/hive/warehouse/partition_bucket/drug=metacin/000000_0
-rwxr-xr-x   1 geouser supergroup         58 2016-09-01 11:16 /user/hive/warehouse/partition_bucket/drug=metacin/000001_0
-rwxr-xr-x   1 geouser supergroup          0 2016-09-01 11:16 /user/hive/warehouse/partition_bucket/drug=metacin/000002_0
-rwxr-xr-x   1 geouser supergroup          0 2016-09-01 11:16 /user/hive/warehouse/partition_bucket/drug=metacin/000003_0

geouser@geouser:~$ hadoop fs -ls /user/hive/warehouse/partition_bucket/drug=para;16/09/01 11:28:17 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform… using builtin-java classes where applicableFound 4 items
-rwxr-xr-x   1 geouser supergroup         29 2016-09-01 11:16 /user/hive/warehouse/partition_bucket/drug=para/000000_0
-rwxr-xr-x   1 geouser supergroup         58 2016-09-01 11:16 /user/hive/warehouse/partition_bucket/drug=para/000001_0
-rwxr-xr-x   1 geouser supergroup          0 2016-09-01 11:16 /user/hive/warehouse/partition_bucket/drug=para/000002_0
-rwxr-xr-x   1 geouser supergroup          0 2016-09-01 11:16 /user/hive/warehouse/partition_bucket/drug=para/000003_0

(iii). View the value of the files using cat command

geouser@geouser:~$ hadoop fs -cat /user/hive/warehouse/partition_bucket/drug=para/000000_0;
16/09/01 11:30:11 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform… using builtin-java classes where applicable
10013404manisha female200
geouser@geouser:~$ hadoop fs -cat /user/hive/warehouse/partition_bucket/drug=para/000001_0;
16/09/01 11:30:17 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform… using builtin-java classes where applicable
10013204mohammedmale 999
10013201alaistermale 500

Step-7: View the bucket value in the table partition_bucket

select * from partition_bucket TABLESAMPLE(BUCKET 1 OUT OF 4 ON gender);
select * from partition_bucket TABLESAMPLE(BUCKET 2 OUT OF 4 ON gender);

hive> select * from partition_bucket TABLESAMPLE(BUCKET 1 OUT OF 4 ON gender);
OK
10013224   marsinga   female   570   crocin
10013203   anifa     female   600   crocin
10013404   manisha    female   200   para
Time taken: 0.277 seconds, Fetched: 3 row(s)
hive> select * from partition_bucket TABLESAMPLE(BUCKET 2 OUT OF 4 ON gender);
OK
10011204   vijay     male   670   crocin
10013304   piyush     male     400   metacin
10013202   briito     male     800   metacin
10013204   mohammed   male     999   para
10013201   alaister   male     500   para
Time taken: 0.252 seconds, Fetched: 5 row(s)

Latest News

HIVE TABLE USING PARTITION BUCKETING

HIVE:

Apache Hive is a data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis. Hive is good for performing queries on large datasets.

In Hive Partition and Bucketing are the main concepts.

Why we use Partition:

Partition:

Why we use Bucketing:

Partitioning gives effective results when,

1.There are limited number of partitions,

2.Comparatively equal sized partitions.

Bucketing:

Hive Partition can be further subdivided into Clusters or Buckets

Example for Partition Bucketing

Step-1: Create a hive table

Step-2: Load data into the hive table

Step-3: Create a table in hive with partition and bucketing

Step-4: Set the properties for partition and bucketing

Step-5: Insert value into the table

Step-6: View the value of the Hive table using hadoop command

(i). List the files in the table partition_bucket

(ii). List the files inside the table partition_bucket

(iii). View the value of the files using cat command

Step-7: View the bucket value in the table partition_bucket

Recent Posts

Categories

Latest News

HIVE TABLE USING PARTITION BUCKETING

HIVE:

Apache Hive is a data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis. Hive is good for performing queries on large datasets.

In Hive Partition and Bucketing are the main concepts.

Why we use Partition:

Partition:

Why we use Bucketing:

Partitioning gives effective results when,

1.There are limited number of partitions,

2.Comparatively equal sized partitions.

Bucketing:

Hive Partition can be further subdivided into Clusters or Buckets

Example for Partition Bucketing

Step-1: Create a hive table

Step-2: Load data into the hive table

Step-3: Create a table in hive with partition and bucketing

Step-4: Set the properties for partition and bucketing

Step-5: Insert value into the table

Step-6: View the value of the Hive table using hadoop command

(i). List the files in the table partition_bucket

(ii). List the files inside the table partition_bucket

(iii). View the value of the files using cat command

Step-7: View the bucket value in the table partition_bucket

Related News

Recent Posts

Categories