HIVE TABLE USING PARTITION BUCKETING
Hive Partition Bucketing (Use Partition and Bucketing in same table):
HIVE:
Apache Hive is a data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis. Hive is good for performing queries on large datasets.
In Hive Partition and Bucketing are the main concepts.
Why we use Partition:
A simple query in Hive reads the entire dataset even if we have where clause filter. This becomes a bottleneck for running MapReduce jobs over a large table. We can overcome this issue by implementing partitions in Hive.
Hive makes it very easy to implement partitions by using the automatic partition scheme when the table is created.
Partition:
Hive organizes tables into Partitions. It is a way of dividing a table into related parts based on the values of partitioned columns such as date, city, and department. Using partition, it is easy to query a portion of the data.
Why we use Bucketing:
Partitioning gives effective results when,
1.There are limited number of partitions,
2.Comparatively equal sized partitions.
But this may not possible in all scenarios, like when are partitioning our tables based geographic locations like country, some bigger countries will have large partitions where as small countries data will create small partitions . So, In these cases Partitioning will not be ideal.
To overcome the problem of over partitioning, Hive provides Bucketing concept, another technique for decomposing table data sets into more manageable parts.
Bucketing:
Hive Partition can be further subdivided into Clusters or Buckets
Hive Buckets is nothing but another technique of decomposing data or decreasing the data into more manageable parts or equal parts. we can’t create number of Hive Buckets the reason is we should declare the number of buckets for a table in the time of table creation.
Example for Partition Bucketing
Step-1: Create a hive table
create table patient1(patient_id int, patient_name string, gender string, total_amount int, drug string) row format delimited fields terminated by ',' stored as textfile;
Step-2: Load data into the hive table
load data local inpath '/home/geouser/Documents/patient1' into table patient1;
hive> create table patient1(patient_id int, patient_name string, gender string, total_amount int, drug string) row format delimited fields terminated by ‘,’ stored as textfile;
OK
Time taken: 0.972 seconds
hive> load data local inpath ‘/home/geouser/Documents/patient1’ into table patient1;
Loading data to table default.patient1
Table default.patient1 stats: [numFiles=1, totalSize=282]
OK
Time taken: 2.338 seconds
Step-3: Create a table in hive with partition and bucketing
create table partition_bucket (patient_id int, patient_name string, gender string, total_amount int) partitioned by (drug string) clustered by (gender) into 4 buckets;
hive> create table partition_bucket (patient_id int, patient_name string, gender string, total_amount int) partitioned by (drug string) clustered by (gender) into 4 buckets;
OK
Time taken: 0.585 seconds
Step-4: Set the properties for partition and bucketing
SET hive.exec.dynamic.partition=true; SET hive.exec.dynamic.partition.mode=non-strict; SET hive.enforce.bucketing =true;
hive> SET hive.exec.dynamic.partition=true;
hive> SET hive.exec.dynamic.partition.mode=non-strict;
hive> SET hive.enforce.bucketing =true;
Step-5: Insert value into the table
insert overwrite table partition_bucket partition(drug) select * from patient1;
hive> insert overwrite table partition_bucket partition(drug) select * from patient1;
Query ID = geouser_20160901111450_a2e1829d-2139-41ed-8a0a-7d113ff81cc7
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 4
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Starting Job = job_1472708272048_Hive0001, Tracking URL = http://geouser:8088/proxy/application_1472708272048_0001/
Kill Command = /home/geouser/hadoop-2.7.1/bin/hadoop job -kill job_1472708272048_0001
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 4
2016-09-01 11:15:23,802 Stage-1 map = 0%, reduce = 0%
2016-09-01 11:15:50,072 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 3.63 sec
2016-09-01 11:16:25,977 Stage-1 map = 100%, reduce = 50%, Cumulative CPU 7.13 sec
2016-09-01 11:16:28,685 Stage-1 map = 100%, reduce = 67%, Cumulative CPU 11.37 sec
2016-09-01 11:16:35,141 Stage-1 map = 100%, reduce = 83%, Cumulative CPU 17.98 sec
2016-09-01 11:16:36,230 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 24.07 sec
MapReduce Total cumulative CPU time: 24 seconds 70 msec
Ended Job = job_1472708272048_0001
Loading data to table default.partition_bucket partition (drug=null)
Time taken for load dynamic partitions : 2309
Loading partition {drug=metacin}
Loading partition {drug=crocin}
Loading partition {drug=para}
Time taken for adding to write entity : 18
Partition default.partition_bucket{drug=crocin} stats: [numFiles=4, numRows=3, totalSize=85, rawDataSize=82]
Partition default.partition_bucket{drug=metacin} stats: [numFiles=4, numRows=2, totalSize=58, rawDataSize=56]
Partition default.partition_bucket{drug=para} stats: [numFiles=4, numRows=3, totalSize=87, rawDataSize=84]
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Reduce: 4 Cumulative CPU: 24.07 sec HDFS Read: 19311 HDFS Write: 580 SUCCESS
Total MapReduce CPU Time Spent: 24 seconds 70 msec
OK
Time taken: 113.403 seconds
Step-6: View the value of the Hive table using hadoop command
(i). List the files in the table partition_bucket
hadoop fs -ls /user/hive/warehouse/partition_bucket;
geouser@geouser:~$ hadoop fs -ls /user/hive/warehouse/partition_bucket;
16/09/01 11:27:24 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform… using builtin-java classes where applicable
Found 3 items
drwxr-xr-x – geouser supergroup 0 2016-09-01 11:16 /user/hive/warehouse/partition_bucket/drug=crocin
drwxr-xr-x – geouser supergroup 0 2016-09-01 11:16 /user/hive/warehouse/partition_bucket/drug=metacin
drwxr-xr-x – geouser supergroup 0 2016-09-01 11:16 /user/hive/warehouse/partition_bucket/drug=para
(ii). List the files inside the table partition_bucket
hadoop fs -ls /user/hive/warehouse/partition_bucket/drug=crocin; hadoop fs -ls /user/hive/warehouse/partition_bucket/drug=metacin; hadoop fs -ls /user/hive/warehouse/partition_bucket/drug=para;
geouser@geouser:~$ hadoop fs -ls /user/hive/warehouse/partition_bucket/drug=crocin;16/09/01 11:27:55 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform… using builtin-java classes where applicable Found 4 items -rwxr-xr-x 1 geouser supergroup 58 2016-09-01 11:16 /user/hive/warehouse/partition_bucket/drug=crocin/000000_0
-rwxr-xr-x 1 geouser supergroup 27 2016-09-01 11:16 /user/hive/warehouse/partition_bucket/drug=crocin/000001_0
-rwxr-xr-x 1 geouser supergroup 0 2016-09-01 11:16 /user/hive/warehouse/partition_bucket/drug=crocin/000002_0
-rwxr-xr-x 1 geouser supergroup 0 2016-09-01 11:16 /user/hive/warehouse/partition_bucket/drug=crocin/000003_0
geouser@geouser:~$ hadoop fs -ls /user/hive/warehouse/partition_bucket/drug=metacin;16/09/01 11:28:06 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform… using builtin-java classes where applicable
Found 4 items
-rwxr-xr-x 1 geouser supergroup 0 2016-09-01 11:16 /user/hive/warehouse/partition_bucket/drug=metacin/000000_0
-rwxr-xr-x 1 geouser supergroup 58 2016-09-01 11:16 /user/hive/warehouse/partition_bucket/drug=metacin/000001_0
-rwxr-xr-x 1 geouser supergroup 0 2016-09-01 11:16 /user/hive/warehouse/partition_bucket/drug=metacin/000002_0
-rwxr-xr-x 1 geouser supergroup 0 2016-09-01 11:16 /user/hive/warehouse/partition_bucket/drug=metacin/000003_0
>
geouser@geouser:~$ hadoop fs -ls /user/hive/warehouse/partition_bucket/drug=para;16/09/01 11:28:17 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform… using builtin-java classes where applicableFound 4 items
-rwxr-xr-x 1 geouser supergroup 29 2016-09-01 11:16 /user/hive/warehouse/partition_bucket/drug=para/000000_0
-rwxr-xr-x 1 geouser supergroup 58 2016-09-01 11:16 /user/hive/warehouse/partition_bucket/drug=para/000001_0
-rwxr-xr-x 1 geouser supergroup 0 2016-09-01 11:16 /user/hive/warehouse/partition_bucket/drug=para/000002_0
-rwxr-xr-x 1 geouser supergroup 0 2016-09-01 11:16 /user/hive/warehouse/partition_bucket/drug=para/000003_0
(iii). View the value of the files using cat command
geouser@geouser:~$ hadoop fs -cat /user/hive/warehouse/partition_bucket/drug=para/000000_0;
16/09/01 11:30:11 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform… using builtin-java classes where applicable
10013404manisha female200
geouser@geouser:~$ hadoop fs -cat /user/hive/warehouse/partition_bucket/drug=para/000001_0;
16/09/01 11:30:17 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform… using builtin-java classes where applicable
10013204mohammedmale 999
10013201alaistermale 500
Step-7: View the bucket value in the table partition_bucket
select * from partition_bucket TABLESAMPLE(BUCKET 1 OUT OF 4 ON gender); select * from partition_bucket TABLESAMPLE(BUCKET 2 OUT OF 4 ON gender);
hive> select * from partition_bucket TABLESAMPLE(BUCKET 1 OUT OF 4 ON gender);
OK
10013224 marsinga female 570 crocin
10013203 anifa female 600 crocin
10013404 manisha female 200 para
Time taken: 0.277 seconds, Fetched: 3 row(s)
hive> select * from partition_bucket TABLESAMPLE(BUCKET 2 OUT OF 4 ON gender);
OK
10011204 vijay male 670 crocin
10013304 piyush male 400 metacin
10013202 briito male 800 metacin
10013204 mohammed male 999 para
10013201 alaister male 500 para
Time taken: 0.252 seconds, Fetched: 5 row(s)