Hive Partition  Bucketing (Use Partition and Bucketing in  same table):

HIVE:

hive

Apache Hive is a data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis. Hive is good for performing queries on large datasets.

In Hive Partition and Bucketing are the main concepts.

 

Why we use Partition:

           A simple query in Hive reads the entire dataset even if we have where clause filter. This becomes a bottleneck for running MapReduce jobs over a large table. We can overcome this issue by implementing partitions in Hive.

          Hive makes it very easy to implement partitions by using the automatic partition scheme when the table is created.

Partition:

      Hive organizes tables into Partitions. It is a way of dividing a table into related parts based on the values of partitioned columns such as date, city, and department. Using partition, it is easy to query a portion of the data.

 

Why we use Bucketing:

Partitioning gives effective results when,
1.There are limited number of partitions,
2.Comparatively equal sized partitions.

          But this may not possible in all scenarios, like when are partitioning our tables based geographic locations like country, some bigger countries will have large partitions where as small countries data will create small partitions . So, In these cases Partitioning will not be ideal.

        To overcome the problem of over partitioning, Hive provides Bucketing concept, another technique for decomposing table data sets into more manageable parts.

Bucketing:

Hive Partition can be further subdivided into Clusters or Buckets

Hive Buckets is nothing but another technique of decomposing data or decreasing the data into more manageable parts or equal parts. we can’t create number of Hive Buckets the reason is we should declare the number of buckets for a table in the time of table creation.

 big data training

Example for Partition  Bucketing

Step-1: Create a hive table

create table patient1(patient_id int, patient_name string, gender string, total_amount int, drug string) row format delimited fields terminated by ',' stored as textfile; 

Step-2: Load data into the hive table

load data local inpath '/home/geouser/Documents/patient1' into table patient1; 

hive> create table patient1(patient_id int, patient_name string, gender string, total_amount int, drug string) row format delimited fields terminated by ‘,’ stored as textfile;
OK
Time taken: 0.972 seconds
hive>  load data local inpath ‘/home/geouser/Documents/patient1’ into table patient1;
Loading data to table default.patient1
Table default.patient1 stats: [numFiles=1, totalSize=282]
OK
Time taken: 2.338 seconds

 

Step-3: Create a  table in hive with partition and bucketing

 create table partition_bucket (patient_id int, patient_name string, gender string, total_amount int) partitioned by (drug string) clustered by (gender) into 4 buckets;

hive> create table partition_bucket (patient_id int, patient_name string, gender string, total_amount int) partitioned by (drug string) clustered by (gender) into 4 buckets;
OK
Time taken: 0.585 seconds


Step-4: Set the properties for partition and bucketing

SET hive.exec.dynamic.partition=true; SET hive.exec.dynamic.partition.mode=non-strict; SET hive.enforce.bucketing =true;

hive> SET hive.exec.dynamic.partition=true;
hive> SET hive.exec.dynamic.partition.mode=non-strict;
hive> SET hive.enforce.bucketing =true;


Step-5: Insert value into the table

insert overwrite table partition_bucket partition(drug) select * from patient1;

hive> insert overwrite table partition_bucket partition(drug) select * from patient1;
Query ID = geouser_20160901111450_a2e1829d-2139-41ed-8a0a-7d113ff81cc7
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 4
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Starting Job = job_1472708272048_Hive0001, Tracking URL = http://geouser:8088/proxy/application_1472708272048_0001/
Kill Command = /home/geouser/hadoop-2.7.1/bin/hadoop job  -kill job_1472708272048_0001

Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 4
2016-09-01 11:15:23,802 Stage-1 map = 0%,  reduce = 0%
2016-09-01 11:15:50,072 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 3.63 sec
2016-09-01 11:16:25,977 Stage-1 map = 100%,  reduce = 50%, Cumulative CPU 7.13 sec
2016-09-01 11:16:28,685 Stage-1 map = 100%,  reduce = 67%, Cumulative CPU 11.37 sec
2016-09-01 11:16:35,141 Stage-1 map = 100%,  reduce = 83%, Cumulative CPU 17.98 sec
2016-09-01 11:16:36,230 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 24.07 sec
MapReduce Total cumulative CPU time: 24 seconds 70 msec
Ended Job = job_1472708272048_0001
Loading data to table default.partition_bucket partition (drug=null)
Time taken for load dynamic partitions : 2309

Loading partition {drug=metacin}
Loading partition {drug=crocin}
Loading partition {drug=para}
Time taken for adding to write entity : 18
Partition default.partition_bucket{drug=crocin} stats: [numFiles=4, numRows=3, totalSize=85, rawDataSize=82]
Partition default.partition_bucket{drug=metacin} stats: [numFiles=4, numRows=2, totalSize=58, rawDataSize=56]
Partition default.partition_bucket{drug=para} stats: [numFiles=4, numRows=3, totalSize=87, rawDataSize=84]
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1  Reduce: 4   Cumulative CPU: 24.07 sec   HDFS Read: 19311 HDFS Write: 580 SUCCESS
Total MapReduce CPU Time Spent: 24 seconds 70 msec
OK
Time taken: 113.403 seconds

machine learning in training,machine learning,machine learning course content

Step-6: View the value of the Hive table using hadoop command

             (i). List the files in the table partition_bucket

hadoop fs -ls /user/hive/warehouse/partition_bucket;

geouser@geouser:~$ hadoop fs -ls /user/hive/warehouse/partition_bucket;
16/09/01 11:27:24 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform… using builtin-java classes where applicable
Found 3 items
drwxr-xr-x   – geouser supergroup          0 2016-09-01 11:16 /user/hive/warehouse/partition_bucket/drug=crocin
drwxr-xr-x   – geouser supergroup          0 2016-09-01 11:16 /user/hive/warehouse/partition_bucket/drug=metacin
drwxr-xr-x   – geouser supergroup          0 2016-09-01 11:16 /user/hive/warehouse/partition_bucket/drug=para

(ii). List the files inside the table partition_bucket

hadoop fs -ls /user/hive/warehouse/partition_bucket/drug=crocin; hadoop fs -ls /user/hive/warehouse/partition_bucket/drug=metacin; hadoop fs -ls /user/hive/warehouse/partition_bucket/drug=para;

geouser@geouser:~$ hadoop fs -ls /user/hive/warehouse/partition_bucket/drug=crocin;16/09/01 11:27:55 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform… using builtin-java classes where applicable Found 4 items -rwxr-xr-x   1 geouser supergroup         58 2016-09-01 11:16 /user/hive/warehouse/partition_bucket/drug=crocin/000000_0
-rwxr-xr-x   1 geouser supergroup         27 2016-09-01 11:16 /user/hive/warehouse/partition_bucket/drug=crocin/000001_0
-rwxr-xr-x   1 geouser supergroup          0 2016-09-01 11:16 /user/hive/warehouse/partition_bucket/drug=crocin/000002_0
-rwxr-xr-x   1 geouser supergroup          0 2016-09-01 11:16 /user/hive/warehouse/partition_bucket/drug=crocin/000003_0

geouser@geouser:~$ hadoop fs -ls /user/hive/warehouse/partition_bucket/drug=metacin;16/09/01 11:28:06 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform… using builtin-java classes where applicable
Found 4 items
-rwxr-xr-x   1 geouser supergroup          0 2016-09-01 11:16 /user/hive/warehouse/partition_bucket/drug=metacin/000000_0
-rwxr-xr-x   1 geouser supergroup         58 2016-09-01 11:16 /user/hive/warehouse/partition_bucket/drug=metacin/000001_0
-rwxr-xr-x   1 geouser supergroup          0 2016-09-01 11:16 /user/hive/warehouse/partition_bucket/drug=metacin/000002_0
-rwxr-xr-x   1 geouser supergroup          0 2016-09-01 11:16 /user/hive/warehouse/partition_bucket/drug=metacin/000003_0

>

geouser@geouser:~$ hadoop fs -ls /user/hive/warehouse/partition_bucket/drug=para;16/09/01 11:28:17 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform… using builtin-java classes where applicableFound 4 items
-rwxr-xr-x   1 geouser supergroup         29 2016-09-01 11:16 /user/hive/warehouse/partition_bucket/drug=para/000000_0
-rwxr-xr-x   1 geouser supergroup         58 2016-09-01 11:16 /user/hive/warehouse/partition_bucket/drug=para/000001_0
-rwxr-xr-x   1 geouser supergroup          0 2016-09-01 11:16 /user/hive/warehouse/partition_bucket/drug=para/000002_0
-rwxr-xr-x   1 geouser supergroup          0 2016-09-01 11:16 /user/hive/warehouse/partition_bucket/drug=para/000003_0

 

(iii). View the value of the files using cat command

geouser@geouser:~$ hadoop fs -cat /user/hive/warehouse/partition_bucket/drug=para/000000_0;
16/09/01 11:30:11 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform… using builtin-java classes where applicable
10013404manisha female200
geouser@geouser:~$ hadoop fs -cat /user/hive/warehouse/partition_bucket/drug=para/000001_0;
16/09/01 11:30:17 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform… using builtin-java classes where applicable
10013204mohammedmale  999
10013201alaistermale  500

 

Step-7: View the bucket value in the table partition_bucket

select * from partition_bucket TABLESAMPLE(BUCKET 1 OUT OF 4 ON gender); select * from partition_bucket TABLESAMPLE(BUCKET 2 OUT OF 4 ON gender);

hive> select * from partition_bucket TABLESAMPLE(BUCKET 1 OUT OF 4 ON gender);
OK
10013224    marsinga    female    570    crocin
10013203    anifa       female    600    crocin
10013404    manisha     female    200    para
Time taken: 0.277 seconds, Fetched: 3 row(s)
hive> select * from partition_bucket TABLESAMPLE(BUCKET 2 OUT OF 4 ON gender);
OK
10011204    vijay       male    670    crocin
10013304    piyush      male      400    metacin
10013202    briito      male      800    metacin
10013204    mohammed    male      999    para
10013201    alaister    male      500    para
Time taken: 0.252 seconds, Fetched: 5 row(s)