PARTITION AND BUCKETING:
HIVE:

Apache Hive is a data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis. Hive is good for performing queries on large datasets.

In Hive one of the main concept is partitioning.

Why we use Partition:

A simple query in Hive reads the entire dataset even if we have where clause filter. This becomes a bottleneck for running MapReduce jobs over a large table. We can overcome this issue by implementing partitions in Hive. Hive makes it very easy to implement partitions by using the automatic partition scheme when the table is created.

Types of Partition:

HIVE Partition

types of partition in hive

Custom Partition:

Static Partition:

1.The columns whose values are known at COMPILE TIME (given by user).

2.Insert input data files individually into a partition table.

3.Usually when loading files (big files) into Hive tables static partitions are preferred.

4.Static Partition saves your time in loading data compared to dynamic partition.

5.Use where clause to use limit in static partition.

6.We can alter the partition in static partition

7.Perform Static partition on Hive Manage table or external table.

In Static Partition there is a two types.

1.Using existing column,

2.Using new column.

Create a table in Hive:

Step-1: Create a Hive table
create table patient(patient_id int, patient_name string, drug string, gender string, total_amount int) row format delimited fields terminated by ‘,’ stored as textfile;
Step-2: Load value into the Hive table

load data local inpath ‘/home/geouser/Documents/patient’ into table patient;

Data Set:(Patient)

10013201,alaister,para,male,500 10013202,briito,metacin,male,800 10013203,anifa,crocin,female,600 10013204,mohammed,para,male,999 10013304,piyush,metacin,male,400 10013404,manisha,para,female,200 10011204,vijay,crocin,male,670 10013224,marsinga,crocin,female,570

hive> create table patient(patient_id int, patient_name string, drug string, gender string, total_amount int) row format delimited fields terminated by ‘,’ stored as textfile;
OK
Time taken: 0.541 seconds
hive> load data local inpath ‘/home/geouser/Documents/patient’ into table patient;
Loading data to table default.patient
Table default.patient stats: [numFiles=1, totalSize=232]
OK
Time taken: 1.055 seconds

1.Using existing column:

In this type of partition, we use the partition column which is already present in the table.

Example:

Step-1: Create a Hive partition table

create table p_patient1(patient_id int, patient_name string, gender string, total_amount int) partitioned by ( drug string);

hive> create table p_patient1(patient_id int, patient_name string, gender string, total_amount int) partitioned by ( drug string);
OK
Time taken: 0.235 seconds

Step-2: Insert value into the Partitioned table

insert overwrite table p_patient1 partition(drug=’metacin’) select patient_id, patient_name, gender, total_amount from patient where drug=’metacin’;

hive> insert overwrite table p_patient1 partition(drug=’metacin’) select patient_id, patient_name, gender, total_amount from patient where drug=’metacin’;
Query ID = geouser_20160824111554_6603e3bf-9b2d-4e57-8ce7-78bb33068a7b
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there’s no reduce operator
Starting Job = job_1472015672745_0005, Tracking URL = http://geouser:8088/proxy/application_1472015672745_0005/
Kill Command = /home/geouser/hadoop-2.7.1/bin/hadoop job  -kill job_1472015672745_0005
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2016-08-24 11:16:19,439 Stage-1 map = 0%,  reduce = 0%
2016-08-24 11:16:34,999 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 5.89 sec
MapReduce Total cumulative CPU time: 5 seconds 890 msec
Ended Job = job_1472015672745_0005
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to: hdfs://localhost:9000/user/hive/warehouse/p_patient1/drug=metacin/.hive-staging_hive_2016-08-24_11-15-54_037_2017514360714144410-1/-ext-10000
Loading data to table default.p_patient1 partition (drug=metacin)
Partition default.p_patient1{drug=metacin} stats: [numFiles=1, numRows=2, totalSize=58, rawDataSize=56]
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1   Cumulative CPU: 5.89 sec   HDFS Read: 4644 HDFS Write: 145 SUCCESS
Total MapReduce CPU Time Spent: 5 seconds 890 msec
OK
Time taken: 45.299 seconds

Step-3: View the Partitione value

select * from p_patient1;

hive> select * from p_patient1;
OK
10013202    briito      male      800    metacin
10013304    piyush      male      400    metacin
Time taken: 0.147 seconds, Fetched: 2 row(s)

2.Using new column:

In this type of partition, we use new partition column

Example:

Step-1: Create a Partitioned Hive table

create table p_patient(patient_id int, patient_name string, drug string, gender string, total_amount int) partitioned by (new string);

hive>  create table p_patient(patient_id int, patient_name string, drug string, gender string, total_amount int) partitioned by (new string);
OK
Time taken: 0.232 seconds

Step-2: Insert the value into the table

insert overwrite table p_patient partition(new=’metacin’) select * from patient where drug=’metacin’;

hive> insert overwrite table p_patient partition(new=’metacin’) select * from patient where drug=’metacin’;
Query ID = geouser_20160824111010_a72a7634-5ef5-4e15-a3a1-48046ab5e569
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there’s no reduce operator
Starting Job = job_1472015672745_0004, Tracking URL = http://geouser:8088/proxy/application_1472015672745_0004/
Kill Command = /home/geouser/hadoop-2.7.1/bin/hadoop job  -kill job_1472015672745_0004
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2016-08-24 11:10:32,722 Stage-1 map = 0%,  reduce = 0%
2016-08-24 11:10:47,186 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 6.0 sec
MapReduce Total cumulative CPU time: 6 seconds 0 msec
Ended Job = job_1472015672745_0004
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to: hdfs://localhost:9000/user/hive/warehouse/p_patient/new=metacin/.hive-staging_hive_2016-08-24_11-10-10_899_701911370209935060-1/-ext-10000
Loading data to table default.p_patient partition (new=metacin)
Partition default.p_patient{new=metacin} stats: [numFiles=1, numRows=2, totalSize=74, rawDataSize=72]
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1   Cumulative CPU: 6.0 sec   HDFS Read: 4671 HDFS Write: 159 SUCCESS
Total MapReduce CPU Time Spent: 6 seconds 0 msec
OK
Time taken: 40.037 seconds

Step-3: View the partitioned value

select * from p_patient;

hive> select * from p_patient;
OK
10013202    briito        metacin    male      800    metacin
10013304    piyush      metacin    male      400    metacin
Time taken: 0.298 seconds, Fetched: 2 row(s)

Dynamic Partition:

1.columns whose values are only known at EXECUTION TIME.

2.We use dynamic partition while loading from an existing table that is not partitioned.

3.We use dynamic partition while unknown values for partition columns.

4.Usually dynamic partition load the data from non partitioned table.

5.Dynamic Partition takes more time in loading data compared to static partition.

6.There is no required where clause to use limit.

7.We can’t perform alter on Dynamic partition.

8.Perform dynamic partition on hive external table and managed table.

In dynamic partition the partitioned column of the partitioned hive table is must present in the last column of the existing hive table.

Example:

Step-1: Create a hive table

create table patient1(patient_id int, patient_name string, gender string, total_amount int, drug string) row format delimited fields terminated by ‘,’ stored as textfile;

Step-2: Load data into the hive table

load data local inpath ‘/home/geouser/Documents/patient1’ into table patient1;

Data Set: (Patient1)

10013201,alaister,male,500,para 10013202,briito,male,800,metacin 10013203,anifa,female,600,crocin 10013204,mohammed,male,999,para 10013304,piyush,male,400,metacin 10013404,manisha,female,200,para 10011204,vijay,male,670,crocin 10013224,marsinga,female,570,crocin

hive> create table patient1(patient_id int, patient_name string, gender string, total_amount int, drug string) row format delimited fields terminated by ‘,’ stored as textfile;
OK
Time taken: 0.972 seconds
hive>  load data local inpath ‘/home/geouser/Documents/patient1’ into table patient1;
Loading data to table default.patient1
Table default.patient1 stats: [numFiles=1, totalSize=282]
OK
Time taken: 2.338 seconds

Step-3: Before creating Partitioned table in hive first we set the properties for dynamic partition

SET hive.exec.dynamic.partition=true;

SET hive.exec.dynamic.partition.mode=non-strict;

Step-4: Create a partitioned table in hive

create table dynamic_partition_patient (patient_id int,patient_name string, gender string, total_amount int) partitioned by (drug string);

hive> create table dynamic_partition_patient (patient_id int,patient_name string, gender string, total_amount int) partitioned by (drug string);
OK
Time taken: 0.348 seconds

Step-5: Insert value into the partitioned table

insert into table dynamic_partition_patient PARTITION(drug) select * from patient1;

hive> insert into table dynamic_partition_patient PARTITION(drug) select * from patient1;
Query ID = geouser_20160824121550_09ad8cb7-121d-4df6-aa07-620fa1a04269
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there’s no reduce operator
Starting Job = job_1472015672745_0008, Tracking URL = http://geouser:8088/proxy/application_1472015672745_0008/
Kill Command = /home/geouser/hadoop-2.7.1/bin/hadoop job  -kill job_1472015672745_0008
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2016-08-24 12:16:32,948 Stage-1 map = 0%,  reduce = 0%
2016-08-24 12:16:59,856 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 4.56 sec
MapReduce Total cumulative CPU time: 4 seconds 560 msec
Ended Job = job_1472015672745_0008
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to: hdfs://localhost:9000/user/hive/warehouse/dynamic_partition_patient/.hive-staging_hive_2016-08-24_12-15-50_006_735085815761966908-1/-ext-10000
Loading data to table default.dynamic_partition_patient partition (drug=null)
Time taken for load dynamic partitions : 1856
Loading partition {drug=crocin}
Loading partition {drug=metacin}
Loading partition {drug=para}
Time taken for adding to write entity : 36
Partition default.dynamic_partition_patient{drug=crocin} stats: [numFiles=1, numRows=3, totalSize=85, rawDataSize=82]
Partition default.dynamic_partition_patient{drug=metacin} stats: [numFiles=1, numRows=2, totalSize=58, rawDataSize=56]
Partition default.dynamic_partition_patient{drug=para} stats: [numFiles=1, numRows=3, totalSize=87, rawDataSize=84]
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1   Cumulative CPU: 4.56 sec   HDFS Read: 4405 HDFS Write: 458 SUCCESS
Total MapReduce CPU Time Spent: 4 seconds 560 msec
OK
Time taken: 77.941 seconds

Step-6: View the value of the Partitioned table using hadoop command

(i). List the files in the partitioned table

hadoop fs -ls /user/hive/warehouse/dynamic_partition_patient

geouser@geouser:~$ hadoop fs -ls /user/hive/warehouse/dynamic_partition_patient
16/08/25 12:16:59 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform… using builtin-java classes where applicable
Found 3 items
drwxr-xr-x   – geouser supergroup          0 2016-08-24 12:17 /user/hive/warehouse/dynamic_partition_patient/drug=crocin
drwxr-xr-x   – geouser supergroup          0 2016-08-24 12:17 /user/hive/warehouse/dynamic_partition_patient/drug=metacin
drwxr-xr-x   – geouser supergroup          0 2016-08-24 12:17 /user/hive/warehouse/dynamic_partition_patient/drug=para

(ii). View the value of the files using cat command

hadoop fs -cat /user/hive/warehouse/dynamic_partition_patient/drug=crocin/000000_0

hadoop fs -cat /user/hive/warehouse/dynamic_partition_patient/drug=para/000000_0

hadoop fs -cat /user/hive/warehouse/dynamic_partition_patient/drug=metacin/000000_0

geouser@geouser:~$ hadoop fs -cat /user/hive/warehouse/dynamic_partition_patient/drug=crocin/000000_0
16/08/25 12:22:49 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform… using builtin-java classes where applicable
10013203anifa   female600
10011204vijay   male670
10013224marsingafemale570
geouser@geouser:~$ hadoop fs -cat /user/hive/warehouse/dynamic_partition_patient/drug=para/000000_0
16/08/25 12:23:08 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform… using builtin-java classes where applicable
10013201alaistermale  500
10013204mohammedmale  999
10013404manisha female200
geouser@geouser:~$ hadoop fs -cat /user/hive/warehouse/dynamic_partition_patient/drug=metacin/000000_0
16/08/25 12:23:58 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform… using builtin-java classes where applicable
10013202briito  male  800
10013304piyush  male  400

Default Partition:

Default Partition in Hive is called as Bucketing.

Usually Partitioning in Hive offers a way of segregating hive table data into multiple files/directories. But partitioning gives effective results when,

1.There are limited number of partitions,

2.Comparatively equal sized partitions.

But this may not possible in all scenarios, like when are partitioning our tables based geographic locations like country, some bigger countries will have large partitions where as small countries data will create small partitions . So, In these cases Partitioning will not be ideal.

To overcome the problem of over partitioning, Hive provides Bucketing concept, another technique for decomposing table data sets into more manageable parts.

Bucketing:

1.Bucketing concept is based on (hashing function on the bucketed column)mod(by total number of buckets). The hash_function depends on the type of the bucketing column.

2.Records with the same bucketed column will always be stored in the same bucket.

3.We use CLUSTERED BY clause to divide the table into buckets.

4.Physically, each bucket is just a file in the table directory, and Bucket numbering is 1-based.

5.Bucketing can be done along with Partitioning on Hive tables and even without partitioning.

6.Bucketed tables will create almost equally distributed data file parts.

Example:

Step-1: Create a Hive table

create table patient(patient_id int, patient_name string, drug string, gender string, total_amount int) row format delimited fields terminated by ‘,’ stored as textfile;

Step-2: Load value into the Hive table

load data local inpath ‘/home/geouser/Documents/patient’ into table patient;

Step-3: First set the property before create bucketing table in hive

set hive.enforce.bucketing =true;

Step-4: Create a bucketing table in Hive

create table bucket_patient(patient_id int, patient_name string, drug string,gender string, total_amount int) clustered by (drug) into 4 buckets;

hive> create table bucket_patient(patient_id int, patient_name string, drug string, gender string, total_amount int) clustered by (drug) into 4 buckets;
OK
Time taken: 1.283 seconds

Step-5: Insert the value into the bucketing table

insert overwrite table bucket_patient select * from patient;

hive> insert overwrite table bucket_patient select * from patient;
Query ID = geouser_20160824112021_33ee4b7d-d424-4662-9061-bcefeebd05d2
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 4
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Starting Job = job_1472015672745_0006, Tracking URL = http://geouser:8088/proxy/application_1472015672745_0006/
Kill Command = /home/geouser/hadoop-2.7.1/bin/hadoop job  -kill job_1472015672745_0006
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 4
2016-08-24 11:20:55,084 Stage-1 map = 0%,  reduce = 0%
2016-08-24 11:21:08,616 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 3.66 sec
2016-08-24 11:21:51,562 Stage-1 map = 100%,  reduce = 67%, Cumulative CPU 9.2 sec
2016-08-24 11:22:24,830 Stage-1 map = 100%,  reduce = 89%, Cumulative CPU 18.69 sec
2016-08-24 11:22:29,080 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 26.08 sec
MapReduce Total cumulative CPU time: 26 seconds 80 msec
Ended Job = job_1472015672745_0006
Loading data to table default.bucket_patient
Table default.bucket_patient stats: [numFiles=4, numRows=8, totalSize=294, rawDataSize=286]
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1  Reduce: 4   Cumulative CPU: 29.67 sec   HDFS Read: 18822 HDFS Write: 547 SUCCESS
Total MapReduce CPU Time Spent: 29 seconds 670 msec
OK
Time taken: 147.019 seconds

Step-6: View the value of first bucket in the bucketing table

select * from bucket_patient TABLESAMPLE(BUCKET 1 OUT OF 4 ON drug);

hive> select * from bucket_patient TABLESAMPLE(BUCKET 1 OUT OF 4 ON drug);
OK
10013224    marsinga    crocin     female    570
10011204    vijay       crocin     male    670
10013404    manisha     para       female    200
10013204    mohammed    para       male      999
10013203    anifa       crocin     female    600
10013201    alaister    para       male      500
Time taken: 0.196 seconds, Fetched: 6 row(s)

Step-7: View the 10% of value in the bucketing table

select * from bucket_patient TABLESAMPLE(10 percent);

hive> select * from bucket_patient TABLESAMPLE(10 percent);
OK
10013224    marsinga    crocin     female    570
Time taken: 0.208 seconds, Fetched: 1 row(s)

Step-8: View the value in limit 5

select * from bucket_patient LIMIT 5;

hive> select * from bucket_patient LIMIT 5;
OK
10013224    marsinga    crocin     female    570
10011204    vijay       crocin     male    670
10013404    manisha     para       female    200
10013204    mohammed    para       male      999
10013203    anifa       crocin     female    600
Time taken: 0.155 seconds, Fetched: 5 row(s)