Hive File Format Examples
Hive File Formats:
A file format is the way in which information is stored or encoded in a computer file. In Hive it refers to how records are stored inside the file. As we are dealing with structured data, each record has to be its own structure. How records are encoded in a file defines a file format. These file formats mainly varies between data encoding, compression rate, usage of space and disk I/O.
Most commonly used file formats are text file,sequence file,RC(RECORD-COLUMNAR) file and ORC(OPTIMIZED ROW-COLUMNAR) file
TextFile Format:
TEXTFILE format is a famous input/output format used in Hadoop. In Hive if we define a table as TEXTFILE it can load data of form CSV (Comma Separated Values), delimited by Tabs, Spaces and JSON data. This means fields in each record should be separated by comma or space or tab or it may be JSON(Java Script Object Notation) data.
By default if we use TEXTFILE format then each line is considered as a record.
Create a text file by specifying STORED AS TEXTFILE in the end of a CREATE TABLE statement.
(e.g) create table text_file(id int,name string,age int,department string,location string) row format delimited fields terminated by ‘,’ lines terminated by ‘\n’ stored as textfile;
# load a textfile into a text_file table
load data local inpath ‘/home/hduser/txt_file’ into table text_file;
# To view the loaded file in the table goto browser and open the table directory
SequenceFile Format:
Sequence files are flat files consisting of binary key-value pairs. When Hive converts queries to MapReduce jobs, it decides on the appropriate key-value pairs to be used for a given record. Sequence files are in binary format which are able to split and the main use of these files is to club two or more smaller files and make them as a one sequence file.
Create a sequence file by specifying STORED AS SEQUENCEFILE in the end of a CREATE TABLE statement.
(e.g) create table sequence_file(id int,name string,age int,department string,location string) row format delimited fields terminated by ‘,’ lines terminated by ‘\n’ stored as sequencefile;
# load a textfile into a sequence_file table
load data local inpath ‘/home/hduser/txt_file’ into table sequence_file;
# To view the loaded file in the table goto browser and open the table directory
Rc file:
RCFILE stands of Record Columnar File which is another type of binary file format which offers high compression rate on the top of the rows. RCFILE is used when we want to perform operations on multiple rows at a time. RCFILEs are flat files consisting of binary key/value pairs, which shares much similarity with SEQUENCEFILE. RCFILE stores columns of a table in form of record in a columnar manner.
Create a Rc file by specifying STORED AS RCFILE in the end of a CREATE TABLE statement.
(e.g) create table rc_file(id int,name string,age int,department string,location string) row format delimited fields terminated by ‘,’ lines terminated by ‘\n’ stored as rcfile;
# insert a rc_file table values from text_file table
insert overwrite table rc_file select * from text_file;
# To view the loaded file in the table goto browser and open the table directory
Orc file:
ORC stands for Optimized Row Columnar which means it can store data in an optimized way than the other file formats. ORC reduces the size of the original data up to 75%. As a result the speed of data processing also increases. ORC shows better performance than Text, Sequence and RC file formats.
Create a Orc file by specifying STORED AS RCFILE in the end of a CREATE TABLE statement.
(e.g) create table orc_file(id int,name string,age int,department string,location string) row format delimited fields terminated by ‘,’ lines terminated by ‘\n’ stored as orcfile;
# insert a orc_file table values from text_file table
insert overwrite table orc_file select * from text_file;
# To view the loaded file in the table goto browser and open the table directory