Avro stores both the data definition and the data together in one message or file making it easy for programs to dynamically understand the information stored in an Avro file or message.

Avro stores the data definition in JSON format making it easy to read and interpret, the data itself is stored in binary format making it compact and efficient. Avro files include markers that can be used to splitting large data sets into subsets suitable for MapReduce processing.


Sqoop is a tool designed to transfer data between Hadoop and relational database servers. It is used to import data from relational databases such as MySQL, Oracle to Hadoop HDFS, and export from Hadoop file system to relational databases.


Hive is a data warehouse infrastructure tool to process structured data in Hadoop. It resides on top of Hadoop to summarize Big Data, and makes querying and analyzing easy.  It’s an open source project under the Apache Software Foundation, it’s a data warehouse software ecosystem in Hadoop. Which manage vast amount of structured data sets, by using HQL language; it’s similar to SQL.


Create database in MySQL

create database sqoopdb;


Create database in MySQL

use sqoopdb;


Create table in sqoopdb;

Create table product_order_details (user_id int, order_id int, product_id varchar(20), order_amount bigint, order_delivery_date varchar(20), user_email_id varchar(20));


Insert values into product_order_details


Import table from MySQL to HIVE in AVRO File Format

sqoop import \

–connect jdbc:mysql://localhost/sqoopdb \

–username root \

–password password \

–table product_order_details \

–as-avrodatafile \

–warehouse-dir /user/hive/warehouse/purchase_db\

-m 1;

sqoop avro,sqoop avsc,avsc sqoop,avro sqoop


View the content of the AVRO files

hdfs dfs -cat /user/hive/warehouse/purchase_db/product_order_details/part-m-00000.avro

sqoop avro,sqoop avsc,avsc sqoop,avro sqoop

Convert AVRO file into AVSC:


It contain Schema of the AVRO file. We can create a new table using the avsc schema file.


Get the avro file from HDFS to Local

hadoop fs -get /user/hive/warehouse/purchase_db/product_order_details/part-

m-00000.avro /tmp/purchase.avro


Convert Avro into Avsc

java -jar /usr/lib/avro/avro-tools-1.7.4.jar getschema /tmp/product_order.avro > /tmp/product_order.avsc

sqoop avro,sqoop avsc,avsc sqoop,avro sqoop

Create a new table using AVSC schema file in Hive:


List the avsc files in local

sudo ls -ltr -ln /tmp/*.avsc

sqoop avro,sqoop avsc,avsc sqoop,avro sqoop


Put the avsc file from local to Hive

hdfs dfs -put /tmp/product_order.avsc /user/hive/warehouse/

avro sqoop,sqoop avro ,avsc sqoop, sqoop avsc


View the content of the AVSC file

hdfs dfs -cat /user/hive/warehouse/product_order.avsc

sqoop avro,sqooop avsc,avsc sqoop,avro sqoop


Create a new hive table using avsc schema file

create external table purchase_details_table stored as avro tblproperties(‘avro.schema.url’=’/user/hive/warehouse/product_order.avsc’);


Describe and Select the newly created table

(i) Describe

desc purchase_details_table;

(ii). Select

select * from purchase_details_table;

sqoop avro,avro sqoop,avsc sqoop,sqoop avsc


Create a new hive table using avsc schema file and insert value for the table using another file

create external table purchase_details stored as avro location’/user/hive/warehouse/purchase_db/product_order_details’ tblproperties(‘avro.schema.url’=’/user/hive/warehouse/product_order.avsc’);


Describe and Select the newly created table

(i) Describe

desc purchase_details;

(ii). Select

select * from purchase_details;

sqoop avro,sqoop avsc,avsc sqoop,avro sqoop

Description about command:

sqoop import -> keyword for importing

connect -> keyword

<path> -> path of the mysql

username -> keyword

root -> name of the user, which is stored in the mysql

password -> keyword

password -> password for the user

table -> keyword

product_order_details -> importing table

as-avrodatafile -> keyword for avro format file

warehouse-dir -> keyword for target directory

/user/hive/warehouse/purchase_db/product_order_details -> path for where our table should be stored.

m 1 -> number of partitioned