SQOOP

SQOOP=> (SQ)L + HAD(OOP)

Relational database systems often store valuable data in a company. If made available, that data can be managed and processed by Apache Hadoop, which is fast becoming the standard for big data processing. Several relational database vendors championed de veloping integration with Hadoop within one or more of their products.

Transferring data to and from relational databases is challenging and laborious. Because data transfer requires careful handling, Apache Sqoop, short for “SQL to Hadoop,” was created to perform bidirectional data transfer between Hadoop and almost any external structured datastore. Taking advantage of MapReduce, Hadoop’s execution engine, Sqoop performs the transfers in a parallel manner.

Import = Transferring data from your relational database or warehouse system to the Hadoop ecosystem.

Export = Transferring data from your Hadoop ecosystem to the relational database or warehouse system.

Importing Data:

Here we will use a MySQL database with the account root and password password. We will be connecting to a database named sqoop_ex;

 Sqoop Command>>

sqoop import \

–connect jdbc:mysql://localhost/sqoop_ex \

–username root \

–password password \

–table mysql_table \

–m 1

Description:

The first parameter after the sqoop executable is import , which specifies the appropriate tool. The import tool is used when you want to transfer data from the relational database into Hadoop.

Next, –connect , contains the JDBC URL to your database. The syntax of the URL is specific for each database, so you need to consult your DB manual for the proper format. The URL is followed by two parameters, –username and –password , which are the credentials that Sqoop should use while connecting to the database.

Then –table , contains the name of the table in mysql to transfer and  –m specifies the Number of Mapper Tasks will kick to start work.

Importing Data into HIVE

This practice importing a data from Mysql to Hive.

Sqoop Command>>

sqoop import \

–connect jdbc:mysql://localhost/sqoop_ex \

–username root \

–password password \

–table mysql_table \

–hive-import \

–m 1

Description:

Parameter –hive-import is used for importing data into Hive.

Importing All Tables into HIVE

This excersice will import all your tables in MySql Selected Database into Hive.

Sqoop Command>>

sqoop import-all-tables \

–connect jdbc:mysql://localhost/sqoop_ex \

–username root \

–password password \

–hive-import \

–m 1

Description:

In above command import-all-tables parameter will import all the tables in mysql sqoop_ex Database to Hive.

Above sqoop_imported table is rename version of hive_import table in MySql. More on Naming Hive Table and Hive Database will be discussed.

Importing Only a Subset of Data using Where Clause

Sqoop Command>>

sqoop import \

–connect jdbc:mysql://localhost/sqoop_ex \

–username root \

–password password \

–table batch_info \

–where “batch_code=5”

–hive-import \

–m 1

–target-dir ‘/user/hive/warehouse/imported’

Description:

–where parameter is used to specify a SQL condition that the imported data should meet.

–target-dir parameter is used to specify the directory on HDFS where Sqoop should import your data.

Importing Data with append.

Sqoop Command>>

sqoop import \

–connect jdbc:mysql://localhost/sqoop_ex \

–username root \

–password password \

–table batch_info \

–where “batch_code=6”

–append

–hive-import \

–m 1

–target-dir ‘/user/hive/warehouse/imported’

Description

–append parameter will appending your new data into your existing hive table.

Importing Free Form Query

Sqoop Command>>

sqoop import \

–connect jdbc:mysql://localhost/sqoop_ex \

–username root \

–password password \

–query “select * from batch_info where course code= ‘H001’ and                                                      $CONDITIONS \

–hive-import \

–hive-table hive_import_table

–m 1

–target-dir ‘/user/hive/warehouse/imported’

Description:

Sqoop will allow you to specify any query for importing data. Instead of the parameter –table , use the parameter –query with the entire query for obtaining the data you would like to transfer.

Here another parameter named –hive-table is going to change the name of your table in Hive Database.

Importing Data into Hive Database.

Sqoop Command>>

sqoop import \

–connect jdbc:mysql://localhost/sqoop_ex \

–username root \

–password password \

–query “select * from batch_info where course code= ‘B001’ and   $CONDITIONS \

–hive-database sqoop_import \

–hive-import \

–hive-table hive_import_table

–m 1

–target-dir ‘/user/hive/warehouse/hive_import_db’

Description

Here –hive-database parameter its used to specify the hive database to import your data

Creating Sqoop job

Sqoop Command>>

sqoop job  \

–create test1

— \

–connect jdbc:mysql://localhost/sqoop_ex \

–username root \

–password password \

–table course_view \

–hive-import \

–incremental append

–check-column course_code

–last-value 5

–m 1

–target-dir ‘/user/hive/warehouse/hive_import_db’

Execution of Sqoop Job

sqoop job –exec test1;

Description

Sqoop Job is You can take advantage of the built-in Sqoop metastore that allows you to save all parameters for later reuse.

Parameter –incremental import in append mode will allow you to transfer only the newly created rows.

–check-column indicates a column name that should be checked for newly appended data.

–last-value contains the last value that successfully imported into Hadoop

sqoop example,sqoop tutorial

Sqoop Other Import Options

 List all mysql tables of particular database

 Sqoop Command>>

sqoop-list-tables \

–connect jdbc:mysql://localhost/sqoop_ex \

–username root \

–password password \

List all mysql databases

Sqoop Command>>

sqoop-databases \

–connect jdbc:mysql://localhost/ \

–username root \

–password password \

view the mapreduce  code generated by sqoop

Sqoop Command>>

 sqoop-codegen \

–connect jdbc:mysql://localhost/ \

–username root \

password password \

–table course_master

Importing only particular column from mysql

Sqoop Command>>

sqoop import \

–connect jdbc:mysql://localhost/sqoop_ex \

–username root \

–password password \

–table course_info

–columns course_code

Exclude

To exclude table while using Import-all-tables parameter

Sqoop Command>>

sqoop import-all-tables \

–connect jdbc:mysql://localhost/sqoop_ex \

–username root \

–password password \

–exclude-tables cities,countries

Setting warehouse Directory

To specify the parent directory for all your Sqoop jobs, instead use the –warehouse- dir parameter

Sqoop Command>>

sqoop import-all-tables \

–connect jdbc:mysql://localhost/sqoop_ex \

–username root \

–password password \

–exclude-tables cities,countries

–warehouse-dir /user/hive/warehouse/imported

Getting Password From Standard Input

You have two options besides specifying the password on the command line with the –password parameter. The first option is to use the parameter -P that will instruct

Sqoop to read the password from standard input.

Sqoop Command>>

sqoop import-all-tables \

–connect jdbc:mysql://localhost/sqoop_ex \

–username root \

–P \

–exclude-tables cities,countries

–warehouse-dir /user/hive/warehouse/imported

Alternatively, you can save your password in a file and specify the path to this file with the parameter –password-file .

sqoop import \

–connect jdbc:mysql://localhost/sqoop_ex \

–username root \

–password-file my-sqoop-password \

–warehouse-dir /user/hive/warehouse/imported

Using a File Format

Sequence File:

The SequenceFile is a special Hadoop file format that is used for storing objects and implements the Writable interface. This format was customized for MapReduce, and thus it expects that each record will consist of two parts: key and value .

Sqoop Command>>

 sqoop import \

–connect jdbc:mysql://localhost/sqoop_ex \

–username root \

–password password \

–table mysql_table \

–hive-import \

–m 1

–as-sequencefile

Avro File

Apache Avro is a generic data serialization system. Specifying the –asavrodatafile parameter instructs Sqoop to use its compact and fast binary encoding format. Avro is a very generic system that can store any arbitrary data structures.

Sqoop Command>>

sqoop import \

–connect jdbc:mysql://localhost/sqoop_ex \

–username root \

–password password \

–table mysql_table \

–hive-import \

–m 1

–as-avrodatafile

Compressing Imported Data

Sqoop Command>>

 sqoop import \

–connect jdbc:mysql://localhost/sqoop_ex \

–username root \

–password password \

–table mysql_table \

–compress

Speeding Up Transfers

Sqoop Command>>

sqoop import \

–connect jdbc:mysql://localhost/sqoop_ex \

–username root \

–password password \

–table mysql_table \

–direct

Sqoop Export 

When you are exporting data into RDBMS you must create the metadata for the respective export table.

 Sqoop Command>>

 sqoop export \

–connect jdbc:mysql://mysql.example.com/sqoop_ex \

–username root \

–password password \

–table hive_import_new \

–export-dir /user/hive/warehouse/imported

Inserting in Batches

Sqoop Command>>

sqoop export \

–connect jdbc:mysql://localhost/sqoop_ex \

–username root \

–password password \

–table hive_import_new \

–export-dir /user/hive/warehouse/hive_import_new\

–batch