Splitting data in Apache Sqoop

Sqoop supports data import from RDBMS -> HDFS and export HDFS -> RDBMS. For exporting the data, Sqoop uses Text input format and line record reader provided by Hadoop. More interesting is Import. Sqoop has own splitters and input formats that take care of this.

Sqoop performs splits depending on values in a particular column of the table. Sqoop takes this column from User as value specified with –split-by column argument at command line. If not given on command line, Sqoop checks for Primary Key in the table – if no key found Sqoop errors out.
The data type of this column has an important role in deciding what splitter will be initialized.

Calculating splits:
Sqoop fires a query on RDBMS, for MySql which looks like:
select min(split_by_col), max(split_by_col) from table;

After retrieving the min and max value of the column, split size is calculated:

split_size = (max – min) / no_of_mappers

Note: no_of_mappers is specified at command line via option “-m”.
Calculating split size depends on the split column data type, above is valid for numeric values.

From min to max, splits are calculated using the split_size.

For example:
Assume: min = 0, max = 400, no_of_mappers = 4
split_size = (400 – 0)/4 = 100

So, splits: [0, 100), [100, 200), [200, 300), [300, 400]

For the first split the query generated would be:
SELECT * FROM table WHERE split_by_col >= 0 AND split_by_col < 100;

In the last query, upper bound in WHERE clause is inclusive:
SELECT * FROM table WHERE split_by_col >= 300 AND split_by_col < 400;

No. of splits is the no. of mappers, so each mapper has a query: read in parallel for performance boost!

Issues
1. If using text column for text splitting, this is the warning Sqoop displays:
If your database sorts in a case-insensitive order, this may result in a partial import or duplicate records.
2. Text splitting is not supported for unicode character columns.

Installing and running Apache Sqoop

Installing Apache Sqoop is a cake Walk, if it is not Sqoop 2!

Pre-requisites:
1. The machine where Sqoop is to be installed should be able to submit Hadoop Job, not required that it should be part of Hadoop cluster.
2. If hbase import is desired, hbase should be installed else not a compulsion. Sqoop warns this but can be safely ignored.
3. Apache Sqoop in Nutshell.

Installation:
1. Download the release: refer ‘History’ section in this post Apache Sqoop in Nutshell.
2. Extract the directory at some location on your machine. Henceforth, we call this directory $SQOOP_HOME.

That is it from installation point of view!

Run:
These are some commands you can use to check if sqoop is working healthy!

Few things to note:
1. Following commands are applicable for mysql, change the JDBC connection string appropriately for respective data stores. Connection string is provided with –connect option as demonstrated below.
2. All commands are run from $SQOOP_HOME/bin directory.

Sqoop Import:
Import from a table that has primary key:
$ sqoop import --connect 'jdbc:mysql://hostname:port/database' --table table_name --split-by column_name

Import from a table by specifying split-by column manually:
$ sqoop import --connect 'jdbc:mysql://hostname:port/database' --table table_name --split-by column_name --target-dir dir_on_hdfs

Import to delimited file on HDFS, specify delimiters:
$ sqoop import --connect 'jdbc:mysql://hostname:port/database' --table table_name --split-by column_name --fields-terminated-by '|' --lines-terminated by '\t'

Specify query:
$ sqoop import --connect 'jdbc:mysql://hostname:port/database' --query 'select col1, col2 from table_name where col1 = 1' --split-by column_name

Import to a sequence file on HDFS:
$ sqoop import --connect 'jdbc:mysql://hostname:port/database' --table table_name --split-by column_name --as-sequence-file

Sqoop Export:
Simple Export:
$ sqoop export --connect 'jdbc:mysql://hostname:port/database' --table table_name --export-dir input_dir_on_hdfs

Misc:
Directly clicking these links wouldn’t work, replace $SQOOP_VERSION.
More commands and info here: http://sqoop.apache.org/docs/$SQOOP_VERSION/SqoopUserGuide.html#_troubleshooting
Troubleshooting: http://sqoop.apache.org/docs/$SQOOP_VERSION/SqoopUserGuide.html#_troubleshooting.

where, SQOOP_VERSION can be ’1.4.1-incubating’, ’1.4.2′ or ’1.4.3′.

Apache Sqoop in Nutshell

It is a tool that facilitates efficient bi-directional bulk data transfer between HDFS and RDBMS.

Features:

  • Internally uses JDBC for importing and exporting the data.
  • For use cases that require fast data transfers, direct mode of Sqoop enables the use of bulk copy utilities.
  • Supports various file formats – Text, Sequence file, Avro.
  • Supports Hive and Hbase imports.
  • Provides metastore to save jobs.
  • Supports incremental imports (RDBMS to HDFS).
  • Is easily extensible.

Database Systems Supported by Sqoop:

MySQL being an open source database; has always been the main focus of Apache Community. The best connector that Sqoop packages is for MySQL.

In all, Sqoop supports following databases:

  • MySQL (direct mode support as well)
  • Oracle
  • SqlServer
  • PostGre (direct mode support as well)
  • DB2
  • Hsqldb
  • generic piece of code that works for all the databases (the functionality is limited).

Third-party extensions:

One of the strong advantages of Sqoop is that it is extensible. There are a number of third-party companies shipping database-specific connectors:

Third-party Sqoop Connectors
RDBMS Developed by Link
Teradata Cloudera View
Netezza Cloudera View
Oracle Quest View
Microsoft Sql Server Microsoft View
Microsoft PDW Microsoft
Couchbase Couchbase View
VoltDB VoltDB Blog

History:

Sqoop was initially developed and maintained by Cloudera. It was incubated in Apache on 23 July 2011, since then Apache committee manages the releases. When Sqoop was under incubation, following versions were released:

Releases during Apache Sqoop incubation
Version Download Docs Release Manager
Sqoop-1.4.0-incubating 1.4.0-incubating 1.4.0-incubating Bilung Lee
Sqoop-1.4.1-incubating 1.4.1-incubating 1.4.1-incubating Jarek Jarcec Cecho

In march 2012, Sqoop graduated to a Top Level Project in Apache. Releases after that:

Releases during Apache Sqoop as TLP
Version Download Docs Release Manager
Sqoop-1.4.2 1.4.2 1.4.2 Abhijeet Gaikwad (mentored by Jarek Jarcec Cecho)

An excellent information about Sqoop graduation and Versions is provided on this blog by Arvind Prabhakar.

Sqoop 2:

Few limitations in Sqoop lead to the experimental development of entirely new Sqoop 2. The disadvantages and new design is proposed here.

The first release in this branch:

Releases during Apache Sqoop as TLP
Version Download Docs Release Manager
Sqoop-1.99.1 1.99.1 1.99.1 Jarek Jarcec Cecho

Jarcec proposed that 1.99.1 version name is apt because it is away from current stable 1.4 and is near to 2.0. It is the first release in 2.0 series and will move to 2.0 when more stable. The proposal was accepted by all developers who voted.

Sqoop Quick links:

Developing ODBC Driver

Well to say I have been developing ODBC drivers for a while. Thought would write down things to understand before we write(implement) a driver. Not much help on web, so here we go…

We are on this blog, simply means we already know what is an ODBC Driver. If don’t there is a lot of material avaliable. Also, I assume we know few frequently used ODBC APIs – SQLAllocHandle, SQLConnect, etc.

Writing a driver is not that difficult, one of the reasons being – driver manager(henceforth written as DM) taking care of a lot of things – basic error handling, maintaining handle states – helps keep track of APIs called on a particular handle, connecting ODBC applications to ODBC Driver and most important finding the driver!

Finding the driver?? DM stands between ODBC App and driver. It is a generic piece of code that can load and run DBMS-specific driver code. So how does DM know which driver to pick up? When we create DSN, we link it to our installed ODBC driver. While calling SQLConnect API, we provide this DSN – and DM knows its us! All the APIs called before SQLConnect are kept in queue by DM – how would it know which driver’s APIs to call when application hasn’t provided any identification of the driver yet (conn string or DSN name saves us)!! Call SQLConnect(dsnname,.., username,.. passwrd,…), DM knows which Driver to load.

Driver Manager:

For linux I would suggest go ahead and install unixODBC DM. Windows has a Driver Manager pre-installed and on the date I am writing this blog – windows DM is the most updated DM wrt latest ODBC specifications.

All about the handles:

Environment, Connection, Statement and Descriptor are the ones to say! This part is really confusing, what are these handles and how do I represent them internally, is there any particular specification laid down by ODBC for these handles? MSDN only reveals: a particular handle should contain diagnostics, state, attributes blah blah!

Here is the answer – go ahead and create a struct if writing in C or a class if in C++ or I would say any data structure available for you to use and put some elements you want in it, e.g. for connection handle – host, port, user, password – i wouldn’t store this, timeout, a diagnostic data structure(this one has a fixed elements defined by ODBC spec – MSDN will help you out)… whatever you think you would need ahead in the ODBC flow. In my experience – this data structure continuously evolves with each new API you implement! Internals of these data structures are never exposed to DM or ODBC Application – but yes pointers to these structures are.

When ODBC application calls SQLAllocHandle API – DM’s API is called first that internally allocates a structure (oh yes, DM maintains its own handle structures as well). Then DM calls Driver’s API and driver internally allocates it’s own structure. In a nutshell, ODBC application points to DM’s handle which internally contains a pointer pointing to Driver’s handle. Now, as handle memory areas are exposed to application – there is a chance they be corrupted! Hell, I told you DM is the savior – keeps checksums or some other strategy!! I would say worry about this if your bypassing THE DM!

Implementing APIs:

MSDN has been a great help. It tells you what a particular API should do. And there are a lot of things which are DBMS dependent. E.g. – SQLConnect would internally try to connect to the DBMS server engine you have. Depends on the design, to give an open-source example – Apache Hive ODBC Driver internally uses thrift client to connect to thrift server embedded in Hive server.

Installation and adding DSN:

This depends on the platform. odbcinst.ini and odbc.ini play important part here.

On linux systems, unixODBc tutorials can be used to locate the path of these. Here you go – http://www.unixodbc.org/odbcinst.html.

Windows is a UI based installation, copy DLL to program files and then make entries into system registry(http://msdn.microsoft.com/en-us/library/windows/desktop/ms714818%28v=vs.85%29.aspx and http://msdn.microsoft.com/en-us/library/windows/desktop/ms712603%28v=vs.85%29.aspx). Implement ConfigDSN API to add, configure and delete DSN. For editing odncinst.iniSQLWriteDSNtoIni or SQLRemoveDSNFromIni APIs can be used. For odbc.iniSQLWritePrivateProfileString, SQLGetPrivateProfileString APIs help.

I hope this helps. If I’v used a lot of parentheses or exclamations! – can’t help it!

Suggestions are welcome, if any glitches.

Follow

Get every new post delivered to your Inbox.