Splitting data in Apache Sqoop

Sqoop supports data import from RDBMS -> HDFS and export HDFS -> RDBMS. For exporting the data, Sqoop uses Text input format and line record reader provided by Hadoop. More interesting is Import. Sqoop has own splitters and input formats that take care of this.

Sqoop performs splits depending on values in a particular column of the table. Sqoop takes this column from User as value specified with –split-by column argument at command line. If not given on command line, Sqoop checks for Primary Key in the table – if no key found Sqoop errors out.
The data type of this column has an important role in deciding what splitter will be initialized.

Calculating splits:
Sqoop fires a query on RDBMS, for MySql which looks like:
select min(split_by_col), max(split_by_col) from table;

After retrieving the min and max value of the column, split size is calculated:

split_size = (max – min) / no_of_mappers

Note: no_of_mappers is specified at command line via option “-m”.
Calculating split size depends on the split column data type, above is valid for numeric values.

From min to max, splits are calculated using the split_size.

For example:
Assume: min = 0, max = 400, no_of_mappers = 4
split_size = (400 – 0)/4 = 100

So, splits: [0, 100), [100, 200), [200, 300), [300, 400]

For the first split the query generated would be:
SELECT * FROM table WHERE split_by_col >= 0 AND split_by_col < 100;

In the last query, upper bound in WHERE clause is inclusive:
SELECT * FROM table WHERE split_by_col >= 300 AND split_by_col < 400;

No. of splits is the no. of mappers, so each mapper has a query: read in parallel for performance boost!

Issues
1. If using text column for text splitting, this is the warning Sqoop displays:
If your database sorts in a case-insensitive order, this may result in a partial import or duplicate records.
2. Text splitting is not supported for unicode character columns.

One thought on “Splitting data in Apache Sqoop

Leave a comment