Apache Sqoop in Nutshell

It is a tool that facilitates efficient bi-directional bulk data transfer between HDFS and RDBMS.

Features:

  • Internally uses JDBC for importing and exporting the data.
  • For use cases that require fast data transfers, direct mode of Sqoop enables the use of bulk copy utilities.
  • Supports various file formats – Text, Sequence file, Avro.
  • Supports Hive and Hbase imports.
  • Provides metastore to save jobs.
  • Supports incremental imports (RDBMS to HDFS).
  • Is easily extensible.

Database Systems Supported by Sqoop:

MySQL being an open source database; has always been the main focus of Apache Community. The best connector that Sqoop packages is for MySQL.

In all, Sqoop supports following databases:

  • MySQL (direct mode support as well)
  • Oracle
  • SqlServer
  • PostGre (direct mode support as well)
  • DB2
  • Hsqldb
  • generic piece of code that works for all the databases (the functionality is limited).

Third-party extensions:

One of the strong advantages of Sqoop is that it is extensible. There are a number of third-party companies shipping database-specific connectors:

Third-party Sqoop Connectors
RDBMS Developed by Link
Teradata Cloudera View
Netezza Cloudera View
Oracle Quest View
Microsoft Sql Server Microsoft View
Microsoft PDW Microsoft
Couchbase Couchbase View
VoltDB VoltDB Blog

History:

Sqoop was initially developed and maintained by Cloudera. It was incubated in Apache on 23 July 2011, since then Apache committee manages the releases. When Sqoop was under incubation, following versions were released:

Releases during Apache Sqoop incubation
Version Download Docs Release Manager
Sqoop-1.4.0-incubating 1.4.0-incubating 1.4.0-incubating Bilung Lee
Sqoop-1.4.1-incubating 1.4.1-incubating 1.4.1-incubating Jarek Jarcec Cecho

In march 2012, Sqoop graduated to a Top Level Project in Apache. Releases after that:

Releases during Apache Sqoop as TLP
Version Download Docs Release Manager
Sqoop-1.4.2 1.4.2 1.4.2 Abhijeet Gaikwad (mentored by Jarek Jarcec Cecho)

An excellent information about Sqoop graduation and Versions is provided on this blog by Arvind Prabhakar.

Sqoop 2:

Few limitations in Sqoop lead to the experimental development of entirely new Sqoop 2. The disadvantages and new design is proposed here.

The first release in this branch:

Releases during Apache Sqoop as TLP
Version Download Docs Release Manager
Sqoop-1.99.1 1.99.1 1.99.1 Jarek Jarcec Cecho

Jarcec proposed that 1.99.1 version name is apt because it is away from current stable 1.4 and is near to 2.0. It is the first release in 2.0 series and will move to 2.0 when more stable. The proposal was accepted by all developers who voted.

Sqoop Quick links:

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

%d bloggers like this: