Tuesday, August 11, 2015

Sqoop: Same Conditions, Different Results & Lessons Learned

Apache developers created a very useful tool for migrating data between RDBS and non relational system, explicitly Hadoop FS. It became an essential part of the Ecosystem because there was a very small number of companies that were primarily using such a NoSQL solution. To be honest, the main part of companies that were missing a tool for handling large amounts of data were skeptical about using the new "Big Data" technologies.

I was a part of a very close group of developers who were migrating large amounts of data from relational system to HDFS. It was some time ago but it was definitely the first time someone was doing such a project here.

I'll briefly sum up what we've learned about Sqoop, one of the top projects of Apache. 

- Using implicit and explicit number of mappers matters. Normally Sqoop uses 4 mappers if you do not specify your own requirements. Sqoop was designed as a tool for parallel processing so it really matters how you decide to use it. It is efficient only when you use parallel processing.

- Of course, when you decide to use 4 mappers, do not write it explicitly. It takes processing time. Instead, just leave the -m option out. The results are tremendously different in the same conditions.

- Use split-by option. Sqoop is unable to split data into chunks when your table has no primary key. Exactly this was the worst problem we had. 

Our tables were without any primary key and big part of data inside was full of null values. Sqoop split-by option works in this manner:

1. SELECT MIN( your_split_by_column) FROM your_table 
2. SELECT MAX(your_split_by_column) FROM your_table 
3. finds out boundaries and this number divides by the number of mappers

Your program fails right in the moment when run into your_split_by_column with null values. Actually, it won't fail. It will just skip those rows. And your solution is not correct.

So how to choose your_split_by_column ?

First, you should try to choose the column which contains values that are distributed uniformly. In case the primary key exists, you are done. In other case there is a chance of existence of surrogate key. We assume that (if you do not try to pull data from Siebel) there is a numerical sequence of numbers. In this does not work, try to find date column or integer columns. If there is no integer, simply take the first columns found. (It could be worse.)

- Regulate fetch size. During import your database can let you down due to limit exceeding error. You can avoid that very easily. This problem was noticed while using MySQL but we hadn't the similar problem with Oracle.