Skip to main content

Import 1 billion records from Oracle to HDFS in a record time



The problem:

A large scale manufacturing organization aggregates data from different sources, maintains it in a single Oracle table, and the number of records is in the order of a little over a billion. A monthly process has to fetch the data from Oracle to HDFS. 

The constraint:
Ideally, only the difference for each month could be fetched. But, there is little to no control over the Oracle data source and there is no reliable way to identify the delta. Hence, all the data have to be fetched all the time.

To give a perspective, if the table is exported as a CSV from a SQL Client (say, SQL Developer), it takes more than 20 hours to download the table.

The tool:
Sqoop is the standard tool used to import data from the relational database to HDFS.

The solution:
$ sqoop import -D **oracle.row.fetch.size=50000 --fetch-size 15000 --num-mappers 40** --table `<schema>.<table_name>` -connect `<jdbc_connection_url>`   --username `<user>` -P --target-dir `<hdfs_target_location>` --direct --split-by `<split_column>`

The above command fetched the data in under 5 minutes.

The explanation:

1. using --direct.
When using --direct, sqoop uses OraOop, an Oracle-specific driver (Database-specific driver) instead of the standard JDBC driver, which makes it significantly faster. It brought the number down to about 4 hours from around 20 hours. This doesn't work when using --query and does work only with --table. 

2. --num-mappers 
The default is 4, and this influences the degree of parallelization and the file size. Increasing the number increases the number of connections to the oracle database and also increases the number of part files written (in turn reducing the size of each part file). Setting it to 40 brought the time down to around 300 seconds.

3.  fetch size (oracle.row.fetch.size and --fetch-size)
The default is 5000. This number dictates how many records need to be processed at a time. Increasing it 10 fold does give a minor performance boost.

The caveat:
The numbers do change based on the kind of workload. So, if you are using these, you need to experiment with numbers that work for your workload.

The intuition:
It may sound technical, but it's actually a commonsense approach. 

An analogy could be, say, moving logs from one city to another city.
1. Keep all logs in one place before moving. (Support for only the table and not the query).
2. Choose the person who knows about logs in the source city, for that person knows what's the best way to carry. (Using direct allows to use database-specific driver).  
3. Add more such persons to allow transfer in parallel. (Specifying the number of mappers).
4. Increase the capacity of each person to carry more. (Specifying the fetch size).


The conclusion:
Bringing the time down from 20 hours to 4 hours to 5 minutes is a significant improvement to our overall process. If you know of any other sqoop optimization in this context, please do share in the comments.

Comments

Popular posts from this blog

Productivity improvement for remote teams!

The typical working hours in IT companies are from 10 am to 6 pm, though it could extend beyond this time depending on the nature of the project. Usually, we expect everyone to put in about 8 hours a day. There are two broad categories to classify these eight hours: Collaboration time and Core working time.  Collaboration time is when interactions with others are needed and includes all the client meetings, standups, team huddles, and discussions. Ideally, these are the hours that enable individuals to complete their work. Individuals in the team have limited choices on when these meetings have to happen as it could involve multiple stakeholders. Core working time is when the actual work gets done and is the productive hours of the individual. The more focused the individual is, the more effective they are.  These two times overlap with regular office working hours and are not conducive to peak productivity. Some teams strive to have dedicated Core working hours when there are no

Six ways to land rovers on Mars.

Six ways to land robotic rovers on Mars Mars Rover problem is a popular problem statement used by companies to check object orientation and test-driven development skills. In this article, we'll take the core problem statement and see how the solution evolves through six different levels. Knowledge of high school level maths and little python helps to follow this article. The actual Problem Statement: A squad of robotic rovers is to be landed by NASA on a plateau on Mars. This plateau, which is curiously rectangular, must be navigated by the rovers so that their on-board cameras can get a complete view of the surrounding terrain to send back to Earth. A rover's position is represented by a combination of x and y coordinates and a letter representing one of the four cardinal compass points. The plateau is divided up into a grid to simplify navigation. An example position might be 0, 0, N, which means the rover is in the bottom left corner facing North. In order t

The human synergy

After nearly two and half years of inactiveness, I'm reloaded back to throw more ramblings in the open space of internet. The plan is to have a weekly journal on an idea or a product, that I understand or learnt that week. Let's get started and this week's cynosure is "reCAPTCHA" We, often, see sites that ask us to enter the content of a distorted or skewed image of letters or numbers, that is known as CAPTCHA. The widely known fact is that it enables the site to distinguish a human from any automated bots or scripts. It is so reliable, that vast number of sites are using it, and about 200 million captchas are answered by humans in a day. Roughly it takes about 10 seconds per person to answer a captcha. In summation, each day more than 150,000 hours of human effort is consumed by these Captchas, which does nothing more than confirming that the detail is entered by a human. Could this human effort be used for a higher purpose? Yes. The answer