Friday, 20 June 2014

Why do we need Staging Area during ETL Load


"We have a simple data warehouse that takes data from a few RDBMS source systems and load the data in dimension and fact tables of the warehouse. I wonder why we have a staging layer in between. Why can’t we process everything on the fly and push them in the data warehouse?"

Really staging area is not a necessity if we can handle it on the fly. But can we? Here are a few reasons why you can’t avoid a staging area:

  1. Source systems are only available for extraction during a specific time slot which is generally lesser than your overall data loading time. It’s a good idea to extract and keep things at your end before you lose the connection to the source systems.
  2. You want to extract data based on some conditions which require you to join two or more different systems together. E.g. you want to only extract those customers who also exist in some other system. You will not be able to perform a SQL query joining two tables from two physically different databases.
  3. Various source systems have different allotted timing for data extraction.
  4. Data warehouse’s data loading frequency does not match with the refresh frequencies of the source systems
  5. Extracted data from the same set of source systems are going to be used in multiple places (data warehouse loading, ODS loading, third-party applications etc.)
  6. ETL process involves complex data transformations that require extra space to temporarily stage the data
  7. There is specific data reconciliation / debugging requirement which warrants the use of staging area for pre, during or post load data validations

Clearly staging area gives lot flexibility during data loading. Shouldn’t we have a separate staging area always then? Is there any impact of having a stage area? Yes there are a few.
  1. Staging area increases latency – that is the time required for a change in the source system to take effect in the data warehouse. In lot of real time / near real time applications, staging area is rather avoided
  2. Data in the staging area occupies extra space
To me, in all practical senses, the benefit of having a staging area outweighs its problems. Hence, in general I will suggest designating a specific staging area in data warehousing projects.

No comments:

Post a Comment