My scrapbook about almost anything I stumble upon in my tech world. If you find anything useful don't forget to give thumbs-up :)

Breaking

Monday, February 6, 2017

Implementing Slowly Changing Dimension in ETL


Dimension is a term in data management and data warehousing that refers to logical groupings of data such as geographical location, customer information, or product information. Slowly Changing Dimensions (SCD) are dimensions that have data that slowly changes.

                    For example, you may have a Dimension in your database that tracks the sales records of your salespersons in different pharmacies. Creating sales reports seems simple enough, until a salesperson is transferred from one pharmacy to another. How do you record such a change in your sales Dimension?

When we need to track change, it is unacceptable to put everything into the fact table or make every dimension time-dependent to deal with these changes. We would quickly talk ourselves back into a full-blown normalized structure with the consequential loss of understand-ability and query performance. Instead, we take advantage of the fact that most dimensions are nearly constant over time. We can preserve the independent dimensional structure with only relatively minor adjustments to contend with the changes. We refer to these nearly constant dimensions as slowly changing dimensions. Since Ralph Kimball first introduced the notion of slowly changing dimensions in 1994, some IT professionals—in a never-ending quest to speak in acronym—have termed them SCDs.

For each attribute in our dimension tables, we must specify a strategy to handle change. In other words, when an attribute value changes in the operational world, how will we respond to the change in our dimensional models? In the following section we'll describe three basic techniques for dealing with attribute changes, along with a couple hybrid approaches. You may decide that you need to employ a combination of these techniques within a single dimension table.

Type 1: Overwrite the Value
With the type 1 response, we merely overwrite the old attribute value in the dimension row, replacing it with the current value. In so doing, the attribute always reflects the most recent assignment.

Type 2: Add a Dimension Row
A type 2 response is the predominant technique for maintaining the historical data when it comes to slowly changing dimensions. A type 2 SCD is a dimension where a new row is created when the value of an attribute changes

Type 3: Add a Dimension Column
While the type 2 response partitions history, it does not allow us to associate the new attribute value with old fact history or vice versa. A type 3 SCD is a dimension where an alternate old column is created when an attribute changes

Type 6: Hybrid Slowly Changing Dimension Techniques
The Type 6 method is one that combines the approaches of types 1, 2 and 3 (1 + 2 + 3 = 6). One possible explanation of the origin of the term was that it was coined by Ralph Kimball during a conversation with Stephen Pace from Kalido but has also been referred to by Tom Haughey. It is not frequently used because it has the potential to complicate end user access, but has some advantages over the other approaches especially when techniques are employed to mitigate the downstream complexity.
The approach is to use a Type 1 slowly changing dimension, but adding an additional pair of date columns to indicate the date range at which a particular row in the dimension applies and a flag to indicate if the record is the current record.

The aim is to move data from the source application system to the Reporting Warehouse for analysis and reporting of a worldwide pharmaceutical major. The code here is responsible for the movement of data between the Operational Data Store (ODS) and the data warehouse.

Implementation Flow:






Like the below page to get update  
https://www.facebook.com/datastage4you
https://twitter.com/datagenx
https://plus.google.com/+AtulSingh0/posts
https://datagenx.slack.com/messages/datascience/

Disclaimer

The postings on this site are my own and don't necessarily represent IBM's or other companies positions, strategies or opinions. All content provided on this blog is for informational purposes and knowledge sharing only.
The owner of this blog makes no representations as to the accuracy or completeness of any information on this site or found by following any link on this site. The owner will not be liable for any errors or omissions in this information nor for the availability of this information. The owner will not be liable for any losses, injuries, or damages from the display or use of his information.