Showing posts with label DWH. Show all posts
Showing posts with label DWH. Show all posts

Monday, 6 February 2017

Implementing Slowly Changing Dimension in ETL


Dimension is a term in data management and data warehousing that refers to logical groupings of data such as geographical location, customer information, or product information. Slowly Changing Dimensions (SCD) are dimensions that have data that slowly changes.

                    For example, you may have a Dimension in your database that tracks the sales records of your salespersons in different pharmacies. Creating sales reports seems simple enough, until a salesperson is transferred from one pharmacy to another. How do you record such a change in your sales Dimension?

When we need to track change, it is unacceptable to put everything into the fact table or make every dimension time-dependent to deal with these changes. We would quickly talk ourselves back into a full-blown normalized structure with the consequential loss of understand-ability and query performance. Instead, we take advantage of the fact that most dimensions are nearly constant over time. We can preserve the independent dimensional structure with only relatively minor adjustments to contend with the changes. We refer to these nearly constant dimensions as slowly changing dimensions. Since Ralph Kimball first introduced the notion of slowly changing dimensions in 1994, some IT professionals—in a never-ending quest to speak in acronym—have termed them SCDs.

For each attribute in our dimension tables, we must specify a strategy to handle change. In other words, when an attribute value changes in the operational world, how will we respond to the change in our dimensional models? In the following section we'll describe three basic techniques for dealing with attribute changes, along with a couple hybrid approaches. You may decide that you need to employ a combination of these techniques within a single dimension table.

Type 1: Overwrite the Value
With the type 1 response, we merely overwrite the old attribute value in the dimension row, replacing it with the current value. In so doing, the attribute always reflects the most recent assignment.

Type 2: Add a Dimension Row
A type 2 response is the predominant technique for maintaining the historical data when it comes to slowly changing dimensions. A type 2 SCD is a dimension where a new row is created when the value of an attribute changes

Type 3: Add a Dimension Column
While the type 2 response partitions history, it does not allow us to associate the new attribute value with old fact history or vice versa. A type 3 SCD is a dimension where an alternate old column is created when an attribute changes

Type 6: Hybrid Slowly Changing Dimension Techniques
The Type 6 method is one that combines the approaches of types 1, 2 and 3 (1 + 2 + 3 = 6). One possible explanation of the origin of the term was that it was coined by Ralph Kimball during a conversation with Stephen Pace from Kalido but has also been referred to by Tom Haughey. It is not frequently used because it has the potential to complicate end user access, but has some advantages over the other approaches especially when techniques are employed to mitigate the downstream complexity.
The approach is to use a Type 1 slowly changing dimension, but adding an additional pair of date columns to indicate the date range at which a particular row in the dimension applies and a flag to indicate if the record is the current record.

The aim is to move data from the source application system to the Reporting Warehouse for analysis and reporting of a worldwide pharmaceutical major. The code here is responsible for the movement of data between the Operational Data Store (ODS) and the data warehouse.

Implementation Flow:






Like the below page to get update  
https://www.facebook.com/datastage4you
https://twitter.com/datagenx
https://plus.google.com/+AtulSingh0/posts
https://datagenx.slack.com/messages/datascience/

Monday, 9 May 2016

The DW Testing Life Cycle



As with any other piece of software a DW implementation undergoes the natural cycle of Unit testing, System testing, Regression testing, Integration testing and Acceptance testing. However, unlike others there are no off-the-shelf testing products available for a DW.


Unit testing:

Traditionally this has been the task of the developer. This is a white-box testing to ensure the module or component is coded as per agreed upon design specifications. The developer should focus on the following:

a)  All inbound and outbound directory structures are created properly with appropriate permissions and sufficient disk space. All tables used during the ETL3 are present with necessary privileges.

b)  The ETL routines give expected results:
All transformation logics work as designed from source till target
Boundary conditions are satisfied− e.g. check for date fields with leap year dates
Surrogate keys have been generated properly
NULL values have been populated where expected
Rejects have occurred where expected and log for rejects is created with sufficient details
Error recovery methods

c) That the data loaded into the target is complete:

All source data that is expected to get loaded into target, actually get loaded− compare counts between source and target and use data profiling tools
All fields are loaded with full contents− i.e. no data field is truncated while transforming
No duplicates are loaded
Aggregations take place in the target properly
Data integrity constraints are properly taken care of

                                 https://www.barcodesinc.com/generator/image.php?code=The%20DW%20Testing%20Life%20Cycle&style=197&type=C128B&width=375&height=50&xres=1&font=3

System testing:

 Generally the QA team owns this responsibility. For them the design document is the bible and the entire set of test cases is directly based upon it. Here we test for the functionality of the application and mostly it is black-box. The major challenge here is preparation of test data. An intelligently designed input dataset can bring out the flaws in the application more quickly. Wherever possible use production-like data. You may also use data generation tools or customized tools of your own to create test data. We must test for all possible combinations of input and specifically check out the errors and exceptions. An unbiased approach is required to ensure maximum efficiency. Knowledge of the business process is an added advantage since we must be able to interpret the results functionally and not just code-wise.

The QA team must test for:

Data completeness and correctness− match source to target counts and validate the data.
Data aggregations− match aggregated data against staging tables and/or ODS
Lookups/Transformations is applied correctly as per specifications
Granularity of data is as per specifications
Error logs and audit tables are generated and populated properly
Notifications to IT and/or business are generated in proper format



Regression testing:

 A DW application is not a one-time solution. Possibly it is the best example of an incremental design where requirements are enhanced and refined quite often based on business needs and feedbacks. In such a situation it is very critical to test that the existing functionalities of a DW application are not messed up whenever an enhancement is made to it. Generally this is done by running all functional tests for existing code whenever a new piece of code is introduced. However, a better strategy could be to preserve earlier test input data and result sets and running the same again. Now the new results could be compared against the older ones to ensure proper functionality.


Integration testing:

This is done to ensure that the application developed works from an end-to-end perspective. Here we must consider the compatibility of the DW application with upstream and downstream flows. We need to ensure for data integrity across the flow. Our test strategy should include testing for:

Sequence of jobs to be executed with job dependencies and scheduling
Re-startability of jobs in case of failures
Generation of error logs
Cleanup scripts for the environment including database

This activity is a combined responsibility and participation of experts from all related applications is a must in order to avoid misinterpretation of results.




Acceptance testing:

 This is the most critical part because here the actual users validate your output datasets. They are the best judges to ensure that the application works as expected by them. However, business users may not have proper ETL knowledge. Hence, the development and test team should be ready to provide answers regarding ETL process that relate to data population. The test team must have sufficient business knowledge to translate the results in terms of business. Also the load windows refresh period for the DW and the views created should be signed off from users.
Performance testing:

In addition to the above tests a DW must necessarily go through another phase called performance testing. Any DW application is designed to be scaleable and robust. Therefore, when it goes into production environment, it should not cause performance problems. Here, we must test the system with huge volume of data. We must ensure that the load window is met even under such volumes. This phase should involve DBA team, and ETL expert and others who can review and validate your code for optimization.





Like the below page to get update  
https://www.facebook.com/datastage4you
https://twitter.com/datagenx
https://plus.google.com/+AtulSingh0/posts
https://datagenx.slack.com/messages/datascience/

Monday, 28 March 2016

10 Scenario based Interview Questions #1 - DataStage


1. Design a job which convert single source row to three target row.
2. Design a job which can identify the row if they are duplicate in input.
3. Design a job which will fetch the input file header and footer.
4. Design a job which will segregate unique and duplicate records in different files.
5. Design a job which remove the header from the input file.

For more ->  DataStage Scenario
6. Design a job which remove the footer from the input file.
7. Design a job which throw a mail if footer is not there in input file.
8. Design a job which extract the alternate records from the input file.
9. Design a job which extract the Nth row from the input file
10. Design a job which extract data from two input files and load them in alternate in target.


For more ->  DataStage Scenario

Friday, 18 March 2016

Data Warehouse Glossary #1



Ad Hoc Query:

A database search that is designed to extract specific information from a database.  It is ad hoc if it is designed at the point of execution as opposed to being a “canned” report.  Most ad hoc query software uses the structured query language (SQL).

Aggregation:

The process of summarizing or combining data.

Catalog:

A component of a data dictionary that describes and organizes the various aspects of a database such as its folders, dimensions, measures, prompts, functions, queries and other database objects.  It is used to create queries, reports, analyses and cubes.

Cross Tab:

A type of multi-dimensional report that displays values or measures in cells created by the intersection of two or more dimensions in a table format.

Dashboard:

A data visualization method and workflow management tool that brings together useful information on a series of screens and/or web pages.  Some of the information that may be contained on a dashboard includes reports, web links, calendar, news, tasks, e-mail, etc.  When incorporated into a DSS or EIS key performance indicators may be represented as graphics that are linked to various hyperlinks, graphs, tables and other reports.  The dashboard draws its information from multiple sources applications, office products, databases, Internet, etc.

Cube:

A multi-dimensional matrix of data that has multiple dimensions (independent variables) and measures (dependent variables) that are created by an Online Analytical Processing System (OLAP).  Each dimension may be organized into a hierarchy with multiple levels.  The intersection of two or more dimensional categories is referred to as a cell.


Data-based Knowledge:

Factual information used in the decision making process that is derived from data marts or warehouses using business intelligence tools.  Data warehousing organizes information into a format so that it represents an organizations knowledge with respect to a particular subject area, e.g. finance or clinical outcomes.

Data Cleansing:

The process of cleaning or removing errors, redundancies and inconsistencies in the data that is being imported into a data mart or data warehouse.  It is part of the quality assurance process.

Data Mart:

A database that is similar in structure to a data warehouse, but is typically smaller and is focused on a more limited area.  Multiple, integrated data marts are sometimes referred to as an Integrated Data Warehouse.  Data marts may be used in place of a larger data warehouse or in conjunction with it.  They are typically less expensive to develop and faster to deploy and are therefore becoming more popular with smaller organizations.

Data Migration:

The transfer of data from one platform to another.  This may include conversion from one language, file structure and/or operating environment to another.

Data Mining:

The process of researching data marts and data warehouses to detect specific patterns in the data sets.  Data mining may be performed on databases and multi-dimensional data cubes with ad hoc query tools and OLAP software.  The queries and reports are typically designed to answer specific questions to uncover trends or hidden relationships in the data.

Data Scrubbing:

See Data Cleansing


Data Transformation:

The modification of transaction data extracted from one or more data sources before it is loaded into the data mart or warehouse.  The modifications may include data cleansing, translation of data into a common format so that is can be aggregated and compared, summarizing the data, etc.

Data Warehouse:

An integrated, non-volatile database of historical information that is designed around specific content areas and is used to answer questions regarding an organizations operations and environment.

Database Management System:

The software that is used to create data warehouses and data marts.  For the purposes of data warehousing, they typically include relational database management systems and multi-dimensional database management systems.  Both types of database management systems create the database structures, store and retrieve the data and include various administrative functions.

Decision Support System (DSS):

A set of queries, reports, rule-based analyses, tables and charts that are designed to aid management with their decision-making responsibilities.  These functions are typically “wrapped around” a data mart or data warehouse.  The DSS tends to employ more detailed level data than an EIS.







Like the below page to get update  
https://www.facebook.com/datastage4you
https://twitter.com/datagenx
https://plus.google.com/+AtulSingh0/posts
https://datagenx.slack.com/messages/datascience/