Tuesday, 10 May 2016

ETL Data Validation



    Organizations typically have “dirty data” that must be cleansed or scrubbed before being loaded into the data warehouse.  In an ideal world, there would not be dirty data.  The data in operational systems would be clean. Unfortunately, this is virtually never the case.  The data in these source systems is the result of poor data quality practices and little can be done about the data that is already there.  While organizations should move toward improving data quality at the source system level, nearly all data warehousing initiatives must cope with dirty data, at least in the short term. There are many reasons for dirty data, including:

Dummy values.  Inappropriate values have been entered into fields.  For example, a customer service representative, in a hurry and not perceiving entering correct data as being particularly important, might enter the store’s ZIP code rather than the customer’s ZIP, or enters 999-99-9999 whenever a SSN is unknown.  The operational system accepts the input, but it is not correct.

Absence of data.  Data was not entered for certain fields. This is not always attributable to lazy data entry habits and the lack of edit checks, but to the fact that different business units may have different needs for certain data values in order to run their operations.  For example, the department that originates mortgage loans may have a federal reporting requirement to capture the sex and ethnicity of a customer, whereas the department that originates consumer loans does not.

Multipurpose fields.  A field is used for multiple purposes; consequently, it does not consistently store the same thing.  This can happen with packaged applications that include fields that are not required to run the application.  Different departments may use the “extra” fields for their own purposes, and as a result, what is stored in the fields is not consistent.

Cryptic data.  It is not clear what data is stored in a field.  The documentation is poor and the attribute name provides little help in understanding the field’s content.  The field may be derived from other fields or the field may have been used for different purposes over the years.

Contradicting data.  The data should be the same but it isn’t.  For example, a customer may have different addresses in different source systems.

Inappropriate use of address lines.  Data has been incorrectly entered into address lines.  Address lines are commonly broken down into, for example, Line 1 for first, middle, and last name, Line 2 for street address, Line 3 for apartment number, and so on.  Data is not always entered into the correct line, which makes it difficult to parse the data for later use.

Violation of business rules.  Some of the values stored in a field are inconsistent with business reality.  For example, a source system may have recorded an adjustable rate mortgage loan where the value of the minimum interest rate is higher than the value of the maximum interest rate. 

Reused primary keys.  A primary key is not unique; it is used with multiple occurrences.  There are many ways that this problem can occur.  For example, assume that a branch bank has a unique identifier (i.e., a primary key).  The branch is closed and the primary key is no longer in use.  But two years later, a new branch is opened, and the old identifier is reused.  The primary key is the same for the old and the new branch.

Non-unique identifiers.  An item of interest, such as a customer, has been assigned multiple identifiers.  For example, in the health care field, it is common for health care providers to assign their own identifier to patients.  This makes it difficult to integrate patient records to provide a comprehensive understanding of a patient’s health care history.

Data integration problems.  The data is difficult or impossible to integrate.  This can be due to non-unique identifiers, or the absence of an appropriate primary key.  To illustrate, for decades customers have been associated with their accounts through a customer name field on the account record.  Integrating multiple customer accounts in this situation can be difficult.  When we examine all the account records that belong to one customer, we find different spellings or abbreviations of the same customer name, sometimes the customer is recorded under an alias or a maiden name, and occasionally two or three customers have a joint account and all of their names are squeezed into one name field.




Like the below page to get update  
https://www.facebook.com/datastage4you
https://twitter.com/datagenx
https://plus.google.com/+AtulSingh0/posts
https://datagenx.slack.com/messages/datascience/