Showing posts with label testing. Show all posts
Showing posts with label testing. Show all posts

Friday, 8 September 2017

BI Report Testing Trends


Extending my quite old post (ETL Testing - Trends and Challenge - link) by sharing my thoughts on BI Report Testing Issues and Solutions. Feel free to add your views and comments if any.

Business Intelligence
Business Intelligence (BI) and Data Warehouse (DW) systems allow companies to use their data for data analysis, business scenarios and forecasting, business planning, operation optimization and financial management and compliance. To construct Data Warehouses, Extract, Transform and Load (ETL) technologies help to collect data from various sources, transform the data depending on business rules and needs, and load the data into a destination database.

Consolidating data into a single corporate view enables the gathering of intelligence about your business, pulling from different data sources, such as your Physician, Hospitals, Labs and Claims systems.


  • BI Report Testing Trends
Once the ETL part is tested , the data being showed onto the reports hold utmost importance. QA team should verify the data reported with the source data for consistency and accuracy.
  • Verify Report data from source (DWH tables/views)
QA should verify the report data (field/column level) from source by creating required SQL at their own based on different filter criteria (as available on report filter page).
  • Creating SQLs 
Create SQL queries to fetch and verify the data from Source and Target. Sometimes it’s not possible to do the complex transformations done in ETL. In such a case the data can be transferred to some file and calculations can be performed.
  • GUI & Layout
Verifying Report GUI (selection page) and layout (report output layout).
  • Performance verification
Verifying Report’s performance (report’s response time should be under predefined time limit as specified by business need). Also report’s performance can be tested for multiple users (those # of users are expected to access the report at same time and this limit should be defined by business need)
  • Security verification
Verifying that only authorized users can access the report OR some specific part of report (if that part should not allow to any general user)





Like the below page to get update  
https://www.facebook.com/datastage4you
https://twitter.com/datagenx
https://plus.google.com/+AtulSingh0/posts
https://datagenx.slack.com/messages/datascience/

Tuesday, 10 May 2016

ETL Data Validation



    Organizations typically have “dirty data” that must be cleansed or scrubbed before being loaded into the data warehouse.  In an ideal world, there would not be dirty data.  The data in operational systems would be clean. Unfortunately, this is virtually never the case.  The data in these source systems is the result of poor data quality practices and little can be done about the data that is already there.  While organizations should move toward improving data quality at the source system level, nearly all data warehousing initiatives must cope with dirty data, at least in the short term. There are many reasons for dirty data, including:

Dummy values.  Inappropriate values have been entered into fields.  For example, a customer service representative, in a hurry and not perceiving entering correct data as being particularly important, might enter the store’s ZIP code rather than the customer’s ZIP, or enters 999-99-9999 whenever a SSN is unknown.  The operational system accepts the input, but it is not correct.

Absence of data.  Data was not entered for certain fields. This is not always attributable to lazy data entry habits and the lack of edit checks, but to the fact that different business units may have different needs for certain data values in order to run their operations.  For example, the department that originates mortgage loans may have a federal reporting requirement to capture the sex and ethnicity of a customer, whereas the department that originates consumer loans does not.

Multipurpose fields.  A field is used for multiple purposes; consequently, it does not consistently store the same thing.  This can happen with packaged applications that include fields that are not required to run the application.  Different departments may use the “extra” fields for their own purposes, and as a result, what is stored in the fields is not consistent.

Cryptic data.  It is not clear what data is stored in a field.  The documentation is poor and the attribute name provides little help in understanding the field’s content.  The field may be derived from other fields or the field may have been used for different purposes over the years.

Contradicting data.  The data should be the same but it isn’t.  For example, a customer may have different addresses in different source systems.

Inappropriate use of address lines.  Data has been incorrectly entered into address lines.  Address lines are commonly broken down into, for example, Line 1 for first, middle, and last name, Line 2 for street address, Line 3 for apartment number, and so on.  Data is not always entered into the correct line, which makes it difficult to parse the data for later use.

Violation of business rules.  Some of the values stored in a field are inconsistent with business reality.  For example, a source system may have recorded an adjustable rate mortgage loan where the value of the minimum interest rate is higher than the value of the maximum interest rate. 

Reused primary keys.  A primary key is not unique; it is used with multiple occurrences.  There are many ways that this problem can occur.  For example, assume that a branch bank has a unique identifier (i.e., a primary key).  The branch is closed and the primary key is no longer in use.  But two years later, a new branch is opened, and the old identifier is reused.  The primary key is the same for the old and the new branch.

Non-unique identifiers.  An item of interest, such as a customer, has been assigned multiple identifiers.  For example, in the health care field, it is common for health care providers to assign their own identifier to patients.  This makes it difficult to integrate patient records to provide a comprehensive understanding of a patient’s health care history.

Data integration problems.  The data is difficult or impossible to integrate.  This can be due to non-unique identifiers, or the absence of an appropriate primary key.  To illustrate, for decades customers have been associated with their accounts through a customer name field on the account record.  Integrating multiple customer accounts in this situation can be difficult.  When we examine all the account records that belong to one customer, we find different spellings or abbreviations of the same customer name, sometimes the customer is recorded under an alias or a maiden name, and occasionally two or three customers have a joint account and all of their names are squeezed into one name field.




Like the below page to get update  
https://www.facebook.com/datastage4you
https://twitter.com/datagenx
https://plus.google.com/+AtulSingh0/posts
https://datagenx.slack.com/messages/datascience/

Monday, 9 May 2016

The DW Testing Life Cycle



As with any other piece of software a DW implementation undergoes the natural cycle of Unit testing, System testing, Regression testing, Integration testing and Acceptance testing. However, unlike others there are no off-the-shelf testing products available for a DW.


Unit testing:

Traditionally this has been the task of the developer. This is a white-box testing to ensure the module or component is coded as per agreed upon design specifications. The developer should focus on the following:

a)  All inbound and outbound directory structures are created properly with appropriate permissions and sufficient disk space. All tables used during the ETL3 are present with necessary privileges.

b)  The ETL routines give expected results:
All transformation logics work as designed from source till target
Boundary conditions are satisfied− e.g. check for date fields with leap year dates
Surrogate keys have been generated properly
NULL values have been populated where expected
Rejects have occurred where expected and log for rejects is created with sufficient details
Error recovery methods

c) That the data loaded into the target is complete:

All source data that is expected to get loaded into target, actually get loaded− compare counts between source and target and use data profiling tools
All fields are loaded with full contents− i.e. no data field is truncated while transforming
No duplicates are loaded
Aggregations take place in the target properly
Data integrity constraints are properly taken care of

                                 https://www.barcodesinc.com/generator/image.php?code=The%20DW%20Testing%20Life%20Cycle&style=197&type=C128B&width=375&height=50&xres=1&font=3

System testing:

 Generally the QA team owns this responsibility. For them the design document is the bible and the entire set of test cases is directly based upon it. Here we test for the functionality of the application and mostly it is black-box. The major challenge here is preparation of test data. An intelligently designed input dataset can bring out the flaws in the application more quickly. Wherever possible use production-like data. You may also use data generation tools or customized tools of your own to create test data. We must test for all possible combinations of input and specifically check out the errors and exceptions. An unbiased approach is required to ensure maximum efficiency. Knowledge of the business process is an added advantage since we must be able to interpret the results functionally and not just code-wise.

The QA team must test for:

Data completeness and correctness− match source to target counts and validate the data.
Data aggregations− match aggregated data against staging tables and/or ODS
Lookups/Transformations is applied correctly as per specifications
Granularity of data is as per specifications
Error logs and audit tables are generated and populated properly
Notifications to IT and/or business are generated in proper format



Regression testing:

 A DW application is not a one-time solution. Possibly it is the best example of an incremental design where requirements are enhanced and refined quite often based on business needs and feedbacks. In such a situation it is very critical to test that the existing functionalities of a DW application are not messed up whenever an enhancement is made to it. Generally this is done by running all functional tests for existing code whenever a new piece of code is introduced. However, a better strategy could be to preserve earlier test input data and result sets and running the same again. Now the new results could be compared against the older ones to ensure proper functionality.


Integration testing:

This is done to ensure that the application developed works from an end-to-end perspective. Here we must consider the compatibility of the DW application with upstream and downstream flows. We need to ensure for data integrity across the flow. Our test strategy should include testing for:

Sequence of jobs to be executed with job dependencies and scheduling
Re-startability of jobs in case of failures
Generation of error logs
Cleanup scripts for the environment including database

This activity is a combined responsibility and participation of experts from all related applications is a must in order to avoid misinterpretation of results.




Acceptance testing:

 This is the most critical part because here the actual users validate your output datasets. They are the best judges to ensure that the application works as expected by them. However, business users may not have proper ETL knowledge. Hence, the development and test team should be ready to provide answers regarding ETL process that relate to data population. The test team must have sufficient business knowledge to translate the results in terms of business. Also the load windows refresh period for the DW and the views created should be signed off from users.
Performance testing:

In addition to the above tests a DW must necessarily go through another phase called performance testing. Any DW application is designed to be scaleable and robust. Therefore, when it goes into production environment, it should not cause performance problems. Here, we must test the system with huge volume of data. We must ensure that the load window is met even under such volumes. This phase should involve DBA team, and ETL expert and others who can review and validate your code for optimization.





Like the below page to get update  
https://www.facebook.com/datastage4you
https://twitter.com/datagenx
https://plus.google.com/+AtulSingh0/posts
https://datagenx.slack.com/messages/datascience/

Monday, 29 February 2016

Building Test data from Live Data in DataStage



 Datastage is providing below stages to generate test data from live data (real data). We are well aware about these stages but not aware with this kind of use (I know :-), we don't)

Think about it !



Head stage: selects the first N records from each partition of an input data set and copies the selected records to an output data set.






Tail stage: selects the last N records from each partition of an input data set and copies the selected records to an output data set.






Sample stage: samples an input data set. Operates in two modes: Percent mode, extracts rows, selecting them by means of a random number generator, and writes a given percentage of these to each output data set; Period mode, extracts every Nth row from each partition, where N is the period which you supply.


Filter stage: transfers, unmodified, the records of the input data set which satisfy the specified requirements and filters out all other records.  You can specify different requirements to route rows down different output links.



External Filter stage: allows you to specify a UNIX command that acts as a filter on the data you are processing. An example would be to use the stage to grep a data set for a certain string, or pattern, and discard records which did not contain a match



Sequential File stage: FILTER OPTION - use this to specify that the data is passed through a filter program before being written to a file or files on output or before being placed in a dataset on input.








Like the below page to get update  
https://www.facebook.com/datastage4you
https://twitter.com/datagenx
https://plus.google.com/+AtulSingh0/posts
https://datagenx.slack.com/messages/datascience/