Tuesday, 31 May 2016

Python Points #13 - Comprehensions

Sunday, 29 May 2016

Python Points - Assignments

Monday, 23 May 2016

Tuning SQL Statement #3


Tuning SQL Statement #1.......
Tuning SQL Statement #2.......

Consider whether a UNION ALL will suffice in place of a UNION.
The UNION statement performs the equivalent of a SELECT DISTINCT on the final result set. In other words, the UNION clause forces all rows returned by each portion of the UNION to be sorted and merged and duplicate to be filtered before the first row is returned.  A UNION ALL simply returns all rows including duplicates and does not have to perform any sort, merge or filter.  If your tables are mutually exclusive (include no duplicate records), or you don't care if duplicates are returned, the UNION ALL is much more efficient as it does not perform the SELECT DISTINCT function, which saves a lot of unnecessary resources from being used.

Do’s    
SELECT acct_num, balance_amt
FROM debit_transactions
WHERE tran_date = '1-JAN-06' 
UNION ALL
SELECT acct_num, balance_amt
FROM credit_transactions
WHERE tran_date = '1-JAN-06' ;  

Don’ts
SELECT acct_num, balance_amt
FROM debit_transactions
WHERE tran_date = '1-JAN-06'
UNION
SELECT acct_num, balance_amt
FROM credit_transactions
WHERE tran_date = '1-JAN-06' ;



http://www.datagenx.net/2016/04/tuning-select-statement-2.html
The DECODE Function
Using the DECODE function as a replacement for multiple query set operators is good for performance but should only be used in extreme cases such as the UNION clause joined SQL statements as shown.

Do’s  
SELECT stock_id||' '||
      DECODE(SIGN(qtyonhand)
            ,-1,'Out of Stock',0,'Out of Stock'
            ,1,DECODE(SIGN(qtyonhand-min)
                   ,-1,'Under Stocked',0,'Stocked'
                   ,1,DECODE(sign(qtyonhand-max)
                        ,-1,'Stocked',0,'Stocked'
                          ,1,'Over Stocked'
                   )
            )
      ) FROM stock;  
    

Don’ts
SELECT stock_id||' Out of Stock' FROM stock WHERE
   qtyonhand <=0
UNION
SELECT stock_id||' Under Stocked' FROM stock
      WHERE qtyonhand BETWEEN 1 AND min-1
UNION
SELECT stock_id||' Stocked' FROM stock
WHERE qtyonhand BETWEEN min AND max
UNION
SELECT stock_id||' Over Stocked' FROM stock
      WHERE qtyonhand > max;


    

Datatype Conversions
Try to avoid using any type of data conversion function in any part of an SQL statement which could potentially match an index, especially if you are trying to assist performance by matching appropriate indexes.

Do’s  
SELECT ...
FROM customer
WHERE cust_id = TO_NUMBER('11');    SELECT ...
FROM customer
WHERE cust_id = '11';

Don’ts
SELECT ...
FROM customer
WHERE cust_type = 1;    SELECT ...
FROM emp
WHERE TO_NUMBER (cust_type) = 1;






Like the below page to get update  
https://www.facebook.com/datastage4you
https://twitter.com/datagenx
https://plus.google.com/+AtulSingh0/posts
https://datagenx.slack.com/messages/datascience/

Thursday, 19 May 2016

5 Tips For Better DataStage Design #13



1. The query used in the database should be in such a way that required number of rows are fetched. Do not extract the columns which are not required.

2. For parallel jobs, sequential File should not be read using same partitioning.


http://www.datagenx.net/2016/05/5-tips-for-better-datastage-design-13.html


3. For huge amount of data, use of sequential file stage is not a good practice. This stage also should not be used for intermediate storage between jobs. It degrades the performance of the job.

4. The number of lookups in a job design should be minimum. Join stage is a good alternative to lookup stage.

5. For parallel jobs, proper portioning method is to be used for better job performance and accurate flow of data.





Like the below page to get update  
https://www.facebook.com/datastage4you
https://twitter.com/datagenx
https://plus.google.com/+AtulSingh0/posts
https://datagenx.slack.com/messages/datascience/

Wednesday, 18 May 2016

Windows & Linux - Command Comparison



I am a linux command line geek and when I am working on windows I miss it as there are lot of kung-fu which we can do easily in linux terminal than any GUI tool. So thought of creating a command equivalent list of Windows and Linux.

let me know if there is any daily command which I missed to add here. Hoping this will help you guys.

www.datagenx.net/2016/05/windows-linux-command-comparison.html



Command's Purpose MS-DOS Linux Basic Linux Example
Copies files copy cp cp thisfile.txt /home/thisdirectory
Moves files move mv mv thisfile.txt /home/thisdirectory
Lists files dir ls ls
Clears screen cls clear clear
Closes shell prompt exit exit exit
Displays or sets date date date date
Deletes files del rm rm thisfile.txt
"Echoes" output to the screen echo echo echo this message
Edits text files edit gedit gedit thisfile.txt
Compares the contents of files fc diff diff file1 file2
Finds a string of text in a file find grep grep word or phrase thisfile.txt
Displays command help command /? man or info man command
Creates a directory mkdir, md mkdir mkdir directory
Deletes an existing directory rmdir, rd rmdir rmdirt directory
Views contents of a file more less less thisfile.txt
Renames a file ren mv mv thisfile.txt thatfile.txt
Displays your location in the file system chdir pwd pwd
Changes directories with a specified path (absolute path) cd pathname cd pathname cd /directory/directory
Changes directories with a relative path cd.. cd .. cd ..
Displays the time time date date
Shows amount of RAM in use mem free free
Displays the contents of a file type cat type filename
Reverse IP lookup nslookup nslookup
Checking the network status netstat netstat netstat
Pinging any system ping ping ping 127.0.0.1
Running Process tasklist ps ps -ef
Get login username net session who who 
Tracing IP  tracert traceroute
Sets file permissions Attrib chmod chmod perm1 perm2
Display OS version ver uname -a uname -a
Sorts lines in a file sort sort sort filename
Compare two files diff fc fc file1 file2
OS scheduler at crontab crontab -l
FTP ftp ftp ftp user@host




Like the below page to get update  
https://www.facebook.com/datastage4you
https://twitter.com/datagenx
https://plus.google.com/+AtulSingh0/posts
https://datagenx.slack.com/messages/datascience/

Tuesday, 17 May 2016

Lookup Stage behaviour



Today, I am gonna ask you a question, What value I will get from lookup when my datatype is integer (Not Null) and there is no match b/w source and reference data???

Generally, we say, NULL as there is no match b/w source and reference. But that's not true.
So let's see how the DataStage and Lookup behave :-)

http://www.datagenx.net/2016/05/lookup-stage-behaviour.html
When Source and Reference are NULLable -
-       If there is no match b/ source and reference, we will get NULL in output 

When Source and Reference are Not-NULLable -
-       If there is no match b/ source and reference, we will get DataStage Defaults for that datatype.
        such as - 0 for integer and empty string or '' for varchar when data is going out from lookup stage.

So, Be careful when you are planning to filter the data outside lookup stage based on referenced columns value as field in output file is not null, transformer stage don't receive a null (because it comes with the default value 0) and can't handle it as you expec.

Hoping, this will add one pointer in your learning. Let me know your thoughts in comment section.




Like the below page to get update  
https://www.facebook.com/datastage4you
https://twitter.com/datagenx
https://plus.google.com/+AtulSingh0/posts
https://datagenx.slack.com/messages/datascience/

Tuesday, 10 May 2016

ETL Data Validation



    Organizations typically have “dirty data” that must be cleansed or scrubbed before being loaded into the data warehouse.  In an ideal world, there would not be dirty data.  The data in operational systems would be clean. Unfortunately, this is virtually never the case.  The data in these source systems is the result of poor data quality practices and little can be done about the data that is already there.  While organizations should move toward improving data quality at the source system level, nearly all data warehousing initiatives must cope with dirty data, at least in the short term. There are many reasons for dirty data, including:

Dummy values.  Inappropriate values have been entered into fields.  For example, a customer service representative, in a hurry and not perceiving entering correct data as being particularly important, might enter the store’s ZIP code rather than the customer’s ZIP, or enters 999-99-9999 whenever a SSN is unknown.  The operational system accepts the input, but it is not correct.

Absence of data.  Data was not entered for certain fields. This is not always attributable to lazy data entry habits and the lack of edit checks, but to the fact that different business units may have different needs for certain data values in order to run their operations.  For example, the department that originates mortgage loans may have a federal reporting requirement to capture the sex and ethnicity of a customer, whereas the department that originates consumer loans does not.

Multipurpose fields.  A field is used for multiple purposes; consequently, it does not consistently store the same thing.  This can happen with packaged applications that include fields that are not required to run the application.  Different departments may use the “extra” fields for their own purposes, and as a result, what is stored in the fields is not consistent.

Cryptic data.  It is not clear what data is stored in a field.  The documentation is poor and the attribute name provides little help in understanding the field’s content.  The field may be derived from other fields or the field may have been used for different purposes over the years.

Contradicting data.  The data should be the same but it isn’t.  For example, a customer may have different addresses in different source systems.

Inappropriate use of address lines.  Data has been incorrectly entered into address lines.  Address lines are commonly broken down into, for example, Line 1 for first, middle, and last name, Line 2 for street address, Line 3 for apartment number, and so on.  Data is not always entered into the correct line, which makes it difficult to parse the data for later use.

Violation of business rules.  Some of the values stored in a field are inconsistent with business reality.  For example, a source system may have recorded an adjustable rate mortgage loan where the value of the minimum interest rate is higher than the value of the maximum interest rate. 

Reused primary keys.  A primary key is not unique; it is used with multiple occurrences.  There are many ways that this problem can occur.  For example, assume that a branch bank has a unique identifier (i.e., a primary key).  The branch is closed and the primary key is no longer in use.  But two years later, a new branch is opened, and the old identifier is reused.  The primary key is the same for the old and the new branch.

Non-unique identifiers.  An item of interest, such as a customer, has been assigned multiple identifiers.  For example, in the health care field, it is common for health care providers to assign their own identifier to patients.  This makes it difficult to integrate patient records to provide a comprehensive understanding of a patient’s health care history.

Data integration problems.  The data is difficult or impossible to integrate.  This can be due to non-unique identifiers, or the absence of an appropriate primary key.  To illustrate, for decades customers have been associated with their accounts through a customer name field on the account record.  Integrating multiple customer accounts in this situation can be difficult.  When we examine all the account records that belong to one customer, we find different spellings or abbreviations of the same customer name, sometimes the customer is recorded under an alias or a maiden name, and occasionally two or three customers have a joint account and all of their names are squeezed into one name field.




Like the below page to get update  
https://www.facebook.com/datastage4you
https://twitter.com/datagenx
https://plus.google.com/+AtulSingh0/posts
https://datagenx.slack.com/messages/datascience/

Monday, 9 May 2016

The DW Testing Life Cycle



As with any other piece of software a DW implementation undergoes the natural cycle of Unit testing, System testing, Regression testing, Integration testing and Acceptance testing. However, unlike others there are no off-the-shelf testing products available for a DW.


Unit testing:

Traditionally this has been the task of the developer. This is a white-box testing to ensure the module or component is coded as per agreed upon design specifications. The developer should focus on the following:

a)  All inbound and outbound directory structures are created properly with appropriate permissions and sufficient disk space. All tables used during the ETL3 are present with necessary privileges.

b)  The ETL routines give expected results:
All transformation logics work as designed from source till target
Boundary conditions are satisfied− e.g. check for date fields with leap year dates
Surrogate keys have been generated properly
NULL values have been populated where expected
Rejects have occurred where expected and log for rejects is created with sufficient details
Error recovery methods

c) That the data loaded into the target is complete:

All source data that is expected to get loaded into target, actually get loaded− compare counts between source and target and use data profiling tools
All fields are loaded with full contents− i.e. no data field is truncated while transforming
No duplicates are loaded
Aggregations take place in the target properly
Data integrity constraints are properly taken care of

                                 https://www.barcodesinc.com/generator/image.php?code=The%20DW%20Testing%20Life%20Cycle&style=197&type=C128B&width=375&height=50&xres=1&font=3

System testing:

 Generally the QA team owns this responsibility. For them the design document is the bible and the entire set of test cases is directly based upon it. Here we test for the functionality of the application and mostly it is black-box. The major challenge here is preparation of test data. An intelligently designed input dataset can bring out the flaws in the application more quickly. Wherever possible use production-like data. You may also use data generation tools or customized tools of your own to create test data. We must test for all possible combinations of input and specifically check out the errors and exceptions. An unbiased approach is required to ensure maximum efficiency. Knowledge of the business process is an added advantage since we must be able to interpret the results functionally and not just code-wise.

The QA team must test for:

Data completeness and correctness− match source to target counts and validate the data.
Data aggregations− match aggregated data against staging tables and/or ODS
Lookups/Transformations is applied correctly as per specifications
Granularity of data is as per specifications
Error logs and audit tables are generated and populated properly
Notifications to IT and/or business are generated in proper format



Regression testing:

 A DW application is not a one-time solution. Possibly it is the best example of an incremental design where requirements are enhanced and refined quite often based on business needs and feedbacks. In such a situation it is very critical to test that the existing functionalities of a DW application are not messed up whenever an enhancement is made to it. Generally this is done by running all functional tests for existing code whenever a new piece of code is introduced. However, a better strategy could be to preserve earlier test input data and result sets and running the same again. Now the new results could be compared against the older ones to ensure proper functionality.


Integration testing:

This is done to ensure that the application developed works from an end-to-end perspective. Here we must consider the compatibility of the DW application with upstream and downstream flows. We need to ensure for data integrity across the flow. Our test strategy should include testing for:

Sequence of jobs to be executed with job dependencies and scheduling
Re-startability of jobs in case of failures
Generation of error logs
Cleanup scripts for the environment including database

This activity is a combined responsibility and participation of experts from all related applications is a must in order to avoid misinterpretation of results.




Acceptance testing:

 This is the most critical part because here the actual users validate your output datasets. They are the best judges to ensure that the application works as expected by them. However, business users may not have proper ETL knowledge. Hence, the development and test team should be ready to provide answers regarding ETL process that relate to data population. The test team must have sufficient business knowledge to translate the results in terms of business. Also the load windows refresh period for the DW and the views created should be signed off from users.
Performance testing:

In addition to the above tests a DW must necessarily go through another phase called performance testing. Any DW application is designed to be scaleable and robust. Therefore, when it goes into production environment, it should not cause performance problems. Here, we must test the system with huge volume of data. We must ensure that the load window is met even under such volumes. This phase should involve DBA team, and ETL expert and others who can review and validate your code for optimization.





Like the below page to get update  
https://www.facebook.com/datastage4you
https://twitter.com/datagenx
https://plus.google.com/+AtulSingh0/posts
https://datagenx.slack.com/messages/datascience/

Friday, 6 May 2016

DS Fatal Error: Destination "APT_TRinput0Rec0" is already bound


Fatal Error: Destination "APT_TRinput0Rec0" is already bound - Transformer Stage Error


Solutions:
* Check if the output stage is having identical column names
* Check if RCP is enabled in input links

If yes,
Rename the target output name accordingly
or Disable the RCP>





Like the below page to get update  
https://www.facebook.com/datastage4you
https://twitter.com/datagenx
https://plus.google.com/+AtulSingh0/posts
https://datagenx.slack.com/messages/datascience/

Tuesday, 3 May 2016

Otherwise Constraint - A Quick DataStage Recipe


Recipe:

How to use "Otherwise" constraint in Transformer Stage


www.datagenx.net

How To:

To use "Otherwise" constraint in Transformer stage, Order of link is important.
Typically link with "Otherwise" constraint should be last in Transformer stage link order




Like the below page to get update  
https://www.facebook.com/datastage4you
https://twitter.com/datagenx
https://plus.google.com/+AtulSingh0/posts
https://datagenx.slack.com/messages/datascience/