Wednesday, 27 April 2016

Tuning SQL Statement #2


Tuning SQL Statement #1.......
Tuning SQL Statement #3.......


•    The UNION statement performs the equivalent of a SELECT DISTINCT on the final result set. In other words, the UNION clause forces all rows returned by each portion of the UNION to be sorted and merged and duplicate to be filtered before the first row is returned.  A UNION ALL simply returns all rows including duplicates and does not have to perform any sort, merge or filter.  If your tables are mutually exclusive (include no duplicate records), or you don't care if duplicates are returned, the UNION ALL is much more efficient as it does not perform the SELECT DISTINCT function, which saves a lot of unnecessary resources from being used.

Do's-
SELECT acct_num, balance_amt FROM debit_transactions  WHERE tran_date = '1-JAN-06' 
UNION ALL
SELECT acct_num, balance_amt  FROM credit_transactions  WHERE tran_date = '1-JAN-06' ;  

Don'ts-
SELECT acct_num, balance_amt  FROM debit_transactions  WHERE tran_date = '1-JAN-06'
UNION
SELECT acct_num, balance_amt  FROM credit_transactions  WHERE tran_date = '1-JAN-06' ;



•    Try to avoid using any type of data conversion function in any part of an SQL statement which could potentially match an index, especially if you are trying to assist performance by matching appropriate indexes

Do's-
SELECT ...  FROM customer  

WHERE cust_id = TO_NUMBER('11');  

SELECT ...  FROM customer
WHERE cust_type = 1;    


Don'ts-
SELECT ...  FROM customer
WHERE cust_id = '11';

SELECT ...  FROM emp
WHERE TO_NUMBER (cust_type) = 1;



•    When using functions in SQL statements it is best to keep the functions away from any columns involving index matching unless function-based indexes are available and can be created.

Do's-
SELECT account_name, trans_date, amount  FROM transaction
WHERE account_name LIKE 'CAPITAL%';  

Don'ts-
SELECT account_name, trans_date, amount  FROM transaction
WHERE SUBSTR(account_name,1,7) = 'CAPITAL';



•    If your SELECT statement contains a HAVING clause, write your query so that the WHERE clause does most of the work (removing undesired rows) instead of the HAVING clause do the work of removing undesired rows. Using the WHERE clause appropriately can eliminate unnecessary rows before they get to the GROUP BY and HAVING clause, saving some unnecessary work, and boosting performance

Do's-
SELECT country, AVG (population)  FROM survey
WHERE country != 'INDIA'
AND country != 'CHINA';
GROUP BY country;  

Don'ts-
SELECT country, AVG (population)  FROM survey
GROUP BY country
HAVING country != ‘INDIA'
AND country != 'CHINA’;



•    The ORDER BY clause is always executed after the WHERE clause. Try not to override the WHERE clause with the ORDER BY clause because the Optimizer may choose a less efficient method of execution based on the ORDER BY clause. Try to put your where columns in order by clause if possible.

Do's-
SELECT * FROM ordersline
WHERE order_id < 10
ORDER BY order_id, seq#;  

Don'ts-
SELECT * FROM ordersline
WHERE order_id < 10
ORDER BY seq#;







Like the below page to get update  
https://www.facebook.com/datastage4you
https://twitter.com/datagenx
https://plus.google.com/+AtulSingh0/posts
https://datagenx.slack.com/messages/datascience/

Tuesday, 26 April 2016

Tuning SQL Statement #1



•    Try to use explicit columns and try to read columns in index orders if possible, even to the point of reading indexes and not tables.
•    Use a standard approach to table aliases.  If two identical SQL statements vary because an identical table has two different aliases, then the SQL is different and will not be shared.
•    Use table aliases and prefix all column names by their aliases when more than one table is involved in a query.  This reduces parse time AND prevents future syntax errors if someone adds a column to one of the tables with the same name as a column in another table.
•    Use bind variables.  The values of bind variables do not need to be the same for two statements to be considered identical.  Bind variables are not substituted until a statement has been successfully parsed.

i.e -
Do's     - Select FirstNme, MidName, LastName from atul.employee;
Don'ts  - Select * from atul.employee;



•    Try to match comparison condition column sequences with existing index column sequences, although it is not strictly necessary.
•    Always try to use unique, single-column indexes wherever possible. A single-column unique index is much more likely to produce exact hits. An exact hit is the fastest access method.
•    Try to avoid WHERE clauses that are NEGATIVE. Negative key words are in the WHERE clause, such as "IS NULL", "<>", "!=", "!>", "!<", "NOT", "NOT EXISTS", "NOT IN", "NOT LIKE"

i.e -
Do's     - Select FirstNme from atul.employee where DeptCd Like 'IN%';
Don'ts  - Select FirstNme from atul.employee where SubSTR(DeptCd, 1,2) = 'IN';

Do's     - Select Material from atul.store where quantity > 0;
Don'ts  - Select Material from atul.store where quantity <> 0;

Do's     - Select Material from atul.store where quantity > 3;
Don'ts  - Select Material from atul.store where quantity + 3 > 6;



Tuming SELECT Statement Continues...............




Like the below page to get update  
https://www.facebook.com/datastage4you
https://twitter.com/datagenx
https://plus.google.com/+AtulSingh0/posts
https://datagenx.slack.com/messages/datascience/

Monday, 25 April 2016

Suppress Warnings - A Quick DataStage Recipe



Recipe:

How to suppress job warnings in DS job log

HowTo:

Add "rule" in Message Handler

www.datagenx.net

Method:

From DS Director, from Message Handler, add a rule
Select Warning message as example of Message text
Or
Open the job log, select the message you want to suppress
right click and add in Message Handler





Like the below page to get update  
https://www.facebook.com/datastage4you
https://twitter.com/datagenx
https://plus.google.com/+AtulSingh0/posts
https://datagenx.slack.com/messages/datascience/

Wednesday, 20 April 2016

Shell Script Scenario #8 - Odd and Even No



 Write a shell script to find whether an input integer is even or odd. 








Like the below page to get update  
https://www.facebook.com/datastage4you
https://twitter.com/datagenx
https://plus.google.com/+AtulSingh0/posts
https://datagenx.slack.com/messages/datascience/

Tuesday, 19 April 2016

Best book for Linux Shell Scripting - Intermediate User


Today, I am gonna share my secret of Linux Scripting. The book, shared below, is core responsible for my shell scripting technique and ease. I love the tips n tricks, tweaks or many more or you can say I simply love this book :-)

So here it is -- Hoping you all also get benefitted by this.

For Intermediate Linux Shell Scripting users, this book provide all the tips and tricks which we can include in out script to make it work more efficient way.

Mastering Unix Shell Scripting: Bash, Bourne, and Korn Shell Scripting for Programmers, System Administrators, and UNIX Gurus
by Randal K. Michael
Link: http://www.amazon.in/dp/0470183012


== ==




Like the below page to get update  
https://www.facebook.com/datastage4you
https://twitter.com/datagenx
https://plus.google.com/+AtulSingh0/posts
https://datagenx.slack.com/messages/datascience/

Monday, 18 April 2016

10 Scenario based Interview Questions #2 - DataStage



11. Design a job which insert the data if not exists and update if exists in target
12. Design a job which includes a header and footer in output file
13. Design a job which checks whether Currency data is having 'Rs. 9999.99' format or not.
14. Design a job which checks the date passed validation. Date input in YYYYMMDD format.

For more ->  DataStage Scenario

15. Design a job which check the date difference in Hrs and Mins ( HH:MM)
16. Design a job which delete the data if exists in target first and then load.
17. Design a job which check whether each column is having a value or not
18. Design a job which populate the Date dimention table.
19. Design a job which transpose the columns into row with the help of transformer stage.
20. Design a job which remove duplicate without using Remove Duplicate Stage.


For more ->  DataStage Scenario



Like the below page to get update  
https://www.facebook.com/datastage4you
https://twitter.com/datagenx
https://plus.google.com/+AtulSingh0/posts
https://datagenx.slack.com/messages/datascience/

Thursday, 14 April 2016

ETL Code Review Checklist




Guideline
Design jobs for restartability/ if not designed then what is the reason ?
Do not follow Sequential File stage with "Same" partitioning.
check if the APT_CONFIG_FILE parameter is added. This is required to change the number of nodes  during runtime.
Do not hard-code parameters.
ETL Code Review Checklist
Do not hard-code directory paths.
Do not use fork-joins to generate lookup data sets.
Use "Hash" aggregation for limited distinct key values.  Outputs after all rows are read.
Use "Sort" aggregation for large number of distinct key values.  Data must be pre-sorted.  Outputs after each aggregation group.
Use multiple aggregators to reduce collection time when aggregating all rows.  Define a constant key column using row generator.  First aggregator sums in parallel.  Second aggregator sums sequentially.
Make sure sequences are not too long.  Break up into logical units of work.
Is the error handling done properly? It is prefered to propogate errors from lower jobs to the highest level( ex a sequence)
What is the volume of extract data( is there a where clause in the SQL)
Are the correct scripts to clean up datasets after job complete revoked ?
Is there a reject process in place ?
Can we combine or split so we can reduce number of jobs or complexity respectively?
It is not recommended to have an increase in the number of nodes if there are too many stages in the job( this increases the number of processes spun off)
Volume information and growth information for the Lookup/Join tables?
Check if there is a select * in any of the queries. It is not advised to have select * , instead the required columns have to be added in the statement
Check the paritioning and sorting at each stage
When a sequence is used make sure none of the parameters passed are left blank
Check if there are separate jobs for atleast extract, transform and load 
Check if there is annotation for each stage and the job, the job properties should have the author,date etc filled out
Check for naming convention of the jobs, stages and links
Try avoiding peeks in production jobs, peeks are generally used for debug in the development
Make sure the developer has not suppressed many warnings that are valid
Verify that the jobs conform to the Flat File and Dataset naming specification.  This is especially important for cleaning up files and logging errors appropriately.
Verify that all fields are written to the Reject flat files.  This is necessary for debugging and reconciliation.







Like the below page to get update  
https://www.facebook.com/datastage4you
https://twitter.com/datagenx
https://plus.google.com/+AtulSingh0/posts
https://datagenx.slack.com/messages/datascience/

Tuesday, 12 April 2016

Export DataStage job log from Director Client



These steps are not for exporting log for more than 3-5 jobs as these are manual steps and need to repeat for each job (Yes, each job individually) and NO CUSTOMIZATION available in export.

** If want to customized export, use dsjob command in your script.
1. Open the job log view for one of the jobs in Director Client.
2. Choose Project > Print All entries > Full details > Print to file, and then enter a file name in which you want to save the log.




Like the below page to get update  
https://www.facebook.com/datastage4you
https://twitter.com/datagenx
https://plus.google.com/+AtulSingh0/posts
https://datagenx.slack.com/messages/datascience/

Sunday, 10 April 2016

5 Tips For Better DataStage Design #12



1. Minimum number of sort stages should be use in a datastage job. “Don’t sort if previously sorted” in sort Stage, this option should be set this to “true”, which improves the Sort Stage performance. The same Hash key should be used.  In Transformer Stage “Preserve Sort Order” can be used to maintain sort order option.

2. Minimum number of stages should be used in a job; otherwise it affects the performance of the job.
If a job is having more stages then the job should be decomposed into smaller number of small jobs. The use of container is a best way for better visualize and readability. If the existing active stages occupy almost all the CPU resources, the performance can be improved by running multiple parallel copies of the same stage process. This is done by using a share container.





3. Use of minimum of Stage variables in transformer is a good practice. The performance degrades when more stage variables are used.

4. The use of column propagation should be taken care . Columns, which are not needed in the job flow, should not be propagated from one Stage to another and from one job to the next. The best option is to disable the RCP.

5. When there is a need of renaming columns or addition of new columns, use of copy or modify stage is good practice.





Like the below page to get update  
https://www.facebook.com/datastage4you
https://twitter.com/datagenx
https://plus.google.com/+AtulSingh0/posts
https://datagenx.slack.com/messages/datascience/

Monday, 4 April 2016

DataStage Scenario #16 - Cross duplicate Removal


Need to remove duplicate where source or destination can be switched.


Input:


source   destination   distance
city1 city2 500
city2 city1 500
city3 city4 500
city4 city3 500 
city5 city7 700
city7 city5 700



Output:

source   destination   distance
city1 city2 500
city3 city4 500
city5 city7 700





Like the below page to get update  
https://www.facebook.com/datastage4you
https://twitter.com/datagenx
https://plus.google.com/+AtulSingh0/posts
https://groups.google.com/forum/#!forum/datagenx

Friday, 1 April 2016

jupyter notebook tip #1



How we can start jupyter notebook from a specific directory?

Below, I am sharing a very tiny batch (and linux shell) script which will kick off the jupyter notebook from a particular directory with just a double click...

I hope, this will give you a little ease while working on IPython :-)





Like the below page to get update  
https://www.facebook.com/datastage4you
https://twitter.com/datagenx
https://plus.google.com/+AtulSingh0/posts
https://datagenx.slack.com/messages/datascience/