Showing posts with label DB2. Show all posts
Showing posts with label DB2. Show all posts

Friday, 11 November 2016

DataStage Partitioning #2


Keyless partition
    Round Robin
    Entire
    Same
    Random

Round Robin
The first record goes to the first processing node, the second to the second processing node, and so on. When DataStage reaches the last processing node in the system, it starts over. This method is useful for resizing partitions of an input data set that are not equal in size. The round robin method always creates approximately equal-sized partitions. This method is the one normally used when DataStage initially partitions data.
Example: Assume 8 nodes are allocated to store the records then 1st record will go into the first node and the 2nd record will go into the second node ……8th record will go into the eighth node and the 9th record will go into the first node and so on….

Entire
Send all rows down all partitions.
Example: Assume 8 nodes are allocated, then in all the 8 nodes all the records will be passed.

Same
Preserve the same partitioning.
Example: Two stages in a job (Sort and Dataset). In sort stage you have done “Hash” partition and in the dataset you have given “Same” partition. In the dataset the data will be preserved with the hash partition.

Random
DataStage uses a random algorithm to choose where the rows goes. The result of Random is that you cannot know where a row will end up.


Application Execution: Parallel jobs can be executed in two ways
> Sequential
> Parallel
In the first slot of the below figure sequential execution is shown. Parallel job can be executed in two processing, SMP and MPP, in the second and third slot of the below figure it is shown.




Ref - www.ibm.com


Like the below page to get update  
https://www.facebook.com/datastage4you
https://twitter.com/datagenx
https://plus.google.com/+AtulSingh0/posts
https://datagenx.slack.com/messages/datascience/

Thursday, 10 November 2016

DataStage Partitioning #1


Partitioning mechanism divides a portion of data into smaller segments, which is then processed independently by each node in parallel. It helps make a benefit of parallel architectures like SMP, MPP, Grid computing and Clusters.

Partition is logical. Partition is to divide memory or mass storage into isolated sections. Memory space will be split into many partitions to have high parallelism. In DOS systems, you can partition a disk, and each partition will behave like a separate disk drive.


Note:
In hash partitioning no specified space will be allocated to a partition in the memory. The partition space is allocated depending upon the data.



Why Partition?
•    Ability to run multiple operating systems, or multiple versions of an operating system, on the same server
•    Ability to improve workload balancing and distribution by managing processor allocations across applications and users on the server
•    Ability to leverage hardware models such as “Capacity on Demand” and "Pay as You Grow.”

Types of partition
  • Hash
  • Modulus
  • DB2
  • Auto
  • Random
  • Range
  • Round Robin
  • Entire
  • Same

Auto

DataStage inserts partitioners as necessary to ensure correct result. Generally chooses Round Robin or Same. Since Datastage has limited awareness of data and business rules, best practice is to explicitly specify partitioning as per requirement when processing requires groups of related records.

Key based partition
  • Hash
  • Modulus
  • DB2
  • Range

Hash
Determines partition based on key value(s). One or more keys with different data type are supported. DataStage’s internal algorithm applied to key values determines the partition. All key values are converted to characters before the algorithm is applied.
Example: Key is State. All “CA” rows go into one partition; all “MA” rows go into one partition. Two rows of the same state never go into different partitions.

Modulus

Partition based on modulus of key divided by the number of partitions. Key is an Integer type. ( partition=MOD(key_value/number of partition) )
Example: Key is OrderNumber (Integer type). Rows with the same order number will all go into the same partition.

DB2
Matches DB2 EEE partitioning, DB2 published its hashing algorithm and DataStage copies that.
Example: This partition is used when loading data into the DB2 table. It takes the partition key from the loading DB2 table and inserts the records effectively. If the partition key is defined in the DB2 database then it takes that Partition key otherwise it defaults to primary key.

Range
The partition is chosen based on a range map, which maps ranges of values to specified partitions. This is similar to Hash, but partition mapping is user-determined and partitions are ordered. Range partitioning requires processing the data twice which makes it hard to find a reason for using it.

This figure gives the clear view of Key based Partitioning and repartitioning.



DataStage's parallel technology operates by a divide-and-conquer technique, splitting the largest integration jobs into subsets ("partition parallelism") and flowing these subsets concurrently across all available processors ("pipeline parallelism"). This combination of pipeline and partition parallelism delivers true linear scalability (defined as an increase in performance proportional to the number of processors) and makes hardware the only mitigating factor to performance.

                However, downstream processes may need data partitioned differently. Consider a transformation that is based on customer last name, but the enriching needs to occur on zip code - for house-holding purposes - with loading into the warehouse based on customer credit card number (more on parallel database interfaces below). With dynamic data re-partitioning, data is re-partitioned on-the-fly between processes - without landing the data to disk - based on the downstream process data partitioning needs.


Ref - www.ibm.com



Like the below page to get update  
https://www.facebook.com/datastage4you
https://twitter.com/datagenx
https://plus.google.com/+AtulSingh0/posts
https://datagenx.slack.com/messages/datascience/

Tuesday, 7 June 2016

Tuning SQL Statement #4



Using Functions in QueriesWhen using functions in SQL statements it is best to keep the functions away from any columns involving index matching unless function-based indexes are available and can be created.

Do’s   
SELECT account_name, trans_date, amount
FROM transaction
WHERE account_name LIKE 'CAPITAL%';    


Don’ts
SELECT account_name, trans_date, amount
FROM transaction
WHERE SUBSTR(account_name,1,7) = 'CAPITAL';




Sequences
A sequence is often used to create unique integer identifiers as primary keys for tables. A sequence is a distinct database object and is accessed as sequence.NEXTVAL and sequence.CURRVAL.
Use sequences for unique integer identifiers; they are centralized, more controllable, more easily maintained, and perform better than other methods of counting.

Do’s  
INSERT INTO supplier (supplier_id, name, ticker)
      VALUES (supplier_seq.NEXTVAL,'A new supplier', 'TICK');    


 Don’ts
INSERT INTO supplier (supplier_id, name, ticker)
      VALUES ((SELECT MAX(supplier_id)+1
      FROM supplier), 'A new supplier', 'TICK');


    

Equi, Anti, and Range
Using an equals sign (equi) is the fastest comparison condition if a unique index exists. Any type of anti comparison such as != or NOT is looking for what is not in a table and thus must read the entire table; sometimes full index scans can be used.

Do’s   
SELECT account_name, trans_date, amount
FROM transaction
WHERE amount > 0;    SELECT account_name, trans_date, amount
FROM transaction
WHERE amount != 0;
 

Don’ts
SELECT account_name, trans_date, amount
FROM transaction
WHERE amount > 0;    SELECT account_name, trans_date, amount
FROM transaction
WHERE amount  NOT = 0;



LIKE Pattern MatchingIn general, since LIKE will match patterns which are in no way related to indexes, LIKE will usually read an entire table.






Like the below page to get update  
https://www.facebook.com/datastage4you
https://twitter.com/datagenx
https://plus.google.com/+AtulSingh0/posts
https://datagenx.slack.com/messages/datascience/

Monday, 23 May 2016

Tuning SQL Statement #3


Tuning SQL Statement #1.......
Tuning SQL Statement #2.......

Consider whether a UNION ALL will suffice in place of a UNION.
The UNION statement performs the equivalent of a SELECT DISTINCT on the final result set. In other words, the UNION clause forces all rows returned by each portion of the UNION to be sorted and merged and duplicate to be filtered before the first row is returned.  A UNION ALL simply returns all rows including duplicates and does not have to perform any sort, merge or filter.  If your tables are mutually exclusive (include no duplicate records), or you don't care if duplicates are returned, the UNION ALL is much more efficient as it does not perform the SELECT DISTINCT function, which saves a lot of unnecessary resources from being used.

Do’s    
SELECT acct_num, balance_amt
FROM debit_transactions
WHERE tran_date = '1-JAN-06' 
UNION ALL
SELECT acct_num, balance_amt
FROM credit_transactions
WHERE tran_date = '1-JAN-06' ;  

Don’ts
SELECT acct_num, balance_amt
FROM debit_transactions
WHERE tran_date = '1-JAN-06'
UNION
SELECT acct_num, balance_amt
FROM credit_transactions
WHERE tran_date = '1-JAN-06' ;



http://www.datagenx.net/2016/04/tuning-select-statement-2.html
The DECODE Function
Using the DECODE function as a replacement for multiple query set operators is good for performance but should only be used in extreme cases such as the UNION clause joined SQL statements as shown.

Do’s  
SELECT stock_id||' '||
      DECODE(SIGN(qtyonhand)
            ,-1,'Out of Stock',0,'Out of Stock'
            ,1,DECODE(SIGN(qtyonhand-min)
                   ,-1,'Under Stocked',0,'Stocked'
                   ,1,DECODE(sign(qtyonhand-max)
                        ,-1,'Stocked',0,'Stocked'
                          ,1,'Over Stocked'
                   )
            )
      ) FROM stock;  
    

Don’ts
SELECT stock_id||' Out of Stock' FROM stock WHERE
   qtyonhand <=0
UNION
SELECT stock_id||' Under Stocked' FROM stock
      WHERE qtyonhand BETWEEN 1 AND min-1
UNION
SELECT stock_id||' Stocked' FROM stock
WHERE qtyonhand BETWEEN min AND max
UNION
SELECT stock_id||' Over Stocked' FROM stock
      WHERE qtyonhand > max;


    

Datatype Conversions
Try to avoid using any type of data conversion function in any part of an SQL statement which could potentially match an index, especially if you are trying to assist performance by matching appropriate indexes.

Do’s  
SELECT ...
FROM customer
WHERE cust_id = TO_NUMBER('11');    SELECT ...
FROM customer
WHERE cust_id = '11';

Don’ts
SELECT ...
FROM customer
WHERE cust_type = 1;    SELECT ...
FROM emp
WHERE TO_NUMBER (cust_type) = 1;






Like the below page to get update  
https://www.facebook.com/datastage4you
https://twitter.com/datagenx
https://plus.google.com/+AtulSingh0/posts
https://datagenx.slack.com/messages/datascience/

Wednesday, 27 April 2016

Tuning SQL Statement #2


Tuning SQL Statement #1.......
Tuning SQL Statement #3.......


•    The UNION statement performs the equivalent of a SELECT DISTINCT on the final result set. In other words, the UNION clause forces all rows returned by each portion of the UNION to be sorted and merged and duplicate to be filtered before the first row is returned.  A UNION ALL simply returns all rows including duplicates and does not have to perform any sort, merge or filter.  If your tables are mutually exclusive (include no duplicate records), or you don't care if duplicates are returned, the UNION ALL is much more efficient as it does not perform the SELECT DISTINCT function, which saves a lot of unnecessary resources from being used.

Do's-
SELECT acct_num, balance_amt FROM debit_transactions  WHERE tran_date = '1-JAN-06' 
UNION ALL
SELECT acct_num, balance_amt  FROM credit_transactions  WHERE tran_date = '1-JAN-06' ;  

Don'ts-
SELECT acct_num, balance_amt  FROM debit_transactions  WHERE tran_date = '1-JAN-06'
UNION
SELECT acct_num, balance_amt  FROM credit_transactions  WHERE tran_date = '1-JAN-06' ;



•    Try to avoid using any type of data conversion function in any part of an SQL statement which could potentially match an index, especially if you are trying to assist performance by matching appropriate indexes

Do's-
SELECT ...  FROM customer  

WHERE cust_id = TO_NUMBER('11');  

SELECT ...  FROM customer
WHERE cust_type = 1;    


Don'ts-
SELECT ...  FROM customer
WHERE cust_id = '11';

SELECT ...  FROM emp
WHERE TO_NUMBER (cust_type) = 1;



•    When using functions in SQL statements it is best to keep the functions away from any columns involving index matching unless function-based indexes are available and can be created.

Do's-
SELECT account_name, trans_date, amount  FROM transaction
WHERE account_name LIKE 'CAPITAL%';  

Don'ts-
SELECT account_name, trans_date, amount  FROM transaction
WHERE SUBSTR(account_name,1,7) = 'CAPITAL';



•    If your SELECT statement contains a HAVING clause, write your query so that the WHERE clause does most of the work (removing undesired rows) instead of the HAVING clause do the work of removing undesired rows. Using the WHERE clause appropriately can eliminate unnecessary rows before they get to the GROUP BY and HAVING clause, saving some unnecessary work, and boosting performance

Do's-
SELECT country, AVG (population)  FROM survey
WHERE country != 'INDIA'
AND country != 'CHINA';
GROUP BY country;  

Don'ts-
SELECT country, AVG (population)  FROM survey
GROUP BY country
HAVING country != ‘INDIA'
AND country != 'CHINA’;



•    The ORDER BY clause is always executed after the WHERE clause. Try not to override the WHERE clause with the ORDER BY clause because the Optimizer may choose a less efficient method of execution based on the ORDER BY clause. Try to put your where columns in order by clause if possible.

Do's-
SELECT * FROM ordersline
WHERE order_id < 10
ORDER BY order_id, seq#;  

Don'ts-
SELECT * FROM ordersline
WHERE order_id < 10
ORDER BY seq#;







Like the below page to get update  
https://www.facebook.com/datastage4you
https://twitter.com/datagenx
https://plus.google.com/+AtulSingh0/posts
https://datagenx.slack.com/messages/datascience/

Tuesday, 26 April 2016

Tuning SQL Statement #1



•    Try to use explicit columns and try to read columns in index orders if possible, even to the point of reading indexes and not tables.
•    Use a standard approach to table aliases.  If two identical SQL statements vary because an identical table has two different aliases, then the SQL is different and will not be shared.
•    Use table aliases and prefix all column names by their aliases when more than one table is involved in a query.  This reduces parse time AND prevents future syntax errors if someone adds a column to one of the tables with the same name as a column in another table.
•    Use bind variables.  The values of bind variables do not need to be the same for two statements to be considered identical.  Bind variables are not substituted until a statement has been successfully parsed.

i.e -
Do's     - Select FirstNme, MidName, LastName from atul.employee;
Don'ts  - Select * from atul.employee;



•    Try to match comparison condition column sequences with existing index column sequences, although it is not strictly necessary.
•    Always try to use unique, single-column indexes wherever possible. A single-column unique index is much more likely to produce exact hits. An exact hit is the fastest access method.
•    Try to avoid WHERE clauses that are NEGATIVE. Negative key words are in the WHERE clause, such as "IS NULL", "<>", "!=", "!>", "!<", "NOT", "NOT EXISTS", "NOT IN", "NOT LIKE"

i.e -
Do's     - Select FirstNme from atul.employee where DeptCd Like 'IN%';
Don'ts  - Select FirstNme from atul.employee where SubSTR(DeptCd, 1,2) = 'IN';

Do's     - Select Material from atul.store where quantity > 0;
Don'ts  - Select Material from atul.store where quantity <> 0;

Do's     - Select Material from atul.store where quantity > 3;
Don'ts  - Select Material from atul.store where quantity + 3 > 6;



Tuming SELECT Statement Continues...............




Like the below page to get update  
https://www.facebook.com/datastage4you
https://twitter.com/datagenx
https://plus.google.com/+AtulSingh0/posts
https://datagenx.slack.com/messages/datascience/

Thursday, 27 August 2015

DB2 Column Compare Tool





            Whenever, we are working in Datawarehouse / Database (Oracle / MSSQL / DB2) projects, we often need to compare to tables column wise. When such requirement comes, we might end up in confusion putting the columns list in the excel sheet and comparing each column by column.

              Here’s procedure you can create in your local DB and run it feeding it with Tables names as input, and in one run you get three outputs: Matched columns, Table1 unmatched columns, table2 unmatched columns. You might need to give schema name of Table if there are two tables in SYSCOLUMNS with the same name.




You can modify or re-write the same procedure for your respective DBs.