Showing posts with label Partitioning. Show all posts
Showing posts with label Partitioning. Show all posts

Monday, 14 November 2016

DataStage Partitioning #3



Best allocation of Partitions in DataStage for storage area

Srno
No of Ways
Volume of Data
Best way of Partition
Allocation of Configuration File (Node)
1
DB2 EEE  extraction in serial
Low
-
1
2
DB2 EEE extraction in parallel
High
Node number = current node (key)
64 (Depends on how many nodes are allocated)
3
Partition or Repartition in the Stages of DataStage
Any
Modulus (It should be single key that to integer)
Hash (Any number of keys with different data type)
8 (Depends on how many nodes are allocated for the job)
4
Writing into DB2
Any
DB2
-
5
Writing into Dataset
Any
Same
1,2,4,8,16,32,64 etc… (Based on the incoming records it writes into it.)
6
Writing into Sequential File
Low
-
1

 

Best allocation of Partitions in DataStage for each stage

S. No
Stage
Best way of Partition
Important points
1
Join
Left and Right link: Hash or Modulus
All the input links should be sorted based on the joining key and partitioned with higher key order.

  1.  
Lookup
Main link: Hash or same
Reference link: Entire
Both the links need not be in the sorted order

  1.  
Merge
Master and update link: Hash or Modulus
All the input links should be sorted based on the merging key and partitioned with higher key order. Pre-sort makes merge “lightweight” for memory.

  1.  
Remove Duplicate, Aggregator
Hash or Modulus
If the input link is in sorted order based on the key it will perform better.

  1.  
Sort
Hash or Modulus
Sorting happens after partitioning


Transformer, Funnel, Copy, Filter
Same
None
7
Change Capture
Left and Right link: Hash or Modulus
Both the input links should be in the sorted order based on the key and partitioned with higher key order.





Like the below page to get update  
https://www.facebook.com/datastage4you
https://twitter.com/datagenx
https://plus.google.com/+AtulSingh0/posts
https://datagenx.slack.com/messages/datascience/

Friday, 11 November 2016

DataStage Partitioning #2


Keyless partition
    Round Robin
    Entire
    Same
    Random

Round Robin
The first record goes to the first processing node, the second to the second processing node, and so on. When DataStage reaches the last processing node in the system, it starts over. This method is useful for resizing partitions of an input data set that are not equal in size. The round robin method always creates approximately equal-sized partitions. This method is the one normally used when DataStage initially partitions data.
Example: Assume 8 nodes are allocated to store the records then 1st record will go into the first node and the 2nd record will go into the second node ……8th record will go into the eighth node and the 9th record will go into the first node and so on….

Entire
Send all rows down all partitions.
Example: Assume 8 nodes are allocated, then in all the 8 nodes all the records will be passed.

Same
Preserve the same partitioning.
Example: Two stages in a job (Sort and Dataset). In sort stage you have done “Hash” partition and in the dataset you have given “Same” partition. In the dataset the data will be preserved with the hash partition.

Random
DataStage uses a random algorithm to choose where the rows goes. The result of Random is that you cannot know where a row will end up.


Application Execution: Parallel jobs can be executed in two ways
> Sequential
> Parallel
In the first slot of the below figure sequential execution is shown. Parallel job can be executed in two processing, SMP and MPP, in the second and third slot of the below figure it is shown.




Ref - www.ibm.com


Like the below page to get update  
https://www.facebook.com/datastage4you
https://twitter.com/datagenx
https://plus.google.com/+AtulSingh0/posts
https://datagenx.slack.com/messages/datascience/

Thursday, 10 November 2016

DataStage Partitioning #1


Partitioning mechanism divides a portion of data into smaller segments, which is then processed independently by each node in parallel. It helps make a benefit of parallel architectures like SMP, MPP, Grid computing and Clusters.

Partition is logical. Partition is to divide memory or mass storage into isolated sections. Memory space will be split into many partitions to have high parallelism. In DOS systems, you can partition a disk, and each partition will behave like a separate disk drive.


Note:
In hash partitioning no specified space will be allocated to a partition in the memory. The partition space is allocated depending upon the data.



Why Partition?
•    Ability to run multiple operating systems, or multiple versions of an operating system, on the same server
•    Ability to improve workload balancing and distribution by managing processor allocations across applications and users on the server
•    Ability to leverage hardware models such as “Capacity on Demand” and "Pay as You Grow.”

Types of partition
  • Hash
  • Modulus
  • DB2
  • Auto
  • Random
  • Range
  • Round Robin
  • Entire
  • Same

Auto

DataStage inserts partitioners as necessary to ensure correct result. Generally chooses Round Robin or Same. Since Datastage has limited awareness of data and business rules, best practice is to explicitly specify partitioning as per requirement when processing requires groups of related records.

Key based partition
  • Hash
  • Modulus
  • DB2
  • Range

Hash
Determines partition based on key value(s). One or more keys with different data type are supported. DataStage’s internal algorithm applied to key values determines the partition. All key values are converted to characters before the algorithm is applied.
Example: Key is State. All “CA” rows go into one partition; all “MA” rows go into one partition. Two rows of the same state never go into different partitions.

Modulus

Partition based on modulus of key divided by the number of partitions. Key is an Integer type. ( partition=MOD(key_value/number of partition) )
Example: Key is OrderNumber (Integer type). Rows with the same order number will all go into the same partition.

DB2
Matches DB2 EEE partitioning, DB2 published its hashing algorithm and DataStage copies that.
Example: This partition is used when loading data into the DB2 table. It takes the partition key from the loading DB2 table and inserts the records effectively. If the partition key is defined in the DB2 database then it takes that Partition key otherwise it defaults to primary key.

Range
The partition is chosen based on a range map, which maps ranges of values to specified partitions. This is similar to Hash, but partition mapping is user-determined and partitions are ordered. Range partitioning requires processing the data twice which makes it hard to find a reason for using it.

This figure gives the clear view of Key based Partitioning and repartitioning.



DataStage's parallel technology operates by a divide-and-conquer technique, splitting the largest integration jobs into subsets ("partition parallelism") and flowing these subsets concurrently across all available processors ("pipeline parallelism"). This combination of pipeline and partition parallelism delivers true linear scalability (defined as an increase in performance proportional to the number of processors) and makes hardware the only mitigating factor to performance.

                However, downstream processes may need data partitioned differently. Consider a transformation that is based on customer last name, but the enriching needs to occur on zip code - for house-holding purposes - with loading into the warehouse based on customer credit card number (more on parallel database interfaces below). With dynamic data re-partitioning, data is re-partitioned on-the-fly between processes - without landing the data to disk - based on the downstream process data partitioning needs.


Ref - www.ibm.com



Like the below page to get update  
https://www.facebook.com/datastage4you
https://twitter.com/datagenx
https://plus.google.com/+AtulSingh0/posts
https://datagenx.slack.com/messages/datascience/

Monday, 22 August 2016

5 Tips For Better DataStage Design #15



1. Stage variable does not accept null value. Hence no null able column should be directly mapped to stage variable without null handling.

2. Use of SetNull() function in stage variables should be avoided because it causes compilation error.

http://www.datagenx.net/2016/08/5-tips-for-better-datastage-design-15.html

3. If input links are not already partitioned on join key then they should be hash partitioned on the join key in join stage. In case of multiple join key it is recommended to partition on one key and sort by the other keys.

4. If there is a need to do the repartition on an input link then we need to clear the preserve partitioning flag in the previous stage. Otherwise it will generate warning in job log.

5. If database table has less volume of data as a reference then it is good to use lookup stage.

6. It is always advisable to avoid Transformation stage. Because the Transformation stage is not written in DataStage native language, instead it is written in c. So every time you compile a job it embeds the c code with the native code in the executable file, which degrades the performance of the job.





Like the below page to get update  
https://www.facebook.com/datastage4you
https://twitter.com/datagenx
https://plus.google.com/+AtulSingh0/posts
https://datagenx.slack.com/messages/datascience/

Thursday, 19 May 2016

5 Tips For Better DataStage Design #13



1. The query used in the database should be in such a way that required number of rows are fetched. Do not extract the columns which are not required.

2. For parallel jobs, sequential File should not be read using same partitioning.


http://www.datagenx.net/2016/05/5-tips-for-better-datastage-design-13.html


3. For huge amount of data, use of sequential file stage is not a good practice. This stage also should not be used for intermediate storage between jobs. It degrades the performance of the job.

4. The number of lookups in a job design should be minimum. Join stage is a good alternative to lookup stage.

5. For parallel jobs, proper portioning method is to be used for better job performance and accurate flow of data.





Like the below page to get update  
https://www.facebook.com/datastage4you
https://twitter.com/datagenx
https://plus.google.com/+AtulSingh0/posts
https://datagenx.slack.com/messages/datascience/

Thursday, 14 April 2016

ETL Code Review Checklist




Guideline
Design jobs for restartability/ if not designed then what is the reason ?
Do not follow Sequential File stage with "Same" partitioning.
check if the APT_CONFIG_FILE parameter is added. This is required to change the number of nodes  during runtime.
Do not hard-code parameters.
ETL Code Review Checklist
Do not hard-code directory paths.
Do not use fork-joins to generate lookup data sets.
Use "Hash" aggregation for limited distinct key values.  Outputs after all rows are read.
Use "Sort" aggregation for large number of distinct key values.  Data must be pre-sorted.  Outputs after each aggregation group.
Use multiple aggregators to reduce collection time when aggregating all rows.  Define a constant key column using row generator.  First aggregator sums in parallel.  Second aggregator sums sequentially.
Make sure sequences are not too long.  Break up into logical units of work.
Is the error handling done properly? It is prefered to propogate errors from lower jobs to the highest level( ex a sequence)
What is the volume of extract data( is there a where clause in the SQL)
Are the correct scripts to clean up datasets after job complete revoked ?
Is there a reject process in place ?
Can we combine or split so we can reduce number of jobs or complexity respectively?
It is not recommended to have an increase in the number of nodes if there are too many stages in the job( this increases the number of processes spun off)
Volume information and growth information for the Lookup/Join tables?
Check if there is a select * in any of the queries. It is not advised to have select * , instead the required columns have to be added in the statement
Check the paritioning and sorting at each stage
When a sequence is used make sure none of the parameters passed are left blank
Check if there are separate jobs for atleast extract, transform and load 
Check if there is annotation for each stage and the job, the job properties should have the author,date etc filled out
Check for naming convention of the jobs, stages and links
Try avoiding peeks in production jobs, peeks are generally used for debug in the development
Make sure the developer has not suppressed many warnings that are valid
Verify that the jobs conform to the Flat File and Dataset naming specification.  This is especially important for cleaning up files and logging errors appropriately.
Verify that all fields are written to the Reject flat files.  This is necessary for debugging and reconciliation.







Like the below page to get update  
https://www.facebook.com/datastage4you
https://twitter.com/datagenx
https://plus.google.com/+AtulSingh0/posts
https://datagenx.slack.com/messages/datascience/

Friday, 8 January 2016

5 Tips For Better DataStage Design #7



#1. In case the partition type for the next immediate stage is to be changed then the ‘Propagate partition’ should be set to ‘Clear’ in the current stage.



#2. Make sure that appropriate partitioning and sorting are used in the stages, where ever possible. This enhances the performances. Make sure that you understand the partitioning being used. Otherwise leave it auto.

#3. For fixed width files, final delimiter should be set to 'none' in the file format property.

#4. If any processing stage requires a key ( like remove duplicate, merge, join, etc ) the Keys, sorting keys and Partitioning keys should be same and in the same order

#5. To improve Funnel, all the input links must be hash partitioned on the sort keys.





Like the below page to get update  
https://www.facebook.com/datastage4you
https://twitter.com/datagenx
https://plus.google.com/+AtulSingh0/posts
https://groups.google.com/forum/#!forum/datagenx

Wednesday, 9 September 2015

DataStage Terminology


 DataStage Term                     Description
DataStage Administrator Client program that is used to manage the project of DataStage.
DataStage server DataStage server engine component that links to DataStage client programs.
DataStage Director Execution and monitoring of the program that you want to use DataStage job.
DataStage Designer Graphical design tool that developers use to design and development of the DataStage job.
DataStage Manager Program used to manage the contents of the repository and see DataStage. Please refer to the "DataStage Manager Guide."
Stage Component that represents the data source for DataStage job, and process steps.
Source I pointed to the source of data extraction. STEPS, such as, for example.
Target Refers to the destination of the data. For example, point to the file to be loaded to AML (output by DataStage).
Category Category names used to categorize the jobs in DataStage.
Container Process of calling a common process.
Job Program that defines how to do data extraction, transformation, integration, and loading data into the target database.
Job templates And job processing is performed similar role model.
Job parameters Variables that are included in the job design. (For example,. Is the file name and password, for example)
Job sequence Control is a job for a start and run other jobs.
Scratch disk Disk space to store the data set, such as virtual record.
Table definition Definition that describes the required data includes information about the associated data tables and columns. Also referred to as metadata.
Partitioning For high-speed processing of large amounts of data, the mechanism of DataStage to perform split of the data.
Parallel engine Engine running on multiple nodes to control jobs DataStage.
Parallel jobs Available to parallel processing, the DataStage job type.
Project Jobs and a collection of components required to develop and run. The entry level to DataStage from the client. DataStage project must be licensed.
Metadata Data about the data. For example, a table definition that describes the columns that are building data.
Link Each stage of the job combined to form a data flow or reference lookup.
Routine Functions that are used in common.
Column definition Define the columns to be included in the data table. Contains the names and data types of the columns that make up the column.
Environmental parameters Variables that are included in the job design. For example, the file name and password.
DB2 stage Stage to be able to read and write to DB2 database.
FileSet stage Collection of files used to store the data.
Lookup Stage Performing processing such as a file reference to be used in DataStage or text file or table.
LookupFileSet Stage The process of storing a lookup table.
Sequential Stage Want to manipulate text files.
Custom Stage The process can not be implemented in stages, which is provided as standard DataStage, the process of implementing the language C.
Copy Stage The process of copying the data set.
Stage Generator The process of generating a dummy data set.
DataSet stage Data file to be used in the parallel engine.
Funnel Stage The process of copying the data from one set of multiple data sets.
Filter Stage Processing to extract records from the input data set.
Merge Stage Processing the join more than one input record.
LinkCorrector Stage Process of collecting the data that was previously divided.
RemoveDuplicate Stage The process of removing duplicate entries from the data set.
ChangeCapture Stage The process of comparing the two data sets, the difference between the two records.
RowImport Stage Process of importing a column from a string or binary column.
RowExport Stage Processing to export a column of another type string or binary column.
Transformer Stage Processing to edit item, or type conversion.
Modify Stage Process of conversion to the specified data type, the conversion of the value to the specified value NULL.
XML input stage Reading and processing of XML file, the extraction of the required element from the XML data.
Sort Stage Process to sort data in ascending or descending order.
Join Stage Processing the join more than one input record.
RowGenerator Stage Process of adding a line to an existing dataset
ColumnGenerator Stage Process of adding a column to an existing dataset
Aggregator stage The process of aggregating the data.
Pivot Stage The process of converting multiple columns into multiple rows of records.
Peek Stage Treatment to destroy the records.
Stream link Link that represents the flow of data.
Reference Links Input link that represents the reference data.
Reject link Link to output the data that do not meet the criteria you specify.
Integer type Data type that represents the integer value.
Decimal type Data type represents a number that contains the value of the decimal point.
NULL value Specific value indicating the unknown value. 0, not the same as a blank or empty sequence.
DataSet Collection of data.
SQLStatement Statement to manipulate the data in the table.
TWS Stands for Tivoli Workload Scheduler. The name of the product you want to create a job net.
Hash One way of dividing the data specified in the partitioning function of DataStage. Partitioning is performed by using a hash value.
Modulus One way of dividing the data specified in the partitioning function of DataStage. I do partitioning by dividing the number of partitions in the specified value.
Same One way of dividing the data specified in the partitioning function of DataStage. Processing as input partition without subdivision has been output by the previous stage.
Job Properties Property sheet to make settings for the job.





Like the below page to get update  
https://www.facebook.com/datastage4you
https://twitter.com/datagenx
https://plus.google.com/+AtulSingh0/posts
https://groups.google.com/forum/#!forum/datagenx