Showing posts with label Stage. Show all posts
Showing posts with label Stage. Show all posts

Wednesday, 5 April 2017

NULL Handling in Sequential File Stage

DataStage has a mechanism for denoting NULL field values. It is slightly different in server and parallel jobs. In the sequential file stage a character or string may be used to represent NULL column values. Here's how represent NULL with the character "~":

Server Job:
1. Create a Sequential file stage and make sure there is an Output link from it.
2. Open the Sequential file stage and click the "Outputs" tab ans Select "Format"
3. On the right enter the "~" next to "Default NULL string:"

Parallel Job:
1. Create a Sequential file stage and make sure there is an Output link from it.
2. Open the Sequential file stage and click the "Outputs" tab ans Select "Format"
3. Right click on "Field defaults" ==> "Add sub-property" and select "Null field value"
4. Enter the "~" in the newly created field.

Like the below page to get update

Wednesday, 18 January 2017

5 Tips For Better DataStage Design #17

**  There is an automap button in some stages,it can maps fields with the same names.

**  When you add a shared container into your job you need to map the columns of the container to your job link. What you might miss is the extra option you get on the Columns tab "Load" button. In addition to the normal column load you get "Load from Container" which is a quick way to load the container metadata into your job.

**  Don't create a job from an empty canvas. Always copy and use an existing job. Don't create shared containers from a blank canvas, always build and test a full job and then turn part of it into a container.

**  If you want to copy and paste settings between jobs,you had better open two Designers,then you can have two property windows open at the same time and copy or compare them more easily.As most property windows in DataStage are modal and you can only have one property window open per Designer session.

**  You can load metadata into a stage by using the "Load" button on the column tab or by dragging and dropping a table definition from the Designer repository window onto a link in your job. For sequential file stages the drag and drop is faster as it loads both the column names and the format values in one go. If you used the load button you would need to load the column names and then the format details separately.

**  Maybe you often meet a Modify stage or stage function working incorrectly, trial and error should be often the only way to work out the syntax of a function. If you do this in a large and complex job, it can be consumed a lot of times to debug it. The better way is have a couple test jobs in your project with a row generator, a modify or transformer stage and a peek stage. Have a column of each type in this test job. Use this throughout your project as a quick way to test a function or conversion. By the way, to correctly running the transformer stage need install the c++ compiler.

Like the below page to get update

Monday, 14 November 2016

DataStage Partitioning #3

Best allocation of Partitions in DataStage for storage area

No of Ways
Volume of Data
Best way of Partition
Allocation of Configuration File (Node)
DB2 EEE  extraction in serial
DB2 EEE extraction in parallel
Node number = current node (key)
64 (Depends on how many nodes are allocated)
Partition or Repartition in the Stages of DataStage
Modulus (It should be single key that to integer)
Hash (Any number of keys with different data type)
8 (Depends on how many nodes are allocated for the job)
Writing into DB2
Writing into Dataset
1,2,4,8,16,32,64 etc… (Based on the incoming records it writes into it.)
Writing into Sequential File


Best allocation of Partitions in DataStage for each stage

S. No
Best way of Partition
Important points
Left and Right link: Hash or Modulus
All the input links should be sorted based on the joining key and partitioned with higher key order.

Main link: Hash or same
Reference link: Entire
Both the links need not be in the sorted order

Master and update link: Hash or Modulus
All the input links should be sorted based on the merging key and partitioned with higher key order. Pre-sort makes merge “lightweight” for memory.

Remove Duplicate, Aggregator
Hash or Modulus
If the input link is in sorted order based on the key it will perform better.

Hash or Modulus
Sorting happens after partitioning

Transformer, Funnel, Copy, Filter
Change Capture
Left and Right link: Hash or Modulus
Both the input links should be in the sorted order based on the key and partitioned with higher key order.

Like the below page to get update

Monday, 7 November 2016

Modify Stage - Drop Columns

Monday, 29 August 2016

Modify Stage - What's been promised

Modify stage, one of the most un-used stage in DataStage but very useful in terms of performance tuning. It is advisable to Developers not to use transformer stage to just Trimming or NULL handling but if and only if in the case when they are aware and comfortable with the syntax and derivations supported by modify stage as there is no drop down or right click options to help us with functions/synatx.

The definition of Modify Stage as IBM documented -

"The Modify stage alters the record schema of its input data set. The modified data set is then output. You
can drop or keep columns from the schema, or change the type of a column.
The Modify stage is a processing stage. It can have a single input link and a single output link."
The operations offered by Modify Stage is -
1. Dropping of Columns
2. Keeping Columns
3. Create Duplicate Columns
4. Rename Columns
5. Change Null-ability of Columns
6. Change Data Type of Columns

Stage is supporting only 1 input stage and 1 output stage.

All these operations are easily done in other stages such as copy, transformer etc. But why Modify stage is required or can say, we should use this?

Answer of this Datastage problem is simple - Performance Tuning of jobs

** Why not to use Transformer -
Cause, Whenever we call the transformer functions, data processed to and through C++ code (transformer implementation) which cause the performance letancy(delay). This delay is negligible for less no of records than higher no. So, Prefer the Modify stage when no of records are high to process.

Keep looking for this place as we are going to learn lot of tips on Modify stage.

Get this Article as PDF - and

Like the below page to get update

Monday, 22 August 2016

5 Tips For Better DataStage Design #15

1. Stage variable does not accept null value. Hence no null able column should be directly mapped to stage variable without null handling.

2. Use of SetNull() function in stage variables should be avoided because it causes compilation error.

3. If input links are not already partitioned on join key then they should be hash partitioned on the join key in join stage. In case of multiple join key it is recommended to partition on one key and sort by the other keys.

4. If there is a need to do the repartition on an input link then we need to clear the preserve partitioning flag in the previous stage. Otherwise it will generate warning in job log.

5. If database table has less volume of data as a reference then it is good to use lookup stage.

6. It is always advisable to avoid Transformation stage. Because the Transformation stage is not written in DataStage native language, instead it is written in c. So every time you compile a job it embeds the c code with the native code in the executable file, which degrades the performance of the job.

Like the below page to get update

Thursday, 16 June 2016

5 Tips For Better DataStage Design #14

1. The use of Lookup stage depends upon the volume of data.Sparse lookup type should be used when primary input data volume is small.If the reference data volume is more, Lookup Stage should be avoided.

2. Use of ORDER BY clause in the database is good as compared to use of sort stage.

3. In Dtatastage Administrator, Tuned the 'Project Tunable' for better performance.

4. For Funnel, the use of this stage reduces the performance of a job. Funnel Stage should be run in continuous mode.

5. If the hash file is used only for lookup then "enable Preload to memory". This will improve the performance.

Like the below page to get update

Tuesday, 17 May 2016

Lookup Stage behaviour

Today, I am gonna ask you a question, What value I will get from lookup when my datatype is integer (Not Null) and there is no match b/w source and reference data???

Generally, we say, NULL as there is no match b/w source and reference. But that's not true.
So let's see how the DataStage and Lookup behave :-)
When Source and Reference are NULLable -
-       If there is no match b/ source and reference, we will get NULL in output 

When Source and Reference are Not-NULLable -
-       If there is no match b/ source and reference, we will get DataStage Defaults for that datatype.
        such as - 0 for integer and empty string or '' for varchar when data is going out from lookup stage.

So, Be careful when you are planning to filter the data outside lookup stage based on referenced columns value as field in output file is not null, transformer stage don't receive a null (because it comes with the default value 0) and can't handle it as you expec.

Hoping, this will add one pointer in your learning. Let me know your thoughts in comment section.

Like the below page to get update

Friday, 6 May 2016

DS Fatal Error: Destination "APT_TRinput0Rec0" is already bound

Fatal Error: Destination "APT_TRinput0Rec0" is already bound - Transformer Stage Error

* Check if the output stage is having identical column names
* Check if RCP is enabled in input links

If yes,
Rename the target output name accordingly
or Disable the RCP>

Like the below page to get update

Sunday, 10 April 2016

5 Tips For Better DataStage Design #12

1. Minimum number of sort stages should be use in a datastage job. “Don’t sort if previously sorted” in sort Stage, this option should be set this to “true”, which improves the Sort Stage performance. The same Hash key should be used.  In Transformer Stage “Preserve Sort Order” can be used to maintain sort order option.

2. Minimum number of stages should be used in a job; otherwise it affects the performance of the job.
If a job is having more stages then the job should be decomposed into smaller number of small jobs. The use of container is a best way for better visualize and readability. If the existing active stages occupy almost all the CPU resources, the performance can be improved by running multiple parallel copies of the same stage process. This is done by using a share container.

3. Use of minimum of Stage variables in transformer is a good practice. The performance degrades when more stage variables are used.

4. The use of column propagation should be taken care . Columns, which are not needed in the job flow, should not be propagated from one Stage to another and from one job to the next. The best option is to disable the RCP.

5. When there is a need of renaming columns or addition of new columns, use of copy or modify stage is good practice.

Like the below page to get update

Tuesday, 22 March 2016

Transformer Stage alternative - A Quick DataStage Recipe

What to use instead of "Transformer" Stage

Copy Stage

Use "Copy" stage instead of "Transformer" Stage for following:
Renaming columns
Dropping columns
Default type conversions
Job design placeholder between stages

Modify Stage

Use "Modify" stage
Non default type conversions
Null handling
Character string trimming

Filter Stage

Use "Filter" Stage
Using constraints on output data

Will add more.......

Like the below page to get update

Sunday, 13 March 2016

Capture Unmatched records from Join Stage - A Quick DataStage Recipe

How to capture unmatched records from "Join" Stage. This stage does not provide reject handling for unmatched records (such as in Inner Join)


Use Outer Join operation. All rows on an outer link are output regardless of match on key values.

How to:

Add Outer join operation for "Join’ stage
When a match does not occur, Join stage inserts NULL values into unmatched columns
Filter stage can be used to test NULL values in unmatched columns

Like the below page to get update

Monday, 1 February 2016

5 Tips For Better DataStage Design #8

#1. Templates have to be created to enhance reusability and enforce coding standard. Jobs should be created using templates.
#2. The template should contain the standard job flow along with proper naming conventions of components, proper Job level annotation and short/long description. Change record section should be kept in log description to keep track.

#3. Don't copy the job design only. copy using 'save as' or create copy option at the job level.
#4. The DataStage connection should be logged off after completion of work to avoid locked jobs.
#5. Creation of common lookup jobs
#6. Some extraction jobs can be created to created reference datasets. The datasets can then be used in different conversion modules

Like the below page to get update!forum/datagenx

Wednesday, 23 December 2015

5 Tips For Better DataStage Design #6

#1. If you are using a copy or a filter stage either immediately after or immediately before a transformer stage, you are reducing the efficiency by using more stages because a transformer does the job of both copy stage as well as a filter stage

#2. Work done by "COPY Stage"
a) Columns order can be altered.
b) And columns can be dropped.
c) We can change the column names.

#3. When you need to run the same sequence of jobs again and again, better create a sequencer with all the jobs that you need to run. Running this sequencer will run all the jobs. You can provide the sequence as per your requirement.

#4. Sort the data as much as possible in DB and reduced the use of DS-Sort for better performance of jobs. Avoid the work done by DataStage which is possible in DB. But it doesn't mean you have to put all the complexity in SQL only, for that we are using datastage.

#5. Ensure that all the character fields are trimmed before any processing. Normally extra spaces in the data may lead to some errors like lookup mismatch which are hard to detect.

Like the below page to get update!forum/datagenx

Friday, 20 November 2015

5 Tips For Better DataStage Design #5

#1. Use the Data Set Management utility, which is available in the Tools menu of the DataStage Designer or the DataStage Manager, to examine the schema, look at row counts, and delete a Parallel Data Set. You can also view the data itself.

#2. Use Sort stages instead of Remove duplicate stages. Sort stage has got more grouping options and sort indicator options.

#3. for quick checking if DS job is running on Server or not, from UNIX
ps -ef | grep 'DSD.RUN'

#4. Make use of Order By clause when a DB stage is being used in join. The intention is to make use of Database power for sorting instead of Data Stage resources. Keep the join partitioning as Auto. Indicate don’t sort option between DB stage and join stage using sort stage when using order by clause.

#5. There are two types of variables - string and encrypted. If you create an encrypted environment variable it will appears as the string "*******" in the Administrator tool.

Like the below page to get update!forum/datagenx