Showing posts with label tips. Show all posts
Showing posts with label tips. Show all posts

Thursday, 14 September 2017

Evaluation Sequence in Transformer Stage - A Quick DataStage Recipe


What is evaluation sequence in Transformer Stage Or Order of Stage & Loop Variable and Derivations


1. Transformer Stage
     a. Stage Variables
     b. Loop Variables
     c. Derivations

How To:

Evaluate each stage variable initial value
For each input row to process:
Evaluate each stage variable derivation value, unless the derivation is empty
For each output link:
Evaluate each column derivation value
Write the output record
Next output link
Next input row

** The stage variables and the columns within a link are evaluated in the order in which they are displayed in the Transformer editor. Similarly, the output links are also evaluated in the order in which they are displayed

Like the below page to get update

Saturday, 12 August 2017

Delete all lines in Notepad++ except lines containing a pattern

Who dont want to keep the only items which needed and we want to have when working with lots of junk data and want to fetch only which requires :-)

This can be done very easily in NotePad++ .

1. Use Ctrl-F to open Search box and  Select "Mark"

2. Put the Pattern in "Find What" box (in my case, I want to keep "Singh" only)

3. Then Click on Mark All to Mark all the lines which is having "Singh" in it.

4. When Marked, it will look like below -

5. After marking, Close this pop-up and Go To Search --> Bookmark --> Remove Unmarked Lines

As soon as Click on Remove Unmarked Lines, it will delete the all lines other than marked line.

Like the below page to get update

Friday, 14 July 2017

Utility to Get Last Run Log of DataStage Job

Love the new feature of IBM DataStage to fetch the last run log of any jobs with the help of "dsjob" command.
In older version of DataStage, it is very tedious to get the last run log but from/after v9.1 IBM added an additional feature in dsjob command. Lets see hows this works -

  • To Fetch Last Run Log:
  • To Fetch Second Last Run Log
  • To Fetch Third Last Run Log
  • To Fetch Nth Last Run Log

Like the below page to get update

Thursday, 13 July 2017

How to Run Python Code in NotePad++

Can you run the python code from NotePad++ ?? Question seems to be little odd but we can tweak our notepad++ settings and configure it that way. Let's see how -

1. Write few line as your python code, You can use below lines-

print("Today we are going to learn how to use notepad++ to run the python code")
print("As first step, we have to write few python code line")

input("Press Enter to Exit..........")

2. Save this file, in my case, it is saved as ""
3. Check the python executable path in your system. In my case it is - C:\tools\Anaconda3\python.exe  (It can be different as per your python installation)

4. Now go to Run menu or Press F5. This will open a run window as below -

5. Python below code in 'the program to run' -
Python_Executable_Path $(FULL_CURRENT_PATH)
C:\tools\Anaconda3\python.exe $(FULL_CURRENT_PATH)

6. Save this Run configuration by clicking on Save button on same window

7. Choose Run button combination (can use Ctrl + Alt + Shift + Key) and Save.

8. You can see this combination under Run Menu.

9. Now, You can run the Python code by pressing the combination buttons (My case - F9)

Things to Remember:
1. This tweak is not replacement for Python IDE :-) such as PyCharm, Spider or many others.
2. Always put the  input("Press Enter to Exit..........")at very last line of your code else you will not able to see the python code output.

Like the below page to get update

Wednesday, 5 July 2017

Conditionally Aborting Jobs with Transformer Stage

How to develop a job which will stop processing when FROM_DATE and TO_DATE is equal in data? Or
I want to abort the job when reject row count is more than 50?

Above scenarios can be implemented using Transformer Stage but How? Let's check this out -

  • The Transformer can be used to conditionally abort a job when incoming data matches a specific rule. 
    • In our case 1, it is FROM_DATE  = TO_DATE 
    • In our case 2, it is some reject condition 
  • Create a new output link that will handle rows that match the abort rule. 
  • Within the link constraints dialog box, apply the abort rule to this output link
  • Set the “Abort After Rows” count to the number of rows allowed before the job should be aborted .
    • In case 1, it should be 1. as we want to abort the job when FROM_DATE is equal to TO_DATE
    • In case 2, it should be 50 as we want to abort the job when reject condition have more than 50 records

But, since the Transformer will abort the entire job flow immediately, it is possible that valid rows will not have been flushed from Sequential File (export) buffers, or committed to database tables.
It is important to set the Sequential File buffer flush  or database commit parameters otherwise we have to manually remove the data which has been inserted into sequential file or database.

Like the below page to get update

Wednesday, 18 January 2017

5 Tips For Better DataStage Design #17

**  There is an automap button in some stages,it can maps fields with the same names.

**  When you add a shared container into your job you need to map the columns of the container to your job link. What you might miss is the extra option you get on the Columns tab "Load" button. In addition to the normal column load you get "Load from Container" which is a quick way to load the container metadata into your job.

**  Don't create a job from an empty canvas. Always copy and use an existing job. Don't create shared containers from a blank canvas, always build and test a full job and then turn part of it into a container.

**  If you want to copy and paste settings between jobs,you had better open two Designers,then you can have two property windows open at the same time and copy or compare them more easily.As most property windows in DataStage are modal and you can only have one property window open per Designer session.

**  You can load metadata into a stage by using the "Load" button on the column tab or by dragging and dropping a table definition from the Designer repository window onto a link in your job. For sequential file stages the drag and drop is faster as it loads both the column names and the format values in one go. If you used the load button you would need to load the column names and then the format details separately.

**  Maybe you often meet a Modify stage or stage function working incorrectly, trial and error should be often the only way to work out the syntax of a function. If you do this in a large and complex job, it can be consumed a lot of times to debug it. The better way is have a couple test jobs in your project with a row generator, a modify or transformer stage and a peek stage. Have a column of each type in this test job. Use this throughout your project as a quick way to test a function or conversion. By the way, to correctly running the transformer stage need install the c++ compiler.

Like the below page to get update

Thursday, 8 December 2016

Notepad++ tip - Format JSON file

Notepad++ is a very powerful tool with lots of plugins and functionality which can reduce a lot of our work. Today, we will see how to deal with JSON data in Notepad++.

1. First of all, whenever you are opening any data or code file, always select respective language style. How you do it is below -

Open data/code file --> Go to Language Menu --> Select respective language setting (in our case, its J - JSON )

After doing this, you will see that code/data text is more visible to your eyes.

2. Install some plugins, for JSON, install below ones -
a. JSON Viewer
b. JSToolNpp

Go to Plugins --> Plugin Manager --> Show Plugin Manager --> Available ---> Select & Install

3. For formatting JSON, select all content, now use Ctrl+Alt+Shift+M or Ctrl+Alt+J
4. Your JSON file has been formatted :-)

You can download some useful plugins from here ->
If dont have access, Use this -

Place this plugin folder into your notepad++ installation directory and restart the notepad++.

Like the below page to get update

Monday, 21 November 2016

Reading DSParam - datastage parameter file

I am sharing a utility which can help you to read DSParam file which holds all the environmental datastage parameters.

Utility to view contents of DSParams file. Useful when trying to see what all the customer has set at the project level.

$ cat DSParams | ./ | more
$ cat DSParams | ./ > outputfile

1. copy script text below to a file ( on a UNIX system
2. Set execute permissions on this file. chmod 777
3. Usually perl is in /usr/bin/perl but you might have to adjust this path if neccessary. (hint "which perl" should tell you which one to use)
4. cat the DSParams file from the project you are concerned with and redirect the output to this script. You may have to put the Fully Qualified Path for this file.
5. capture the output to screen or file. File may be useful to have the customer send the info to you in email.

Like the below page to get update

Tuesday, 11 October 2016

5 Tips For Better DataStage Design #16

1. Use 4-node configuration file for unit testing/system testing the job.
2. If there are multiple jobs to be run for the same module. Archive the source files in the after job routine of the last job.
3. Check whether the file exists in the landing directory before moving the sequential file. The ‘mv’ command will move the landing directory if the file is not found.

4. Ensure that the unix files created by any Datastage job is created by the same unix user who has run the job.
5. Make sure that the Short Job Description is filled using ‘Description Annotation’ and it contains the job name as part of the description. Don’t use Annotation for putting the job description.

Like the below page to get update

Saturday, 8 October 2016

#2 DataStage Solutions to Common Warnings/Error - Null Handling

Warnings/Errors Related to Null Handling -

1.1       When checking operator: When binding output interface field “XXXXX” to field “XXXXX”: Converting a nullable source to a non-nullable result

Cause: This can happen when reading from oracle database or in any processing stage where input column is defined as nullable and metadata in datastage is defined as non-nullable.

Resolution: Convert a nullable field to non  nullable. Need to apply available null functions in datastage or in the query.

1.2       APT_CombinedOperatorController(1),0: Field 'XXXXX' from input dataset '0' is NULL. Record dropped.

Cause: This can happen when there is no null handling mentioned on column and the same column is used in constraints/Stage Varibales.

Resolution:  Provide Null handling function to the column mentioned in constraint/Stage variable.

1.3       Fatal Error: Attempt to setIsNull() on the accessor interfacing to non-nullable field "XXXX".

Cause: This can happen when the column in source is nullable but in DB2 stage its mentioned as Non Nullable

Resolution: Change the Nullable field for the column to “Yes” instead of “No” i.e.

1.4       Exporting nullable field without null handling properties

Cause: This can happen when the columns are mentioned as nullable in sequential file stage and no representation for null values was specified.

Resolution: Specify Null field value in Format tab of sequential file stage.

Like the below page to get update

Wednesday, 28 September 2016

DS_PXDEBUG - DataStage Parallel Debugging Variable

* Controlled with an environment variable, not exposed on GUI.  DS_PXDEBUG set to activate feature (e.g. DS_PXDEBUG=1)
* Warning logged when job run with this debug feature on.
* Debug collected under a new project-level directory "Debugging" on the server. Subdirectories on a per-job basis, named after the job (created as required). For multi-instance jobs jobs run with a non-empty invocation ID, the directory will be "<jobname>.<invocationID>".
* Internally turns on Osh environment variable APT_MSG_FILELINE so that warnings/errors issued by Osh have source filename & linenumber attached.
* Internally turns on Osh environment variable APT_ENGLISH_MESSAGES so that unlocalised copies of PX-originated error / warning messages are issued in addition to the localised copy (where available).

* Internally turns on Osh environment variables APT_PM_PLAYER_TIMING APT_PM_PLAYER_MEMORY APT_RECORD_COUNTS for more reporting from playes
* Places content of jobs RT_SC<jobnum> directory in the debug location (includes job parameter file, Osh script, parent shell script, any osh and compile scripts associated with transformers). These will be in the same characterset as the original files.
* Places content of jobs RT_BP<jobnum>.O directory in the debug location. Includes library file binaries for PX transformers (plus possibly binaries associated with any Server portions of the job).
* Dump of environment varaible values at startup (same as in the log) placed in a named file in the debug location.
* Dump of osh command options placed in a named file in the debug location. Note that this is as issued from the Server wrapper code. Particularly in the case of Windows, it may not represent exactly what is received by the Osh command line, due to the action of the OshWrapper program,  and interpretation of quotes and backslash-escapes.
* Copy of received raw osh output messages in a named file in the debug location. These will typically be in the host characterset, even though on an NLS system Orchestrate will be originating them in UTF8.
* Copy of PX configuration file placed in the debug location. This will be in the same characterset as the original file.
* This new feature collects together and enhances a number of debug features already exposed with other environment variables. In order to minimise code impact risk, the original features will not be removed at this stage.
* The exception is the "dump of raw osh output messages"; it was previously placed in the &COMO& directory. If the old and the new debug options are both enabled, the new one will take precedence and there will not be a copy in &COMO&. Again this decision has been taken to minimise code change.

Contributed by Christ Thornton 2/2/2007

Like the below page to get update

Monday, 29 August 2016

Modify Stage - What's been promised

Modify stage, one of the most un-used stage in DataStage but very useful in terms of performance tuning. It is advisable to Developers not to use transformer stage to just Trimming or NULL handling but if and only if in the case when they are aware and comfortable with the syntax and derivations supported by modify stage as there is no drop down or right click options to help us with functions/synatx.

The definition of Modify Stage as IBM documented -

"The Modify stage alters the record schema of its input data set. The modified data set is then output. You
can drop or keep columns from the schema, or change the type of a column.
The Modify stage is a processing stage. It can have a single input link and a single output link."
The operations offered by Modify Stage is -
1. Dropping of Columns
2. Keeping Columns
3. Create Duplicate Columns
4. Rename Columns
5. Change Null-ability of Columns
6. Change Data Type of Columns

Stage is supporting only 1 input stage and 1 output stage.

All these operations are easily done in other stages such as copy, transformer etc. But why Modify stage is required or can say, we should use this?

Answer of this Datastage problem is simple - Performance Tuning of jobs

** Why not to use Transformer -
Cause, Whenever we call the transformer functions, data processed to and through C++ code (transformer implementation) which cause the performance letancy(delay). This delay is negligible for less no of records than higher no. So, Prefer the Modify stage when no of records are high to process.

Keep looking for this place as we are going to learn lot of tips on Modify stage.

Get this Article as PDF - and

Like the below page to get update

Monday, 22 August 2016

5 Tips For Better DataStage Design #15

1. Stage variable does not accept null value. Hence no null able column should be directly mapped to stage variable without null handling.

2. Use of SetNull() function in stage variables should be avoided because it causes compilation error.

3. If input links are not already partitioned on join key then they should be hash partitioned on the join key in join stage. In case of multiple join key it is recommended to partition on one key and sort by the other keys.

4. If there is a need to do the repartition on an input link then we need to clear the preserve partitioning flag in the previous stage. Otherwise it will generate warning in job log.

5. If database table has less volume of data as a reference then it is good to use lookup stage.

6. It is always advisable to avoid Transformation stage. Because the Transformation stage is not written in DataStage native language, instead it is written in c. So every time you compile a job it embeds the c code with the native code in the executable file, which degrades the performance of the job.

Like the below page to get update

Wednesday, 3 August 2016

#3 How to Copy DataSet from One Server to Another Server

This post is third and last of How to Copy DataSet from One Server to Another Server Series

We have generated a populated a dataset and identified the files which we need to move to another server serverB from serverA


4. Reading the dataset on another server

This is the most crucial step, Now all 4 files are moved on serverB or the common location which can be accessible from serverB.

For my case, common dir is my home - /home/users/atul

A. Change the default.apt file
We need to change the fastname in default.apt (config file) which we copied from the serverA, [ NOT the default.apt for serverB]

Open the file in any text editor or vi and change as below screen shot -

Temporarily create the "resource disk" and "resource scratchdisk" location if not existing as defined in above config file.

B. Copy the dataset data files 

Move the dataset data file from common directory to "resource disk" as defined in config file.

cp ~/dummy.ds.* /opt/IBM/InformationServer/Server/DataSets/

Now, all files locations are like -

Config file and Dataset descriptor file - my home dir or common dir
Dataset data files - /opt/IBM/InformationServer/Server/DataSets/

Design a job which will read thess dataset files and populate data into sequential file or any other output.

Job Paramaters -
APT_CONFIG_FILE = /home/users/atul/default.apt

DataSet Properties
DataSet File - /home/users/atul/dummy.ds

That is all, you can read the copied dataset on serverB, you can populate this data to some other output such as seq file, table so that you can avoid the use of copied default.apt config file which is not for serverB.

Try it out, let me know if you have any question.

If you like this post, follow the below pages to get update

Monday, 1 August 2016

#2 How to Copy DataSet from One Server to Another Server

This post is second part of How to Copy DataSet from One Server to Another Server


 After generating the dummy dataset, next step is to identify the files which we need to copy.

2. Files which we need to move

a. APT_CONFIG_FILE - configuration file which used in dataset
b. DataSet Descriptor file - *.ds file, in our case it is dummy.ds
c. DataSet Data files - Actual data files which stored in RESOURCE DISK location

So let's get all the path which we need to access -

APT_CONFIG_FILE = /opt/IBM/InformationServer/Server/Configulations/default.apt
RESOURCE DISK = /opt/IBM/InformationServer/Server/DataSets
DATASET LOC = /home/users/atul/dummy.ds

Use commands or any FTP tool to copy these files in a shared location which can be accessible from another server (serverB)

For my case, I have stored all of them into my linux home direcory which is common in both server.

So I have executed these commands to copy all the required files into my home directory.

cp  /opt/IBM/InformationServer/Server/Configulations/default.apt ~
cp  /opt/IBM/InformationServer/Server/DataSets/dummy.ds.* ~
cp  /home/users/atul/dummy.ds ~

Now, my home directory is having these files -

You can copy these 4 files on serverB where you want to move your dataset. I am not doing the same as my home directory is common for both server.

3. Why we need these files only

Config file was used by datastage to create dataset ( descriptot file, data files, data file location)
So, we needed - config file, dataset descriptor file and dataset data files.

Like the below page to get update

Saturday, 30 July 2016

#1 How to Copy DataSet from One Server to Another Server

Hi Guys...
I've been asked so many times that how can we move/copy one dataset from one server to another So here is the way which I follow.

At very first step, Analyze if you can avoid this by using some other way like creating sequential file and ftp Or load the data into temporary table which can be accessible on another server, if using datastage packs then via mqs, xml or json formats etc. Why I am suggesting these solutions coz these are easy to design and guaranteed the data quality at other end.

If above solutions are not possible, please follow the below steps -

Points I am going to cover here -
1. Generating a dummy dataset
2. Files which we need to move
3. Why we need these files only
4. Reading the dataset on another server


1. Generating a dummy dataset

I have created a dummy job which is generating a dataset with default APT_Config_file which has 2 nodes.

Here, I am generating 10 dummy rows with the help of Row Generator stage and storing them into a datasset.

a. Config File - I am using the default config file (replaced the server name in "fastname" with serverA)

APT_CONFIG_FILE = /opt/IBM/InformationServer/Server/Configulations/default.apt

check out the "resource disk" location in config file, we need it for further processing

RESOURCE DISK = /opt/IBM/InformationServer/Server/DataSets

b. dataset location - I have created this dataset in my home dir named dummy.ds

DATASET LOC = /home/atul/dummy.ds

Keep looking for next post........

Like the below page to get update

Thursday, 16 June 2016

5 Tips For Better DataStage Design #14

1. The use of Lookup stage depends upon the volume of data.Sparse lookup type should be used when primary input data volume is small.If the reference data volume is more, Lookup Stage should be avoided.

2. Use of ORDER BY clause in the database is good as compared to use of sort stage.

3. In Dtatastage Administrator, Tuned the 'Project Tunable' for better performance.

4. For Funnel, the use of this stage reduces the performance of a job. Funnel Stage should be run in continuous mode.

5. If the hash file is used only for lookup then "enable Preload to memory". This will improve the performance.

Like the below page to get update

Tuesday, 7 June 2016

Tuning SQL Statement #4

Using Functions in QueriesWhen using functions in SQL statements it is best to keep the functions away from any columns involving index matching unless function-based indexes are available and can be created.

SELECT account_name, trans_date, amount
FROM transaction
WHERE account_name LIKE 'CAPITAL%';    

SELECT account_name, trans_date, amount
FROM transaction
WHERE SUBSTR(account_name,1,7) = 'CAPITAL';

A sequence is often used to create unique integer identifiers as primary keys for tables. A sequence is a distinct database object and is accessed as sequence.NEXTVAL and sequence.CURRVAL.
Use sequences for unique integer identifiers; they are centralized, more controllable, more easily maintained, and perform better than other methods of counting.

INSERT INTO supplier (supplier_id, name, ticker)
      VALUES (supplier_seq.NEXTVAL,'A new supplier', 'TICK');    

INSERT INTO supplier (supplier_id, name, ticker)
      VALUES ((SELECT MAX(supplier_id)+1
      FROM supplier), 'A new supplier', 'TICK');


Equi, Anti, and Range
Using an equals sign (equi) is the fastest comparison condition if a unique index exists. Any type of anti comparison such as != or NOT is looking for what is not in a table and thus must read the entire table; sometimes full index scans can be used.

SELECT account_name, trans_date, amount
FROM transaction
WHERE amount > 0;    SELECT account_name, trans_date, amount
FROM transaction
WHERE amount != 0;

SELECT account_name, trans_date, amount
FROM transaction
WHERE amount > 0;    SELECT account_name, trans_date, amount
FROM transaction
WHERE amount  NOT = 0;

LIKE Pattern MatchingIn general, since LIKE will match patterns which are in no way related to indexes, LIKE will usually read an entire table.

Like the below page to get update

Monday, 23 May 2016

Tuning SQL Statement #3

Tuning SQL Statement #1.......
Tuning SQL Statement #2.......

Consider whether a UNION ALL will suffice in place of a UNION.
The UNION statement performs the equivalent of a SELECT DISTINCT on the final result set. In other words, the UNION clause forces all rows returned by each portion of the UNION to be sorted and merged and duplicate to be filtered before the first row is returned.  A UNION ALL simply returns all rows including duplicates and does not have to perform any sort, merge or filter.  If your tables are mutually exclusive (include no duplicate records), or you don't care if duplicates are returned, the UNION ALL is much more efficient as it does not perform the SELECT DISTINCT function, which saves a lot of unnecessary resources from being used.

SELECT acct_num, balance_amt
FROM debit_transactions
WHERE tran_date = '1-JAN-06' 
SELECT acct_num, balance_amt
FROM credit_transactions
WHERE tran_date = '1-JAN-06' ;  

SELECT acct_num, balance_amt
FROM debit_transactions
WHERE tran_date = '1-JAN-06'
SELECT acct_num, balance_amt
FROM credit_transactions
WHERE tran_date = '1-JAN-06' ;
The DECODE Function
Using the DECODE function as a replacement for multiple query set operators is good for performance but should only be used in extreme cases such as the UNION clause joined SQL statements as shown.

SELECT stock_id||' '||
            ,-1,'Out of Stock',0,'Out of Stock'
                   ,-1,'Under Stocked',0,'Stocked'
                          ,1,'Over Stocked'
      ) FROM stock;  

SELECT stock_id||' Out of Stock' FROM stock WHERE
   qtyonhand <=0
SELECT stock_id||' Under Stocked' FROM stock
      WHERE qtyonhand BETWEEN 1 AND min-1
SELECT stock_id||' Stocked' FROM stock
WHERE qtyonhand BETWEEN min AND max
SELECT stock_id||' Over Stocked' FROM stock
      WHERE qtyonhand > max;


Datatype Conversions
Try to avoid using any type of data conversion function in any part of an SQL statement which could potentially match an index, especially if you are trying to assist performance by matching appropriate indexes.

FROM customer
WHERE cust_id = TO_NUMBER('11');    SELECT ...
FROM customer
WHERE cust_id = '11';

FROM customer
WHERE cust_type = 1;    SELECT ...
FROM emp
WHERE TO_NUMBER (cust_type) = 1;

Like the below page to get update