Monday, 30 November 2015

Monitoring Memory by DataStage Processes #1

Before going to monitoring memory, we need to clear about why and when we have to monitor memory on the server?

Why & When?

  • Troubleshooting and to identify potential resource bottlenecks
  • Detect memory leaks
  • To check resource usage to plan better capacity planning
  • More Memory, Better Performace

To monitor DataStage Memory Usage, we have to work on these 3 points -

1. Monitor memory leaks
               Analyzing memory usage can be useful in several scenarios. Some of the most common scenarios include identifying memory leaks. A memory leak is a type of bug that causes a program to keep increasing its memory usage indefinitely.

2. Tune job design
               Comparing the amount of memory different job designs consume can help you tune your designs to be more memory efficient.

3. Tune job scheduling
               The last scenario is to tune job scheduling. Collecting memory usage by processes over a period of time can help you organize job scheduling to prevent peaks of memory consumption.

Monitoring Memory Usage with ps Command -

- Simple command available in all UNIX/Linux platforms
- Basic syntax to monitor memory usage

ps —e —o pid, ppid, user, vsz, etime, args 

Where  -
pid - process id
ppid - parent's process id
user - user that owns process
vsz - amount of virtual memory
etime - elapsed time process has been running

args - command line that started process

Other ps monitoring -- Check Memory Utilization by Datastage processes

More will be in next post.  Monitoring Memory by DataStage Processes #2

Like the below page to get update!forum/datagenx

Wednesday, 25 November 2015

DataStage Scenario #10 - two realtime scenario

Hello guys, Hoping you are enjoying while solving DataStage Scenarios. Today I am going to ask two real time scenario. try to solve these :-)

We have to design a job, which will extract data from table tab1, when we get some value in a file file1. No relation between table and file.
Simple Hhh? Let's make it little restricted. You can not use the sequencer job, All functionality we need in a single parallel job.

When you able to solve first one, come to this -

Reading source table Stab which is having 20 columns (Sc1, Sc2, Sc3.... ), Need to validate individual column from Sc1 to Scl0 from another table Rtab column Rc1 to Rc10 ( means Sc1 with Rc1, Sc2 with Rc2 .........). The condition is, If any column is got invalid whole row will be dropped and that column captured in a single reject report. Design such a way that we should get two rows in reject file if two column are not valid in a single input row.

Wish you a luck !!

Like the below page to get update!forum/datagenx

Monday, 23 November 2015

Python points #3 - Comparison

Friday, 20 November 2015

5 Tips For Better DataStage Design #5

#1. Use the Data Set Management utility, which is available in the Tools menu of the DataStage Designer or the DataStage Manager, to examine the schema, look at row counts, and delete a Parallel Data Set. You can also view the data itself.

#2. Use Sort stages instead of Remove duplicate stages. Sort stage has got more grouping options and sort indicator options.

#3. for quick checking if DS job is running on Server or not, from UNIX
ps -ef | grep 'DSD.RUN'

#4. Make use of Order By clause when a DB stage is being used in join. The intention is to make use of Database power for sorting instead of Data Stage resources. Keep the join partitioning as Auto. Indicate don’t sort option between DB stage and join stage using sort stage when using order by clause.

#5. There are two types of variables - string and encrypted. If you create an encrypted environment variable it will appears as the string "*******" in the Administrator tool.

Like the below page to get update!forum/datagenx

Thursday, 19 November 2015

Python points #2 - Data Type & String Manipulations

Wednesday, 18 November 2015

Installing R on CentOS/RedHat Linux

For using the RStudio in Linux, we need to setup some config file.

First, create these 2 config file
$ touch /etc/rstudio/rserver.conf /etc/rstudio/rsession.conf

Edit /etc/rstudio/rserver.conf for port change and home address
$ vi /etc/rstudio/rserver.conf 

#default port is 8787


Note that after editing the /etc/rstudio/rserver.conf file you should always restart the server to apply your changes (and validate that your configuration entries were valid). You can do this by entering the following command:

$ sudo rstudio-server restart

After restarting the RStudio server, you can access RStudio tool in your browser by below URL

generic URL -
http://<RStudio home Address>:<Port>

Like the below page to get update!forum/datagenx

Tuesday, 17 November 2015

Python points #1 - Syntax

Monday, 16 November 2015

List DataStage jobs which used this Parameter

Open the DataStage Administrator Client

Click on the Projects tab and select the project you would like to generate a list for.

Click the Command Button

In the command entry box type:


<VARNAME> should be the name of the parameter or environment variable



Click Execute

If the output is on more than one page, click Next to page done and click Close when finished.

In this example, a job type of 3 is a parallel job. Valid job types value are:
0 = Server
1 = Mainframe
2 = Sequence
3 = Parallel

Like the below page to get update!forum/datagenx

Sunday, 15 November 2015

Shell Script Scenario #7 - Anagram words

Two words are called Anagram when you can rearrange the letters from one to spell the other.
i.e. -
Coat and Taco
Heater and Reheat
Cloud and Could

So, write a script which accept two input from user and return a result whether they are anagram or not ?

Like the below page to get update!forum/datagenx

Tuesday, 10 November 2015

Check Memory Utilization by Datastage processes

While we are running lots of DataStage job on a Linux DataStage server or different environment are sharing the same server which cause the resource crunch at server-side which affect the job performance.

It's always preferable to have an eye on resource utilization while we are running jobs. Mostly, DataStage admin set a cron with a resource monitoring script which will invoke in every five min ( or more) and check the resource statistics on server and notify them accordingly.

The following processes are started on the DataStage Engine server as follows:

dsapi_slave - server side process for DataStage clients like Designer

osh - Parallel Engine process
DSD.StageRun - Server Engine Stage process
DSD.RUN - DataStage supervisor process used to initialize Parallel Engine and Server Engine jobs. There is one DSD.RUN process for each active job

ps auxw | head -1;ps auxw | grep dsapi_slave | sort -rn -k5  | head -10
atul 38846  0.0  0.0 103308   856 pts/0    S+   07:20   0:00 grep dsapi_slave

The example shown lists the top 10 dsapi_slave processes from a memory utilization perspective. We can substitute or add an appropriate argument for grep like osh, DSD.RUN or even the user name that was used to invoke a DataStage task to get a list that matches your criteria.

Like the below page to get update!forum/datagenx

Monday, 9 November 2015

DataStage Scenario #9 - Add Header & Trailer

         Design a job to add Header and Trailer to input data.

Example Input:



Employee Name
Employee Count : 11

Monday, 2 November 2015

5 Tips For Better DataStage Design #4

1) While using AsInteger() function in datastage transformer always trim the imput column before passing it to function because if there are extra spaces or unwanted characters which generates zeros when actual integer values are expected. We should use APT_STRING_PADCHAR=0x20 (space) env var for fixed field padding.

2) The Len(col) will return the incorrect length if the input column is having some non-ASCII or double byte characters in it. So check your NLS settings for the job to fix this.

3) To remove embedded spaces from decimal data, use StripWhiteSpace(input.field) function to remove all spaces.

4) To get the datastage job no, Open the log view of the job in datastage director and double click on any entry of the log. The job number will be listed under the field "Job Number:"

5) Set these 2 parameters APT_NO_PART_INSERTION, APT_NO_SORT_INSERTION to TRUE to avoid datastage to insert partitioning or sorting method to improve the job performance at compile job. This will remove the warning also "When checking operator: User inserted sort "<name>" does not fulfill the sort requirements of the downstream operator "<name>""

Like the below page to get update!forum/datagenx