Sunday, 31 January 2016

Create a Daemon to Trace New Processes


Description

The following code can be used to create a daemon that will watch for processes that show up in the "ps -ef" output with certain characteristics. When it identifies such processes, it will attach to them with a trace utility (i.e. strace, truss, tusc... you must change the code to reflect this on whatever platform this is run on). The tool does not follow these processes with a fork since it will trace any children that contain the same "ps -ef" characteristics. This makes it useful for tracing DS PX programs that contain rsh since truss's fork flag (i.e. "-f") blocks the rsh from executing.



Usage

The script below should be saved to a file such as /tmp/tracer.sh and given rwx permissions.  The trace utility name that is appropriate for your platform should be altered in the "ps -ef" command and in the "for" loop.  The script would then be run using this syntax:
    /tmp/tracer.sh <search string>
As mentioned above, the search string can be any value that would appear in the "ps -ef" output.  Such values might be a user id, particular time, a command, or arguments to a command.  The fifth and eight lines of this script gather lists of all commands to be traced and then attempts to remove commands that should be ignored.  If you find too many processes getting traced, identify why it was selected and then alter these two lines by adding a "grep -v" to the list of items bieng ignored.









Like the below page to get update  
https://www.facebook.com/datastage4you
https://twitter.com/datagenx
https://plus.google.com/+AtulSingh0/posts
https://groups.google.com/forum/#!forum/datagenx

Friday, 29 January 2016

Where to get DataStage Scheduled jobs ?


Did you guys ever try to find out how the scheduler works when we schedule jobs by DataStage scheduler? Interesting na !!

So, let's have a discussion how this is happening.

DataStage itself doesn't contain any scheduling system or application but it is using OS-level scheduling techniques which is OS-level job scheduler.



Unix/Linux-
If the DataStage server is on Linux/Unix/*nix OS. That's "at" and "cron"

'at -lv' or 'at -l' will show any jobs that are scheduled to run once, 'crontab -l' will show jobs that are scheduled to repeat


Windows-
On Windows, the jobs become entries in the Scheduled Tasks control panel. In GUI, you can't understand much. But we can extract and dump this information to see it better.

'schtasks /query /v /fo list' command give you a proper formatted information about the scheduled jobs.


Windows has nice output format options, and (with the verbose switch) lots of info, including the next run time, last run time and last result, so you can tell if a job actually ran when it was supposed to.  You may need to use a batch script to filter out the specific entry.






Like the below page to get update  
https://www.facebook.com/datastage4you
https://twitter.com/datagenx
https://plus.google.com/+AtulSingh0/posts
https://groups.google.com/forum/#!forum/datagenx

Thursday, 28 January 2016

How to get table list used in DataStage jobs ?


While developing jobs in DataStage, sometimes we face this requirement to get all the table list which was used by our DataStage jobs, but unfortunately, there is no direct way to get that.

DataStage is not having any command which can give us the table list. But, there are some ways by which can get the table list. All are the steps, which we are going to discuss, are needs one-time setup or development.




1) Setting up a universe query -


We can tweak this query to get the table list for all the DataStage jobs.

2) Parsing job export XML -
a) We can parse the tables from job export XML file. We can write a shell script and parse the XML to get the table name
b) Or we can develop a DataStage job which reads this XML and parses all the tables

Make use of these practices while developement  - 

3) While doing the development of DataStage project, make a practice to maintain a table which is having table and job information. This will help a lot afterward. 

4) Before using any table in any job, store that metadata into DataStage Repository folder. This will help you to do the Usage Analysis afterward.



Like the below page to get update  
https://www.facebook.com/datastage4you
https://twitter.com/datagenx
https://plus.google.com/+AtulSingh0/posts
https://groups.google.com/forum/#!forum/datagenx

Wednesday, 27 January 2016

Python Points #8 - Dictionary

Sunday, 24 January 2016

ps command #4 - Sorting with sort command



We can sort the ps command output by unix sort also which is easy to use. Need to pass ps command output to sort command with proper argument and Volla !! You will get the output as you want.

Let's see how this is work [ sort command arguement can differ per your linux flavour and version ]
 I am using - CentOS 6.3


1. Display the top CPU consuming process (Column 3 - %CPU)
$ ps aux | head -1; ps aux | sort -k3 -nr |grep -v 'USER'| head
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND 
atul     21210  2.0  0.1 110232  1140 pts/3    R+   00:13   0:00 ps aux
hduser    2671  0.8  4.1 960428 42436 pts/1    Sl+  Aug22   5:29 mongod
root      1447  0.2  0.3 185112  3384 ?        Sl   Aug22   1:36 /usr/sbin/vmtoolsd
atul      2478  0.2  2.1 448120 21876 ?        Sl   Aug22   1:51 /usr/lib/vmware-tools/sbin64/vmtoolsd -n vmusr --blockFd 3
rtkit     2359  0.1  0.1 168448  1204 ?        SNl  Aug22   0:44 /usr/libexec/rtkit-daemon
root         7  0.1  0.0      0     0 ?        S    Aug22   0:53 [events/0]
root      2204  0.1  4.3 147500 43872 tty1     Ss+  Aug22   0:45 /usr/bin/Xorg :0 -nr -verbose -audit 4 -auth /var/run/gdm/auth-for-gdm-wEmBs1/database -nolisten tcp vt1
root       920  0.0  0.0      0     0 ?        S    Aug22   0:00 [bluetooth]
root         9  0.0  0.0      0     0 ?        S    Aug22   0:00 [khelper]
root         8  0.0  0.0      0     0 ?        S    Aug22   0:00 [cgroup] 

For my linux sort command arguements are --
-kn  ==> This use to select the column n, such as for column 4, -k4
-n   ==> column is numeric
-r   ==> reverse order

sort -k3 -nr ==> sort the third column of output in numeric reverse sort (largest to smallest)

2. Display the top 10 memory consuming process (Column 4 - %MEM)
$ ps aux | head -1; ps aux | sort -k4 -nr |grep -v 'USER'| head
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND 
root      2204  0.1  4.3 147500 43872 tty1     Ss+  Aug22   0:46 /usr/bin/Xorg :0 -nr -verbose -audit 4 -auth /var/run/gdm/auth-for-gdm-wEmBs1/database -nolisten tcp vt1
hduser    2671  0.8  4.1 960428 42436 pts/1    Sl+  Aug22   5:32 mongod
atul      2458  0.0  2.3 943204 23624 ?        S    Aug22   0:16 nautilus
atul      2516  0.0  2.2 275280 22316 ?        Ss   Aug22   0:06 gnome-screensaver
atul      2478  0.2  2.1 448120 21876 ?        Sl   Aug22   1:52 /usr/lib/vmware-tools/sbin64/vmtoolsd -n vmusr --blockFd 3
atul      2507  0.0  1.6 321388 16680 ?        S    Aug22   0:01 python /usr/share/system-config-printer/applet.py
atul      2589  0.0  1.4 292556 14600 ?        Sl   Aug22   0:14 gnome-terminal
atul      2536  0.0  1.3 395832 13372 ?        S    Aug22   0:00 /usr/bin/gnote --panel-applet --oaf-activate-iid=OAFIID:GnoteApplet_Factory --oaf-ior-fd=22
atul      2502  0.0  1.3 474620 13952 ?        Sl   Aug22   0:01 gpk-update-icon
atul      2537  0.0  1.2 459964 12736 ?        S    Aug22   0:10 /usr/libexec/clock-applet --oaf-activate-iid=OAFIID:GNOME_ClockApplet_Factory --oaf-ior-fd=34
 




3. Display the process by time (Column 4 - TIME)
$ ps vx | head -1; ps vx | sort -k4 -r| grep -v 'PID' | head
PID TTY      STAT   TIME  MAJFL   TRS   DRS   RSS %MEM COMMAND 2478 ?        Sl     1:52    351   593 447526 21876  2.1 /usr/lib/vmware-tools/sbin64/vmtoolsd -n vmusr --blockFd 3
 2458 ?        S      0:16    228  1763 941440 23624  2.3 nautilus
 2589 ?        Sl     0:15     28   296 292259 14600  1.4 gnome-terminal
 2421 ?        Ssl    0:14     22    34 500541 9676  0.9 /usr/libexec/gnome-settings-daemon
 2479 ?        S      0:13     23   403 310472 11996  1.1 nm-applet --sm-disable
 2537 ?        S      0:10     37   168 459795 12736  1.2 /usr/libexec/clock-applet --oaf-activate-iid=OAFIID:GNOME_ClockApplet_Factory --oaf-ior-fd=34
 2444 ?        Ssl    0:10     25    64 445791 4872  0.4 /usr/bin/pulseaudio --start --log-target=syslog
 2445 ?        S      0:07     51   593 322206 12684  1.2 gnome-panel
 2516 ?        Ss     0:06      4   151 275128 22316  2.2 gnome-screensaver
 2522 ?        Sl     0:05      5    41 231870 1960  0.1 /usr/libexec/gvfs-afc-volume-monitor

 


4. Display the top 10 real memory usage process (Column 8 - RSS)
$ ps vx | head -1; ps vx | sort -k8 -nr| grep -v 'PID' | head 
PID TTY      STAT   TIME  MAJFL   TRS   DRS   RSS %MEM COMMAND
 2458 ?        S      0:16    228  1763 941440 23624  2.3 nautilus
 2516 ?        Ss     0:06      4   151 275128 22316  2.2 gnome-screensaver
 2478 ?        Sl     1:52    351   593 447526 21876  2.1 /usr/lib/vmware-tools/sbin64/vmtoolsd -n vmusr --blockFd 3
 2507 ?        S      0:01     73     2 321385 16680  1.6 python /usr/share/system-config-printer/applet.py
 2589 ?        Sl     0:15     28   296 292259 14600  1.4 gnome-terminal
 2502 ?        Sl     0:01     29   257 474362 13952  1.3 gpk-update-icon
 2536 ?        S      0:00     92  1607 394224 13372  1.3 /usr/bin/gnote --panel-applet --oaf-activate-iid=OAFIID:GnoteApplet_Factory --oaf-ior-fd=22
 2537 ?        S      0:10     37   168 459795 12744  1.2 /usr/libexec/clock-applet --oaf-activate-iid=OAFIID:GNOME_ClockApplet_Factory --oaf-ior-fd=34
 2445 ?        S      0:07     51   593 322206 12684  1.2 gnome-panel
 2438 ?        Sl     0:03     30   542 433105 12512  1.2 metacity


Like above examples you can create so many one liners for you. But before using anyone of above one command, check your ps and sort command behavior then use them.
Mostly, every other shell has its own argument for ps and sort but basics are same. For sorting any command output by particular column first understand that output/column and then use sort commnd.

I hope, you find this helpful. Keep Learning !!





Like the below page to get update  
https://www.facebook.com/datastage4you
https://twitter.com/datagenx
https://plus.google.com/+AtulSingh0/posts
https://groups.google.com/forum/#!forum/datagenx

Friday, 22 January 2016

Shell Script for getting lines after and before of search String




Sometime when we search some text string in Unix environment we need those strings also which come before or after that searched string.
This below small shell-script gives same output and return output in one file depends on user choice that how many no. of lines user wants to print after and before any searched string.





This script take input as search string and no. of line you want to print before and after search result line and give output in search.txt






Like the below page to get update  
https://www.facebook.com/datastage4you
https://twitter.com/datagenx
https://plus.google.com/+AtulSingh0/posts
https://groups.google.com/forum/#!forum/datagenx

Wednesday, 20 January 2016

Using your python library



How to use your own written functions or routines when code in python??


Follow below steps - 
1. Create a folder where you can put all reusable code/function/routines files
2. Let's say it is "routines"
3. Now suppose, you have written all your functions and save it in a file myfunc.py and saved it in routines folder
4. Add routines folder path to your windows path or linux path
for linux:
Edit your .bash_profile (typically towards the end)to the following line
export PYTHON_PATH=$PYTHON_PATH:'/path/to/folder/'   #if you have this variable
export PATH=$PATH:'/path/to/folder/'   #else use  this line
where you put the correct path in the appropriate location




How to use your function in your code:-
1.  import that module into your python script session with a command like
import myfunc

2.  for using a function "my_sqrt" from your library myfunc
x = myfunc.my_sqrt(val)

3.  you can create alias for your library also
import myfunc as mf
x = mf.my_sqrt(val)

If want to import a particular peice from library, use as
from myfunc import my_sqrt
x = my_sqrt(val)

this is tedious if you have to import multiple so use as
from myfunc import *
x = my_sqrt(val)

But, remember if you are import multiple library and they having same function name.
While using, you need to call them as step 2





Like the below page to get update  
https://www.facebook.com/datastage4you
https://twitter.com/datagenx
https://plus.google.com/+AtulSingh0/posts
https://groups.google.com/forum/#!forum/datagenx

Tuesday, 19 January 2016

Python Points #7 - Loops

Sunday, 17 January 2016

Count of Jobs - A Quick DataStage Recipe



What to Cook:
How to count number of jobs in DS project

Ingredients:
Use dsjobs "-ljobs" command



How to Cook:
Go to DS Administrator "projects" tab
Click on command button
Enter following command to execute:
SH -c "dsjob -ljobs <Project Name> | wc -l"

<Project Name> - Enter your project name






Like the below page to get update  
https://www.facebook.com/datastage4you
https://twitter.com/datagenx
https://plus.google.com/+AtulSingh0/posts
https://groups.google.com/forum/#!forum/datagenx

Thursday, 14 January 2016

Data Wrangling Cheatsheet

Wednesday, 13 January 2016

DataStage Scenario #12 - Combine data column wise



Col1 Col2 Col3
1 2 3
4 5 6
7 8 9



Col4 Col5 Col6
a b c
d e f
g h i



Col1 Col2 Col3 Col4 Col5 Col6
1 2 3 a b c
4 5 6 d e f
7 8 9 g h i




Like the below page to get update  
https://www.facebook.com/datastage4you
https://twitter.com/datagenx
https://plus.google.com/+AtulSingh0/posts
https://groups.google.com/forum/#!forum/datagenx

Tuesday, 12 January 2016

Python Points #6 - Strings

Monday, 11 January 2016

Cartesian Join - A Quick DataStage Recipe


What to Cook:
How to do Cartesian join -> Join every row of a table to every row of other table





Ingredients:
Use "column generator" stage

How to Cook:
Add both tables in DS job
Add dummy column using "Column Generator" stage each for both tables. Make sure dummy column values are same.
Join both tables using these  dummy columns






Like the below page to get update  
https://www.facebook.com/datastage4you
https://twitter.com/datagenx
https://plus.google.com/+AtulSingh0/posts
https://groups.google.com/forum/#!forum/datagenx

Sunday, 10 January 2016

A Quick DataStage Recipe


Under this series, I am trying to cook some quick solution for DataStage problems, issues, technical implementation of re-usable logics which we face in day to day task.

       
        Hope you will find them useful and helping. Keep looking for this space.

A Quick DataStage Recipe -> http://www.datagenx.net/search/label/aQDsR?max-results=12





Like the below page to get update  
https://www.facebook.com/datastage4you
https://twitter.com/datagenx
https://plus.google.com/+AtulSingh0/posts
https://groups.google.com/forum/#!forum/datagenx

Friday, 8 January 2016

5 Tips For Better DataStage Design #7



#1. In case the partition type for the next immediate stage is to be changed then the ‘Propagate partition’ should be set to ‘Clear’ in the current stage.



#2. Make sure that appropriate partitioning and sorting are used in the stages, where ever possible. This enhances the performances. Make sure that you understand the partitioning being used. Otherwise leave it auto.

#3. For fixed width files, final delimiter should be set to 'none' in the file format property.

#4. If any processing stage requires a key ( like remove duplicate, merge, join, etc ) the Keys, sorting keys and Partitioning keys should be same and in the same order

#5. To improve Funnel, all the input links must be hash partitioned on the sort keys.





Like the below page to get update  
https://www.facebook.com/datastage4you
https://twitter.com/datagenx
https://plus.google.com/+AtulSingh0/posts
https://groups.google.com/forum/#!forum/datagenx

Monday, 4 January 2016

Monitoring Memory by DataStage Processes #2



You can find other parts here -> Monitoring Memory by DataStage Processes #1



Continuously capturing memory usage of all osh processes -


$ while:; do date; ps -e -o pid,ppid,user,vsz,time,args |grep -v grep| grep osh; sleep 3;done

To Stop, Use Ctrl-C

osh processes are processes created by DataStage parallel jobs. This command is used to monitor all osh processes over a period of time. In this example, the grep command is used to filter processes with the string “osh” but this can be modified if you want to filter processes by something else such as user ID or ppid. This loop is started every second but can also be modified by increasing the value after the sleep command. This command will continue to run until you press Ctrl + c.



Monitoring only new processes:

Sometimes it is difficult to find a common string to filter the processes you want to monitor. In those cases, and assuming that you can reproduce the problem or condition you want to analyze, you can use this script to keep track of all new processes.

This script will help us to monitor the new processes generated after this script 'ps_new.sh'. With this script, we can monitor the process of datastage job, which is executed after this script execution, specifically.

=======================================================================
=======================================================================

How to Use this script?

1. Run the script ps_new.sh
     ./ps_new.sh
2. Start the datastage job or reproduce the issue
3. Press Ctrl-C to stop the script ps_new.sh
4. Analyse the output file generated by script






Like the below page to get update  
https://www.facebook.com/datastage4you
https://twitter.com/datagenx
https://plus.google.com/+AtulSingh0/posts
https://groups.google.com/forum/#!forum/datagenx