Monday, 28 December 2015

Python Points #5 - Lists

Thursday, 24 December 2015

Python Points #4 - Conditions

Wednesday, 23 December 2015

5 Tips For Better DataStage Design #6



#1. If you are using a copy or a filter stage either immediately after or immediately before a transformer stage, you are reducing the efficiency by using more stages because a transformer does the job of both copy stage as well as a filter stage

#2. Work done by "COPY Stage"
a) Columns order can be altered.
b) And columns can be dropped.
c) We can change the column names.



#3. When you need to run the same sequence of jobs again and again, better create a sequencer with all the jobs that you need to run. Running this sequencer will run all the jobs. You can provide the sequence as per your requirement.

#4. Sort the data as much as possible in DB and reduced the use of DS-Sort for better performance of jobs. Avoid the work done by DataStage which is possible in DB. But it doesn't mean you have to put all the complexity in SQL only, for that we are using datastage.

#5. Ensure that all the character fields are trimmed before any processing. Normally extra spaces in the data may lead to some errors like lookup mismatch which are hard to detect.





Like the below page to get update  
https://www.facebook.com/datastage4you
https://twitter.com/datagenx
https://plus.google.com/+AtulSingh0/posts
https://groups.google.com/forum/#!forum/datagenx

Tuesday, 22 December 2015

Notepad++ tip - Find out the non-ascii characters


Working on some code and when try to compile or run arrrrrr, got a non-ascii char error ?????
Now how to resolve this, here is the way if you are using notepad++ as a text editor.



1. Ctrl-F ( View -> Find )
2. put [^\x00-\x7F]+ in search box
3. Select search mode as 'Regular expression'
4. Volla !!

This will help you to track or replace all non-ascii charater in text file.




Like the below page to get update  
https://www.facebook.com/datastage4you
https://twitter.com/datagenx
https://plus.google.com/+AtulSingh0/posts
https://groups.google.com/forum/#!forum/datagenx

Thursday, 17 December 2015

Python, IPython, Jupyter notebook, Graphlab Installation on Windows


In "Python Installation from Source in Linux" and "Data Science Tools Installation in Linux" we have seen, how to install these tools on linux, Today we will learn how to setup these tools on Windows -




Python Installation:

1. Download the Python Windows installer from here -> https://www.python.org/downloads/release/python-2711/

2. Install it as we install any software on windows

3. Now, setup the Environment Variable -
a.              If you haven’t played with environment variables before, just stick to following these instructions as you can set them up through the Windows GUI.
b.             Right click on "My Computer", select "Properties" > "Advanced system settings" and click on the "Environment Variables" button
c.             In the System Variables box, find the variable called "path" and click on the "Edit…" button
d.             In the "Variable value" box, at the end of the entry, add the following text: ;C:\Python27;C:\Python27\Scripts (change the path as per your installation)
e.             Click "OK" a couple of times and hey presto, your environment variables are set up.
f.              Open cmd and type command 'python', if you get the python prompt we are good else check the steps once again.

4. The next step in the process is to set up easy_install and so we need to go to the setuptools page (links to version 0.8) and download the ez_setup.py script. You can download it from here -> https://bitbucket.org/pypa/setuptools/raw/0.8/ez_setup.py. and put this in python script directory (C:\Python27\Scripts)

5. Open a command prompt and type python ez_setup.py install – you’ll see a load of code whizz by which will hopefully end as follows;

C:\Python27> python ez_setup.py install
Processing dependencies for setuptools==0.8
Finished processing dependencies for setuptools==0.8
C:\Python27>
6. easy_install has now been set up and you can test to see if it is there, by typing easy_install in to a command prompt, which will throw an error about no URLs, you know that the tool has been set up successfully.

To use easy_install to get new libraries, just use the following syntax: easy_install <library name>


IPython Installation:

C:\Python27> easy_install ipython
Jupyter notebook Installation
C:\Python27\Scripts> pip install jupyter
You can run the jupyter notebook as below -

C:\Python27\Scripts>jupyter notebook

Graphlab Create Installation

C:\Python27\Scripts> pip install --upgrade --no-cache-dir https://pypi.python.org/packages/source/G/GraphLab-Create/GraphLab-Create-1.7.1.tar.gz#md5=caa4b1f78625a278dd016400d15bc5bd



Like the below page to get update  
https://www.facebook.com/datastage4you
https://twitter.com/datagenx
https://plus.google.com/+AtulSingh0/posts
https://groups.google.com/forum/#!forum/datagenx

Wednesday, 16 December 2015

Python Regular Expression quick guide





^ Matches the beginning of a line
$ Matches the end of the line
. Matches any character
\s Matches whitespace
\S Matches any non-whitespace character
* Repeats a character zero or more times
*? Repeats a character zero or more times (non-greedy)
+ Repeats a character one or more times
+? Repeats a character one or more times (non-greedy)
[aeiou] Matches a single character in the listed set
[^XYZ]  Matches a single character not in the listed set
[a-z0-9] The set of characters can include a range
( Indicates where string extraction is to start
) Indicates where string extraction is to end




Like the below page to get update  
https://www.facebook.com/datastage4you
https://twitter.com/datagenx
https://plus.google.com/+AtulSingh0/posts
https://groups.google.com/forum/#!forum/datagenx

Tuesday, 15 December 2015

How to use Universe Shell (uvsh) in DataStage?


In DataStage Administration, we have to use datastage command line (universe shell) to get the information directly from the datastage universe database.

While accessing it from command line what novice admin do is -

$ uvsh
This directory is not set up for DataStage.
Would you like to set it up(Y/N)?   
Confused ? What to do ?

Always answer that question "no", it means you're in the wrong place.
Always launch "uvsh" or "dssh" from one of two places - $DSHOME or inside a project directory. For the latter you're good to go, for the former you'll need to LOGTO your project name before you issue any sql.



How to use UVSH?

## Entered into the $DSHOME
$ cd $DSHOME

## Sourced the dsenv file
$ . dsenv

## Change directory to the project directory.
$ LOGTO <project_name>

## Run uvsh command 
$ $DSHOME/bin/uvsh

Many Datastage admin support to execute command from Datastage Administrator or use dssh instead of uvsh.

How to use DSSH?
## Sourced the dsenv file
$ . $DSHOME/dsenv

## Change directory to the project directory.
$ LOGTO <project_name>

## Run dssh command 
$ $DSHOME/bin/dssh



Like the below page to get update  
https://www.facebook.com/datastage4you
https://twitter.com/datagenx
https://plus.google.com/+AtulSingh0/posts
https://groups.google.com/forum/#!forum/datagenx

Sunday, 13 December 2015

Swirl Learn R in R




Step 1:  Install R
Step 2:  Install RStudio
Step 3:  Install swirl in R or RStudio
> install.packages("swirl")              
Step 4: Using R
> library("swirl")
> swirl()            

Step 5: Install R courses from R Course Repository





Like the below page to get update  
https://www.facebook.com/datastage4you
https://twitter.com/datagenx
https://plus.google.com/+AtulSingh0/posts
https://groups.google.com/forum/#!forum/datagenx

Data Science Tools Installation in Linux

Friday, 11 December 2015

DataStage Scenario #11 - Get numeric or alphabets only



Goal - Extract numeric part and alpha part from a string as below

Input:

Source
ATUL1234
SINGH374
I23S
C343LEAR





Output -

Source Part1  Part2
ATUL1234 ATUL 1234
SINGH374 SINGH 374
I23S IS 23
C343LEAR CLEAR 343








Like the below page to get update  



Wednesday, 9 December 2015

ps command #3 - Sorting



For sorting the ps command output, we can use ps --sort option ( it is not linux sort command). More details can be found on man page of ps command.

--sort spec     specify sorting order. Sorting syntax is [+|-]key[,[+|-]key[,...]] Choose a multi-letter key from the 
                STANDARD FORMAT SPECIFIERS section. The "+" is optional since default direction is increasing numerical
                or lexicographic order. Identical to k. 
                For example: ps jax --sort=uid,-ppid,+pid


ps command output - sorted by memory used ( high to low)

$ ps aux --sort -rss

USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
atul     43584  0.0 16.0 633196 162468 ?       Sl   Dec05   1:01 evince /home/atul/Desktop/Learning/book.pdf
atul     17099  0.3 15.7 1244044 159208 ?      Sl   Dec04  10:00 /usr/lib64/firefox/firefox
root      2272  0.2  8.5 223428 86132 tty1     Ss+  Dec03  12:36 /usr/bin/Xorg :0 -br -verbose -audit 4 -auth /var/run/gdm/auth-for-gdm-kda2x4/database -nolisten tcp vt1
atul      2773  0.0  5.3 1199004 53952 ?       Sl   Dec03   1:18 nautilus
atul      2827  0.0  3.5 296192 36036 ?        Ss   Dec03   0:56 gnome-screensaver
atul     43834  0.0  1.2 990904 12892 ?        Sl   Dec05   1:45 /home/atul/Desktop/sublime_text
atul      2799  0.1  1.1 371080 11216 ?        S    Dec03   8:39 /usr/lib/vmware-tools/sbin64/vmtoolsd -n vmusr --blockFd 3
atul     22246  0.0  0.9 300112 10072 ?        Sl   Dec06   0:12 gnome-terminal
atul      2767  0.0  0.7 502416  7464 ?        Sl   Dec03   0:35 gnome-panel
atul     22937  0.0  0.7 305276  7364 ?        S    Dec06   0:00 gedit
atul      2811  0.0  0.6 324292  6332 ?        S    Dec03   0:00 python /usr/share/system-config-printer/applet.py
root     44117  0.0  0.6  50068  6132 ?        Ss   Dec05   0:02 /usr/sbin/restorecond -u
atul      2852  0.0  0.5 548844  5476 ?        S    Dec03   0:13 /usr/libexec/clock-applet --oaf-activate-iid=OAFIID:GNOME_ClockApplet_Factory --oaf-ior-fd=28
atul      2788  0.0  0.4 331480  5032 ?        S    Dec03   0:48 /usr/libexec/wnck-applet --oaf-activate-iid=OAFIID:GNOME_Wncklet_Factory --oaf-ior-fd=18
atul      2760  0.0  0.4 447048  4900 ?        Sl   Dec03   0:26 metacity
atul      2783  0.0  0.4 469076  4464 ?        Sl   Dec03   0:03 gpk-update-icon
atul      2817  0.0  0.3 262056  3608 ?        S    Dec03   0:01 bluetooth-applet



If want the list from low to high , remove '-' before argument

$ ps aux --sort rss

USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root         2  0.0  0.0      0     0 ?        S    Dec03   0:00 [kthreadd]
root         3  0.0  0.0      0     0 ?        S    Dec03   0:00 [migration/0]
root         4  0.0  0.0      0     0 ?        S    Dec03   0:05 [ksoftirqd/0]
root         5  0.0  0.0      0     0 ?        S    Dec03   0:00 [stopper/0]
root         6  0.0  0.0      0     0 ?        S    Dec03   0:02 [watchdog/0]
root         7  0.0  0.0      0     0 ?        S    Dec03   4:15 [events/0]
root         8  0.0  0.0      0     0 ?        S    Dec03   0:00 [events/0]
root         9  0.0  0.0      0     0 ?        S    Dec03   0:00 [events_long/0]
root        10  0.0  0.0      0     0 ?        S    Dec03   0:00 [events_power_ef]
root        11  0.0  0.0      0     0 ?        S    Dec03   0:00 [cgroup]
root        12  0.0  0.0      0     0 ?        S    Dec03   0:00 [khelper]
root        13  0.0  0.0      0     0 ?        S    Dec03   0:00 [netns]
root        14  0.0  0.0      0     0 ?        S    Dec03   0:00 [async/mgr]
root        15  0.0  0.0      0     0 ?        S    Dec03   0:00 [pm]
root        16  0.0  0.0      0     0 ?        S    Dec03   0:03 [sync_supers]
root        17  0.0  0.0      0     0 ?        S    Dec03   0:02 [bdi-default]


Sort ps output by pid -


$ ps aux --sort pid       # pid from low to high
$ ps aux --sort -pid      # pid from high to low


GNU sort specifiers - 


STANDARD FORMAT SPECIFIERS

Here are the different keywords that may be used to control the output format (e.g. with option -o) or to sort the
selected processes with the GNU-style --sort option.

For example:  ps -eo pid,user,args --sort user

This version of ps tries to recognize most of the keywords used in other implementations of ps.

The following user-defined format specifiers may contain spaces: args, cmd, comm, command, fname, ucmd, ucomm, lstart,
bsdstart, start.

Some keywords may not be available for sorting.

CODE       HEADER   DESCRIPTION

%cpu       %CPU     cpu utilization of the process in "##.#" format. Currently, it is the CPU time used divided by the
                    time the process has been running (cputime/realtime ratio), expressed as a percentage. It will not
                    add up to 100% unless you are lucky. (alias pcpu).

%mem       %MEM     ratio of the process’s resident set size  to the physical memory on the machine, expressed as a
                    percentage. (alias pmem).

bsdstart   START    time the command started. If the process was started less than 24 hours ago, the output format is
                    " HH:MM", else it is "mmm dd" (where mmm is the three letters of the month).

bsdtime    TIME     accumulated cpu time, user + system. The display format is usually "MMM:SS", but can be shifted to
                    the right if the process used more than 999 minutes of cpu time.

c          C        processor utilization. Currently, this is the integer value of the percent usage over the lifetime
                    of the process. (see %cpu).

comm       COMMAND  command name (only the executable name). Modifications to the command name will not be shown.
                    A process marked <defunct> is partly dead, waiting to be fully destroyed by its parent. The output
                    in this column may contain spaces. (alias ucmd, ucomm). See also the args format keyword, the -f
                    option, and the c option.
                    When specified last, this column will extend to the edge of the display. If ps can not determine
                    display width, as when output is redirected (piped) into a file or another command, the output
                    width is undefined. (it may be 80, unlimited, determined by the TERM variable, and so on) The
                    COLUMNS environment variable or --cols option may be used to exactly determine the width in this
                    case. The w or -w option may be also be used to adjust width.

command    COMMAND  see args. (alias args, cmd).

cp         CP       per-mill (tenths of a percent) CPU usage. (see %cpu).

cputime    TIME     cumulative CPU time, "[dd-]hh:mm:ss" format. (alias time).

egroup     EGROUP   effective group ID of the process. This will be the textual group ID, if it can be obtained and
                    the field width permits, or a decimal representation otherwise. (alias group).

etime      ELAPSED  elapsed time since the process was started, in the form [[dd-]hh:]mm:ss.

euid       EUID     effective user ID. (alias uid).

euser      EUSER    effective user name. This will be the textual user ID, if it can be obtained and the field width
                    permits, or a decimal representation otherwise. The n option can be used to force the decimal
                    representation. (alias uname, user).

gid        GID      see egid. (alias egid).

lstart     STARTED  time the command started.

ni         NI       nice value. This ranges from 19 (nicest) to -20 (not nice to others), see nice(1). (alias nice).

pcpu       %CPU     see %cpu. (alias %cpu).

pgid       PGID     process group ID or, equivalently, the process ID of the process group leader. (alias pgrp).

pid        PID      process ID number of the process.

pmem       %MEM     see %mem. (alias %mem).

ppid       PPID     parent process ID.

rss        RSS      resident set size, the non-swapped physical memory that a task has used (in kiloBytes).
                    (alias rssize, rsz).

ruid       RUID     real user ID.

size       SZ       approximate amount of swap space that would be required if the process were to dirty all writable
                    pages and then be swapped out. This number is very rough!

start      STARTED  time the command started. If the process was started less than 24 hours ago, the output format is
                    "HH:MM:SS", else it is "  mmm dd" (where mmm is a three-letter month name).

sz         SZ       size in physical pages of the core image of the process. This includes text, data, and stack
                    space. Device mappings are currently excluded; this is subject to change. See vsz and rss.

time       TIME     cumulative CPU time, "[dd-]hh:mm:ss" format. (alias cputime).

tname      TTY      controlling tty (terminal). (alias tt, tty).

vsz        VSZ      virtual memory size of the process in KiB (1024-byte units). Device mappings are currently
                    excluded; this is subject to change. (alias vsize).







Like the below page to get update  
https://www.facebook.com/datastage4you
https://twitter.com/datagenx
https://plus.google.com/+AtulSingh0/posts
https://groups.google.com/forum/#!forum/datagenx

Tuesday, 8 December 2015

ps command #2 - Advance


ps command #1

Usually, when we are monitoring process, we are targeting something which can impact our server performance or some specific process. For doing so we grep the ps output -

This is how we call list all http processes -
$ ps aux | grep http
atul      7585  0.0  0.0 177676   592 ?        S    Dec06   0:00 /usr/libexec/gvfsd-http --spawner :1.7 /org/gtk/gvfs/exec_spaw/2
root     28848  0.0  0.0   2700   168 pts/0    D+   02:49   0:00 grep http

you can filter ps command output by any keyword as above.

There are some ps options which can give you a customized output -

To see every process on the system using standard syntax:
$ ps -e
$ ps -ef
$ ps -eF
$ ps -ely


To see every process on the system using BSD syntax:
$ ps ax
$ ps axu


To print a process tree:
$ ps -ejH
$ ps axjf


To get info about threads:
$ ps -eLf
$ ps axms


To get security info:
$ ps -eo euser,ruser,suser,fuser,f,comm,lable
$ ps axZ
$ ps -eM

To see every process running as root (real & effective ID) in user format:
$ ps -U root -u root u


To see every process with a user-defined format:
$ ps -eo pid,tid,class,rtprio,ni,pri,psr,pcpu,stat,wchan:14,comm
$ ps axo stat,euid,ruid,tty,tpgid,sess,pgrp,ppid,pid,pcpu,comm
$ ps -eopid,tt,user,fname,tmout,f,wchan   

Print only the process IDs of process syslogd:
$ ps -C syslogd -o pid=
 #ps -C <process_name> -o pid=


Print only the name of PID 42:
$ ps -p 42 -o comm=
  #ps -p <process_id> -o comm=






Like the below page to get update  
https://www.facebook.com/datastage4you
https://twitter.com/datagenx
https://plus.google.com/+AtulSingh0/posts
https://groups.google.com/forum/#!forum/datagenx

Monday, 7 December 2015

ps command #1 - Basic



A linux command to monitor the system process consuming resources on the server. If you are working on a linux system it is good to have at least basic understanding of this command.

ps program or command when run take a snapshot of running processes at that time and display on the terminal which can be used to analyse the system performance or identify any problematic process which can be a risk for system.

ps Command:


When we run simple ps command, it will display very basic information -
$ ps
  PID TTY          TIME CMD
22396 pts/0    00:00:00 su
22402 pts/0    00:00:00 bash
22417 pts/0    00:00:00 su
22420 pts/0    00:00:00 bash
23332 pts/0    00:00:00 ps

PID - process id
TTY - terminal in which process is running
TIME - total cpu time taken till now
CMD - command


Let's try with one argument -f (full)
$ ps -f
UID        PID  PPID  C STIME TTY          TIME CMD
root     22396 22377  0 09:49 pts/0    00:00:00 su
root     22402 22396  0 09:49 pts/0    00:00:00 bash
root     22417 22402  0 09:50 pts/0    00:00:00 su
root     22420 22417  0 09:50 pts/0    00:00:00 bash
root     23337 22420  0 11:02 pts/0    00:00:00 ps -f

this output is display with some more information -
UID - process owner user id 
PPID - parent process id
STIME - process start time


Let's play with some argument and see what will be the output look like -

$ ps -ef
atul      7585     1  0 18:29 ?        00:00:00 /usr/libexec/gvfsd-http --spawner :1.7 /org/gtk/gvfs/exec_spaw/2
root     16991     1  0 Dec04 ?        00:00:00 /usr/sbin/bluetoothd --udev
atul     17099     1  0 Dec04 ?        00:09:26 /usr/lib64/firefox/firefox
atul     22246     1  0 19:28 ?        00:00:05 gnome-terminal
atul     22248 22246  0 19:28 ?        00:00:00 gnome-pty-helper
atul     22377 22246  0 19:35 pts/0    00:00:00 bash
root     22396 22377  0 19:35 pts/0    00:00:00 su
root     22402 22396  0 19:35 pts/0    00:00:00 bash
root     22417 22402  0 19:35 pts/0    00:00:00 su
root     22420 22417  0 19:35 pts/0    00:00:00 bash
atul     22937     1  0 20:06 ?        00:00:00 gedit
root     23282  1899  0 20:47 ?        00:00:00 /usr/libexec/hald-addon-rfkill-killswitch
root     24348  1810  0 22:07 ?        00:00:00 /sbin/dhclient -d -4 -sf /usr/libexec/nm-dhcp-client.action -pf /var/run/dhclient-eth1.pid -lf /var/lib/dhclient/dhclient-e738be73-e337-4f64-865e-aa936ac77c14-eth1.lease -cf /var/run/nm-dhclient-eth1.conf eth1
atul     27098  1236  2 22:23 ?        00:00:00 /usr/lib/rstudio-server/bin/rsession -u atul
root     27112     1  0 22:23 ?        00:00:00 /usr/libexec/fprintd


All process
To see all processes on the system (along with the command line arguments used to start each process) you could use:

$ ps aux


Processes for User
To see all processes for a particular user (along with the command line arguments for each process) you could use:

$ ps U <username> u

$ ps U atul u
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
atul      2695  0.0  0.0 229128   672 ?        Sl   Dec03   0:00 /usr/bin/gnome-keyring-daemon --daemonize --login
atul      2705  0.0  0.1 253264  1888 ?        Ssl  Dec03   0:01 gnome-session
atul      2713  0.0  0.0  20040   128 ?        S    Dec03   0:00 dbus-launch --sh-syntax --exit-with-session
atul      2714  0.0  0.1  32476  1356 ?        Ssl  Dec03   0:01 /bin/dbus-daemon --fork --print-pid 5 --print-address 7 --session
atul      2732  0.0  0.3 133360  3636 ?        S    Dec03   0:06 /usr/libexec/gconfd-2
atul      2740  0.0  0.3 507280  3408 ?        Ssl  Dec03   0:26 /usr/libexec/gnome-settings-daemon
atul      2741  0.0  0.1 286220  1624 ?        Ss   Dec03   0:00 seahorse-daemon
atul      2746  0.0  0.0 137388   844 ?        S    Dec03   0:00 /usr/libexec/gvfsd
atul      2760  0.0  0.5 447048  5116 ?        Sl   Dec03   0:25 metacity
atul      2767  0.0  0.7 502416  7600 ?        Sl   Dec03   0:34 gnome-panel
atul      2769  0.0  0.3 450232  3156 ?        S<sl Dec03   0:35 /usr/bin/pulseaudio --start --log-target=syslog
atul      2772  0.0  0.0  94828   252 ?        S    Dec03   0:00 /usr/libexec/pulse/gconf-helper
atul      2773  0.0  5.6 1199004 57544 ?       Sl   Dec03   1:18 nautilus
atul      2775  0.0  0.0 696412   256 ?        Ssl  Dec03   0:00 /usr/libexec/bonobo-activation-server --ac-activate --ior-output-fd=18
atul      2778  0.0  0.2  30400  2212 ?        S    Dec03   0:00 /usr/sbin/restorecond -u
atul      2783  0.0  0.4 469076  4244 ?        Sl   Dec03   0:02 gpk-update-icon
atul      2786  0.0  0.0 146404   900 ?        S    Dec03   0:00 /usr/libexec/gvfs-gdu-volume-monitor
atul      2787  0.0  0.2 375072  2924 ?        S    Dec03   0:00 gnome-volume-control-applet
atul      2788  0.0  0.5 331480  5988 ?        S    Dec03   0:48 /usr/libexec/wnck-applet --oaf-activate-iid=OAFIID:GNOME_Wncklet_Factory --oaf-ior-fd=18
atul      2789  0.0  0.2 476996  2900 ?        Sl   Dec03   0:00 /usr/libexec/trashapplet --oaf-activate-iid=OAFIID:GNOME_Panel_TrashApplet_Factory --oaf-ior-fd=24

Process tree
A process tree shows the child/parent relationships between processes. (When a process spawns another process, the spawned is called a child process while the other is the parent)


$ ps afjx





Like the below page to get update  
https://www.facebook.com/datastage4you
https://twitter.com/datagenx
https://plus.google.com/+AtulSingh0/posts
https://groups.google.com/forum/#!forum/datagenx

Thursday, 3 December 2015

Python Installation from Source in Linux

Monday, 30 November 2015

Monitoring Memory by DataStage Processes #1



Before going to monitoring memory, we need to clear about why and when we have to monitor memory on the server?

Why & When?

  • Troubleshooting and to identify potential resource bottlenecks
  • Detect memory leaks
  • To check resource usage to plan better capacity planning
  • More Memory, Better Performace



To monitor DataStage Memory Usage, we have to work on these 3 points -

1. Monitor memory leaks
               Analyzing memory usage can be useful in several scenarios. Some of the most common scenarios include identifying memory leaks. A memory leak is a type of bug that causes a program to keep increasing its memory usage indefinitely.

2. Tune job design
               Comparing the amount of memory different job designs consume can help you tune your designs to be more memory efficient.

3. Tune job scheduling
               The last scenario is to tune job scheduling. Collecting memory usage by processes over a period of time can help you organize job scheduling to prevent peaks of memory consumption.


Monitoring Memory Usage with ps Command -

- Simple command available in all UNIX/Linux platforms
- Basic syntax to monitor memory usage

ps —e —o pid, ppid, user, vsz, etime, args 

Where  -
pid - process id
ppid - parent's process id
user - user that owns process
vsz - amount of virtual memory
etime - elapsed time process has been running

args - command line that started process


Other ps monitoring -- Check Memory Utilization by Datastage processes

More will be in next post.  Monitoring Memory by DataStage Processes #2



Like the below page to get update  
https://www.facebook.com/datastage4you
https://twitter.com/datagenx
https://plus.google.com/+AtulSingh0/posts
https://groups.google.com/forum/#!forum/datagenx

Wednesday, 25 November 2015

DataStage Scenario #10 - two realtime scenario



Hello guys, Hoping you are enjoying while solving DataStage Scenarios. Today I am going to ask two real time scenario. try to solve these :-)

Scn1:
We have to design a job, which will extract data from table tab1, when we get some value in a file file1. No relation between table and file.
Simple Hhh? Let's make it little restricted. You can not use the sequencer job, All functionality we need in a single parallel job.





When you able to solve first one, come to this -

Scn2:
Reading source table Stab which is having 20 columns (Sc1, Sc2, Sc3.... ), Need to validate individual column from Sc1 to Scl0 from another table Rtab column Rc1 to Rc10 ( means Sc1 with Rc1, Sc2 with Rc2 .........). The condition is, If any column is got invalid whole row will be dropped and that column captured in a single reject report. Design such a way that we should get two rows in reject file if two column are not valid in a single input row.

Wish you a luck !!



Like the below page to get update  
https://www.facebook.com/datastage4you
https://twitter.com/datagenx
https://plus.google.com/+AtulSingh0/posts
https://groups.google.com/forum/#!forum/datagenx

Monday, 23 November 2015

Python points #3 - Comparison

Friday, 20 November 2015

5 Tips For Better DataStage Design #5



#1. Use the Data Set Management utility, which is available in the Tools menu of the DataStage Designer or the DataStage Manager, to examine the schema, look at row counts, and delete a Parallel Data Set. You can also view the data itself.

#2. Use Sort stages instead of Remove duplicate stages. Sort stage has got more grouping options and sort indicator options.

#3. for quick checking if DS job is running on Server or not, from UNIX
ps -ef | grep 'DSD.RUN'



#4. Make use of Order By clause when a DB stage is being used in join. The intention is to make use of Database power for sorting instead of Data Stage resources. Keep the join partitioning as Auto. Indicate don’t sort option between DB stage and join stage using sort stage when using order by clause.

#5. There are two types of variables - string and encrypted. If you create an encrypted environment variable it will appears as the string "*******" in the Administrator tool.





Like the below page to get update  
https://www.facebook.com/datastage4you
https://twitter.com/datagenx
https://plus.google.com/+AtulSingh0/posts
https://groups.google.com/forum/#!forum/datagenx

Thursday, 19 November 2015

Python points #2 - Data Type & String Manipulations

Wednesday, 18 November 2015

Installing R on CentOS/RedHat Linux





For using the RStudio in Linux, we need to setup some config file.

First, create these 2 config file
$ touch /etc/rstudio/rserver.conf /etc/rstudio/rsession.conf

Edit /etc/rstudio/rserver.conf for port change and home address
$ vi /etc/rstudio/rserver.conf 

#default port is 8787

www-port=80
www-address=127.0.0.1

Note that after editing the /etc/rstudio/rserver.conf file you should always restart the server to apply your changes (and validate that your configuration entries were valid). You can do this by entering the following command:

$ sudo rstudio-server restart

After restarting the RStudio server, you can access RStudio tool in your browser by below URL

http://127.0.0.1:80/

generic URL -
http://<RStudio home Address>:<Port>





Like the below page to get update  
https://www.facebook.com/datastage4you
https://twitter.com/datagenx
https://plus.google.com/+AtulSingh0/posts
https://groups.google.com/forum/#!forum/datagenx

Tuesday, 17 November 2015

Python points #1 - Syntax

Monday, 16 November 2015

List DataStage jobs which used this Parameter



Open the DataStage Administrator Client

Click on the Projects tab and select the project you would like to generate a list for.

Click the Command Button

In the command entry box type:

LIST DS_JOBS WITH JOBTYPE = 3 AND EVAL "TRANS('DS_JOBOBJECTS','J\':@RECORD<5>:'\ROOT',14,'X')" LIKE ...<VARNAME>...

<VARNAME> should be the name of the parameter or environment variable



Example:

LIST DS_JOBS WITH JOBTYPE = 3 AND EVAL "TRANS('DS_JOBOBJECTS','J\':@RECORD<5>:'\ROOT',14,'X')" LIKE ...TMPDIR...

Click Execute

If the output is on more than one page, click Next to page done and click Close when finished.


In this example, a job type of 3 is a parallel job. Valid job types value are:
0 = Server
1 = Mainframe
2 = Sequence
3 = Parallel





Like the below page to get update  
https://www.facebook.com/datastage4you
https://twitter.com/datagenx
https://plus.google.com/+AtulSingh0/posts
https://groups.google.com/forum/#!forum/datagenx