Monday, 28 December 2015

Python Points #5 - Lists

Thursday, 24 December 2015

Python Points #4 - Conditions

Wednesday, 23 December 2015

5 Tips For Better DataStage Design #6



#1. If you are using a copy or a filter stage either immediately after or immediately before a transformer stage, you are reducing the efficiency by using more stages because a transformer does the job of both copy stage as well as a filter stage

#2. Work done by "COPY Stage"
a) Columns order can be altered.
b) And columns can be dropped.
c) We can change the column names.



#3. When you need to run the same sequence of jobs again and again, better create a sequencer with all the jobs that you need to run. Running this sequencer will run all the jobs. You can provide the sequence as per your requirement.

#4. Sort the data as much as possible in DB and reduced the use of DS-Sort for better performance of jobs. Avoid the work done by DataStage which is possible in DB. But it doesn't mean you have to put all the complexity in SQL only, for that we are using datastage.

#5. Ensure that all the character fields are trimmed before any processing. Normally extra spaces in the data may lead to some errors like lookup mismatch which are hard to detect.





Like the below page to get update  
https://www.facebook.com/datastage4you
https://twitter.com/datagenx
https://plus.google.com/+AtulSingh0/posts
https://groups.google.com/forum/#!forum/datagenx

Tuesday, 22 December 2015

Notepad++ tip - Find out the non-ascii characters


Working on some code and when try to compile or run arrrrrr, got a non-ascii char error ?????
Now how to resolve this, here is the way if you are using notepad++ as a text editor.



1. Ctrl-F ( View -> Find )
2. put [^\x00-\x7F]+ in search box
3. Select search mode as 'Regular expression'
4. Volla !!

This will help you to track or replace all non-ascii charater in text file.




Like the below page to get update  
https://www.facebook.com/datastage4you
https://twitter.com/datagenx
https://plus.google.com/+AtulSingh0/posts
https://groups.google.com/forum/#!forum/datagenx

Thursday, 17 December 2015

Python, IPython, Jupyter notebook, Graphlab Installation on Windows


In "Python Installation from Source in Linux" and "Data Science Tools Installation in Linux" we have seen, how to install these tools on linux, Today we will learn how to setup these tools on Windows -




Python Installation:

1. Download the Python Windows installer from here -> https://www.python.org/downloads/release/python-2711/

2. Install it as we install any software on windows

3. Now, setup the Environment Variable -
a.              If you haven’t played with environment variables before, just stick to following these instructions as you can set them up through the Windows GUI.
b.             Right click on "My Computer", select "Properties" > "Advanced system settings" and click on the "Environment Variables" button
c.             In the System Variables box, find the variable called "path" and click on the "Edit…" button
d.             In the "Variable value" box, at the end of the entry, add the following text: ;C:\Python27;C:\Python27\Scripts (change the path as per your installation)
e.             Click "OK" a couple of times and hey presto, your environment variables are set up.
f.              Open cmd and type command 'python', if you get the python prompt we are good else check the steps once again.

4. The next step in the process is to set up easy_install and so we need to go to the setuptools page (links to version 0.8) and download the ez_setup.py script. You can download it from here -> https://bitbucket.org/pypa/setuptools/raw/0.8/ez_setup.py. and put this in python script directory (C:\Python27\Scripts)

5. Open a command prompt and type python ez_setup.py install – you’ll see a load of code whizz by which will hopefully end as follows;

C:\Python27> python ez_setup.py install
Processing dependencies for setuptools==0.8
Finished processing dependencies for setuptools==0.8
C:\Python27>
6. easy_install has now been set up and you can test to see if it is there, by typing easy_install in to a command prompt, which will throw an error about no URLs, you know that the tool has been set up successfully.

To use easy_install to get new libraries, just use the following syntax: easy_install <library name>


IPython Installation:

C:\Python27> easy_install ipython
Jupyter notebook Installation
C:\Python27\Scripts> pip install jupyter
You can run the jupyter notebook as below -

C:\Python27\Scripts>jupyter notebook

Graphlab Create Installation

C:\Python27\Scripts> pip install --upgrade --no-cache-dir https://pypi.python.org/packages/source/G/GraphLab-Create/GraphLab-Create-1.7.1.tar.gz#md5=caa4b1f78625a278dd016400d15bc5bd



Like the below page to get update  
https://www.facebook.com/datastage4you
https://twitter.com/datagenx
https://plus.google.com/+AtulSingh0/posts
https://groups.google.com/forum/#!forum/datagenx

Wednesday, 16 December 2015

Python Regular Expression quick guide





^ Matches the beginning of a line
$ Matches the end of the line
. Matches any character
\s Matches whitespace
\S Matches any non-whitespace character
* Repeats a character zero or more times
*? Repeats a character zero or more times (non-greedy)
+ Repeats a character one or more times
+? Repeats a character one or more times (non-greedy)
[aeiou] Matches a single character in the listed set
[^XYZ]  Matches a single character not in the listed set
[a-z0-9] The set of characters can include a range
( Indicates where string extraction is to start
) Indicates where string extraction is to end




Like the below page to get update  
https://www.facebook.com/datastage4you
https://twitter.com/datagenx
https://plus.google.com/+AtulSingh0/posts
https://groups.google.com/forum/#!forum/datagenx

Tuesday, 15 December 2015

How to use Universe Shell (uvsh) in DataStage?


In DataStage Administration, we have to use datastage command line (universe shell) to get the information directly from the datastage universe database.

While accessing it from command line what novice admin do is -

$ uvsh
This directory is not set up for DataStage.
Would you like to set it up(Y/N)?   
Confused ? What to do ?

Always answer that question "no", it means you're in the wrong place.
Always launch "uvsh" or "dssh" from one of two places - $DSHOME or inside a project directory. For the latter you're good to go, for the former you'll need to LOGTO your project name before you issue any sql.



How to use UVSH?

## Entered into the $DSHOME
$ cd $DSHOME

## Sourced the dsenv file
$ . dsenv

## Change directory to the project directory.
$ LOGTO <project_name>

## Run uvsh command 
$ $DSHOME/bin/uvsh

Many Datastage admin support to execute command from Datastage Administrator or use dssh instead of uvsh.

How to use DSSH?
## Sourced the dsenv file
$ . $DSHOME/dsenv

## Change directory to the project directory.
$ LOGTO <project_name>

## Run dssh command 
$ $DSHOME/bin/dssh



Like the below page to get update  
https://www.facebook.com/datastage4you
https://twitter.com/datagenx
https://plus.google.com/+AtulSingh0/posts
https://groups.google.com/forum/#!forum/datagenx

Sunday, 13 December 2015

Swirl Learn R in R




Step 1:  Install R
Step 2:  Install RStudio
Step 3:  Install swirl in R or RStudio
> install.packages("swirl")              
Step 4: Using R
> library("swirl")
> swirl()            

Step 5: Install R courses from R Course Repository





Like the below page to get update  
https://www.facebook.com/datastage4you
https://twitter.com/datagenx
https://plus.google.com/+AtulSingh0/posts
https://groups.google.com/forum/#!forum/datagenx

Data Science Tools Installation in Linux

Friday, 11 December 2015

DataStage Scenario #11 - Get numeric or alphabets only



Goal - Extract numeric part and alpha part from a string as below

Input:

Source
ATUL1234
SINGH374
I23S
C343LEAR





Output -

Source Part1  Part2
ATUL1234 ATUL 1234
SINGH374 SINGH 374
I23S IS 23
C343LEAR CLEAR 343








Like the below page to get update  



Wednesday, 9 December 2015

ps command #3 - Sorting



For sorting the ps command output, we can use ps --sort option ( it is not linux sort command). More details can be found on man page of ps command.

--sort spec     specify sorting order. Sorting syntax is [+|-]key[,[+|-]key[,...]] Choose a multi-letter key from the 
                STANDARD FORMAT SPECIFIERS section. The "+" is optional since default direction is increasing numerical
                or lexicographic order. Identical to k. 
                For example: ps jax --sort=uid,-ppid,+pid


ps command output - sorted by memory used ( high to low)

$ ps aux --sort -rss

USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
atul     43584  0.0 16.0 633196 162468 ?       Sl   Dec05   1:01 evince /home/atul/Desktop/Learning/book.pdf
atul     17099  0.3 15.7 1244044 159208 ?      Sl   Dec04  10:00 /usr/lib64/firefox/firefox
root      2272  0.2  8.5 223428 86132 tty1     Ss+  Dec03  12:36 /usr/bin/Xorg :0 -br -verbose -audit 4 -auth /var/run/gdm/auth-for-gdm-kda2x4/database -nolisten tcp vt1
atul      2773  0.0  5.3 1199004 53952 ?       Sl   Dec03   1:18 nautilus
atul      2827  0.0  3.5 296192 36036 ?        Ss   Dec03   0:56 gnome-screensaver
atul     43834  0.0  1.2 990904 12892 ?        Sl   Dec05   1:45 /home/atul/Desktop/sublime_text
atul      2799  0.1  1.1 371080 11216 ?        S    Dec03   8:39 /usr/lib/vmware-tools/sbin64/vmtoolsd -n vmusr --blockFd 3
atul     22246  0.0  0.9 300112 10072 ?        Sl   Dec06   0:12 gnome-terminal
atul      2767  0.0  0.7 502416  7464 ?        Sl   Dec03   0:35 gnome-panel
atul     22937  0.0  0.7 305276  7364 ?        S    Dec06   0:00 gedit
atul      2811  0.0  0.6 324292  6332 ?        S    Dec03   0:00 python /usr/share/system-config-printer/applet.py
root     44117  0.0  0.6  50068  6132 ?        Ss   Dec05   0:02 /usr/sbin/restorecond -u
atul      2852  0.0  0.5 548844  5476 ?        S    Dec03   0:13 /usr/libexec/clock-applet --oaf-activate-iid=OAFIID:GNOME_ClockApplet_Factory --oaf-ior-fd=28
atul      2788  0.0  0.4 331480  5032 ?        S    Dec03   0:48 /usr/libexec/wnck-applet --oaf-activate-iid=OAFIID:GNOME_Wncklet_Factory --oaf-ior-fd=18
atul      2760  0.0  0.4 447048  4900 ?        Sl   Dec03   0:26 metacity
atul      2783  0.0  0.4 469076  4464 ?        Sl   Dec03   0:03 gpk-update-icon
atul      2817  0.0  0.3 262056  3608 ?        S    Dec03   0:01 bluetooth-applet



If want the list from low to high , remove '-' before argument

$ ps aux --sort rss

USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root         2  0.0  0.0      0     0 ?        S    Dec03   0:00 [kthreadd]
root         3  0.0  0.0      0     0 ?        S    Dec03   0:00 [migration/0]
root         4  0.0  0.0      0     0 ?        S    Dec03   0:05 [ksoftirqd/0]
root         5  0.0  0.0      0     0 ?        S    Dec03   0:00 [stopper/0]
root         6  0.0  0.0      0     0 ?        S    Dec03   0:02 [watchdog/0]
root         7  0.0  0.0      0     0 ?        S    Dec03   4:15 [events/0]
root         8  0.0  0.0      0     0 ?        S    Dec03   0:00 [events/0]
root         9  0.0  0.0      0     0 ?        S    Dec03   0:00 [events_long/0]
root        10  0.0  0.0      0     0 ?        S    Dec03   0:00 [events_power_ef]
root        11  0.0  0.0      0     0 ?        S    Dec03   0:00 [cgroup]
root        12  0.0  0.0      0     0 ?        S    Dec03   0:00 [khelper]
root        13  0.0  0.0      0     0 ?        S    Dec03   0:00 [netns]
root        14  0.0  0.0      0     0 ?        S    Dec03   0:00 [async/mgr]
root        15  0.0  0.0      0     0 ?        S    Dec03   0:00 [pm]
root        16  0.0  0.0      0     0 ?        S    Dec03   0:03 [sync_supers]
root        17  0.0  0.0      0     0 ?        S    Dec03   0:02 [bdi-default]


Sort ps output by pid -


$ ps aux --sort pid       # pid from low to high
$ ps aux --sort -pid      # pid from high to low


GNU sort specifiers - 


STANDARD FORMAT SPECIFIERS

Here are the different keywords that may be used to control the output format (e.g. with option -o) or to sort the
selected processes with the GNU-style --sort option.

For example:  ps -eo pid,user,args --sort user

This version of ps tries to recognize most of the keywords used in other implementations of ps.

The following user-defined format specifiers may contain spaces: args, cmd, comm, command, fname, ucmd, ucomm, lstart,
bsdstart, start.

Some keywords may not be available for sorting.

CODE       HEADER   DESCRIPTION

%cpu       %CPU     cpu utilization of the process in "##.#" format. Currently, it is the CPU time used divided by the
                    time the process has been running (cputime/realtime ratio), expressed as a percentage. It will not
                    add up to 100% unless you are lucky. (alias pcpu).

%mem       %MEM     ratio of the process’s resident set size  to the physical memory on the machine, expressed as a
                    percentage. (alias pmem).

bsdstart   START    time the command started. If the process was started less than 24 hours ago, the output format is
                    " HH:MM", else it is "mmm dd" (where mmm is the three letters of the month).

bsdtime    TIME     accumulated cpu time, user + system. The display format is usually "MMM:SS", but can be shifted to
                    the right if the process used more than 999 minutes of cpu time.

c          C        processor utilization. Currently, this is the integer value of the percent usage over the lifetime
                    of the process. (see %cpu).

comm       COMMAND  command name (only the executable name). Modifications to the command name will not be shown.
                    A process marked <defunct> is partly dead, waiting to be fully destroyed by its parent. The output
                    in this column may contain spaces. (alias ucmd, ucomm). See also the args format keyword, the -f
                    option, and the c option.
                    When specified last, this column will extend to the edge of the display. If ps can not determine
                    display width, as when output is redirected (piped) into a file or another command, the output
                    width is undefined. (it may be 80, unlimited, determined by the TERM variable, and so on) The
                    COLUMNS environment variable or --cols option may be used to exactly determine the width in this
                    case. The w or -w option may be also be used to adjust width.

command    COMMAND  see args. (alias args, cmd).

cp         CP       per-mill (tenths of a percent) CPU usage. (see %cpu).

cputime    TIME     cumulative CPU time, "[dd-]hh:mm:ss" format. (alias time).

egroup     EGROUP   effective group ID of the process. This will be the textual group ID, if it can be obtained and
                    the field width permits, or a decimal representation otherwise. (alias group).

etime      ELAPSED  elapsed time since the process was started, in the form [[dd-]hh:]mm:ss.

euid       EUID     effective user ID. (alias uid).

euser      EUSER    effective user name. This will be the textual user ID, if it can be obtained and the field width
                    permits, or a decimal representation otherwise. The n option can be used to force the decimal
                    representation. (alias uname, user).

gid        GID      see egid. (alias egid).

lstart     STARTED  time the command started.

ni         NI       nice value. This ranges from 19 (nicest) to -20 (not nice to others), see nice(1). (alias nice).

pcpu       %CPU     see %cpu. (alias %cpu).

pgid       PGID     process group ID or, equivalently, the process ID of the process group leader. (alias pgrp).

pid        PID      process ID number of the process.

pmem       %MEM     see %mem. (alias %mem).

ppid       PPID     parent process ID.

rss        RSS      resident set size, the non-swapped physical memory that a task has used (in kiloBytes).
                    (alias rssize, rsz).

ruid       RUID     real user ID.

size       SZ       approximate amount of swap space that would be required if the process were to dirty all writable
                    pages and then be swapped out. This number is very rough!

start      STARTED  time the command started. If the process was started less than 24 hours ago, the output format is
                    "HH:MM:SS", else it is "  mmm dd" (where mmm is a three-letter month name).

sz         SZ       size in physical pages of the core image of the process. This includes text, data, and stack
                    space. Device mappings are currently excluded; this is subject to change. See vsz and rss.

time       TIME     cumulative CPU time, "[dd-]hh:mm:ss" format. (alias cputime).

tname      TTY      controlling tty (terminal). (alias tt, tty).

vsz        VSZ      virtual memory size of the process in KiB (1024-byte units). Device mappings are currently
                    excluded; this is subject to change. (alias vsize).







Like the below page to get update  
https://www.facebook.com/datastage4you
https://twitter.com/datagenx
https://plus.google.com/+AtulSingh0/posts
https://groups.google.com/forum/#!forum/datagenx

Tuesday, 8 December 2015

ps command #2 - Advance


ps command #1

Usually, when we are monitoring process, we are targeting something which can impact our server performance or some specific process. For doing so we grep the ps output -

This is how we call list all http processes -
$ ps aux | grep http
atul      7585  0.0  0.0 177676   592 ?        S    Dec06   0:00 /usr/libexec/gvfsd-http --spawner :1.7 /org/gtk/gvfs/exec_spaw/2
root     28848  0.0  0.0   2700   168 pts/0    D+   02:49   0:00 grep http

you can filter ps command output by any keyword as above.

There are some ps options which can give you a customized output -

To see every process on the system using standard syntax:
$ ps -e
$ ps -ef
$ ps -eF
$ ps -ely


To see every process on the system using BSD syntax:
$ ps ax
$ ps axu


To print a process tree:
$ ps -ejH
$ ps axjf


To get info about threads:
$ ps -eLf
$ ps axms


To get security info:
$ ps -eo euser,ruser,suser,fuser,f,comm,lable
$ ps axZ
$ ps -eM

To see every process running as root (real & effective ID) in user format:
$ ps -U root -u root u


To see every process with a user-defined format:
$ ps -eo pid,tid,class,rtprio,ni,pri,psr,pcpu,stat,wchan:14,comm
$ ps axo stat,euid,ruid,tty,tpgid,sess,pgrp,ppid,pid,pcpu,comm
$ ps -eopid,tt,user,fname,tmout,f,wchan   

Print only the process IDs of process syslogd:
$ ps -C syslogd -o pid=
 #ps -C <process_name> -o pid=


Print only the name of PID 42:
$ ps -p 42 -o comm=
  #ps -p <process_id> -o comm=






Like the below page to get update  
https://www.facebook.com/datastage4you
https://twitter.com/datagenx
https://plus.google.com/+AtulSingh0/posts
https://groups.google.com/forum/#!forum/datagenx

Monday, 7 December 2015

ps command #1 - Basic



A linux command to monitor the system process consuming resources on the server. If you are working on a linux system it is good to have at least basic understanding of this command.

ps program or command when run take a snapshot of running processes at that time and display on the terminal which can be used to analyse the system performance or identify any problematic process which can be a risk for system.

ps Command:


When we run simple ps command, it will display very basic information -
$ ps
  PID TTY          TIME CMD
22396 pts/0    00:00:00 su
22402 pts/0    00:00:00 bash
22417 pts/0    00:00:00 su
22420 pts/0    00:00:00 bash
23332 pts/0    00:00:00 ps

PID - process id
TTY - terminal in which process is running
TIME - total cpu time taken till now
CMD - command


Let's try with one argument -f (full)
$ ps -f
UID        PID  PPID  C STIME TTY          TIME CMD
root     22396 22377  0 09:49 pts/0    00:00:00 su
root     22402 22396  0 09:49 pts/0    00:00:00 bash
root     22417 22402  0 09:50 pts/0    00:00:00 su
root     22420 22417  0 09:50 pts/0    00:00:00 bash
root     23337 22420  0 11:02 pts/0    00:00:00 ps -f

this output is display with some more information -
UID - process owner user id 
PPID - parent process id
STIME - process start time


Let's play with some argument and see what will be the output look like -

$ ps -ef
atul      7585     1  0 18:29 ?        00:00:00 /usr/libexec/gvfsd-http --spawner :1.7 /org/gtk/gvfs/exec_spaw/2
root     16991     1  0 Dec04 ?        00:00:00 /usr/sbin/bluetoothd --udev
atul     17099     1  0 Dec04 ?        00:09:26 /usr/lib64/firefox/firefox
atul     22246     1  0 19:28 ?        00:00:05 gnome-terminal
atul     22248 22246  0 19:28 ?        00:00:00 gnome-pty-helper
atul     22377 22246  0 19:35 pts/0    00:00:00 bash
root     22396 22377  0 19:35 pts/0    00:00:00 su
root     22402 22396  0 19:35 pts/0    00:00:00 bash
root     22417 22402  0 19:35 pts/0    00:00:00 su
root     22420 22417  0 19:35 pts/0    00:00:00 bash
atul     22937     1  0 20:06 ?        00:00:00 gedit
root     23282  1899  0 20:47 ?        00:00:00 /usr/libexec/hald-addon-rfkill-killswitch
root     24348  1810  0 22:07 ?        00:00:00 /sbin/dhclient -d -4 -sf /usr/libexec/nm-dhcp-client.action -pf /var/run/dhclient-eth1.pid -lf /var/lib/dhclient/dhclient-e738be73-e337-4f64-865e-aa936ac77c14-eth1.lease -cf /var/run/nm-dhclient-eth1.conf eth1
atul     27098  1236  2 22:23 ?        00:00:00 /usr/lib/rstudio-server/bin/rsession -u atul
root     27112     1  0 22:23 ?        00:00:00 /usr/libexec/fprintd


All process
To see all processes on the system (along with the command line arguments used to start each process) you could use:

$ ps aux


Processes for User
To see all processes for a particular user (along with the command line arguments for each process) you could use:

$ ps U <username> u

$ ps U atul u
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
atul      2695  0.0  0.0 229128   672 ?        Sl   Dec03   0:00 /usr/bin/gnome-keyring-daemon --daemonize --login
atul      2705  0.0  0.1 253264  1888 ?        Ssl  Dec03   0:01 gnome-session
atul      2713  0.0  0.0  20040   128 ?        S    Dec03   0:00 dbus-launch --sh-syntax --exit-with-session
atul      2714  0.0  0.1  32476  1356 ?        Ssl  Dec03   0:01 /bin/dbus-daemon --fork --print-pid 5 --print-address 7 --session
atul      2732  0.0  0.3 133360  3636 ?        S    Dec03   0:06 /usr/libexec/gconfd-2
atul      2740  0.0  0.3 507280  3408 ?        Ssl  Dec03   0:26 /usr/libexec/gnome-settings-daemon
atul      2741  0.0  0.1 286220  1624 ?        Ss   Dec03   0:00 seahorse-daemon
atul      2746  0.0  0.0 137388   844 ?        S    Dec03   0:00 /usr/libexec/gvfsd
atul      2760  0.0  0.5 447048  5116 ?        Sl   Dec03   0:25 metacity
atul      2767  0.0  0.7 502416  7600 ?        Sl   Dec03   0:34 gnome-panel
atul      2769  0.0  0.3 450232  3156 ?        S<sl Dec03   0:35 /usr/bin/pulseaudio --start --log-target=syslog
atul      2772  0.0  0.0  94828   252 ?        S    Dec03   0:00 /usr/libexec/pulse/gconf-helper
atul      2773  0.0  5.6 1199004 57544 ?       Sl   Dec03   1:18 nautilus
atul      2775  0.0  0.0 696412   256 ?        Ssl  Dec03   0:00 /usr/libexec/bonobo-activation-server --ac-activate --ior-output-fd=18
atul      2778  0.0  0.2  30400  2212 ?        S    Dec03   0:00 /usr/sbin/restorecond -u
atul      2783  0.0  0.4 469076  4244 ?        Sl   Dec03   0:02 gpk-update-icon
atul      2786  0.0  0.0 146404   900 ?        S    Dec03   0:00 /usr/libexec/gvfs-gdu-volume-monitor
atul      2787  0.0  0.2 375072  2924 ?        S    Dec03   0:00 gnome-volume-control-applet
atul      2788  0.0  0.5 331480  5988 ?        S    Dec03   0:48 /usr/libexec/wnck-applet --oaf-activate-iid=OAFIID:GNOME_Wncklet_Factory --oaf-ior-fd=18
atul      2789  0.0  0.2 476996  2900 ?        Sl   Dec03   0:00 /usr/libexec/trashapplet --oaf-activate-iid=OAFIID:GNOME_Panel_TrashApplet_Factory --oaf-ior-fd=24

Process tree
A process tree shows the child/parent relationships between processes. (When a process spawns another process, the spawned is called a child process while the other is the parent)


$ ps afjx





Like the below page to get update  
https://www.facebook.com/datastage4you
https://twitter.com/datagenx
https://plus.google.com/+AtulSingh0/posts
https://groups.google.com/forum/#!forum/datagenx

Thursday, 3 December 2015

Python Installation from Source in Linux