Showing posts with label Process. Show all posts
Showing posts with label Process. Show all posts

Monday, 8 August 2016

ETL Strategy #2


Continued......ETL Strategy #1
 

Delta from Sources extracted by Timestamp
This project will use the Timestamp to capture the deltas for most of the Operational Database sources where a date/time value can be used. ETL process will extract data from the operational data stores based on date/time value column like Update_dt during processing of the delta records, and then populate it into the Initial Staging area. The flow chart shown below shows step by step flow.



As shown in the flow chart above. It is shown in two parts, one for initial load and the other for delta processing.

Ref #              Step Description
1    Insert record into control tables manually or using scripts for each ETL process. This is done only once when a table gets loaded for the first time in data warehouse 
2    Set the extract date to desired Initial load date on the control table. This is the timestamp which the ETL process will use to go against the source system.
3    Run ETL batch process which will read the control tables for extract timestamp.
4    Extract all data from source system greater than the set extract timestamp on the control table.
5    Check if the load completed successfully or failed with errors.
6    If the load failed with errors, then the error handling service is called.
7    If the load completed successfully then the load flag is set to successful.
8    The max timestamp of  the ETL load is obtained
9    A new record is inserted to the control structure with the timestamp obtained in the above step.
10    The process continues to pull the delta records with the subsequent runs.
   


Delta from Sources extracted by comparison
Where a transaction Date or Timestamp is not available, a process will compare the new and current version of a source to generate its delta. This strategy is mostly used for files as source of data. This is manageable for small to medium size files that are used in this project and should be avoided with larger source file. A transaction code (I=Insert; U=Update; D=Delete) will have to be generated so that the rest of the ETL stream can recognise the type of transaction and process it.
Files are pushed into ETL server or they are pulled from the FTP servers to ETL server. If the files contain delta records, then the files are uploaded directly to the Data warehouse. If the file is a full extract file, then the file comparison delta process will be used to identify the changed records before uploading to the Data warehouse.
         

E10 Validate Source Data transferred via FTP
Input:    Source Data File and Source Control File.
Output:    NONE.
Dependency: Availability of other systems files.
Functions:
•    Validate if the number of records in the Source File is the same number as the one contained in the Source Control File.  This will guarantee that the right number of records has been transferred from Source to Target.







Like the below page to get update  
https://www.facebook.com/datastage4you
https://twitter.com/datagenx
https://plus.google.com/+AtulSingh0/posts
https://datagenx.slack.com/messages/datascience/

Tuesday, 19 July 2016

ETL Strategy #1



The ETL Load will be designed to extract data from all the required Data Sources. The data to feed the Next Gen BI Database will have to be brought from the sources with specific Extraction, Transformation & Load (ETL) processes. It is essential to define a good ETL strategy to ensure that the execution of these tasks will support appropriate data volumes and Design Should be Scalable, and Maintenance free.


Initial Load Strategy

The Initial Load process is there to support the requirement to include historical data that can’t be included through the regular Delta Refresh.
For the Next Gen BI project, it is expected to have a full extract available for the required sources to prepare an Initial Load prior to the regular Delta Refresh. This Initial extraction of the data will then flow through the regular transformation and load process.
As discussed in the control process, Control tables will be used to initiate the first iteration of the Initial ETL process, a list of source table name with extraction dates will be loaded in the control tables. ETL process can be kicked off through the scheduler and the ETL process will read the control tables and process the full extract.
The rest of the process for an Initial load is same as the delta refresh. As shown in the flow chart under the section (Delta from Sources extracted by Timestamp), The only difference is the loading of the control tables to start the process for the first time when a table gets loaded in the Data Warehouse.


Delta Refresh or CDC Strategy

The Delta refresh process will apply only the appropriate transactions to the Data Warehouse. The result is a greatly reduced volume of information to be processed and applied. Also the Delta transactions for the Warehouse can be reused as input for the different Data mart, since they will be part of the Staging area and already processed. As discussed in the Control Process Strategy, control tables will be used to control delta refresh process.


Continued......ETL Strategy #2



Like the below page to get update  
https://www.facebook.com/datastage4you
https://twitter.com/datagenx
https://plus.google.com/+AtulSingh0/posts
https://datagenx.slack.com/messages/datascience/

Tuesday, 17 May 2016

Lookup Stage behaviour



Today, I am gonna ask you a question, What value I will get from lookup when my datatype is integer (Not Null) and there is no match b/w source and reference data???

Generally, we say, NULL as there is no match b/w source and reference. But that's not true.
So let's see how the DataStage and Lookup behave :-)

http://www.datagenx.net/2016/05/lookup-stage-behaviour.html
When Source and Reference are NULLable -
-       If there is no match b/ source and reference, we will get NULL in output 

When Source and Reference are Not-NULLable -
-       If there is no match b/ source and reference, we will get DataStage Defaults for that datatype.
        such as - 0 for integer and empty string or '' for varchar when data is going out from lookup stage.

So, Be careful when you are planning to filter the data outside lookup stage based on referenced columns value as field in output file is not null, transformer stage don't receive a null (because it comes with the default value 0) and can't handle it as you expec.

Hoping, this will add one pointer in your learning. Let me know your thoughts in comment section.




Like the below page to get update  
https://www.facebook.com/datastage4you
https://twitter.com/datagenx
https://plus.google.com/+AtulSingh0/posts
https://datagenx.slack.com/messages/datascience/

Monday, 7 March 2016

Data Warehouse Approaches #2



Top-down approach(Inmon)

The top-down approach views the data warehouse from the top of the entire analytic environment.

The data warehouse holds atomic or transaction data that is extracted from one or more source systems and integrated within a normalized, enterprise data model. From there, the data is summarized, dimensionalized, and distributed to one or more “dependent” data marts. These data marts are “dependent” because they derive all their data from a centralized data warehouse.

Sometimes, organizations supplement the data warehouse with a staging area to collect and store source system data before it can be moved and integrated within the data warehouse. A separate staging area is particularly useful if there are numerous source systems, large volumes of data, or small batch windows with which to extract data from source systems.


Pros/Cons 

The major benefit of a “top-down” approach is that it provides an integrated, flexible architecture to support downstream analytic data structures.
First, this means the data warehouse provides a departure point for all data marts, enforcing consistency and standardization so that organizations can achieve a single version of the truth. Second, the atomic data in the warehouse lets organizations re-purpose that data in any number of ways to meet new and unexpected business needs.

For example, a data warehouse can be used to create rich data sets for statisticians, deliver operational reports, or support operational data stores (ODS) and analytic applications. Moreover, users can query the data warehouse if they need cross-functional or enterprise views of the data.

On the downside, a top-down approach may take longer and cost more to deploy than other approaches, especially in the initial increments. This is because organizations must create a reasonably detailed enterprise data model as well as the physical infrastructure to house the staging area, data warehouse, and the data marts before deploying their applications or reports. (Of course, depending on the size of an implementation, organizations can deploy all three “tiers” within a single database.) This initial delay may cause some groups with their own IT budgets to build their own analytic applications. Also, it may not be intuitive or seamless for end users to drill through from a data mart to a data warehouse to find the details behind the summary data in their reports.


Bottom-up approach(Kimball)

In a bottom-up approach, the goal is to deliver business value by deploying dimensional data marts as quickly as possible. Unlike the top-down approach, these data marts contain all the data — both atomic and summary — that users may want or need, now or in the future. Data is modeled in a star schema design to optimize usability and query performance. Each data mart builds on the next, reusing dimensions and facts so users can query across data marts, if desired, to obtain a single version of the truth as well as both summary and atomic data.

The “bottom-up” approach consciously tries to minimize back-office operations, preferring to focus an organization’s effort on developing dimensional designs that meet end-user requirements. The “bottom-up” staging area is non-persistent, and may simply stream flat files from source systems to data marts using the file transfer protocol. In most cases, dimensional data marts are logically stored within a single database. This approach minimizes data redundancy and makes it easier to extend existing dimensional models to accommodate new subject areas.


Pros/Cons 

The major benefit of a bottom-up approach is that it focuses on creating user-friendly, flexible data structures using dimensional, star schema models. It also delivers value rapidly because it doesn’t lay down a heavy infrastructure up front.
Without an integration infrastructure, the bottom-up approach relies on a “dimensional bus” to ensure that data marts are logically integrated and stovepipe applications are avoided. To integrate data marts logically, organizations use “conformed” dimensions and facts when building new data marts. Thus, each new data mart is integrated with others within a logical enterprise dimensional model.
Another advantage of the bottom-up approach is that since the data marts contain both summary and atomic data, users do not have to “drill through” from a data mart to another structure to obtain detailed or transaction data. The use of a staging area also eliminates redundant extracts and overhead required to move source data into the dimensional data marts.

One problem with a bottom-up approach is that it requires organizations to enforce the use of standard dimensions and facts to ensure integration and deliver a single version of the truth. When data marts are logically arrayed within a single physical database, this integration is easily done. But in a distributed, decentralized organization, it may be too much to ask departments and business units to adhere and reuse references and rules for calculating facts. There can be a tendency for organizations to create “independent” or non-integrated data marts.

In addition, dimensional marts are designed to optimize queries, not support batch or transaction processing. Thus, organizations that use a bottom-up approach need to create additional data structures outside of the bottom-up architecture to accommodate data mining, ODSs, and operational reporting requirements. However, this may be achieved simply by pulling a subset of data from a data mart at night when users are not active on the system.







Like the below page to get update  
https://www.facebook.com/datastage4you
https://twitter.com/datagenx
https://plus.google.com/+AtulSingh0/posts
https://datagenx.slack.com/messages/datascience/

Sunday, 31 January 2016

Create a Daemon to Trace New Processes


Description

The following code can be used to create a daemon that will watch for processes that show up in the "ps -ef" output with certain characteristics. When it identifies such processes, it will attach to them with a trace utility (i.e. strace, truss, tusc... you must change the code to reflect this on whatever platform this is run on). The tool does not follow these processes with a fork since it will trace any children that contain the same "ps -ef" characteristics. This makes it useful for tracing DS PX programs that contain rsh since truss's fork flag (i.e. "-f") blocks the rsh from executing.



Usage

The script below should be saved to a file such as /tmp/tracer.sh and given rwx permissions.  The trace utility name that is appropriate for your platform should be altered in the "ps -ef" command and in the "for" loop.  The script would then be run using this syntax:
    /tmp/tracer.sh <search string>
As mentioned above, the search string can be any value that would appear in the "ps -ef" output.  Such values might be a user id, particular time, a command, or arguments to a command.  The fifth and eight lines of this script gather lists of all commands to be traced and then attempts to remove commands that should be ignored.  If you find too many processes getting traced, identify why it was selected and then alter these two lines by adding a "grep -v" to the list of items bieng ignored.









Like the below page to get update  
https://www.facebook.com/datastage4you
https://twitter.com/datagenx
https://plus.google.com/+AtulSingh0/posts
https://groups.google.com/forum/#!forum/datagenx

Wednesday, 9 December 2015

ps command #3 - Sorting



For sorting the ps command output, we can use ps --sort option ( it is not linux sort command). More details can be found on man page of ps command.

--sort spec     specify sorting order. Sorting syntax is [+|-]key[,[+|-]key[,...]] Choose a multi-letter key from the 
                STANDARD FORMAT SPECIFIERS section. The "+" is optional since default direction is increasing numerical
                or lexicographic order. Identical to k. 
                For example: ps jax --sort=uid,-ppid,+pid


ps command output - sorted by memory used ( high to low)

$ ps aux --sort -rss

USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
atul     43584  0.0 16.0 633196 162468 ?       Sl   Dec05   1:01 evince /home/atul/Desktop/Learning/book.pdf
atul     17099  0.3 15.7 1244044 159208 ?      Sl   Dec04  10:00 /usr/lib64/firefox/firefox
root      2272  0.2  8.5 223428 86132 tty1     Ss+  Dec03  12:36 /usr/bin/Xorg :0 -br -verbose -audit 4 -auth /var/run/gdm/auth-for-gdm-kda2x4/database -nolisten tcp vt1
atul      2773  0.0  5.3 1199004 53952 ?       Sl   Dec03   1:18 nautilus
atul      2827  0.0  3.5 296192 36036 ?        Ss   Dec03   0:56 gnome-screensaver
atul     43834  0.0  1.2 990904 12892 ?        Sl   Dec05   1:45 /home/atul/Desktop/sublime_text
atul      2799  0.1  1.1 371080 11216 ?        S    Dec03   8:39 /usr/lib/vmware-tools/sbin64/vmtoolsd -n vmusr --blockFd 3
atul     22246  0.0  0.9 300112 10072 ?        Sl   Dec06   0:12 gnome-terminal
atul      2767  0.0  0.7 502416  7464 ?        Sl   Dec03   0:35 gnome-panel
atul     22937  0.0  0.7 305276  7364 ?        S    Dec06   0:00 gedit
atul      2811  0.0  0.6 324292  6332 ?        S    Dec03   0:00 python /usr/share/system-config-printer/applet.py
root     44117  0.0  0.6  50068  6132 ?        Ss   Dec05   0:02 /usr/sbin/restorecond -u
atul      2852  0.0  0.5 548844  5476 ?        S    Dec03   0:13 /usr/libexec/clock-applet --oaf-activate-iid=OAFIID:GNOME_ClockApplet_Factory --oaf-ior-fd=28
atul      2788  0.0  0.4 331480  5032 ?        S    Dec03   0:48 /usr/libexec/wnck-applet --oaf-activate-iid=OAFIID:GNOME_Wncklet_Factory --oaf-ior-fd=18
atul      2760  0.0  0.4 447048  4900 ?        Sl   Dec03   0:26 metacity
atul      2783  0.0  0.4 469076  4464 ?        Sl   Dec03   0:03 gpk-update-icon
atul      2817  0.0  0.3 262056  3608 ?        S    Dec03   0:01 bluetooth-applet



If want the list from low to high , remove '-' before argument

$ ps aux --sort rss

USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root         2  0.0  0.0      0     0 ?        S    Dec03   0:00 [kthreadd]
root         3  0.0  0.0      0     0 ?        S    Dec03   0:00 [migration/0]
root         4  0.0  0.0      0     0 ?        S    Dec03   0:05 [ksoftirqd/0]
root         5  0.0  0.0      0     0 ?        S    Dec03   0:00 [stopper/0]
root         6  0.0  0.0      0     0 ?        S    Dec03   0:02 [watchdog/0]
root         7  0.0  0.0      0     0 ?        S    Dec03   4:15 [events/0]
root         8  0.0  0.0      0     0 ?        S    Dec03   0:00 [events/0]
root         9  0.0  0.0      0     0 ?        S    Dec03   0:00 [events_long/0]
root        10  0.0  0.0      0     0 ?        S    Dec03   0:00 [events_power_ef]
root        11  0.0  0.0      0     0 ?        S    Dec03   0:00 [cgroup]
root        12  0.0  0.0      0     0 ?        S    Dec03   0:00 [khelper]
root        13  0.0  0.0      0     0 ?        S    Dec03   0:00 [netns]
root        14  0.0  0.0      0     0 ?        S    Dec03   0:00 [async/mgr]
root        15  0.0  0.0      0     0 ?        S    Dec03   0:00 [pm]
root        16  0.0  0.0      0     0 ?        S    Dec03   0:03 [sync_supers]
root        17  0.0  0.0      0     0 ?        S    Dec03   0:02 [bdi-default]


Sort ps output by pid -


$ ps aux --sort pid       # pid from low to high
$ ps aux --sort -pid      # pid from high to low


GNU sort specifiers - 


STANDARD FORMAT SPECIFIERS

Here are the different keywords that may be used to control the output format (e.g. with option -o) or to sort the
selected processes with the GNU-style --sort option.

For example:  ps -eo pid,user,args --sort user

This version of ps tries to recognize most of the keywords used in other implementations of ps.

The following user-defined format specifiers may contain spaces: args, cmd, comm, command, fname, ucmd, ucomm, lstart,
bsdstart, start.

Some keywords may not be available for sorting.

CODE       HEADER   DESCRIPTION

%cpu       %CPU     cpu utilization of the process in "##.#" format. Currently, it is the CPU time used divided by the
                    time the process has been running (cputime/realtime ratio), expressed as a percentage. It will not
                    add up to 100% unless you are lucky. (alias pcpu).

%mem       %MEM     ratio of the process’s resident set size  to the physical memory on the machine, expressed as a
                    percentage. (alias pmem).

bsdstart   START    time the command started. If the process was started less than 24 hours ago, the output format is
                    " HH:MM", else it is "mmm dd" (where mmm is the three letters of the month).

bsdtime    TIME     accumulated cpu time, user + system. The display format is usually "MMM:SS", but can be shifted to
                    the right if the process used more than 999 minutes of cpu time.

c          C        processor utilization. Currently, this is the integer value of the percent usage over the lifetime
                    of the process. (see %cpu).

comm       COMMAND  command name (only the executable name). Modifications to the command name will not be shown.
                    A process marked <defunct> is partly dead, waiting to be fully destroyed by its parent. The output
                    in this column may contain spaces. (alias ucmd, ucomm). See also the args format keyword, the -f
                    option, and the c option.
                    When specified last, this column will extend to the edge of the display. If ps can not determine
                    display width, as when output is redirected (piped) into a file or another command, the output
                    width is undefined. (it may be 80, unlimited, determined by the TERM variable, and so on) The
                    COLUMNS environment variable or --cols option may be used to exactly determine the width in this
                    case. The w or -w option may be also be used to adjust width.

command    COMMAND  see args. (alias args, cmd).

cp         CP       per-mill (tenths of a percent) CPU usage. (see %cpu).

cputime    TIME     cumulative CPU time, "[dd-]hh:mm:ss" format. (alias time).

egroup     EGROUP   effective group ID of the process. This will be the textual group ID, if it can be obtained and
                    the field width permits, or a decimal representation otherwise. (alias group).

etime      ELAPSED  elapsed time since the process was started, in the form [[dd-]hh:]mm:ss.

euid       EUID     effective user ID. (alias uid).

euser      EUSER    effective user name. This will be the textual user ID, if it can be obtained and the field width
                    permits, or a decimal representation otherwise. The n option can be used to force the decimal
                    representation. (alias uname, user).

gid        GID      see egid. (alias egid).

lstart     STARTED  time the command started.

ni         NI       nice value. This ranges from 19 (nicest) to -20 (not nice to others), see nice(1). (alias nice).

pcpu       %CPU     see %cpu. (alias %cpu).

pgid       PGID     process group ID or, equivalently, the process ID of the process group leader. (alias pgrp).

pid        PID      process ID number of the process.

pmem       %MEM     see %mem. (alias %mem).

ppid       PPID     parent process ID.

rss        RSS      resident set size, the non-swapped physical memory that a task has used (in kiloBytes).
                    (alias rssize, rsz).

ruid       RUID     real user ID.

size       SZ       approximate amount of swap space that would be required if the process were to dirty all writable
                    pages and then be swapped out. This number is very rough!

start      STARTED  time the command started. If the process was started less than 24 hours ago, the output format is
                    "HH:MM:SS", else it is "  mmm dd" (where mmm is a three-letter month name).

sz         SZ       size in physical pages of the core image of the process. This includes text, data, and stack
                    space. Device mappings are currently excluded; this is subject to change. See vsz and rss.

time       TIME     cumulative CPU time, "[dd-]hh:mm:ss" format. (alias cputime).

tname      TTY      controlling tty (terminal). (alias tt, tty).

vsz        VSZ      virtual memory size of the process in KiB (1024-byte units). Device mappings are currently
                    excluded; this is subject to change. (alias vsize).







Like the below page to get update  
https://www.facebook.com/datastage4you
https://twitter.com/datagenx
https://plus.google.com/+AtulSingh0/posts
https://groups.google.com/forum/#!forum/datagenx

Tuesday, 8 December 2015

ps command #2 - Advance


ps command #1

Usually, when we are monitoring process, we are targeting something which can impact our server performance or some specific process. For doing so we grep the ps output -

This is how we call list all http processes -
$ ps aux | grep http
atul      7585  0.0  0.0 177676   592 ?        S    Dec06   0:00 /usr/libexec/gvfsd-http --spawner :1.7 /org/gtk/gvfs/exec_spaw/2
root     28848  0.0  0.0   2700   168 pts/0    D+   02:49   0:00 grep http

you can filter ps command output by any keyword as above.

There are some ps options which can give you a customized output -

To see every process on the system using standard syntax:
$ ps -e
$ ps -ef
$ ps -eF
$ ps -ely


To see every process on the system using BSD syntax:
$ ps ax
$ ps axu


To print a process tree:
$ ps -ejH
$ ps axjf


To get info about threads:
$ ps -eLf
$ ps axms


To get security info:
$ ps -eo euser,ruser,suser,fuser,f,comm,lable
$ ps axZ
$ ps -eM

To see every process running as root (real & effective ID) in user format:
$ ps -U root -u root u


To see every process with a user-defined format:
$ ps -eo pid,tid,class,rtprio,ni,pri,psr,pcpu,stat,wchan:14,comm
$ ps axo stat,euid,ruid,tty,tpgid,sess,pgrp,ppid,pid,pcpu,comm
$ ps -eopid,tt,user,fname,tmout,f,wchan   

Print only the process IDs of process syslogd:
$ ps -C syslogd -o pid=
 #ps -C <process_name> -o pid=


Print only the name of PID 42:
$ ps -p 42 -o comm=
  #ps -p <process_id> -o comm=






Like the below page to get update  
https://www.facebook.com/datastage4you
https://twitter.com/datagenx
https://plus.google.com/+AtulSingh0/posts
https://groups.google.com/forum/#!forum/datagenx

Monday, 7 December 2015

ps command #1 - Basic



A linux command to monitor the system process consuming resources on the server. If you are working on a linux system it is good to have at least basic understanding of this command.

ps program or command when run take a snapshot of running processes at that time and display on the terminal which can be used to analyse the system performance or identify any problematic process which can be a risk for system.

ps Command:


When we run simple ps command, it will display very basic information -
$ ps
  PID TTY          TIME CMD
22396 pts/0    00:00:00 su
22402 pts/0    00:00:00 bash
22417 pts/0    00:00:00 su
22420 pts/0    00:00:00 bash
23332 pts/0    00:00:00 ps

PID - process id
TTY - terminal in which process is running
TIME - total cpu time taken till now
CMD - command


Let's try with one argument -f (full)
$ ps -f
UID        PID  PPID  C STIME TTY          TIME CMD
root     22396 22377  0 09:49 pts/0    00:00:00 su
root     22402 22396  0 09:49 pts/0    00:00:00 bash
root     22417 22402  0 09:50 pts/0    00:00:00 su
root     22420 22417  0 09:50 pts/0    00:00:00 bash
root     23337 22420  0 11:02 pts/0    00:00:00 ps -f

this output is display with some more information -
UID - process owner user id 
PPID - parent process id
STIME - process start time


Let's play with some argument and see what will be the output look like -

$ ps -ef
atul      7585     1  0 18:29 ?        00:00:00 /usr/libexec/gvfsd-http --spawner :1.7 /org/gtk/gvfs/exec_spaw/2
root     16991     1  0 Dec04 ?        00:00:00 /usr/sbin/bluetoothd --udev
atul     17099     1  0 Dec04 ?        00:09:26 /usr/lib64/firefox/firefox
atul     22246     1  0 19:28 ?        00:00:05 gnome-terminal
atul     22248 22246  0 19:28 ?        00:00:00 gnome-pty-helper
atul     22377 22246  0 19:35 pts/0    00:00:00 bash
root     22396 22377  0 19:35 pts/0    00:00:00 su
root     22402 22396  0 19:35 pts/0    00:00:00 bash
root     22417 22402  0 19:35 pts/0    00:00:00 su
root     22420 22417  0 19:35 pts/0    00:00:00 bash
atul     22937     1  0 20:06 ?        00:00:00 gedit
root     23282  1899  0 20:47 ?        00:00:00 /usr/libexec/hald-addon-rfkill-killswitch
root     24348  1810  0 22:07 ?        00:00:00 /sbin/dhclient -d -4 -sf /usr/libexec/nm-dhcp-client.action -pf /var/run/dhclient-eth1.pid -lf /var/lib/dhclient/dhclient-e738be73-e337-4f64-865e-aa936ac77c14-eth1.lease -cf /var/run/nm-dhclient-eth1.conf eth1
atul     27098  1236  2 22:23 ?        00:00:00 /usr/lib/rstudio-server/bin/rsession -u atul
root     27112     1  0 22:23 ?        00:00:00 /usr/libexec/fprintd


All process
To see all processes on the system (along with the command line arguments used to start each process) you could use:

$ ps aux


Processes for User
To see all processes for a particular user (along with the command line arguments for each process) you could use:

$ ps U <username> u

$ ps U atul u
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
atul      2695  0.0  0.0 229128   672 ?        Sl   Dec03   0:00 /usr/bin/gnome-keyring-daemon --daemonize --login
atul      2705  0.0  0.1 253264  1888 ?        Ssl  Dec03   0:01 gnome-session
atul      2713  0.0  0.0  20040   128 ?        S    Dec03   0:00 dbus-launch --sh-syntax --exit-with-session
atul      2714  0.0  0.1  32476  1356 ?        Ssl  Dec03   0:01 /bin/dbus-daemon --fork --print-pid 5 --print-address 7 --session
atul      2732  0.0  0.3 133360  3636 ?        S    Dec03   0:06 /usr/libexec/gconfd-2
atul      2740  0.0  0.3 507280  3408 ?        Ssl  Dec03   0:26 /usr/libexec/gnome-settings-daemon
atul      2741  0.0  0.1 286220  1624 ?        Ss   Dec03   0:00 seahorse-daemon
atul      2746  0.0  0.0 137388   844 ?        S    Dec03   0:00 /usr/libexec/gvfsd
atul      2760  0.0  0.5 447048  5116 ?        Sl   Dec03   0:25 metacity
atul      2767  0.0  0.7 502416  7600 ?        Sl   Dec03   0:34 gnome-panel
atul      2769  0.0  0.3 450232  3156 ?        S<sl Dec03   0:35 /usr/bin/pulseaudio --start --log-target=syslog
atul      2772  0.0  0.0  94828   252 ?        S    Dec03   0:00 /usr/libexec/pulse/gconf-helper
atul      2773  0.0  5.6 1199004 57544 ?       Sl   Dec03   1:18 nautilus
atul      2775  0.0  0.0 696412   256 ?        Ssl  Dec03   0:00 /usr/libexec/bonobo-activation-server --ac-activate --ior-output-fd=18
atul      2778  0.0  0.2  30400  2212 ?        S    Dec03   0:00 /usr/sbin/restorecond -u
atul      2783  0.0  0.4 469076  4244 ?        Sl   Dec03   0:02 gpk-update-icon
atul      2786  0.0  0.0 146404   900 ?        S    Dec03   0:00 /usr/libexec/gvfs-gdu-volume-monitor
atul      2787  0.0  0.2 375072  2924 ?        S    Dec03   0:00 gnome-volume-control-applet
atul      2788  0.0  0.5 331480  5988 ?        S    Dec03   0:48 /usr/libexec/wnck-applet --oaf-activate-iid=OAFIID:GNOME_Wncklet_Factory --oaf-ior-fd=18
atul      2789  0.0  0.2 476996  2900 ?        Sl   Dec03   0:00 /usr/libexec/trashapplet --oaf-activate-iid=OAFIID:GNOME_Panel_TrashApplet_Factory --oaf-ior-fd=24

Process tree
A process tree shows the child/parent relationships between processes. (When a process spawns another process, the spawned is called a child process while the other is the parent)


$ ps afjx





Like the below page to get update  
https://www.facebook.com/datastage4you
https://twitter.com/datagenx
https://plus.google.com/+AtulSingh0/posts
https://groups.google.com/forum/#!forum/datagenx

Tuesday, 10 November 2015

Check Memory Utilization by Datastage processes



While we are running lots of DataStage job on a Linux DataStage server or different environment are sharing the same server which cause the resource crunch at server-side which affect the job performance.

It's always preferable to have an eye on resource utilization while we are running jobs. Mostly, DataStage admin set a cron with a resource monitoring script which will invoke in every five min ( or more) and check the resource statistics on server and notify them accordingly.

The following processes are started on the DataStage Engine server as follows:

dsapi_slave - server side process for DataStage clients like Designer

osh - Parallel Engine process
DSD.StageRun - Server Engine Stage process
DSD.RUN - DataStage supervisor process used to initialize Parallel Engine and Server Engine jobs. There is one DSD.RUN process for each active job

ps auxw | head -1;ps auxw | grep dsapi_slave | sort -rn -k5  | head -10
USER   PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
atul 38846  0.0  0.0 103308   856 pts/0    S+   07:20   0:00 grep dsapi_slave


The example shown lists the top 10 dsapi_slave processes from a memory utilization perspective. We can substitute or add an appropriate argument for grep like osh, DSD.RUN or even the user name that was used to invoke a DataStage task to get a list that matches your criteria.



Like the below page to get update  
https://www.facebook.com/datastage4you
https://twitter.com/datagenx
https://plus.google.com/+AtulSingh0/posts
https://groups.google.com/forum/#!forum/datagenx

Friday, 16 October 2015

Datastage job processes type at server side



When we are running datastage jobs in the designer, director, by command line or by any third party tool, it invoked 4 type of process at the server side. Let's have a look on those -

dsapi_slave -  server side process for DataStage clients like Designer
osh -          Parallel Engine process
DSD.StageRun - Server Engine Stage process ( when server job is running )
DSD.RUN -      DataStage supervisor process used to initialize Parallel Engine and Server Engine jobs. There is one DSD.RUN process for each active job ( when parallel job is running )




* No of dsapi_slave process -  how many clients are connected to datastage server. It can be Administrator, Designer or Director.
* No of DSD.RUN process -  Total parallel jobs running
* No of DSD.StageRun - Total server jobs running




Like the below page to get update  
https://www.facebook.com/datastage4you
https://twitter.com/datagenx
https://plus.google.com/+AtulSingh0/posts
https://groups.google.com/forum/#!forum/datagenx