Saturday, 25 March 2017

Check if Python Pandas DataFrame Column is having NaN or NULL

Before implementing any algorithm on the given data, It is a best practice to explore it first so that you can get an idea about the data. Today, we will learn how to check for missing/Nan/NULL values in data.

1. Reading the data
Reading the csv data into storing it into a pandas dataframe.

2. Exploring data
Checking out the data, how it looks by using head command which fetch me some top rows from dataframe.

3. Checking NULLs
Pandas is proving two methods to check NULLs - isnull() and notnull()
These two returns TRUE and FALSE respectively if the value is NULL. So let's check what it will return for our data

isnull() test

notnull() test

Check 0th row, LoanAmount Column - In isnull() test it is TRUE and in notnull() test it is FALSE. It mean, this row/column is holding null.

But we will not prefer this way for large dataset, as this will return TRUE/FALSE matrix for each data point, instead we would interested to know the counts or a simple check if dataset is holding NULL or not.

Use any()
Python also provide any() method which returns TRUE if there is at least single data point which is true for checked condition.

Use all()
Returns TRUE if all the data points follow the condition.

Now, as we know that there are some nulls/NaN values in our data frame, let's check those out - 

data.isnull().sum() - this will return the count of NULLs/NaN values in each column.

If you want to get total no of NaN values, need to take sum once again -


If you want to get any particular column's NaN calculations - 

Here, I have attached the complete Jupyter Notebook for you -

If you want to download the data, You can get it from HERE.

Like the below page to get update

Thursday, 23 March 2017

Measures of Data Spread in Stats

What do we mean by SPREAD? - The measures which can tell us the variability of a dataset, width, average distribution falls into this category.

Let's see which measures we are taking about-

Input: 45, 67, 23, 12, 9, 43, 12, 17, 91
Sorted: 9, 12, 12, 17, 23, 43, 45, 67, 91

It is the simplest measures of Spread. It is the difference between max and min value of a dataset but this will not give you the idea about the data distribution. It may be given a wrong interpretation if our dataset is having outliers.

Range - Max - Min = 91 - 9 = 82

Interquartile Range (IQR):
IQR is the middle 50 percentile data which is difference between 75 percentile and 25 percentile. It is used in boxplot plotting. 

IQR = Q3 - Q1 = 56 - 12 = 44

Variance shows the distance of each element from its mean, If you simply sum this it will be zero and that is why we use squared distance to calculate it.

Standard Deviation (`\sigma` or s):
This measure is square root of Variance, the only difference between Variance and Standard deviation is the output unit as Variance.

`Variance = \sigma^2 or s^2 = \frac{\Sigma_{i=1}^N(x_i-\barx)^2}{N}`

`Standard Deviation = \sigma or s = \root{2}{\sigma^2} = \root{2}{\frac{\Sigma_{i=1}^N(x_i-\barx)^2}{N}}`

Like the below page to get update

Wednesday, 22 March 2017

The Three M in Statis : Measures of Center

In Statistics, 3M summary is very important as it tells a lot about data distribution. These Ms are - Mean, Median and Mode

Mean - Average
Median - Middile Value
Mode - Frequent Item count

for the calulations.

Like the below page to get update

Tuesday, 21 March 2017

Summary Statistics in Data Analysis

Summary statistics  are numbers that summarize properties of the data. i.e - Mean, Spread, tendency etc. We will see each one by one.

Let's take a input dataset -

Input: 45, 67, 23, 12, 9, 43, 12, 17, 91
Sorted: 9, 12, 12, 17, 23, 43, 45, 67, 91

Frequency: The frequency of an attribute value is the percentage of time the value occurs in the data set.
In our dataset, Frequency of 12 is 2.

Mode: The mode of a an attribute is the most frequent attribute value

Mode for our dataset is 2 as 12 is the most frequent item which occurs 2 time

Things to remember:
i- There is no mode if all the values are same
ii - Same is applicable if all values occurrence is 1 

Usually, Mode and Frequency are used for categorical data

Percentiles: This used for continuous data.
Given an ordinal or continuous attribute x and a number p between 0 and 100, the pth percentile
is a value xp of x such that p% of the observed values of x are less than xp.

How to calculate the Percentile:
1. Count the total item in dataset = N
2. Multiply the percentile p with total no of items = N*p
3. This will give you a no which can be a float or integer
4. If it is a float, round off it to nearest integer, named pth no
          i. Sort the data into increasing order
          ii. Now, pth no in this dataset is your percentile value
5. If it is an integer no
          i. Sort the data into increasing order
          ii. Now, average of pth no and (p+1)th no in this dataset is your percentile value

So when we say, 20% means -

No of items in dataset = 9
No of items which should be less than xp. - 9*20% = 1.8
Round off this to nearest integer - 2
Our dataset is already sorted in increasing order, so check the 2nd value - 12

likewise, 25%, 50% and 75% is - 9*25%, 9*50%, 9*75% = 2.25th, 4.5th, 6.75th
2th, 5th, 7th - 12, 23, 45

This is one way to calculate the percentile, If you use calculator or some other method, it might be slightly different.

Mean or Average:  Sum(all items) / Total no of element

Mean -  (9+12+12+17+23+43+45+67+91)/9 = 34.4

However, the mean is very sensitive to outliers. So to understand the data tendency, we go for median rather than means.

Median: Median is 50 percentile, or middle value

How to get Median/Middle value - a. Sort the data into increasing orderb. Get total no of elements - N     if N is even -  median =   ( N/2th element + [N/2 + 1]th element) / 2     if N is odd - median = ceil(N/2)th element

For our case, N = 9, which is odd, so ceil(9/2) = ceil(4.5) = 5th element 
Median = 23

Range:  Difference between Max and Min is called range. 

Input dataset range - 91-9 = 82

Variance: The variance or standard deviation is the most common measure of the spread of a set of points.

`variance(x) = \sigma^2 = \frac{1}{n-1}\Sigma_{i=1}^n(x_i-\bar{x})^2`

where `\bar{x}` is Mean of all value of x
m = total no of items in dataset
`\sigma` is standard deviation

Like the below page to get update

Sunday, 19 March 2017

5 Number Summary - Statistics Basics

What is 5 no summary?

5 no summary is an statistical measure to get the idea about the data tendency.

It includes :

1.  Minimum
2.  Q1 (25 percentile)
3.  Median (middle value or 50 percentile)
4.  Q3 (75 percentile)
5.  Maximum

How to calculate or get these values??

Input data :  45, 67, 23, 12, 9, 43, 12, 17, 91

Step1:  Sort the data

9, 12, 12, 17, 23, 43, 45, 67, 91

Step2:  You can easily get the minimum and maximum no

Min : 9
Max : 91

Step 3: Finding the median - Finding the middle value, dont confuse with Mean or Average. 

How to get Median/Middle value - 
a. Sort the data into increasing order
b. Get total no of elements - N
     if N is even -  median =   ( N/2th element + [N/2 + 1]th element) / 2
     if N is odd - median = ceil(N/2)th element

For our case, N = 9, which is odd, so ceil(9/2) = ceil(4.5) = 5th element 
Median = 23

Step 4: Finding our the Q1 and Q3 (called Quantile) is very easy. Divide the element list into 2 list by Median value - 

 (9, 12, 12, 17), 23, (43, 45, 67, 91) 

Now, Find out the Median for 1st list which is Q1 and Median for 2nd list which is Q3

As we can see, list1 and list2 both are having even no of elements so  - 

Median of list1 (Q1) =  ( N/2th element + [N/2 + 1]th element) / 2
                                  =  ( 4/2th element + [4/2 +1]th element) / 2
                                  =  ( 2nd element  + 3rd element ) /2
                                  =  (12 + 12 ) / 2 
                            Q1 = 12

Median of list2 (Q3) = ( 45 + 67 ) / 2
                                  = 112 / 2
                                  = 56 

We got the Q1 (12) and Q3 (56). 

Our 5 no summary is calculated which is -  

min, Q1, median, Q3, max 
9,     12,  23,         56, 91

Like the below page to get update

Wednesday, 15 March 2017

JDBC DSN Configuration in IIB (WMB)

1. mqsilist :

mqsireportproperties <BROKER> -c JDBCProviders -a -o AllReportableEntityNames

mqsireportproperties TESTNODE_atul.singh -c JDBCProviders -a -o AllReportableEntityNames

mqsireportproperties <BROKER> -c JDBCProviders -o Microsoft_SQL_Server -r

mqsireportproperties TESTNODE_atul.singh -c JDBCProviders -o Microsoft_SQL_Server -r

mqsicreateconfigurableservice <BROKER> -c JDBCProviders -o <JDBC_DSN_NAME> -n connectionUrlFormat -v "jdbc:sqlserver://<SERVER>:<PORT>;user=<USER>;password=<PASSWORD>;databaseName=<DB_NAME>"

mqsicreateconfigurableservice TESTNODE_atul.singh -c JDBCProviders -o SQLSrvrJdbc -n connectionUrlFormat -v "jdbc:sqlserver://IRIS-CSG-338:1433;user=sa;password=password@1;databaseName=sample"

mqsichangeproperties <BROKER> -c JDBCProviders -o <JDBC_DSN_NAME> -n databaseName,databaseSchemaNames,portNumber,securityIdentity,serverName -v <DB_NAME>,dbo,1433,<USER>@<SERVER>,<SERVER>

mqsichangeproperties TESTNODE_atul.singh -c JDBCProviders -o SQLSrvrJdbc -n databaseName,databaseSchemaNames,portNumber,securityIdentity,serverName -v sample,dbo,1433,sa@IRIS-CSG-338,IRIS-CSG-338

mqsichangeproperties <BROKER> -c JDBCProviders -o <JDBC_DSN_NAME> -n jarsURL -v "<DRIVER_JAR_PATH>"

mqsichangeproperties TESTNODE_atul.singh -c JDBCProviders -o SQLSrvrJdbc -n jarsURL -v "C:\Program Files\dbDrivers\sqljdbc_4.2\enu\jre7"

mqsichangeproperties <BROKER> -c JDBCProviders -o <JDBC_DSN_NAME> -n type4DatasourceClassName,type4DriverClassName -v "",""

mqsichangeproperties TESTNODE_atul.singh -c JDBCProviders -o SQLSrvrJdbc -n type4DatasourceClassName,type4DriverClassName -v "",""

mqsisetdbparms <BROKER> -n jdbc::<JDBC_DSN_NAME> -u <USER> -p <PASSWORD>

mqsisetdbparms TESTNODE_atul.singh -n jdbc::SQLSrvrJdbc -u sa -p password@1

mqsireportproperties <BROKER> -c JDBCProviders -o <JDBC_DSN_NAME> -r

mqsireportproperties TESTNODE_atul.singh -c JDBCProviders -o SQLSrvrJdbc -r

mqsistop <BROKER>
mqsistart <BROKER>

mqsistop TESTNODE_atul.singh
mqsistart TESTNODE_atul.singh

Like the below page to get update

Thursday, 9 March 2017

Perl Script to get content difference of a file

This Perl script will help you to get the content difference of a file available in two directories.

Usage: filename dir1 dir2

Working: Script will pick the filename from argument and start comparing its content and print the
difference. Here assumption is the same file is available in two directories.
But we can alter this script to take two file name and one directory or based on our need.

View or Download :

Like the below page to get update

Sunday, 5 March 2017

How to get DataStage Project Settings

DSListProjectSettings.bat is used to report all available settings of existing projects.


The DSListProjectSettings batch file reports all available project settings and environment variable names and values to standard output. The first section shows available DataStage project properties and their settings. The second section shows environment variable names and values. When run successfully, each section will report a status code of zero.


Like the below page to get update

Tuesday, 28 February 2017

ETL Preformance Tuning : Identification of Performance Bottleneck

The goal of performance tuning is to optimize session performance by eliminating performance bottlenecks. To tune the performance of a session, we need to identify a performance bottleneck, eliminate it, and then identify the next performance bottleneck until we are satisfied with the session performance. Test Load option can be used to run sessions while tune session performance.
The most common performance bottleneck occurs when the ETL Server writes to a target database. We can identify performance bottlenecks by the following methods:

Running test sessions. We can configure a test session to read from a flat file source or to write to a flat file target or any stage which can hold the data but not write, to identify source and target bottlenecks.

Studying performance details. We can create a set of information called performance details to identify session bottlenecks. Performance details provide information such as buffer input and output efficiency. Collect Performance Data option in Session Property will enable the Session to generate a Counter of Input and Output rows through each Transformation

Monitoring system performance. System monitoring tools can be used to view percent CPU usage, I/O waits, and paging to identify system bottlenecks.

Once the location of a performance bottleneck is determined, we can eliminate the bottleneck by following these guidelines:

Eliminate source and target database bottlenecks. 
Optimize the query, Increase the database network packet size, or   configuring index and key constraints.

Eliminate mapping bottlenecks. 
Fine tune the pipeline logic and transformation settings and options in  mappings to eliminate mapping bottlenecks.

Eliminate session bottlenecks. 
Session strategy can be optimized performance details can be used to help in tuning session  configuration.

Eliminate system bottlenecks. 
Have the system administrator analyze information from system monitoring tools and improve CPU and network performance.

If all the bottlenecks above are tuned, further optimization of session performance can be done by increasing the number of pipeline partitions in the session. Adding partitions can improve performance by utilizing more of the system hardware while processing the session.

Like the below page to get update

Thursday, 23 February 2017

Shell script to access DataStage Director ETL joblogs

We have various data warehouses hosted on AIX/Linux/Unix operating systems. DataStage Director is one of the ETL monitoring & scheduling tool used in numerous data warehouses. In case of ETL job failures we need to login to DataStage Director and check the log for error messages. The ETL job logs can tend to span into several pages with various types of informative messages. Usually we need to locate the error message under ‘Fatal’ message type. Doing this task by searching the DataStage log manually can be very time consuming and tiring.

This shell script facilitates the user to access the error messages from the ease of the  Linux screen. The script also has facility to email the filtered log messages to the user’s mail box.

Ø  Accept jobname and other parameters while script execution.
Ø  Establish proper environment settings for local DataStage use
Ø  Locate event ids for fatal errors in the ETL joblog.
Ø  Extract detail log for fatal errors
Ø  Mail the filtered job log with exact fatal error message to the user.

Like the below page to get update

Tuesday, 21 February 2017

PDF Split & Merge with Perl

Sharing two utility scripts (Perl), which can be used for extracting only few pages from a PDF file and also for combining different pdf files into a single pdf files. There might be lot of situation where in, we might require only few pages from a big PDF files. So instead of storing / carrying such a big PDF files, we could extract the page or pages which we require from the original big PDF files. Also we could combine different pdf files into a single pdf file using the second script. As an example, if we have extracted pages 10-15 and pages 100-120 from a big pdf file using the and
we can combine these two pdf files (i.e. pdf which contains pages 10-15 and pdf which contains
pages 100-120) into a single pdf file using

NOTE : These two perl scripts uses a perl module called PDF::API2. If this is not present on your system as part of the perl installation, you can download these modules from and install. Please see the installation section for more details.

These two scripts can be used on windows, unix or linux. Currently tested on Windows with active perl 5.8.8, but it should work on unix and linux as well. For the script to work on unix and linux, please change the variable called "path_separator" to "/" instead of "\\". This variable can be seen at the starting of the script. can be used both on windows and unix/linux without any modification



     perl -i <input pdf file> -p <page or page range which needs to be extracted>

        -i : Please give the full path to the input PDF file
        -p : Page Number or Page range which needs to be extracted from the input PDF

        example : To extract pages 3 to 5, execute

           perl -i /tmp/abc.pdf -p 3-5

        example : To extract only page 3, execute

        perl -i /tmp/abc.pdf -p 3

Executing with -h option will display the usage onto the screen

Example : perl ./ -h


perl <output pdf file with full path> <input pdf file 1> <input pdf file 2> etc

Execute the script with all the pdf file which needs to be merged.
Script will merge in the same order which is given in the input

i.e. If you execute like /tmp/out.pdf /tmp/abc.pdf /tmp/xyz.pdf

then pages from xyz.pdf will be after pages from abc.pdf

Executing with -h option will display the usage onto the screen

Example : ./ -h



Like the below page to get update

Thursday, 9 February 2017

get Queue Depths

Monday, 6 February 2017

Implementing Slowly Changing Dimension in ETL

Dimension is a term in data management and data warehousing that refers to logical groupings of data such as geographical location, customer information, or product information. Slowly Changing Dimensions (SCD) are dimensions that have data that slowly changes.

                    For example, you may have a Dimension in your database that tracks the sales records of your salespersons in different pharmacies. Creating sales reports seems simple enough, until a salesperson is transferred from one pharmacy to another. How do you record such a change in your sales Dimension?

When we need to track change, it is unacceptable to put everything into the fact table or make every dimension time-dependent to deal with these changes. We would quickly talk ourselves back into a full-blown normalized structure with the consequential loss of understand-ability and query performance. Instead, we take advantage of the fact that most dimensions are nearly constant over time. We can preserve the independent dimensional structure with only relatively minor adjustments to contend with the changes. We refer to these nearly constant dimensions as slowly changing dimensions. Since Ralph Kimball first introduced the notion of slowly changing dimensions in 1994, some IT professionals—in a never-ending quest to speak in acronym—have termed them SCDs.

For each attribute in our dimension tables, we must specify a strategy to handle change. In other words, when an attribute value changes in the operational world, how will we respond to the change in our dimensional models? In the following section we'll describe three basic techniques for dealing with attribute changes, along with a couple hybrid approaches. You may decide that you need to employ a combination of these techniques within a single dimension table.

Type 1: Overwrite the Value
With the type 1 response, we merely overwrite the old attribute value in the dimension row, replacing it with the current value. In so doing, the attribute always reflects the most recent assignment.

Type 2: Add a Dimension Row
A type 2 response is the predominant technique for maintaining the historical data when it comes to slowly changing dimensions. A type 2 SCD is a dimension where a new row is created when the value of an attribute changes

Type 3: Add a Dimension Column
While the type 2 response partitions history, it does not allow us to associate the new attribute value with old fact history or vice versa. A type 3 SCD is a dimension where an alternate old column is created when an attribute changes

Type 6: Hybrid Slowly Changing Dimension Techniques
The Type 6 method is one that combines the approaches of types 1, 2 and 3 (1 + 2 + 3 = 6). One possible explanation of the origin of the term was that it was coined by Ralph Kimball during a conversation with Stephen Pace from Kalido but has also been referred to by Tom Haughey. It is not frequently used because it has the potential to complicate end user access, but has some advantages over the other approaches especially when techniques are employed to mitigate the downstream complexity.
The approach is to use a Type 1 slowly changing dimension, but adding an additional pair of date columns to indicate the date range at which a particular row in the dimension applies and a flag to indicate if the record is the current record.

The aim is to move data from the source application system to the Reporting Warehouse for analysis and reporting of a worldwide pharmaceutical major. The code here is responsible for the movement of data between the Operational Data Store (ODS) and the data warehouse.

Implementation Flow:

Like the below page to get update

Friday, 3 February 2017

Learning Matplotlib #2

Thursday, 2 February 2017

Plotting in Python - Learning Matplotlib #1

Wednesday, 18 January 2017

5 Tips For Better DataStage Design #17

**  There is an automap button in some stages,it can maps fields with the same names.

**  When you add a shared container into your job you need to map the columns of the container to your job link. What you might miss is the extra option you get on the Columns tab "Load" button. In addition to the normal column load you get "Load from Container" which is a quick way to load the container metadata into your job.

**  Don't create a job from an empty canvas. Always copy and use an existing job. Don't create shared containers from a blank canvas, always build and test a full job and then turn part of it into a container.

**  If you want to copy and paste settings between jobs,you had better open two Designers,then you can have two property windows open at the same time and copy or compare them more easily.As most property windows in DataStage are modal and you can only have one property window open per Designer session.

**  You can load metadata into a stage by using the "Load" button on the column tab or by dragging and dropping a table definition from the Designer repository window onto a link in your job. For sequential file stages the drag and drop is faster as it loads both the column names and the format values in one go. If you used the load button you would need to load the column names and then the format details separately.

**  Maybe you often meet a Modify stage or stage function working incorrectly, trial and error should be often the only way to work out the syntax of a function. If you do this in a large and complex job, it can be consumed a lot of times to debug it. The better way is have a couple test jobs in your project with a row generator, a modify or transformer stage and a peek stage. Have a column of each type in this test job. Use this throughout your project as a quick way to test a function or conversion. By the way, to correctly running the transformer stage need install the c++ compiler.

Like the below page to get update

Monday, 16 January 2017

Linux Shell Utilities

This shell script can be called from any other shell script as "." and any of the functions can be called to use. Even the funcions can be kept in the .profile and called through calling .profile.

The following functions are provided here:
to_lower               [ Change the upper case word to lower case.]
to_upper              [ Change the lower case to upper case. ]
check_numeric    [ Checks whether an input is a number. ]
check_decimal     [ Checks whether an input is a floating point number. ]
check_null            [ Checks for NULLs. ]
string_length        [ Evaluates the length of a string. ]
concatAll              [ Concats 'n' number of strings. ]
token_n                [ Returns the n'th token of a string.]

How To:
This is how you have to call this script to utilize any function in your current script.

In the caller script:
# Calling

RV=`to_lower $x1`
RV1=`string_length $x1`

Find the complete script here -

Like the below page to get update

Saturday, 14 January 2017

Learning Numpy #2

Thursday, 12 January 2017

Learning Numpy #1

Numpy is a python library used for numerical calculations and this is better performant than pure python. In this notebook, I have shared some basics of Numpy and will share more in next few posts. I hope you find these useful.

Like the below page to get update

Wednesday, 11 January 2017

My Learning Path for Machine Learning

I am a Python Lover guy so my way includes lots of Python points. If you dont know the basics of this wonderful language, start it from HERE else you can follow the links which I am going to share.

Learning ML is not only studying ML algorithms, it includes Basic Algebra, Statistics, Algorithms, Programming and lot more. But no need to afraid as such :-) we need to start from somewhere.....

This is my github repo, you can fork it and follow me with these 2 links --

Fork Fork
Follow - Follow @atulsingh0
I am still updating this list and welcome you to update this as well.

Like the below page to get update