Saturday, 25 March 2017

Check if Python Pandas DataFrame Column is having NaN or NULL

Before implementing any algorithm on the given data, It is a best practice to explore it first so that you can get an idea about the data. Today, we will learn how to check for missing/Nan/NULL values in data.

1. Reading the data
Reading the csv data into storing it into a pandas dataframe.

2. Exploring data
Checking out the data, how it looks by using head command which fetch me some top rows from dataframe.

3. Checking NULLs
Pandas is proving two methods to check NULLs - isnull() and notnull()
These two returns TRUE and FALSE respectively if the value is NULL. So let's check what it will return for our data

isnull() test

notnull() test

Check 0th row, LoanAmount Column - In isnull() test it is TRUE and in notnull() test it is FALSE. It mean, this row/column is holding null.

But we will not prefer this way for large dataset, as this will return TRUE/FALSE matrix for each data point, instead we would interested to know the counts or a simple check if dataset is holding NULL or not.

Use any()
Python also provide any() method which returns TRUE if there is at least single data point which is true for checked condition.

Use all()
Returns TRUE if all the data points follow the condition.

Now, as we know that there are some nulls/NaN values in our data frame, let's check those out - 

data.isnull().sum() - this will return the count of NULLs/NaN values in each column.

If you want to get total no of NaN values, need to take sum once again -


If you want to get any particular column's NaN calculations - 

Here, I have attached the complete Jupyter Notebook for you -

If you want to download the data, You can get it from HERE.

Like the below page to get update

Thursday, 23 March 2017

Measures of Data Spread in Stats

What do we mean by SPREAD? - The measures which can tell us the variability of a dataset, width, average distribution falls into this category.

Let's see which measures we are taking about-

Input: 45, 67, 23, 12, 9, 43, 12, 17, 91
Sorted: 9, 12, 12, 17, 23, 43, 45, 67, 91

It is the simplest measures of Spread. It is the difference between max and min value of a dataset but this will not give you the idea about the data distribution. It may be given a wrong interpretation if our dataset is having outliers.

Range - Max - Min = 91 - 9 = 82

Interquartile Range (IQR):
IQR is the middle 50 percentile data which is difference between 75 percentile and 25 percentile. It is used in boxplot plotting. 

IQR = Q3 - Q1 = 56 - 12 = 44

Variance shows the distance of each element from its mean, If you simply sum this it will be zero and that is why we use squared distance to calculate it.

Standard Deviation (`\sigma` or s):
This measure is square root of Variance, the only difference between Variance and Standard deviation is the output unit as Variance.

`Variance = \sigma^2 or s^2 = \frac{\Sigma_{i=1}^N(x_i-\barx)^2}{N}`

`Standard Deviation = \sigma or s = \root{2}{\sigma^2} = \root{2}{\frac{\Sigma_{i=1}^N(x_i-\barx)^2}{N}}`

Like the below page to get update

Wednesday, 22 March 2017

The Three M in Statis : Measures of Center

In Statistics, 3M summary is very important as it tells a lot about data distribution. These Ms are - Mean, Median and Mode

Mean - Average
Median - Middile Value
Mode - Frequent Item count

for the calulations.

Like the below page to get update

Tuesday, 21 March 2017

Summary Statistics in Data Analysis

Summary statistics  are numbers that summarize properties of the data. i.e - Mean, Spread, tendency etc. We will see each one by one.

Let's take a input dataset -

Input: 45, 67, 23, 12, 9, 43, 12, 17, 91
Sorted: 9, 12, 12, 17, 23, 43, 45, 67, 91

Frequency: The frequency of an attribute value is the percentage of time the value occurs in the data set.
In our dataset, Frequency of 12 is 2.

Mode: The mode of a an attribute is the most frequent attribute value

Mode for our dataset is 2 as 12 is the most frequent item which occurs 2 time

Things to remember:
i- There is no mode if all the values are same
ii - Same is applicable if all values occurrence is 1 

Usually, Mode and Frequency are used for categorical data

Percentiles: This used for continuous data.
Given an ordinal or continuous attribute x and a number p between 0 and 100, the pth percentile
is a value xp of x such that p% of the observed values of x are less than xp.

How to calculate the Percentile:
1. Count the total item in dataset = N
2. Multiply the percentile p with total no of items = N*p
3. This will give you a no which can be a float or integer
4. If it is a float, round off it to nearest integer, named pth no
          i. Sort the data into increasing order
          ii. Now, pth no in this dataset is your percentile value
5. If it is an integer no
          i. Sort the data into increasing order
          ii. Now, average of pth no and (p+1)th no in this dataset is your percentile value

So when we say, 20% means -

No of items in dataset = 9
No of items which should be less than xp. - 9*20% = 1.8
Round off this to nearest integer - 2
Our dataset is already sorted in increasing order, so check the 2nd value - 12

likewise, 25%, 50% and 75% is - 9*25%, 9*50%, 9*75% = 2.25th, 4.5th, 6.75th
2th, 5th, 7th - 12, 23, 45

This is one way to calculate the percentile, If you use calculator or some other method, it might be slightly different.

Mean or Average:  Sum(all items) / Total no of element

Mean -  (9+12+12+17+23+43+45+67+91)/9 = 34.4

However, the mean is very sensitive to outliers. So to understand the data tendency, we go for median rather than means.

Median: Median is 50 percentile, or middle value

How to get Median/Middle value - a. Sort the data into increasing orderb. Get total no of elements - N     if N is even -  median =   ( N/2th element + [N/2 + 1]th element) / 2     if N is odd - median = ceil(N/2)th element

For our case, N = 9, which is odd, so ceil(9/2) = ceil(4.5) = 5th element 
Median = 23

Range:  Difference between Max and Min is called range. 

Input dataset range - 91-9 = 82

Variance: The variance or standard deviation is the most common measure of the spread of a set of points.

`variance(x) = \sigma^2 = \frac{1}{n-1}\Sigma_{i=1}^n(x_i-\bar{x})^2`

where `\bar{x}` is Mean of all value of x
m = total no of items in dataset
`\sigma` is standard deviation

Like the below page to get update

Sunday, 19 March 2017

5 Number Summary - Statistics Basics

What is 5 no summary?

5 no summary is an statistical measure to get the idea about the data tendency.

It includes :

1.  Minimum
2.  Q1 (25 percentile)
3.  Median (middle value or 50 percentile)
4.  Q3 (75 percentile)
5.  Maximum

How to calculate or get these values??

Input data :  45, 67, 23, 12, 9, 43, 12, 17, 91

Step1:  Sort the data

9, 12, 12, 17, 23, 43, 45, 67, 91

Step2:  You can easily get the minimum and maximum no

Min : 9
Max : 91

Step 3: Finding the median - Finding the middle value, dont confuse with Mean or Average. 

How to get Median/Middle value - 
a. Sort the data into increasing order
b. Get total no of elements - N
     if N is even -  median =   ( N/2th element + [N/2 + 1]th element) / 2
     if N is odd - median = ceil(N/2)th element

For our case, N = 9, which is odd, so ceil(9/2) = ceil(4.5) = 5th element 
Median = 23

Step 4: Finding our the Q1 and Q3 (called Quantile) is very easy. Divide the element list into 2 list by Median value - 

 (9, 12, 12, 17), 23, (43, 45, 67, 91) 

Now, Find out the Median for 1st list which is Q1 and Median for 2nd list which is Q3

As we can see, list1 and list2 both are having even no of elements so  - 

Median of list1 (Q1) =  ( N/2th element + [N/2 + 1]th element) / 2
                                  =  ( 4/2th element + [4/2 +1]th element) / 2
                                  =  ( 2nd element  + 3rd element ) /2
                                  =  (12 + 12 ) / 2 
                            Q1 = 12

Median of list2 (Q3) = ( 45 + 67 ) / 2
                                  = 112 / 2
                                  = 56 

We got the Q1 (12) and Q3 (56). 

Our 5 no summary is calculated which is -  

min, Q1, median, Q3, max 
9,     12,  23,         56, 91

Like the below page to get update

Wednesday, 15 March 2017

JDBC DSN Configuration in IIB (WMB)

1. mqsilist :

mqsireportproperties <BROKER> -c JDBCProviders -a -o AllReportableEntityNames

mqsireportproperties TESTNODE_atul.singh -c JDBCProviders -a -o AllReportableEntityNames

mqsireportproperties <BROKER> -c JDBCProviders -o Microsoft_SQL_Server -r

mqsireportproperties TESTNODE_atul.singh -c JDBCProviders -o Microsoft_SQL_Server -r

mqsicreateconfigurableservice <BROKER> -c JDBCProviders -o <JDBC_DSN_NAME> -n connectionUrlFormat -v "jdbc:sqlserver://<SERVER>:<PORT>;user=<USER>;password=<PASSWORD>;databaseName=<DB_NAME>"

mqsicreateconfigurableservice TESTNODE_atul.singh -c JDBCProviders -o SQLSrvrJdbc -n connectionUrlFormat -v "jdbc:sqlserver://IRIS-CSG-338:1433;user=sa;password=password@1;databaseName=sample"

mqsichangeproperties <BROKER> -c JDBCProviders -o <JDBC_DSN_NAME> -n databaseName,databaseSchemaNames,portNumber,securityIdentity,serverName -v <DB_NAME>,dbo,1433,<USER>@<SERVER>,<SERVER>

mqsichangeproperties TESTNODE_atul.singh -c JDBCProviders -o SQLSrvrJdbc -n databaseName,databaseSchemaNames,portNumber,securityIdentity,serverName -v sample,dbo,1433,sa@IRIS-CSG-338,IRIS-CSG-338

mqsichangeproperties <BROKER> -c JDBCProviders -o <JDBC_DSN_NAME> -n jarsURL -v "<DRIVER_JAR_PATH>"

mqsichangeproperties TESTNODE_atul.singh -c JDBCProviders -o SQLSrvrJdbc -n jarsURL -v "C:\Program Files\dbDrivers\sqljdbc_4.2\enu\jre7"

mqsichangeproperties <BROKER> -c JDBCProviders -o <JDBC_DSN_NAME> -n type4DatasourceClassName,type4DriverClassName -v "",""

mqsichangeproperties TESTNODE_atul.singh -c JDBCProviders -o SQLSrvrJdbc -n type4DatasourceClassName,type4DriverClassName -v "",""

mqsisetdbparms <BROKER> -n jdbc::<JDBC_DSN_NAME> -u <USER> -p <PASSWORD>

mqsisetdbparms TESTNODE_atul.singh -n jdbc::SQLSrvrJdbc -u sa -p password@1

mqsireportproperties <BROKER> -c JDBCProviders -o <JDBC_DSN_NAME> -r

mqsireportproperties TESTNODE_atul.singh -c JDBCProviders -o SQLSrvrJdbc -r

mqsistop <BROKER>
mqsistart <BROKER>

mqsistop TESTNODE_atul.singh
mqsistart TESTNODE_atul.singh

Like the below page to get update

Thursday, 9 March 2017

Perl Script to get content difference of a file

This Perl script will help you to get the content difference of a file available in two directories.

Usage: filename dir1 dir2

Working: Script will pick the filename from argument and start comparing its content and print the
difference. Here assumption is the same file is available in two directories.
But we can alter this script to take two file name and one directory or based on our need.

View or Download :

Like the below page to get update

Sunday, 5 March 2017

How to get DataStage Project Settings

DSListProjectSettings.bat is used to report all available settings of existing projects.


The DSListProjectSettings batch file reports all available project settings and environment variable names and values to standard output. The first section shows available DataStage project properties and their settings. The second section shows environment variable names and values. When run successfully, each section will report a status code of zero.


Like the below page to get update