Showing posts with label data. Show all posts
Showing posts with label data. Show all posts

Saturday, 12 August 2017

Delete all lines in Notepad++ except lines containing a pattern

Who dont want to keep the only items which needed and we want to have when working with lots of junk data and want to fetch only which requires :-)

This can be done very easily in NotePad++ .

1. Use Ctrl-F to open Search box and  Select "Mark"

2. Put the Pattern in "Find What" box (in my case, I want to keep "Singh" only)

3. Then Click on Mark All to Mark all the lines which is having "Singh" in it.

4. When Marked, it will look like below -

5. After marking, Close this pop-up and Go To Search --> Bookmark --> Remove Unmarked Lines

As soon as Click on Remove Unmarked Lines, it will delete the all lines other than marked line.

Like the below page to get update

Friday, 7 July 2017

ICONV mystery - the UV function

Iconv (Internal CONVersion) is a function supported by UniVerse DB (UV db) to convert the DATA, not only DATE, into internal format. DataStage Server Jobs are using lots of UV functions to manipulate the data.

Today, I will try to unwrap the mystery behind the Iconv function and try to put the details in simpler words. We will not go into data conversion details but date conversion which is used by DataStage :-)

Like most of other date functions (Parallel one), Iconv also accept the date(string) and its format.

Suppose, Date =   June 17, 2017

To Convert this date into internal format, we have to use -

Iconv("2017-06-17", D-YMD)  = 18066
Iconv("2017/06/17", D/YMD)   = 18066
Iconv("2017:17:06", D:YDM)  = 18066
Iconv("17-06-17", D-Y2MD)    = 18066

D-  --> D for Delimiter followed by delimiter char
Y --> year in YYYY
M --> month in MM
D --> date in DD

As we can see, if we provide the date format with date string, Iconv convert the date to an integer no and it is very important to do because now datastage can understand the given date and we can use Oconv function to re-format the date as required.

I will cover Oconv in next post, till then Keep Learning !!

Like the below page to get update

Tuesday, 9 May 2017

Measuring Data Similarity or Dissimilarity #2

Continuing from our last discussion 'Measuring Data Similarity or Dissimilarity #1',  In this post we are going to see how to calculate the similarity or dissimilarity between Numeric Data Types.

2. For Numeric Attribute:

For measuring the dissimilarity between two numeric data points, the easiest or most used way to calculate the 'Euclidean distance', Higher the value of distance, higher the dissimilarity.
           There are two more distance measuring methods named 'Manhattan distance' and 'Minkowski distance'. We are going to look into these one by one. 

a. Euclidean distance: 

Euclidean distance is widely used to calculate the dissimilarity between numeric data points, this is actually derived from 'Pythagoras Theorem' so also known as 'Pythagorean metric' or `L^2` norm.

Euclidean distance between two points `p(x_1, y_1)` and `q(x_2, y_2)` is the length which connects point p from point q.

`dis(p,q) = dis(q,p) = \sqrt((x_2 - x_1)^2 + (y_2 - y_1)^2) = \sqrt(\sum_(i=1)^N(q_i - p_i)^2)`

In One Dimention:

`dis(p,q) = dis(q,p) = \sqrt((q - p)^2) = q - p`

In Two Dimentions:

`dis(p,q) = dis(q,p) = \sqrt((q_1 - p_1)^2 + (q_2 - p_2)^2)`

In Three Dimentions:

`dis(p,q) = dis(q,p) = \sqrt((q_1 - p_1)^2 + (q_2 - p_2)^2 + (q_3 - p_3)^2)`

In N Dimentions:

`dis(p,q) = dis(q,p) = \sqrt((q_1 - p_1)^2 + (q_2 - p_2)^2 + (q_3 - p_3)^2 +.......................+ (q_N - p_N)^2)`

b. Manhattan distance: 

It is also known as "City Block" distance as it is calculated same as we calculate the distance between any two block of city. It is simple difference between the data points.

`dis(p, q) = |(x_2 - x_1)| + |(y_2 - y_1)| = \sum_(i=1)^N|(q_i - p_i)|`

Manhattan distance is also know as `L^1` norm.

c. Minkowski distance: 

This is the generalized form of Euclidean or Manhattan distance and represented as - 

`dis(p,q) = dis(q,p) = [(x_2 - x_1)^n + (y_2 - y_1)^n]^{1/n} = [\sum_(i=1)^N(q_i - p_i)^n]^{1/n}`

where n = 1, 2, 3.......

Like the below page to get update

Measuring Data Similarity or Dissimilarity #1

Yet another question is in data mining to measure whether two datasets are similar or not. There are so many ways to calculate these values based on Data Type. Let's see into these methods -

1. For Binary Attribute:

Binary attributes are those which is having only two states 0 or 1, where 0 means attribute is absent and 1 means it is present. For calculating similarity/dissimilarity between binary attributes we use contingency table -

Contingency Table

q - if i and j both are equal to 1
r - if i is 1 and j is 0
s - if i is 0 and j is 1
t - if i and j both are equal to 0
p - total ( q+r+s+t)

a. Symmetric Binary Dissimilarity - 

For symmetric binary attribute, each state is equally valuable. If i and j are symmetric binary attribute then dissimilarity is calculates as -

`  d(i, j) = \frac{r + s}{q + r + s + t}  `

b. Asymmetric Binary Dissimilarity - 

For asymmetric binary attribute, two states are not equally important. Any one state overshadow the other, such binary attribute are often called "monary" (having one state). For these kind of attribute, dissimilarity is calculates as - 

`d(i, j) = \frac{r + s}{q + r + s}`

likewise, we can calculate the similarity (asymmetric binary similarity)

` sim(i, j) = 1 - d(i, j) `

which leave us with  

` JC = sim(i, j) = \frac{q}{q + r + s} `

The coefficient sim(i, j) is also known as Jaccard coefficient. 

Like the below page to get update

Tuesday, 25 April 2017

Graphical Display of Basic Stats of Data

After a long time, got a chance to share somethings with you guys, so feeling awesome :-), Today we are gonna see the Graphical Display of Data Stats or sometime we call it Exploratory Data Analysis as well, This is the best way to understand your data in very less time and set your analysis path for it. So without doing more chats, let's start -

1. Scatter Plot

Very Basic, Very Easy and Most Used EDA(Exploratory Data Analysis) technique. It is 2-D plot between X and Y variables where X or Y can be numeric data features or columns.
               With this plot we can easily see if there is any relationship, pattern or trends between between these 2 features or any data outlier existing. It is also useful to explore possibility of correlation relationships. Correlation can be positive, negative or neutral.

Now, let's look into a scatter plot -
I am using IRIS dataset and Python matplotlib library for this illustration - -

scatter iris

2. Histogram

Histogram plot is one of the oldest plotting technique to summarize the data distribution of a attribute X. X can be numerical feature and height of bar is frequency. Resulting plot is also called Bar Chart.

histogram bar chart iris

3. Quantile Plot (Bar Charts)

Quantile Plot or Bar Charts also used to display the uni-variate variables data distribution as well as plot the percentile information with outlier detection.

qunatile box plot

Keep looking for this space for further update.

Happy Learning

Like the below page to get update

Saturday, 25 March 2017

Check if Python Pandas DataFrame Column is having NaN or NULL

Before implementing any algorithm on the given data, It is a best practice to explore it first so that you can get an idea about the data. Today, we will learn how to check for missing/Nan/NULL values in data.

1. Reading the data
Reading the csv data into storing it into a pandas dataframe.

2. Exploring data
Checking out the data, how it looks by using head command which fetch me some top rows from dataframe.

3. Checking NULLs
Pandas is proving two methods to check NULLs - isnull() and notnull()
These two returns TRUE and FALSE respectively if the value is NULL. So let's check what it will return for our data

isnull() test

notnull() test

Check 0th row, LoanAmount Column - In isnull() test it is TRUE and in notnull() test it is FALSE. It mean, this row/column is holding null.

But we will not prefer this way for large dataset, as this will return TRUE/FALSE matrix for each data point, instead we would interested to know the counts or a simple check if dataset is holding NULL or not.

Use any()
Python also provide any() method which returns TRUE if there is at least single data point which is true for checked condition.

Use all()
Returns TRUE if all the data points follow the condition.

Now, as we know that there are some nulls/NaN values in our data frame, let's check those out - 

data.isnull().sum() - this will return the count of NULLs/NaN values in each column.

If you want to get total no of NaN values, need to take sum once again -


If you want to get any particular column's NaN calculations - 

Here, I have attached the complete Jupyter Notebook for you -

If you want to download the data, You can get it from HERE.

Like the below page to get update

Thursday, 23 March 2017

Measures of Data Spread in Stats

What do we mean by SPREAD? - The measures which can tell us the variability of a dataset, width, average distribution falls into this category.

Let's see which measures we are taking about-

Input: 45, 67, 23, 12, 9, 43, 12, 17, 91
Sorted: 9, 12, 12, 17, 23, 43, 45, 67, 91

It is the simplest measures of Spread. It is the difference between max and min value of a dataset but this will not give you the idea about the data distribution. It may be given a wrong interpretation if our dataset is having outliers.

Range - Max - Min = 91 - 9 = 82

Interquartile Range (IQR):
IQR is the middle 50 percentile data which is difference between 75 percentile and 25 percentile. It is used in boxplot plotting. 

IQR = Q3 - Q1 = 56 - 12 = 44

Variance shows the distance of each element from its mean, If you simply sum this it will be zero and that is why we use squared distance to calculate it.

Standard Deviation (`\sigma` or s):
This measure is square root of Variance, the only difference between Variance and Standard deviation is the output unit as Variance.

`Variance = \sigma^2 or s^2 = \frac{\Sigma_{i=1}^N(x_i-\barx)^2}{N}`

`Standard Deviation = \sigma or s = \root{2}{\sigma^2} = \root{2}{\frac{\Sigma_{i=1}^N(x_i-\barx)^2}{N}}`

Like the below page to get update

Tuesday, 21 March 2017

Summary Statistics in Data Analysis

Summary statistics  are numbers that summarize properties of the data. i.e - Mean, Spread, tendency etc. We will see each one by one.

Let's take a input dataset -

Input: 45, 67, 23, 12, 9, 43, 12, 17, 91
Sorted: 9, 12, 12, 17, 23, 43, 45, 67, 91

Frequency: The frequency of an attribute value is the percentage of time the value occurs in the data set.
In our dataset, Frequency of 12 is 2.

Mode: The mode of a an attribute is the most frequent attribute value

Mode for our dataset is 2 as 12 is the most frequent item which occurs 2 time

Things to remember:
i- There is no mode if all the values are same
ii - Same is applicable if all values occurrence is 1 

Usually, Mode and Frequency are used for categorical data

summary statistics in data analysis

Percentiles: This used for continuous data.
Given an ordinal or continuous attribute x and a number p between 0 and 100, the pth percentile
is a value xp of x such that p% of the observed values of x are less than xp.

How to calculate the Percentile:
1. Count the total item in dataset = N
2. Multiply the percentile p with total no of items = N*p
3. This will give you a no which can be a float or integer
4. If it is a float, round off it to nearest integer, named pth no
          i. Sort the data into increasing order
          ii. Now, pth no in this dataset is your percentile value
5. If it is an integer no
          i. Sort the data into increasing order
          ii. Now, average of pth no and (p+1)th no in this dataset is your percentile value

So when we say, 20% means -

No of items in dataset = 9
No of items which should be less than xp. - 9*20% = 1.8
Round off this to nearest integer - 2
Our dataset is already sorted in increasing order, so check the 2nd value - 12

likewise, 25%, 50% and 75% is - 9*25%, 9*50%, 9*75% = 2.25th, 4.5th, 6.75th
2th, 5th, 7th - 12, 23, 45

This is one way to calculate the percentile, If you use calculator or some other method, it might be slightly different.

Mean or Average:  Sum(all items) / Total no of element

Mean -  (9+12+12+17+23+43+45+67+91)/9 = 34.4

However, the mean is very sensitive to outliers. So to understand the data tendency, we go for median rather than means.

Median: Median is 50 percentile, or middle value

How to get Median/Middle value - a. Sort the data into increasing orderb. Get total no of elements - N     if N is even -  median =   ( N/2th element + [N/2 + 1]th element) / 2     if N is odd - median = ceil(N/2)th element

For our case, N = 9, which is odd, so ceil(9/2) = ceil(4.5) = 5th element 
Median = 23

Range:  Difference between Max and Min is called range. 

Input dataset range - 91-9 = 82

Variance: The variance or standard deviation is the most common measure of the spread of a set of points.

`variance(x) = \sigma^2 = \frac{1}{n-1}\Sigma_{i=1}^n(x_i-\bar{x})^2`

where `\bar{x}` is Mean of all value of x
m = total no of items in dataset
`\sigma` is standard deviation

Like the below page to get update

Sunday, 19 March 2017

5 Number Summary - Statistics Basics

What is 5 no summary?

5 no summary is an statistical measure to get the idea about the data tendency.

It includes :

1.  Minimum
2.  Q1 (25 percentile)
3.  Median (middle value or 50 percentile)
4.  Q3 (75 percentile)
5.  Maximum

5 number summary

How to calculate or get these values??

Input data :  45, 67, 23, 12, 9, 43, 12, 17, 91

Step1:  Sort the data

9, 12, 12, 17, 23, 43, 45, 67, 91

Step2:  You can easily get the minimum and maximum no

Min : 9
Max : 91

Step 3: Finding the median - Finding the middle value, dont confuse with Mean or Average. 

How to get Median/Middle value - 
a. Sort the data into increasing order
b. Get total no of elements - N
     if N is even -  median =   ( N/2th element + [N/2 + 1]th element) / 2
     if N is odd - median = ceil(N/2)th element

For our case, N = 9, which is odd, so ceil(9/2) = ceil(4.5) = 5th element 
Median = 23

Step 4: Finding our the Q1 and Q3 (called Quantile) is very easy. Divide the element list into 2 list by Median value - 

 (9, 12, 12, 17), 23, (43, 45, 67, 91) 

Now, Find out the Median for 1st list which is Q1 and Median for 2nd list which is Q3

As we can see, list1 and list2 both are having even no of elements so  - 

Median of list1 (Q1) =  ( N/2th element + [N/2 + 1]th element) / 2
                                  =  ( 4/2th element + [4/2 +1]th element) / 2
                                  =  ( 2nd element  + 3rd element ) /2
                                  =  (12 + 12 ) / 2 
                            Q1 = 12

Median of list2 (Q3) = ( 45 + 67 ) / 2
                                  = 112 / 2
                                  = 56 

We got the Q1 (12) and Q3 (56). 

Our 5 no summary is calculated which is -  

min, Q1, median, Q3, max 
9,     12,  23,         56, 91

Like the below page to get update

Saturday, 14 January 2017

Learning Numpy #2

Thursday, 12 January 2017

Learning Numpy #1

Numpy is a python library used for numerical calculations and this is better performant than pure python. In this notebook, I have shared some basics of Numpy and will share more in next few posts. I hope you find these useful.

Click Here for Next Tutorial ~

Like the below page to get update

Wednesday, 11 January 2017

My Learning Path for Machine Learning

I am a Python Lover guy so my way includes lots of Python points. If you dont know the basics of this wonderful language, start it from HERE else you can follow the links which I am going to share.

Learning ML is not only studying ML algorithms, it includes Basic Algebra, Statistics, Algorithms, Programming and lot more. But no need to afraid as such :-) we need to start from somewhere.....

This is my github repo, you can fork it and follow me with these 2 links --

Fork Fork
Follow - Follow @atulsingh0
I am still updating this list and welcome you to update this as well.

Like the below page to get update

Friday, 6 January 2017

10 minutes with pandas library

Thursday, 5 January 2017

Learning Pandas #5 - read & write data from file

Wednesday, 4 January 2017

Learning Pandas #4 - Hierarchical Indexing

Sunday, 1 January 2017

Learning Pandas #3 - Working on Summary & MissingData

Saturday, 31 December 2016

Learning Pandas - DataFrame #2

Friday, 30 December 2016

Learning Pandas - Series #1

Thursday, 6 October 2016

Difference between IBM CDC and CDD

Now a days I am busy with an another IBM tool called CDC - Change Data Capture, now knows as IBM Infosphere Data Replication.
IBM CDC replicates your heterogeneous data in near real time to support data migrations, application consolidation, data synchronization, dynamic warehousing, master data management (MDM), business analytics and data quality processes.
In layman's terms, you can replicate any data automatically in near real time.

When we were at initial stage of our POC, we got a doubt between IBM CDC and CDD (Change Data Delivery) which one we have to use and what is the difference and all. Whenever we were googling about IBM CDD, getting the result with IBM CDC links, In some way we were sure that both are same tool or related with each other. Luckily we found an IBM link which says -
InfoSphere CDD is the exact same code (product) as InfoSphere CDC. The only difference is the licensing model. Please reach out to your IBM Sales Representative for additional details 

So we have contacted our IBM sales buddy to understand the licensing model and this is what we got to know -

* IBM CDC and CDD is same product
* IBM CDC comes with Source and Target Agents and there is individual licensing cost for each one agent.
i.e. - Assuming you are having 2 different database vender (Oracle, Db2) at source side and 3 target db vender (SQL Server, MySQL, Netezza) so you have to pay for 5 replication CDC agent which can be increase or decrease with no of different database software you are using.

* IBM CDD (Change Data Delivery) comes with little relief with costing but only if you already have IBM DataStage license. IBM CDD comes with multiple source agents software and one target agents for IBM DataStage which saves the cost.
 i.e . - Let's take the previous example, We need to buy 2 (for source dbs) + 1(for datastage) license to use Replication.
The benfit here is, you only need to pay for 1 target agent (for DataStage) and use DataStage to deliver your data to any target which saves multiple target license cost.

This is the only reason why they have different names despite being the same software - 
IBM CDC - Change Data Capture ( n source + n target )  [DataStage not required]
IBM CDD - Change Data Delivery ( n source  + 1 target ) [DataStage required]

Like the below page to get update

Wednesday, 3 August 2016

#3 How to Copy DataSet from One Server to Another Server

This post is third and last of How to Copy DataSet from One Server to Another Server Series

We have generated a populated a dataset and identified the files which we need to move to another server serverB from serverA


4. Reading the dataset on another server

This is the most crucial step, Now all 4 files are moved on serverB or the common location which can be accessible from serverB.

For my case, common dir is my home - /home/users/atul

A. Change the default.apt file
We need to change the fastname in default.apt (config file) which we copied from the serverA, [ NOT the default.apt for serverB]

Open the file in any text editor or vi and change as below screen shot -

Temporarily create the "resource disk" and "resource scratchdisk" location if not existing as defined in above config file.

B. Copy the dataset data files 

Move the dataset data file from common directory to "resource disk" as defined in config file.

cp ~/dummy.ds.* /opt/IBM/InformationServer/Server/DataSets/

Now, all files locations are like -

Config file and Dataset descriptor file - my home dir or common dir
Dataset data files - /opt/IBM/InformationServer/Server/DataSets/

Design a job which will read thess dataset files and populate data into sequential file or any other output.

Job Paramaters -
APT_CONFIG_FILE = /home/users/atul/default.apt

DataSet Properties
DataSet File - /home/users/atul/dummy.ds

That is all, you can read the copied dataset on serverB, you can populate this data to some other output such as seq file, table so that you can avoid the use of copied default.apt config file which is not for serverB.

Try it out, let me know if you have any question.

If you like this post, follow the below pages to get update