Tuesday, 20 June 2017

Crontab for Windows #iLoveScripting


While working on one of my project, I required to take backup of all the work which I have completed coz workplace is shared among many developers.
     So being a Linux person, I was looking for something simple like Crontab but ended with Windows Task Scheduler. 

Tool is simple but not as Linux Crontab But it did the work asked by me :-)

How to Use Task Scheduler - 

  • Login with Admin privilege user account  
  • Open Run and "Taskschd.msc"
  •  Or  Go to Start --> Control Panel --> System and Maintenance --> Administrative Tools --> Task Scheduler
  • Click on "Create Task" on right hand side
  • This will open a Wizard to create Task
  • Fill the Task Name, Owner, Privilege and Configured for as below - 
  • Now, Click on Next Tab - Trigger, Here you have to define the time when you want to execute the program
  • You can fill the different Setting to customize your schedule.
  • Now, Click on "Action Tab"
  • In this tab, you have to define the action, such as when triggered what program/script should be execute
  • Click OK
  • You can see your task created under "Task Scheduler Library"


For More details on Task Scheduler, You can visit below links -
https://technet.microsoft.com/en-us/library/cc748993(v=ws.11).aspx




Like the below page to get update  
https://www.facebook.com/datastage4you
https://twitter.com/datagenx
https://plus.google.com/+AtulSingh0/posts
https://datagenx.slack.com/messages/datascience/

Wednesday, 31 May 2017

Remove ctrl-M character from all files within Directory #iLoveScripting


Continuing our journey on #iLoveScripting,..............
This script will do the same task as "clnM.sh" but this will accept Directory Path as an input rather than the filename. It will iterate through each file within given directory and remove all Ctrl-M characters.


If you are unable to see the Script, Please find it here - LINK







Like the below page to get update  
https://www.facebook.com/datastage4you
https://twitter.com/datagenx
https://plus.google.com/+AtulSingh0/posts
https://datagenx.slack.com/messages/datascience/

Tuesday, 30 May 2017

Remove ctrl-M character from file #iLoveScripting


This is my first post under #iLoveScripting which will have lots of shell script which are helping me in my day to day task and sharing here for all guys for easing their work as well.

 The very magical script, which I have use, is "clnM.sh". This script is remove the ctrl-M characters (^M) from your windows file.

Usage:  clnM.sh <FILE>


If you are unable to see the Script, Please find it here - LINK




Like the below page to get update  
https://www.facebook.com/datastage4you
https://twitter.com/datagenx
https://plus.google.com/+AtulSingh0/posts
https://datagenx.slack.com/messages/datascience/

Sunday, 21 May 2017

dos2unix - A script to convert DOS to LINUX formatting #iLoveScripting



dos2unix - a simple filter to convert text files in DOS format to UNIX/LINUX end of line conventions by removing the carriage return character(\r).  This will leave the newline character(\n) which unix expects.

Usgae:
dos2unix [file1] :  Remove DOS End of Line (EOL) char from file1, write back to file1
dos2unix [file1] [file2] : Remove DOS EOL char from file1, write to file2
dos2unix -d [directory] : Remove DOS EOL char from all files in directory



Like the below page to get update  
https://www.facebook.com/datastage4you
https://twitter.com/datagenx
https://plus.google.com/+AtulSingh0/posts
https://datagenx.slack.com/messages/datascience/

Friday, 12 May 2017

#3 - Measuring Data Similarity or Dissimilarity


Continue from -
 'Measuring Data Similarity or Dissimilarity #1'
 'Measuring Data Similarity or Dissimilarity #2',


3. For Ordinal Attributes:

Ordinal attribute is an attribute with possible values that have a meaningful order or ranking among them but the magnitude between successive values is not known. Ordinal values are same as Categorical Values but with the Order.

Such as, For "Performance" columns Values are - Best, Better, Good, Average, Below Average, Bad

These values are Categorical values with order or rank so called Ordinal Values. Ordinal attributes can also be derived from discretization of numeric attributes by splitting the value range into finite number of ordered categories.

We assign rank to these categories to calculate the similarity or dissimilarity, i.e. - There is an attribute f having N possible state can have `1, 2, 3........f_N` ranking.


Measuring Data Similarity or Dissimilarity for Ordinal Attributes


How to Calculate Similarity or Dissimilarity: 

1, Assign the Rank `R_if`to each category of attribute f having N possible states.
2. Normalize the Rank between [0.0, 1.0] so that each attribute have equal weight.
Can be calculated as

`R_in = \frac{R_if - 1}{N - 1}`

3. Now Similarity or Dissimilarity can be calculated with any distance measuring techniques. ( 'Measuring Data Similarity or Dissimilarity #2)






Like the below page to get update  
https://www.facebook.com/datastage4you
https://twitter.com/datagenx
https://plus.google.com/+AtulSingh0/posts
https://datagenx.slack.com/messages/datascience/

Tuesday, 9 May 2017

Measuring Data Similarity or Dissimilarity #2


Continuing from our last discussion 'Measuring Data Similarity or Dissimilarity #1',  In this post we are going to see how to calculate the similarity or dissimilarity between Numeric Data Types.

2. For Numeric Attribute:

For measuring the dissimilarity between two numeric data points, the easiest or most used way to calculate the 'Euclidean distance', Higher the value of distance, higher the dissimilarity.
           There are two more distance measuring methods named 'Manhattan distance' and 'Minkowski distance'. We are going to look into these one by one. 


a. Euclidean distance: 

Euclidean distance is widely used to calculate the dissimilarity between numeric data points, this is actually derived from 'Pythagoras Theorem' so also known as 'Pythagorean metric' or `L^2` norm.

Euclidean distance between two points `p(x_1, y_1)` and `q(x_2, y_2)` is the length which connects point p from point q.

`dis(p,q) = dis(q,p) = \sqrt((x_2 - x_1)^2 + (y_2 - y_1)^2) = \sqrt(\sum_(i=1)^N(q_i - p_i)^2)`

In One Dimention:

`dis(p,q) = dis(q,p) = \sqrt((q - p)^2) = q - p`

In Two Dimentions:

`dis(p,q) = dis(q,p) = \sqrt((q_1 - p_1)^2 + (q_2 - p_2)^2)`

In Three Dimentions:

`dis(p,q) = dis(q,p) = \sqrt((q_1 - p_1)^2 + (q_2 - p_2)^2 + (q_3 - p_3)^2)`

In N Dimentions:

`dis(p,q) = dis(q,p) = \sqrt((q_1 - p_1)^2 + (q_2 - p_2)^2 + (q_3 - p_3)^2 +.......................+ (q_N - p_N)^2)`


b. Manhattan distance: 

It is also known as "City Block" distance as it is calculated same as we calculate the distance between any two block of city. It is simple difference between the data points.

`dis(p, q) = |(x_2 - x_1)| + |(y_2 - y_1)| = \sum_(i=1)^N|(q_i - p_i)|`

Manhattan distance is also know as `L^1` norm.


c. Minkowski distance: 

This is the generalized form of Euclidean or Manhattan distance and represented as - 

`dis(p,q) = dis(q,p) = [(x_2 - x_1)^n + (y_2 - y_1)^n]^{1/n} = [\sum_(i=1)^N(q_i - p_i)^n]^{1/n}`

where n = 1, 2, 3.......






Like the below page to get update  
https://www.facebook.com/datastage4you
https://twitter.com/datagenx
https://plus.google.com/+AtulSingh0/posts
https://datagenx.slack.com/messages/datascience/

Measuring Data Similarity or Dissimilarity #1


Yet another question is in data mining to measure whether two datasets are similar or not. There are so many ways to calculate these values based on Data Type. Let's see into these methods -

1. For Binary Attribute:

Binary attributes are those which is having only two states 0 or 1, where 0 means attribute is absent and 1 means it is present. For calculating similarity/dissimilarity between binary attributes we use contingency table -

Contingency Table

q - if i and j both are equal to 1
r - if i is 1 and j is 0
s - if i is 0 and j is 1
t - if i and j both are equal to 0
p - total ( q+r+s+t)

a. Symmetric Binary Dissimilarity - 

For symmetric binary attribute, each state is equally valuable. If i and j are symmetric binary attribute then dissimilarity is calculates as -

`  d(i, j) = \frac{r + s}{q + r + s + t}  `


b. Asymmetric Binary Dissimilarity - 

For asymmetric binary attribute, two states are not equally important. Any one state overshadow the other, such binary attribute are often called "monary" (having one state). For these kind of attribute, dissimilarity is calculates as - 

`d(i, j) = \frac{r + s}{q + r + s}`

likewise, we can calculate the similarity (asymmetric binary similarity)

` sim(i, j) = 1 - d(i, j) `

which leave us with  

` JC = sim(i, j) = \frac{q}{q + r + s} `

The coefficient sim(i, j) is also known as Jaccard coefficient. 





Like the below page to get update  
https://www.facebook.com/datastage4you
https://twitter.com/datagenx
https://plus.google.com/+AtulSingh0/posts
https://datagenx.slack.com/messages/datascience/

Tuesday, 25 April 2017

Graphical Display of Basic Stats of Data



After a long time, got a chance to share somethings with you guys, so feeling awesome :-), Today we are gonna see the Graphical Display of Data Stats or sometime we call it Exploratory Data Analysis as well, This is the best way to understand your data in very less time and set your analysis path for it. So without doing more chats, let's start -

1. Scatter Plot

Very Basic, Very Easy and Most Used EDA(Exploratory Data Analysis) technique. It is 2-D plot between X and Y variables where X or Y can be numeric data features or columns.
               With this plot we can easily see if there is any relationship, pattern or trends between between these 2 features or any data outlier existing. It is also useful to explore possibility of correlation relationships. Correlation can be positive, negative or neutral.

Now, let's look into a scatter plot -
I am using IRIS dataset and Python matplotlib library for this illustration - -

scatter iris

2. Histogram

Histogram plot is one of the oldest plotting technique to summarize the data distribution of a attribute X. X can be numerical feature and height of bar is frequency. Resulting plot is also called Bar Chart.

histogram bar chart iris

3. Quantile Plot (Bar Charts)

Quantile Plot or Bar Charts also used to display the uni-variate variables data distribution as well as plot the percentile information with outlier detection.

qunatile box plot

Keep looking for this space for further update.

Happy Learning




Like the below page to get update  
https://www.facebook.com/datastage4you
https://twitter.com/datagenx
https://plus.google.com/+AtulSingh0/posts
https://datagenx.slack.com/messages/datascience/

Wednesday, 5 April 2017

NULL Handling in Sequential File Stage



DataStage has a mechanism for denoting NULL field values. It is slightly different in server and parallel jobs. In the sequential file stage a character or string may be used to represent NULL column values. Here's how represent NULL with the character "~":

Server Job:
1. Create a Sequential file stage and make sure there is an Output link from it.
2. Open the Sequential file stage and click the "Outputs" tab ans Select "Format"
3. On the right enter the "~" next to "Default NULL string:"

Parallel Job:
1. Create a Sequential file stage and make sure there is an Output link from it.
2. Open the Sequential file stage and click the "Outputs" tab ans Select "Format"
3. Right click on "Field defaults" ==> "Add sub-property" and select "Null field value"
4. Enter the "~" in the newly created field.





Like the below page to get update  
https://www.facebook.com/datastage4you
https://twitter.com/datagenx
https://plus.google.com/+AtulSingh0/posts
https://datagenx.slack.com/messages/datascience/

Saturday, 1 April 2017

Get Over Running DataStage Job Details #iLoveScripting


With the script below, we will get a list of jobs which are taking more time to complete than last run time. By some tweaks, we can use this script to monitor any kind of process.






Like the below page to get update  
https://www.facebook.com/datastage4you
https://twitter.com/datagenx
https://plus.google.com/+AtulSingh0/posts
https://datagenx.slack.com/messages/datascience/

Saturday, 25 March 2017

Check if Python Pandas DataFrame Column is having NaN or NULL


Before implementing any algorithm on the given data, It is a best practice to explore it first so that you can get an idea about the data. Today, we will learn how to check for missing/Nan/NULL values in data.

1. Reading the data
Reading the csv data into storing it into a pandas dataframe.


2. Exploring data
Checking out the data, how it looks by using head command which fetch me some top rows from dataframe.


3. Checking NULLs
Pandas is proving two methods to check NULLs - isnull() and notnull()
These two returns TRUE and FALSE respectively if the value is NULL. So let's check what it will return for our data

isnull() test

notnull() test

Check 0th row, LoanAmount Column - In isnull() test it is TRUE and in notnull() test it is FALSE. It mean, this row/column is holding null.

But we will not prefer this way for large dataset, as this will return TRUE/FALSE matrix for each data point, instead we would interested to know the counts or a simple check if dataset is holding NULL or not.

Use any()
Python also provide any() method which returns TRUE if there is at least single data point which is true for checked condition.


Use all()
Returns TRUE if all the data points follow the condition.


Now, as we know that there are some nulls/NaN values in our data frame, let's check those out - 

data.isnull().sum() - this will return the count of NULLs/NaN values in each column.


If you want to get total no of NaN values, need to take sum once again -

data.isnull().sum().sum()


If you want to get any particular column's NaN calculations - 




Here, I have attached the complete Jupyter Notebook for you -



If you want to download the data, You can get it from HERE.




Like the below page to get update  
https://www.facebook.com/datastage4you
https://twitter.com/datagenx
https://plus.google.com/+AtulSingh0/posts
https://datagenx.slack.com/messages/datascience/

Thursday, 23 March 2017

Measures of Data Spread in Stats


What do we mean by SPREAD? - The measures which can tell us the variability of a dataset, width, average distribution falls into this category.

Let's see which measures we are taking about-

Input: 45, 67, 23, 12, 9, 43, 12, 17, 91
Sorted: 9, 12, 12, 17, 23, 43, 45, 67, 91



Range:
It is the simplest measures of Spread. It is the difference between max and min value of a dataset but this will not give you the idea about the data distribution. It may be given a wrong interpretation if our dataset is having outliers.

Range - Max - Min = 91 - 9 = 82

Interquartile Range (IQR):
IQR is the middle 50 percentile data which is difference between 75 percentile and 25 percentile. It is used in boxplot plotting. 

IQR = Q3 - Q1 = 56 - 12 = 44

Variance:
Variance shows the distance of each element from its mean, If you simply sum this it will be zero and that is why we use squared distance to calculate it.

Standard Deviation (`\sigma` or s):
This measure is square root of Variance, the only difference between Variance and Standard deviation is the output unit as Variance.


`Variance = \sigma^2 or s^2 = \frac{\Sigma_{i=1}^N(x_i-\barx)^2}{N}`

`Standard Deviation = \sigma or s = \root{2}{\sigma^2} = \root{2}{\frac{\Sigma_{i=1}^N(x_i-\barx)^2}{N}}`







Like the below page to get update  
https://www.facebook.com/datastage4you
https://twitter.com/datagenx
https://plus.google.com/+AtulSingh0/posts
https://datagenx.slack.com/messages/datascience/

Wednesday, 22 March 2017

The Three M in Statis : Measures of Center


In Statistics, 3M summary is very important as it tells a lot about data distribution. These Ms are - Mean, Median and Mode

Mean - Average
Median - Middile Value
Mode - Frequent Item count

You can look into "SUMMARY STATISTICS IN DATA ANALYSIS"
for the calulations.




Like the below page to get update  
https://www.facebook.com/datastage4you
https://twitter.com/datagenx
https://plus.google.com/+AtulSingh0/posts
https://datagenx.slack.com/messages/datascience/