## Wednesday, 31 May 2017

### Remove ctrl-M character from all files within Directory #iLoveScripting

Continuing our journey on #iLoveScripting,..............
This script will do the same task as "clnM.sh" but this will accept Directory Path as an input rather than the filename. It will iterate through each file within given directory and remove all Ctrl-M characters.

If you are unable to see the Script, Please find it here - LINK

## Tuesday, 30 May 2017

### Remove ctrl-M character from file #iLoveScripting

This is my first post under #iLoveScripting which will have lots of shell script which are helping me in my day to day task and sharing here for all guys for easing their work as well.

The very magical script, which I have use, is "clnM.sh". This script is remove the ctrl-M characters (^M) from your windows file.

Usage:  clnM.sh <FILE>

If you are unable to see the Script, Please find it here - LINK

## Sunday, 21 May 2017

### dos2unix - A script to convert DOS to LINUX formatting #iLoveScripting

dos2unix - a simple filter to convert text files in DOS format to UNIX/LINUX end of line conventions by removing the carriage return character(\r).  This will leave the newline character(\n) which unix expects.

Usgae:
dos2unix [file1] :  Remove DOS End of Line (EOL) char from file1, write back to file1
dos2unix [file1] [file2] : Remove DOS EOL char from file1, write to file2
dos2unix -d [directory] : Remove DOS EOL char from all files in directory

## Friday, 12 May 2017

### #3 - Measuring Data Similarity or Dissimilarity

Continue from -
'Measuring Data Similarity or Dissimilarity #1'
'Measuring Data Similarity or Dissimilarity #2',

### 3. For Ordinal Attributes:

Ordinal attribute is an attribute with possible values that have a meaningful order or ranking among them but the magnitude between successive values is not known. Ordinal values are same as Categorical Values but with the Order.

Such as, For "Performance" columns Values are - Best, Better, Good, Average, Below Average, Bad

These values are Categorical values with order or rank so called Ordinal Values. Ordinal attributes can also be derived from discretization of numeric attributes by splitting the value range into finite number of ordered categories.

We assign rank to these categories to calculate the similarity or dissimilarity, i.e. - There is an attribute f having N possible state can have 1, 2, 3........f_N ranking.

#### How to Calculate Similarity or Dissimilarity:

1, Assign the Rank R_ifto each category of attribute f having N possible states.
2. Normalize the Rank between [0.0, 1.0] so that each attribute have equal weight.
Can be calculated as

R_in = \frac{R_if - 1}{N - 1}

3. Now Similarity or Dissimilarity can be calculated with any distance measuring techniques. ( 'Measuring Data Similarity or Dissimilarity #2)

## Tuesday, 9 May 2017

### Measuring Data Similarity or Dissimilarity #2

Continuing from our last discussion 'Measuring Data Similarity or Dissimilarity #1',  In this post we are going to see how to calculate the similarity or dissimilarity between Numeric Data Types.

### 2. For Numeric Attribute:

For measuring the dissimilarity between two numeric data points, the easiest or most used way to calculate the 'Euclidean distance', Higher the value of distance, higher the dissimilarity.
There are two more distance measuring methods named 'Manhattan distance' and 'Minkowski distance'. We are going to look into these one by one.

#### a. Euclidean distance:

Euclidean distance is widely used to calculate the dissimilarity between numeric data points, this is actually derived from 'Pythagoras Theorem' so also known as 'Pythagorean metric' or L^2 norm.

Euclidean distance between two points p(x_1, y_1) and q(x_2, y_2) is the length which connects point p from point q.

dis(p,q) = dis(q,p) = \sqrt((x_2 - x_1)^2 + (y_2 - y_1)^2) = \sqrt(\sum_(i=1)^N(q_i - p_i)^2)

In One Dimention:

dis(p,q) = dis(q,p) = \sqrt((q - p)^2) = q - p

In Two Dimentions:

dis(p,q) = dis(q,p) = \sqrt((q_1 - p_1)^2 + (q_2 - p_2)^2)

In Three Dimentions:

dis(p,q) = dis(q,p) = \sqrt((q_1 - p_1)^2 + (q_2 - p_2)^2 + (q_3 - p_3)^2)

In N Dimentions:

dis(p,q) = dis(q,p) = \sqrt((q_1 - p_1)^2 + (q_2 - p_2)^2 + (q_3 - p_3)^2 +.......................+ (q_N - p_N)^2)

#### b. Manhattan distance:

It is also known as "City Block" distance as it is calculated same as we calculate the distance between any two block of city. It is simple difference between the data points.

dis(p, q) = |(x_2 - x_1)| + |(y_2 - y_1)| = \sum_(i=1)^N|(q_i - p_i)|

Manhattan distance is also know as L^1 norm.

#### c. Minkowski distance:

This is the generalized form of Euclidean or Manhattan distance and represented as -

dis(p,q) = dis(q,p) = [(x_2 - x_1)^n + (y_2 - y_1)^n]^{1/n} = [\sum_(i=1)^N(q_i - p_i)^n]^{1/n}

where n = 1, 2, 3.......

### Measuring Data Similarity or Dissimilarity #1

Yet another question is in data mining to measure whether two datasets are similar or not. There are so many ways to calculate these values based on Data Type. Let's see into these methods -

### 1. For Binary Attribute:

Binary attributes are those which is having only two states 0 or 1, where 0 means attribute is absent and 1 means it is present. For calculating similarity/dissimilarity between binary attributes we use contingency table -

q - if i and j both are equal to 1
r - if i is 1 and j is 0
s - if i is 0 and j is 1
t - if i and j both are equal to 0
p - total ( q+r+s+t)

#### a. Symmetric Binary Dissimilarity -

For symmetric binary attribute, each state is equally valuable. If i and j are symmetric binary attribute then dissimilarity is calculates as -

  d(i, j) = \frac{r + s}{q + r + s + t}

#### b. Asymmetric Binary Dissimilarity -

For asymmetric binary attribute, two states are not equally important. Any one state overshadow the other, such binary attribute are often called "monary" (having one state). For these kind of attribute, dissimilarity is calculates as -

d(i, j) = \frac{r + s}{q + r + s}

likewise, we can calculate the similarity (asymmetric binary similarity)

 sim(i, j) = 1 - d(i, j)

which leave us with

 JC = sim(i, j) = \frac{q}{q + r + s}

The coefficient sim(i, j) is also known as Jaccard coefficient.