Showing posts with label Dissimilarity. Show all posts
Showing posts with label Dissimilarity. Show all posts

Friday, 12 May 2017

#3 - Measuring Data Similarity or Dissimilarity


Continue from -
 'Measuring Data Similarity or Dissimilarity #1'
 'Measuring Data Similarity or Dissimilarity #2',


3. For Ordinal Attributes:

Ordinal attribute is an attribute with possible values that have a meaningful order or ranking among them but the magnitude between successive values is not known. Ordinal values are same as Categorical Values but with the Order.

Such as, For "Performance" columns Values are - Best, Better, Good, Average, Below Average, Bad

These values are Categorical values with order or rank so called Ordinal Values. Ordinal attributes can also be derived from discretization of numeric attributes by splitting the value range into finite number of ordered categories.

We assign rank to these categories to calculate the similarity or dissimilarity, i.e. - There is an attribute f having N possible state can have `1, 2, 3........f_N` ranking.


Measuring Data Similarity or Dissimilarity for Ordinal Attributes


How to Calculate Similarity or Dissimilarity: 

1, Assign the Rank `R_if`to each category of attribute f having N possible states.
2. Normalize the Rank between [0.0, 1.0] so that each attribute have equal weight.
Can be calculated as

`R_in = \frac{R_if - 1}{N - 1}`

3. Now Similarity or Dissimilarity can be calculated with any distance measuring techniques. ( 'Measuring Data Similarity or Dissimilarity #2)






Like the below page to get update  
https://www.facebook.com/datastage4you
https://twitter.com/datagenx
https://plus.google.com/+AtulSingh0/posts
https://datagenx.slack.com/messages/datascience/

Tuesday, 9 May 2017

Measuring Data Similarity or Dissimilarity #2


Continuing from our last discussion 'Measuring Data Similarity or Dissimilarity #1',  In this post we are going to see how to calculate the similarity or dissimilarity between Numeric Data Types.

2. For Numeric Attribute:

For measuring the dissimilarity between two numeric data points, the easiest or most used way to calculate the 'Euclidean distance', Higher the value of distance, higher the dissimilarity.
           There are two more distance measuring methods named 'Manhattan distance' and 'Minkowski distance'. We are going to look into these one by one. 


a. Euclidean distance: 

Euclidean distance is widely used to calculate the dissimilarity between numeric data points, this is actually derived from 'Pythagoras Theorem' so also known as 'Pythagorean metric' or `L^2` norm.

Euclidean distance between two points `p(x_1, y_1)` and `q(x_2, y_2)` is the length which connects point p from point q.

`dis(p,q) = dis(q,p) = \sqrt((x_2 - x_1)^2 + (y_2 - y_1)^2) = \sqrt(\sum_(i=1)^N(q_i - p_i)^2)`

In One Dimention:

`dis(p,q) = dis(q,p) = \sqrt((q - p)^2) = q - p`

In Two Dimentions:

`dis(p,q) = dis(q,p) = \sqrt((q_1 - p_1)^2 + (q_2 - p_2)^2)`

In Three Dimentions:

`dis(p,q) = dis(q,p) = \sqrt((q_1 - p_1)^2 + (q_2 - p_2)^2 + (q_3 - p_3)^2)`

In N Dimentions:

`dis(p,q) = dis(q,p) = \sqrt((q_1 - p_1)^2 + (q_2 - p_2)^2 + (q_3 - p_3)^2 +.......................+ (q_N - p_N)^2)`


b. Manhattan distance: 

It is also known as "City Block" distance as it is calculated same as we calculate the distance between any two block of city. It is simple difference between the data points.

`dis(p, q) = |(x_2 - x_1)| + |(y_2 - y_1)| = \sum_(i=1)^N|(q_i - p_i)|`

Manhattan distance is also know as `L^1` norm.


c. Minkowski distance: 

This is the generalized form of Euclidean or Manhattan distance and represented as - 

`dis(p,q) = dis(q,p) = [(x_2 - x_1)^n + (y_2 - y_1)^n]^{1/n} = [\sum_(i=1)^N(q_i - p_i)^n]^{1/n}`

where n = 1, 2, 3.......






Like the below page to get update  
https://www.facebook.com/datastage4you
https://twitter.com/datagenx
https://plus.google.com/+AtulSingh0/posts
https://datagenx.slack.com/messages/datascience/

Measuring Data Similarity or Dissimilarity #1


Yet another question is in data mining to measure whether two datasets are similar or not. There are so many ways to calculate these values based on Data Type. Let's see into these methods -

1. For Binary Attribute:

Binary attributes are those which is having only two states 0 or 1, where 0 means attribute is absent and 1 means it is present. For calculating similarity/dissimilarity between binary attributes we use contingency table -

Contingency Table

q - if i and j both are equal to 1
r - if i is 1 and j is 0
s - if i is 0 and j is 1
t - if i and j both are equal to 0
p - total ( q+r+s+t)

a. Symmetric Binary Dissimilarity - 

For symmetric binary attribute, each state is equally valuable. If i and j are symmetric binary attribute then dissimilarity is calculates as -

`  d(i, j) = \frac{r + s}{q + r + s + t}  `


b. Asymmetric Binary Dissimilarity - 

For asymmetric binary attribute, two states are not equally important. Any one state overshadow the other, such binary attribute are often called "monary" (having one state). For these kind of attribute, dissimilarity is calculates as - 

`d(i, j) = \frac{r + s}{q + r + s}`

likewise, we can calculate the similarity (asymmetric binary similarity)

` sim(i, j) = 1 - d(i, j) `

which leave us with  

` JC = sim(i, j) = \frac{q}{q + r + s} `

The coefficient sim(i, j) is also known as Jaccard coefficient. 





Like the below page to get update  
https://www.facebook.com/datastage4you
https://twitter.com/datagenx
https://plus.google.com/+AtulSingh0/posts
https://datagenx.slack.com/messages/datascience/