Monday, May 8, 2017

Home / Analysis / data / Data Mining / Dissimilarity / Graphical / Plotting / Similarity / Statistics / Measuring Data Similarity or Dissimilarity #1

Measuring Data Similarity or Dissimilarity #1

by Atul Singh on May 08, 2017 in Analysis, data, Data Mining, Dissimilarity, Graphical, Plotting, Similarity, Statistics

Yet another question is in data mining to measure whether two datasets are similar or not. There are so many ways to calculate these values based on Data Type. Let's see into these methods -

1. For Binary Attribute:

Binary attributes are those which is having only two states 0 or 1, where 0 means attribute is absent and 1 means it is present. For calculating similarity/dissimilarity between binary attributes we use contingency table -

q - if i and j both are equal to 1
r - if i is 1 and j is 0
s - if i is 0 and j is 1
t - if i and j both are equal to 0
p - total ( q+r+s+t)

a. Symmetric Binary Dissimilarity -

For symmetric binary attribute, each state is equally valuable. If i and j are symmetric binary attribute then dissimilarity is calculates as -

` d(i, j) = \frac{r + s}{q + r + s + t} `

b. Asymmetric Binary Dissimilarity -

For asymmetric binary attribute, two states are not equally important. Any one state overshadow the other, such binary attribute are often called "monary" (having one state). For these kind of attribute, dissimilarity is calculates as -

`d(i, j) = \frac{r + s}{q + r + s}`

likewise, we can calculate the similarity (asymmetric binary similarity) -

` sim(i, j) = 1 - d(i, j) `

which leave us with

` JC = sim(i, j) = \frac{q}{q + r + s} `

The coefficient sim(i, j) is also known as Jaccard coefficient.

Like the below page to get update
https://www.facebook.com/datastage4you
https://twitter.com/datagenx
https://plus.google.com/+AtulSingh0/posts
https://datagenx.slack.com/messages/datascience/

About Atul Singh
I am a Data Consultant at a Canadian financial firm. My keen interests varies from Data Analytics, ML, Kubernetes, NLP to ETL. I love to blog and travel in my spare time. If you’d like to get in touch, feel free to say hello through any of the social links.

Disclaimer

The postings on this site are my own and don't necessarily represent IBM's or other companies positions, strategies or opinions. All content provided on this blog is for informational purposes and knowledge sharing only.

The owner of this blog makes no representations as to the accuracy or completeness of any information on this site or found by following any link on this site. The owner will not be liable for any errors or omissions in this information nor for the availability of this information. The owner will not be liable for any losses, injuries, or damages from the display or use of his information.

DataGenX - Atul's Scratchpad

Breaking

Monday, May 8, 2017

Measuring Data Similarity or Dissimilarity #1

1. For Binary Attribute:

a. Symmetric Binary Dissimilarity -

b. Asymmetric Binary Dissimilarity -

-

Follow Us

Search This Blog

Blog Archive

Disclaimer