Showing posts with label Statistics. Show all posts
Showing posts with label Statistics. Show all posts

## Friday, 12 May 2017

### #3 - Measuring Data Similarity or Dissimilarity

Continue from -
'Measuring Data Similarity or Dissimilarity #1'
'Measuring Data Similarity or Dissimilarity #2',

### 3. For Ordinal Attributes:

Ordinal attribute is an attribute with possible values that have a meaningful order or ranking among them but the magnitude between successive values is not known. Ordinal values are same as Categorical Values but with the Order.

Such as, For "Performance" columns Values are - Best, Better, Good, Average, Below Average, Bad

These values are Categorical values with order or rank so called Ordinal Values. Ordinal attributes can also be derived from discretization of numeric attributes by splitting the value range into finite number of ordered categories.

We assign rank to these categories to calculate the similarity or dissimilarity, i.e. - There is an attribute f having N possible state can have 1, 2, 3........f_N ranking.

#### How to Calculate Similarity or Dissimilarity:

1, Assign the Rank R_ifto each category of attribute f having N possible states.
2. Normalize the Rank between [0.0, 1.0] so that each attribute have equal weight.
Can be calculated as

R_in = \frac{R_if - 1}{N - 1}

3. Now Similarity or Dissimilarity can be calculated with any distance measuring techniques. ( 'Measuring Data Similarity or Dissimilarity #2)

## Tuesday, 9 May 2017

### Measuring Data Similarity or Dissimilarity #1

Yet another question is in data mining to measure whether two datasets are similar or not. There are so many ways to calculate these values based on Data Type. Let's see into these methods -

### 1. For Binary Attribute:

Binary attributes are those which is having only two states 0 or 1, where 0 means attribute is absent and 1 means it is present. For calculating similarity/dissimilarity between binary attributes we use contingency table -

q - if i and j both are equal to 1
r - if i is 1 and j is 0
s - if i is 0 and j is 1
t - if i and j both are equal to 0
p - total ( q+r+s+t)

#### a. Symmetric Binary Dissimilarity -

For symmetric binary attribute, each state is equally valuable. If i and j are symmetric binary attribute then dissimilarity is calculates as -

  d(i, j) = \frac{r + s}{q + r + s + t}

#### b. Asymmetric Binary Dissimilarity -

For asymmetric binary attribute, two states are not equally important. Any one state overshadow the other, such binary attribute are often called "monary" (having one state). For these kind of attribute, dissimilarity is calculates as -

d(i, j) = \frac{r + s}{q + r + s}

likewise, we can calculate the similarity (asymmetric binary similarity)

 sim(i, j) = 1 - d(i, j)

which leave us with

 JC = sim(i, j) = \frac{q}{q + r + s}

The coefficient sim(i, j) is also known as Jaccard coefficient.

## Tuesday, 25 April 2017

### Graphical Display of Basic Stats of Data

After a long time, got a chance to share somethings with you guys, so feeling awesome :-), Today we are gonna see the Graphical Display of Data Stats or sometime we call it Exploratory Data Analysis as well, This is the best way to understand your data in very less time and set your analysis path for it. So without doing more chats, let's start -

### 1. Scatter Plot

Very Basic, Very Easy and Most Used EDA(Exploratory Data Analysis) technique. It is 2-D plot between X and Y variables where X or Y can be numeric data features or columns.
With this plot we can easily see if there is any relationship, pattern or trends between between these 2 features or any data outlier existing. It is also useful to explore possibility of correlation relationships. Correlation can be positive, negative or neutral.

Now, let's look into a scatter plot -
I am using IRIS dataset and Python matplotlib library for this illustration - -

### 2. Histogram

Histogram plot is one of the oldest plotting technique to summarize the data distribution of a attribute X. X can be numerical feature and height of bar is frequency. Resulting plot is also called Bar Chart.

### 3. Quantile Plot (Bar Charts)

Quantile Plot or Bar Charts also used to display the uni-variate variables data distribution as well as plot the percentile information with outlier detection.

Keep looking for this space for further update.

Happy Learning

## Thursday, 23 March 2017

### Measures of Data Spread in Stats

What do we mean by SPREAD? - The measures which can tell us the variability of a dataset, width, average distribution falls into this category.

Let's see which measures we are taking about-

Input: 45, 67, 23, 12, 9, 43, 12, 17, 91
Sorted: 9, 12, 12, 17, 23, 43, 45, 67, 91

Range:
It is the simplest measures of Spread. It is the difference between max and min value of a dataset but this will not give you the idea about the data distribution. It may be given a wrong interpretation if our dataset is having outliers.

Range - Max - Min = 91 - 9 = 82

Interquartile Range (IQR):
IQR is the middle 50 percentile data which is difference between 75 percentile and 25 percentile. It is used in boxplot plotting.

IQR = Q3 - Q1 = 56 - 12 = 44

Variance:
Variance shows the distance of each element from its mean, If you simply sum this it will be zero and that is why we use squared distance to calculate it.

Standard Deviation (\sigma or s):
This measure is square root of Variance, the only difference between Variance and Standard deviation is the output unit as Variance.

Variance = \sigma^2 or s^2 = \frac{\Sigma_{i=1}^N(x_i-\barx)^2}{N}

Standard Deviation = \sigma or s = \root{2}{\sigma^2} = \root{2}{\frac{\Sigma_{i=1}^N(x_i-\barx)^2}{N}}

## Wednesday, 22 March 2017

### The Three M in Statis : Measures of Center

In Statistics, 3M summary is very important as it tells a lot about data distribution. These Ms are - Mean, Median and Mode

Mean - Average
Median - Middile Value
Mode - Frequent Item count

You can look into "SUMMARY STATISTICS IN DATA ANALYSIS"
for the calulations.

## Tuesday, 21 March 2017

### Summary Statistics in Data Analysis

Summary statistics  are numbers that summarize properties of the data. i.e - Mean, Spread, tendency etc. We will see each one by one.

Let's take a input dataset -

Input: 45, 67, 23, 12, 9, 43, 12, 17, 91
Sorted: 9, 12, 12, 17, 23, 43, 45, 67, 91

Frequency: The frequency of an attribute value is the percentage of time the value occurs in the data set.
In our dataset, Frequency of 12 is 2.

Mode: The mode of a an attribute is the most frequent attribute value

Mode for our dataset is 2 as 12 is the most frequent item which occurs 2 time

Things to remember:
i- There is no mode if all the values are same
ii - Same is applicable if all values occurrence is 1

Usually, Mode and Frequency are used for categorical data

Percentiles: This used for continuous data.
Given an ordinal or continuous attribute x and a number p between 0 and 100, the pth percentile
is a value xp of x such that p% of the observed values of x are less than xp.

How to calculate the Percentile:
1. Count the total item in dataset = N
2. Multiply the percentile p with total no of items = N*p
3. This will give you a no which can be a float or integer
4. If it is a float, round off it to nearest integer, named pth no
i. Sort the data into increasing order
ii. Now, pth no in this dataset is your percentile value
5. If it is an integer no
i. Sort the data into increasing order
ii. Now, average of pth no and (p+1)th no in this dataset is your percentile value

So when we say, 20% means -

No of items in dataset = 9
No of items which should be less than xp. - 9*20% = 1.8
Round off this to nearest integer - 2
Our dataset is already sorted in increasing order, so check the 2nd value - 12

likewise, 25%, 50% and 75% is - 9*25%, 9*50%, 9*75% = 2.25th, 4.5th, 6.75th
2th, 5th, 7th - 12, 23, 45

This is one way to calculate the percentile, If you use calculator or some other method, it might be slightly different.

Mean or Average:  Sum(all items) / Total no of element

Mean -  (9+12+12+17+23+43+45+67+91)/9 = 34.4

However, the mean is very sensitive to outliers. So to understand the data tendency, we go for median rather than means.

Median: Median is 50 percentile, or middle value

How to get Median/Middle value - a. Sort the data into increasing orderb. Get total no of elements - N     if N is even -  median =   ( N/2th element + [N/2 + 1]th element) / 2     if N is odd - median = ceil(N/2)th element

For our case, N = 9, which is odd, so ceil(9/2) = ceil(4.5) = 5th element
Median = 23

Range:  Difference between Max and Min is called range.

Input dataset range - 91-9 = 82

Variance: The variance or standard deviation is the most common measure of the spread of a set of points.

variance(x) = \sigma^2 = \frac{1}{n-1}\Sigma_{i=1}^n(x_i-\bar{x})^2

where \bar{x} is Mean of all value of x
m = total no of items in dataset
\sigma is standard deviation

## Sunday, 19 March 2017

### What is 5 no summary?

5 no summary is an statistical measure to get the idea about the data tendency.

It includes :

1.  Minimum
2.  Q1 (25 percentile)
3.  Median (middle value or 50 percentile)
4.  Q3 (75 percentile)
5.  Maximum

### How to calculate or get these values??

Input data :  45, 67, 23, 12, 9, 43, 12, 17, 91

Step1:  Sort the data

9, 12, 12, 17, 23, 43, 45, 67, 91

Step2:  You can easily get the minimum and maximum no

Min : 9
Max : 91

Step 3: Finding the median - Finding the middle value, dont confuse with Mean or Average.

How to get Median/Middle value -
a. Sort the data into increasing order
b. Get total no of elements - N
if N is even -  median =   ( N/2th element + [N/2 + 1]th element) / 2
if N is odd - median = ceil(N/2)th element

For our case, N = 9, which is odd, so ceil(9/2) = ceil(4.5) = 5th element
Median = 23

Step 4: Finding our the Q1 and Q3 (called Quantile) is very easy. Divide the element list into 2 list by Median value -

(9, 12, 12, 17), 23, (43, 45, 67, 91)

Now, Find out the Median for 1st list which is Q1 and Median for 2nd list which is Q3

As we can see, list1 and list2 both are having even no of elements so  -

Median of list1 (Q1) =  ( N/2th element + [N/2 + 1]th element) / 2
=  ( 4/2th element + [4/2 +1]th element) / 2
=  ( 2nd element  + 3rd element ) /2
=  (12 + 12 ) / 2
Q1 = 12

Median of list2 (Q3) = ( 45 + 67 ) / 2
= 112 / 2
= 56

We got the Q1 (12) and Q3 (56).

Our 5 no summary is calculated which is -

min, Q1, median, Q3, max
9,     12,  23,         56, 91