Showing posts with label Stats. Show all posts
Showing posts with label Stats. Show all posts

Thursday, 23 March 2017

Measures of Data Spread in Stats


What do we mean by SPREAD? - The measures which can tell us the variability of a dataset, width, average distribution falls into this category.

Let's see which measures we are taking about-

Input: 45, 67, 23, 12, 9, 43, 12, 17, 91
Sorted: 9, 12, 12, 17, 23, 43, 45, 67, 91



Range:
It is the simplest measures of Spread. It is the difference between max and min value of a dataset but this will not give you the idea about the data distribution. It may be given a wrong interpretation if our dataset is having outliers.

Range - Max - Min = 91 - 9 = 82

Interquartile Range (IQR):
IQR is the middle 50 percentile data which is difference between 75 percentile and 25 percentile. It is used in boxplot plotting. 

IQR = Q3 - Q1 = 56 - 12 = 44

Variance:
Variance shows the distance of each element from its mean, If you simply sum this it will be zero and that is why we use squared distance to calculate it.

Standard Deviation (`\sigma` or s):
This measure is square root of Variance, the only difference between Variance and Standard deviation is the output unit as Variance.


`Variance = \sigma^2 or s^2 = \frac{\Sigma_{i=1}^N(x_i-\barx)^2}{N}`

`Standard Deviation = \sigma or s = \root{2}{\sigma^2} = \root{2}{\frac{\Sigma_{i=1}^N(x_i-\barx)^2}{N}}`







Like the below page to get update  
https://www.facebook.com/datastage4you
https://twitter.com/datagenx
https://plus.google.com/+AtulSingh0/posts
https://datagenx.slack.com/messages/datascience/

Wednesday, 22 March 2017

The Three M in Statis : Measures of Center


In Statistics, 3M summary is very important as it tells a lot about data distribution. These Ms are - Mean, Median and Mode

Mean - Average
Median - Middile Value
Mode - Frequent Item count

You can look into "SUMMARY STATISTICS IN DATA ANALYSIS"
for the calulations.




Like the below page to get update  
https://www.facebook.com/datastage4you
https://twitter.com/datagenx
https://plus.google.com/+AtulSingh0/posts
https://datagenx.slack.com/messages/datascience/

Tuesday, 21 March 2017

Summary Statistics in Data Analysis


Summary statistics  are numbers that summarize properties of the data. i.e - Mean, Spread, tendency etc. We will see each one by one.

Let's take a input dataset -

Input: 45, 67, 23, 12, 9, 43, 12, 17, 91
Sorted: 9, 12, 12, 17, 23, 43, 45, 67, 91

Frequency: The frequency of an attribute value is the percentage of time the value occurs in the data set.
In our dataset, Frequency of 12 is 2.


Mode: The mode of a an attribute is the most frequent attribute value

Mode for our dataset is 2 as 12 is the most frequent item which occurs 2 time

Things to remember:
i- There is no mode if all the values are same
ii - Same is applicable if all values occurrence is 1 

Usually, Mode and Frequency are used for categorical data

summary statistics in data analysis

Percentiles: This used for continuous data.
Given an ordinal or continuous attribute x and a number p between 0 and 100, the pth percentile
is a value xp of x such that p% of the observed values of x are less than xp.

How to calculate the Percentile:
1. Count the total item in dataset = N
2. Multiply the percentile p with total no of items = N*p
3. This will give you a no which can be a float or integer
4. If it is a float, round off it to nearest integer, named pth no
          i. Sort the data into increasing order
          ii. Now, pth no in this dataset is your percentile value
5. If it is an integer no
          i. Sort the data into increasing order
          ii. Now, average of pth no and (p+1)th no in this dataset is your percentile value

So when we say, 20% means -

No of items in dataset = 9
No of items which should be less than xp. - 9*20% = 1.8
Round off this to nearest integer - 2
Our dataset is already sorted in increasing order, so check the 2nd value - 12

likewise, 25%, 50% and 75% is - 9*25%, 9*50%, 9*75% = 2.25th, 4.5th, 6.75th
2th, 5th, 7th - 12, 23, 45

This is one way to calculate the percentile, If you use calculator or some other method, it might be slightly different.

Mean or Average:  Sum(all items) / Total no of element

Mean -  (9+12+12+17+23+43+45+67+91)/9 = 34.4

However, the mean is very sensitive to outliers. So to understand the data tendency, we go for median rather than means.

Median: Median is 50 percentile, or middle value

How to get Median/Middle value - a. Sort the data into increasing orderb. Get total no of elements - N     if N is even -  median =   ( N/2th element + [N/2 + 1]th element) / 2     if N is odd - median = ceil(N/2)th element

For our case, N = 9, which is odd, so ceil(9/2) = ceil(4.5) = 5th element 
Median = 23

Range:  Difference between Max and Min is called range. 

Input dataset range - 91-9 = 82


Variance: The variance or standard deviation is the most common measure of the spread of a set of points.

`variance(x) = \sigma^2 = \frac{1}{n-1}\Sigma_{i=1}^n(x_i-\bar{x})^2`

where `\bar{x}` is Mean of all value of x
m = total no of items in dataset
`\sigma` is standard deviation





Like the below page to get update  
https://www.facebook.com/datastage4you
https://twitter.com/datagenx
https://plus.google.com/+AtulSingh0/posts
https://datagenx.slack.com/messages/datascience/


Sunday, 19 March 2017

5 Number Summary - Statistics Basics


What is 5 no summary?

5 no summary is an statistical measure to get the idea about the data tendency.

It includes :

1.  Minimum
2.  Q1 (25 percentile)
3.  Median (middle value or 50 percentile)
4.  Q3 (75 percentile)
5.  Maximum


5 number summary

How to calculate or get these values??

Input data :  45, 67, 23, 12, 9, 43, 12, 17, 91

Step1:  Sort the data

9, 12, 12, 17, 23, 43, 45, 67, 91

Step2:  You can easily get the minimum and maximum no

Min : 9
Max : 91

Step 3: Finding the median - Finding the middle value, dont confuse with Mean or Average. 

How to get Median/Middle value - 
a. Sort the data into increasing order
b. Get total no of elements - N
     if N is even -  median =   ( N/2th element + [N/2 + 1]th element) / 2
     if N is odd - median = ceil(N/2)th element

For our case, N = 9, which is odd, so ceil(9/2) = ceil(4.5) = 5th element 
Median = 23

Step 4: Finding our the Q1 and Q3 (called Quantile) is very easy. Divide the element list into 2 list by Median value - 

 (9, 12, 12, 17), 23, (43, 45, 67, 91) 

Now, Find out the Median for 1st list which is Q1 and Median for 2nd list which is Q3

As we can see, list1 and list2 both are having even no of elements so  - 

Median of list1 (Q1) =  ( N/2th element + [N/2 + 1]th element) / 2
                                  =  ( 4/2th element + [4/2 +1]th element) / 2
                                  =  ( 2nd element  + 3rd element ) /2
                                  =  (12 + 12 ) / 2 
                            Q1 = 12

Median of list2 (Q3) = ( 45 + 67 ) / 2
                                  = 112 / 2
                                  = 56 

We got the Q1 (12) and Q3 (56). 

Our 5 no summary is calculated which is -  

min, Q1, median, Q3, max 
9,     12,  23,         56, 91




Like the below page to get update  
https://www.facebook.com/datastage4you
https://twitter.com/datagenx
https://plus.google.com/+AtulSingh0/posts
https://datagenx.slack.com/messages/datascience/