Summary Statistics in Data Analysis

by Atul Singh on March 21, 2017 in Analysis, Analytics, data, Data Mining, Mining, Statistics, Stats, Summary

Summary statistics are numbers that summarize properties of the data. i.e - Mean, Spread, tendency etc. We will see each one by one.

Let's take a input dataset -

Input: 45, 67, 23, 12, 9, 43, 12, 17, 91
Sorted: 9, 12, 12, 17, 23, 43, 45, 67, 91

Frequency: The frequency of an attribute value is the percentage of time the value occurs in the data set.
In our dataset, Frequency of 12 is 2.

Mode: The mode of a an attribute is the most frequent attribute value

Mode for our dataset is 2 as 12 is the most frequent item which occurs 2 time

Things to remember:
i- There is no mode if all the values are same
ii - Same is applicable if all values occurrence is 1

Usually, Mode and Frequency are used for categorical data

Percentiles: This used for continuous data.
Given an ordinal or continuous attribute x and a number p between 0 and 100, the pth percentile
is a value x_p of x such that p% of the observed values of x are less than x_p.

How to calculate the Percentile:
1. Count the total item in dataset = N
2. Multiply the percentile p with total no of items = N*p
3. This will give you a no which can be a float or integer
4. If it is a float, round off it to nearest integer, named p^th no
i. Sort the data into increasing order
ii. Now, p^th no in this dataset is your percentile value
5. If it is an integer no
i. Sort the data into increasing order
ii. Now, average of p^th no and (p+1)^th no in this dataset is your percentile value

So when we say, 20% means -
No of items in dataset = 9
No of items which should be less than x_p. - 9*20% = 1.8
Round off this to nearest integer - 2
Our dataset is already sorted in increasing order, so check the 2nd value - 12

likewise, 25%, 50% and 75% is - 9*25%, 9*50%, 9*75% = 2.25^th, 4.5^th, 6.75^th
2^th, 5^th, 7^th - 12, 23, 45

This is one way to calculate the percentile, If you use calculator or some other method, it might be slightly different.

Mean or Average: Sum(all items) / Total no of element

Mean - (9+12+12+17+23+43+45+67+91)/9 = 34.4

However, the mean is very sensitive to outliers. So to understand the data tendency, we go for median rather than means.

Median: Median is 50 percentile, or middle value

How to get Median/Middle value - a. Sort the data into increasing orderb. Get total no of elements - N if N is even - median = ( N/2th element + [N/2 + 1]th element) / 2 if N is odd - median = ceil(N/2)th element

For our case, N = 9, which is odd, so ceil(9/2) = ceil(4.5) = 5th element

Median = 23

Range: Difference between Max and Min is called range.

Input dataset range - 91-9 = 82

Variance: The variance or standard deviation is the most common measure of the spread of a set of points.

`variance(x) = \sigma^2 = \frac{1}{n-1}\Sigma_{i=1}^n(x_i-\bar{x})^2`
where `\bar{x}` is Mean of all value of x
m = total no of items in dataset
`\sigma` is standard deviation

Like the below page to get update
https://www.facebook.com/datastage4you
https://twitter.com/datagenx
https://plus.google.com/+AtulSingh0/posts
https://datagenx.slack.com/messages/datascience/

About Atul Singh
I am a Data Consultant at a Canadian financial firm. My keen interests varies from Data Analytics, ML, Kubernetes, NLP to ETL. I love to blog and travel in my spare time. If you’d like to get in touch, feel free to say hello through any of the social links.

Disclaimer

The postings on this site are my own and don't necessarily represent IBM's or other companies positions, strategies or opinions. All content provided on this blog is for informational purposes and knowledge sharing only.

The owner of this blog makes no representations as to the accuracy or completeness of any information on this site or found by following any link on this site. The owner will not be liable for any errors or omissions in this information nor for the availability of this information. The owner will not be liable for any losses, injuries, or damages from the display or use of his information.

DataGenX - Atul's Scratchpad

Breaking

Tuesday, March 21, 2017

Summary Statistics in Data Analysis

-

Follow Us

Search This Blog

Blog Archive

Disclaimer