Showing posts with label Graphical. Show all posts
Showing posts with label Graphical. Show all posts

Wednesday, 31 July 2019

Frequency Distribution #2 - #UnlockStats

Starting from the point where we left off - Frequency Distribution #1 - #UnlockStats
Below is the table for 100 students and their heights categories -
Height (in)No of Students


It consists of a set of rectangles having based on a horizontal axis with center at the class mark and width equal to the class intervals size and length proportional to class frequency.

The histogram shows how the data is distributed, In our example, the width is 3 of each category and left-skewed. Most of the data is left side of the histogram

Frequency Polygon:

A Frequency Polygon is line graph of the class frequencies plotted against class marks = ( UCL + LCL ) / 2
It can be obtained by connecting the midpoint of the tops of the rectangles in the histogram.

Box Plots:

A box plot shows a box which contains the middle 50% of data values, It also shows two whiskers that extend from the box to maximum and minimum value.

Relative Frequency Distribution:

The Relative Frequency of a class is the frequency of the class divided by total frequency of all the classes (total no of data points) and expressed in percentage.
Height (in)Relative Frequency Distribution (%)

Cumulative Frequency Distribution: 

The total frequency of all values less than the upper-class boundary of a given class interval is called the cumulative frequency up to and including that class interval. 
Height (in)No of StudentsCum. Freq. Distribution
60-62 ( <=62) 5 5
63-65 (<=65)185+18 = 23
66-68 (<=68)3823 + 38 = 16
69-71 (<=71)3161 + 31 = 92
72-74 (<=74)8 92 + 8 = 100

A line plot between Upper-Class Boundary and Cum. Frequency is called Cum Freq Distribution polygon or ogive.

Cumulative Relative Frequency Distribution:

Height (in)No of StudentsCum. Rel. Freq. Distribution (%)
<=7131 92

23% of the students have less than or equal to 65 inches.

Types of Frequency Curves:

a. Symmetrical or bell curves are characterized by the fact that observations equidistance from the central maximum has the same frequency.
b. Curves that have tails to the left are said to be skewed to the left.
c. Curves that have tails to the right are said to be skewed to the right.
d. Curves that have approx equal frequencies across their values are said to be uniformly distributed.
e. J-shaped or reverse J-shaped frequency curve the maximum occurs at one end or the other.
f. A U-shaped curve has maxima at both end and minimum in between.
g. A bimodal frequency curve has two maxima.
h. A multimodal frequency curve has more than 2 maxima. 

Like the below page to get the update  
Facebook Page      Facebook Group      Twitter Feed      Telegram Group     

Tuesday, 9 May 2017

Measuring Data Similarity or Dissimilarity #1

Yet another question is in data mining to measure whether two datasets are similar or not. There are so many ways to calculate these values based on Data Type. Let's see into these methods -

1. For Binary Attribute:

Binary attributes are those which is having only two states 0 or 1, where 0 means attribute is absent and 1 means it is present. For calculating similarity/dissimilarity between binary attributes we use contingency table -

Contingency Table

q - if i and j both are equal to 1
r - if i is 1 and j is 0
s - if i is 0 and j is 1
t - if i and j both are equal to 0
p - total ( q+r+s+t)

a. Symmetric Binary Dissimilarity - 

For symmetric binary attribute, each state is equally valuable. If i and j are symmetric binary attribute then dissimilarity is calculates as -

`  d(i, j) = \frac{r + s}{q + r + s + t}  `

b. Asymmetric Binary Dissimilarity - 

For asymmetric binary attribute, two states are not equally important. Any one state overshadow the other, such binary attribute are often called "monary" (having one state). For these kind of attribute, dissimilarity is calculates as - 

`d(i, j) = \frac{r + s}{q + r + s}`

likewise, we can calculate the similarity (asymmetric binary similarity)

` sim(i, j) = 1 - d(i, j) `

which leave us with  

` JC = sim(i, j) = \frac{q}{q + r + s} `

The coefficient sim(i, j) is also known as Jaccard coefficient. 

Like the below page to get update

Tuesday, 25 April 2017

Graphical Display of Basic Stats of Data

After a long time, got a chance to share somethings with you guys, so feeling awesome :-), Today we are gonna see the Graphical Display of Data Stats or sometime we call it Exploratory Data Analysis as well, This is the best way to understand your data in very less time and set your analysis path for it. So without doing more chats, let's start -

1. Scatter Plot

Very Basic, Very Easy and Most Used EDA(Exploratory Data Analysis) technique. It is 2-D plot between X and Y variables where X or Y can be numeric data features or columns.
               With this plot we can easily see if there is any relationship, pattern or trends between between these 2 features or any data outlier existing. It is also useful to explore possibility of correlation relationships. Correlation can be positive, negative or neutral.

Now, let's look into a scatter plot -
I am using IRIS dataset and Python matplotlib library for this illustration - -

scatter iris

2. Histogram

Histogram plot is one of the oldest plotting technique to summarize the data distribution of a attribute X. X can be numerical feature and height of bar is frequency. Resulting plot is also called Bar Chart.

histogram bar chart iris

3. Quantile Plot (Bar Charts)

Quantile Plot or Bar Charts also used to display the uni-variate variables data distribution as well as plot the percentile information with outlier detection.

qunatile box plot

Keep looking for this space for further update.

Happy Learning

Like the below page to get update