Showing posts with label Concept. Show all posts
Showing posts with label Concept. Show all posts

Wednesday, 31 July 2019

Frequency Distribution #2 - #UnlockStats

Starting from the point where we left off - Frequency Distribution #1 - #UnlockStats
Below is the table for 100 students and their heights categories -
Height (in)No of Students
60-625
63-6518
66-6838
69-7131
72-748
total100


Histogram:

It consists of a set of rectangles having based on a horizontal axis with center at the class mark and width equal to the class intervals size and length proportional to class frequency.

The histogram shows how the data is distributed, In our example, the width is 3 of each category and left-skewed. Most of the data is left side of the histogram









Frequency Polygon:

A Frequency Polygon is line graph of the class frequencies plotted against class marks = ( UCL + LCL ) / 2
It can be obtained by connecting the midpoint of the tops of the rectangles in the histogram.



























Box Plots:


A box plot shows a box which contains the middle 50% of data values, It also shows two whiskers that extend from the box to maximum and minimum value.

Relative Frequency Distribution:

The Relative Frequency of a class is the frequency of the class divided by total frequency of all the classes (total no of data points) and expressed in percentage.
Height (in)Relative Frequency Distribution (%)
60-625
63-6518
66-6838
69-7131
72-748

Cumulative Frequency Distribution: 

The total frequency of all values less than the upper-class boundary of a given class interval is called the cumulative frequency up to and including that class interval. 
Height (in)No of StudentsCum. Freq. Distribution
60-62 ( <=62) 5 5
63-65 (<=65)185+18 = 23
66-68 (<=68)3823 + 38 = 16
69-71 (<=71)3161 + 31 = 92
72-74 (<=74)8 92 + 8 = 100

A line plot between Upper-Class Boundary and Cum. Frequency is called Cum Freq Distribution polygon or ogive.

Cumulative Relative Frequency Distribution:

Height (in)No of StudentsCum. Rel. Freq. Distribution (%)
 <=6255
<=651823
<=683816
<=7131 92
<=748100

23% of the students have less than or equal to 65 inches.

Types of Frequency Curves:

a. Symmetrical or bell curves are characterized by the fact that observations equidistance from the central maximum has the same frequency.
b. Curves that have tails to the left are said to be skewed to the left.
c. Curves that have tails to the right are said to be skewed to the right.
d. Curves that have approx equal frequencies across their values are said to be uniformly distributed.
e. J-shaped or reverse J-shaped frequency curve the maximum occurs at one end or the other.
f. A U-shaped curve has maxima at both end and minimum in between.
g. A bimodal frequency curve has two maxima.
h. A multimodal frequency curve has more than 2 maxima. 



Like the below page to get the update  
Facebook Page      Facebook Group      Twitter Feed      Telegram Group     


Tuesday, 30 July 2019

Frequency Distribution #1 - #UnlockStats


Raw Data:

Raw data are collected data that have not been organized in any way.

Array:

An array is a list of raw numerical data in ascending or descending order of magnitude.

Frequency Distribution:

When summarizing large no of data, we categorized them into classes or categories and no of individuals belongs to each class is called the Class Frequency.
           A tabular arrangement of data by classes or categories with class frequency is called Frequency Distribution.

Example:

Below is the table for 100 students and their heights categories -

Height (in) No of Students
60-62 5
63-65 18
66-68 38
69-71 31
72-74 8
total 100

Class Intervals and Class Limits:

A symbol defining a class is called Class Intervals such as 63-65, also called Closed Class Intervals as Class has end numbers.
The end no of the class is called Class Limits such as 66 and 68 where 66 is Lower Class Limit and 68 is Upper-Class Limit.  If Class has either no upper class nor no lower class is called an Open Class Intervals such as category 65+years.

Class Boundaries:

Class Boundaries can be defined by adding upper-class limit if a category to lower class limit of the next category by 2.

Upper-Class Boundary (n) - { UCL(n) + LCL(n+1) } / 2

For 63-65 category, 65.5 { (65+66)/2 } is upper-class limit and 62.5 { (62+63)/2 } is lower class limit.


Size/Width of a Class Interval:

The difference between the lower and upper-class limit is called size or width of a Class Interval.
such as -
For 63-65 category, Width is - 65.5 - 62.5 = 3

The Class Mark:

The Class Mark is mid-point of a Class interval and can be calculated as below - 

Class Mark (n) - { UCL(n) + LCL(n) } / 2

For 63-65 category, Class Mark is - (63 + 65) / 2 = 64


Histogram and Frequency Polygon are two graphic representation of frequency distribution. We will discuss this more in the next post.

Till then, Happy Learning.........





Like the below page to get the update  
Facebook Page      Facebook Group      Twitter Feed      Google+ Feed      Telegram Group     


Sunday, 28 July 2019

Let's #UnlockStats - Extending #UnlockAI


I've been asked so many times how to start with Machine Learning or which course I have to join to learn it?  But I have always the same reply - Machine Learning is a journey where you have to travel with a good friend (i.e. - Python, R, Java.....), face some obstacles (i.e. - Mathematical and Statistical Concepts...) and make more new friends there (i.e. lot of other things which you required to understand this ML world - Problem & its Domain, Algorithms, Comparison, tiring testing). So in simple words, ML is not a destination where you have to reach but its a journey which you have to live. Little dramatic... isn't it :-)
     
Anyway, Starting with very basic stats or statistics which you should be aware of with other ML things. Hoping, you will like it... Please comment if you have any query or request.


Use #UnlockStats to fetch all Stats post and #UnlockAI for all AI/ML posts.





Like the below page to get the update  
Facebook Page      Facebook Group      Twitter Feed      Google+ Feed      Telegram Group     


Monday, 3 June 2019

Containerization - What & Why ??


Containerization, always a word which describe to hold something, literally taken from the world of freight transportation which allows to put lot of different product/item into one box and move around the world without worries of damages. Quite a definition :) Isn't it ?
                 In simple word, or I say in IT term, Containerization is a process or a way which allows user to have a sandbox environment with required software specific to versions which you can flush whenever you are done with it and re-instantiate it when needed.

Now questions comes then what is the difference between a Virtual Machine and Container which we are going to discuss next.

In last decade, Virtual Machines (VM) allow IT giants/users to have one physical machine and host different application and its variants in VMs which shares the resources from host machine. But this comes with a small price, the bottleneck of resources shared. Your physical machine limit to host VMs is totally depends on its resources such as storage, processing power or memory cause VM requires these as contains the guest os and application with its dependencies. Guest OS itself eats lots of host storage & memory and required to be patched on timely manner to support your application.

VM stack is somewhat look like below -

https://www.datagenx.net/2019/02/lets-learn-git-pull-specific-folder.html

Containerization has removed guest OS dependency and uses Host machine and OS which substantially reduce the size of container as well as resources consumption which brings lot more pros over virtual machine. Containerization stack is as below -

https://www.datagenx.net/2019/02/mongodb-index-in-python-simple-index.html
In next post, we will discuss about pros and cons of VM and Containerization.




Like the below page to get the update  
Facebook Page      Facebook Group      Twitter Feed      Google+ Feed      Telegram Group     


Sunday, 25 March 2018

Let's #UnlockAI


Hi Guys, Writing this post after so many days, hoping you didn't take this absence otherwise :-)
Today, I am gonna start a new #hashtag #UnlockAI where we learn the basics of Machine Learning, Deep Learning, Concepts, Algorithm and their limitations under one umbrella. I will try to keep every topic in detail and with the proper example so that it will be easy to understand with under lying the mathematics.




Please post your queries or topics you want to discuss under this #hashtag #UnlockAI





Like the below page to get update  
Facebook Page      Facebook Group      Twitter Feed      Google+ Feed      Telegram Group     


Sunday, 14 January 2018

Mongo DB - Installation and Configuration


MongoDB  is an open-source document database, and the leading NoSQL database. Written in C++.
  
MongoDB features:
    Document-Oriented Storage
    Full Index Support
    Replication & High Availability
    Auto-Sharding
    Querying
    Fast In-Place Updates
    Map/Reduce
    GridFS


Reduce cost, accelerate time to market, and mitigate risk with proactive support and enterprise-grade capabilities.


Today, We will see how to install and run the MongoDB.

MongoDB Installation on Linux


1. DOWNLOAD the stable version of MongoDB. It will a tar file
2. Extract the tar file to some directory.
 
$ tar -xvf mongodb.tar -C /learn/mongodb


3.  change the permisson of folder to user who run the db here-  In my case User - hduser and Group - hadoop
$ chown -R hduser:hadoop /learn/mongodb

4. Add the env var in .bashrc
export MONGO_HOME=/learn/mongodb
export PATH=$PATH:$MONGO_HOME/bin







5. Create the default DB directory for Mongo
$ mkdir -R /data/db
$ chown -R hduser:hadoop /data/db

This is by default, you can specify ur db path when starting the mongo db






$ mongod --dbpath /app/mongodata
this command will start the mongodb. in other terminal you can start work on db. "--dbpath /app/mongodata" is totally optional

If you just use just $ mongod , it will start n use the default db which we have defined in step 5.


Please don't close the current terminal, It can be kill the mongodb process.







6. Start working on MongoDB
$ mongo










Like the below page to get update  
https://www.facebook.com/datastage4you
https://twitter.com/datagenx
https://plus.google.com/+AtulSingh0/posts
https://datagenx.slack.com/messages/datascience/

Thursday, 23 March 2017

Measures of Data Spread in Stats


What do we mean by SPREAD? - The measures which can tell us the variability of a dataset, width, average distribution falls into this category.

Let's see which measures we are taking about-

Input: 45, 67, 23, 12, 9, 43, 12, 17, 91
Sorted: 9, 12, 12, 17, 23, 43, 45, 67, 91



Range:
It is the simplest measures of Spread. It is the difference between max and min value of a dataset but this will not give you the idea about the data distribution. It may be given a wrong interpretation if our dataset is having outliers.

Range - Max - Min = 91 - 9 = 82

Interquartile Range (IQR):
IQR is the middle 50 percentile data which is difference between 75 percentile and 25 percentile. It is used in boxplot plotting. 

IQR = Q3 - Q1 = 56 - 12 = 44

Variance:
Variance shows the distance of each element from its mean, If you simply sum this it will be zero and that is why we use squared distance to calculate it.

Standard Deviation (`\sigma` or s):
This measure is square root of Variance, the only difference between Variance and Standard deviation is the output unit as Variance.


`Variance = \sigma^2 or s^2 = \frac{\Sigma_{i=1}^N(x_i-\barx)^2}{N}`

`Standard Deviation = \sigma or s = \root{2}{\sigma^2} = \root{2}{\frac{\Sigma_{i=1}^N(x_i-\barx)^2}{N}}`







Like the below page to get update  
https://www.facebook.com/datastage4you
https://twitter.com/datagenx
https://plus.google.com/+AtulSingh0/posts
https://datagenx.slack.com/messages/datascience/

Wednesday, 22 March 2017

The Three M in Statis : Measures of Center


In Statistics, 3M summary is very important as it tells a lot about data distribution. These Ms are - Mean, Median and Mode

Mean - Average
Median - Middile Value
Mode - Frequent Item count

You can look into "SUMMARY STATISTICS IN DATA ANALYSIS"
for the calulations.




Like the below page to get update  
https://www.facebook.com/datastage4you
https://twitter.com/datagenx
https://plus.google.com/+AtulSingh0/posts
https://datagenx.slack.com/messages/datascience/

Monday, 14 November 2016

DataStage Partitioning #3



Best allocation of Partitions in DataStage for storage area

Srno
No of Ways
Volume of Data
Best way of Partition
Allocation of Configuration File (Node)
1
DB2 EEE  extraction in serial
Low
-
1
2
DB2 EEE extraction in parallel
High
Node number = current node (key)
64 (Depends on how many nodes are allocated)
3
Partition or Repartition in the Stages of DataStage
Any
Modulus (It should be single key that to integer)
Hash (Any number of keys with different data type)
8 (Depends on how many nodes are allocated for the job)
4
Writing into DB2
Any
DB2
-
5
Writing into Dataset
Any
Same
1,2,4,8,16,32,64 etc… (Based on the incoming records it writes into it.)
6
Writing into Sequential File
Low
-
1

 

Best allocation of Partitions in DataStage for each stage

S. No
Stage
Best way of Partition
Important points
1
Join
Left and Right link: Hash or Modulus
All the input links should be sorted based on the joining key and partitioned with higher key order.

  1.  
Lookup
Main link: Hash or same
Reference link: Entire
Both the links need not be in the sorted order

  1.  
Merge
Master and update link: Hash or Modulus
All the input links should be sorted based on the merging key and partitioned with higher key order. Pre-sort makes merge “lightweight” for memory.

  1.  
Remove Duplicate, Aggregator
Hash or Modulus
If the input link is in sorted order based on the key it will perform better.

  1.  
Sort
Hash or Modulus
Sorting happens after partitioning


Transformer, Funnel, Copy, Filter
Same
None
7
Change Capture
Left and Right link: Hash or Modulus
Both the input links should be in the sorted order based on the key and partitioned with higher key order.





Like the below page to get update  
https://www.facebook.com/datastage4you
https://twitter.com/datagenx
https://plus.google.com/+AtulSingh0/posts
https://datagenx.slack.com/messages/datascience/