Showing posts with label Analysis. Show all posts
Showing posts with label Analysis. Show all posts

Saturday, 25 March 2017

Check if Python Pandas DataFrame Column is having NaN or NULL

Before implementing any algorithm on the given data, It is a best practice to explore it first so that you can get an idea about the data. Today, we will learn how to check for missing/Nan/NULL values in data.

1. Reading the data
Reading the csv data into storing it into a pandas dataframe.

2. Exploring data
Checking out the data, how it looks by using head command which fetch me some top rows from dataframe.

3. Checking NULLs
Pandas is proving two methods to check NULLs - isnull() and notnull()
These two returns TRUE and FALSE respectively if the value is NULL. So let's check what it will return for our data

isnull() test

notnull() test

Check 0th row, LoanAmount Column - In isnull() test it is TRUE and in notnull() test it is FALSE. It mean, this row/column is holding null.

But we will not prefer this way for large dataset, as this will return TRUE/FALSE matrix for each data point, instead we would interested to know the counts or a simple check if dataset is holding NULL or not.

Use any()
Python also provide any() method which returns TRUE if there is at least single data point which is true for checked condition.

Use all()
Returns TRUE if all the data points follow the condition.

Now, as we know that there are some nulls/NaN values in our data frame, let's check those out - 

data.isnull().sum() - this will return the count of NULLs/NaN values in each column.

If you want to get total no of NaN values, need to take sum once again -


If you want to get any particular column's NaN calculations - 

Here, I have attached the complete Jupyter Notebook for you -

If you want to download the data, You can get it from HERE.

Like the below page to get update

Wednesday, 22 March 2017

The Three M in Statis : Measures of Center

In Statistics, 3M summary is very important as it tells a lot about data distribution. These Ms are - Mean, Median and Mode

Mean - Average
Median - Middile Value
Mode - Frequent Item count

for the calulations.

Like the below page to get update

Tuesday, 21 March 2017

Summary Statistics in Data Analysis

Summary statistics  are numbers that summarize properties of the data. i.e - Mean, Spread, tendency etc. We will see each one by one.

Let's take a input dataset -

Input: 45, 67, 23, 12, 9, 43, 12, 17, 91
Sorted: 9, 12, 12, 17, 23, 43, 45, 67, 91

Frequency: The frequency of an attribute value is the percentage of time the value occurs in the data set.
In our dataset, Frequency of 12 is 2.

Mode: The mode of a an attribute is the most frequent attribute value

Mode for our dataset is 2 as 12 is the most frequent item which occurs 2 time

Things to remember:
i- There is no mode if all the values are same
ii - Same is applicable if all values occurrence is 1 

Usually, Mode and Frequency are used for categorical data

Percentiles: This used for continuous data.
Given an ordinal or continuous attribute x and a number p between 0 and 100, the pth percentile
is a value xp of x such that p% of the observed values of x are less than xp.

How to calculate the Percentile:
1. Count the total item in dataset = N
2. Multiply the percentile p with total no of items = N*p
3. This will give you a no which can be a float or integer
4. If it is a float, round off it to nearest integer, named pth no
          i. Sort the data into increasing order
          ii. Now, pth no in this dataset is your percentile value
5. If it is an integer no
          i. Sort the data into increasing order
          ii. Now, average of pth no and (p+1)th no in this dataset is your percentile value

So when we say, 20% means -

No of items in dataset = 9
No of items which should be less than xp. - 9*20% = 1.8
Round off this to nearest integer - 2
Our dataset is already sorted in increasing order, so check the 2nd value - 12

likewise, 25%, 50% and 75% is - 9*25%, 9*50%, 9*75% = 2.25th, 4.5th, 6.75th
2th, 5th, 7th - 12, 23, 45

This is one way to calculate the percentile, If you use calculator or some other method, it might be slightly different.

Mean or Average:  Sum(all items) / Total no of element

Mean -  (9+12+12+17+23+43+45+67+91)/9 = 34.4

However, the mean is very sensitive to outliers. So to understand the data tendency, we go for median rather than means.

Median: Median is 50 percentile, or middle value

How to get Median/Middle value - a. Sort the data into increasing orderb. Get total no of elements - N     if N is even -  median =   ( N/2th element + [N/2 + 1]th element) / 2     if N is odd - median = ceil(N/2)th element

For our case, N = 9, which is odd, so ceil(9/2) = ceil(4.5) = 5th element 
Median = 23

Range:  Difference between Max and Min is called range. 

Input dataset range - 91-9 = 82

Variance: The variance or standard deviation is the most common measure of the spread of a set of points.

`variance(x) = \sigma^2 = \frac{1}{n-1}\Sigma_{i=1}^n(x_i-\bar{x})^2`

where `\bar{x}` is Mean of all value of x
m = total no of items in dataset
`\sigma` is standard deviation

Like the below page to get update

Friday, 3 February 2017

Learning Matplotlib #2

Saturday, 14 January 2017

Learning Numpy #2

Thursday, 12 January 2017

Learning Numpy #1

Numpy is a python library used for numerical calculations and this is better performant than pure python. In this notebook, I have shared some basics of Numpy and will share more in next few posts. I hope you find these useful.

Like the below page to get update

Wednesday, 11 January 2017

My Learning Path for Machine Learning

I am a Python Lover guy so my way includes lots of Python points. If you dont know the basics of this wonderful language, start it from HERE else you can follow the links which I am going to share.

Learning ML is not only studying ML algorithms, it includes Basic Algebra, Statistics, Algorithms, Programming and lot more. But no need to afraid as such :-) we need to start from somewhere.....

This is my github repo, you can fork it and follow me with these 2 links --

Fork Fork
Follow - Follow @atulsingh0
I am still updating this list and welcome you to update this as well.

Like the below page to get update

Friday, 6 January 2017

10 minutes with pandas library

Thursday, 5 January 2017

Learning Pandas #5 - read & write data from file

Wednesday, 4 January 2017

Learning Pandas #4 - Hierarchical Indexing

Sunday, 1 January 2017

Learning Pandas #3 - Working on Summary & MissingData

Saturday, 31 December 2016

Learning Pandas - DataFrame #2

Friday, 30 December 2016

Learning Pandas - Series #1

Wednesday, 21 December 2016

R Points #2 - DataFrame & List Basics

Sunday, 18 December 2016

R Points #1 - Matrix & Factor Basics

Monday, 22 February 2016

Python Points #10d - Writing into Files

Wednesday, 17 February 2016

Python Points #10c - File Methods

Sunday, 7 February 2016

Python Points #10b - Reading Files