Showing posts with label data. Show all posts
Showing posts with label data. Show all posts

Saturday, 25 March 2017

Check if Python Pandas DataFrame Column is having NaN or NULL

Before implementing any algorithm on the given data, It is a best practice to explore it first so that you can get an idea about the data. Today, we will learn how to check for missing/Nan/NULL values in data.

1. Reading the data
Reading the csv data into storing it into a pandas dataframe.

2. Exploring data
Checking out the data, how it looks by using head command which fetch me some top rows from dataframe.

3. Checking NULLs
Pandas is proving two methods to check NULLs - isnull() and notnull()
These two returns TRUE and FALSE respectively if the value is NULL. So let's check what it will return for our data

isnull() test

notnull() test

Check 0th row, LoanAmount Column - In isnull() test it is TRUE and in notnull() test it is FALSE. It mean, this row/column is holding null.

But we will not prefer this way for large dataset, as this will return TRUE/FALSE matrix for each data point, instead we would interested to know the counts or a simple check if dataset is holding NULL or not.

Use any()
Python also provide any() method which returns TRUE if there is at least single data point which is true for checked condition.

Use all()
Returns TRUE if all the data points follow the condition.

Now, as we know that there are some nulls/NaN values in our data frame, let's check those out - 

data.isnull().sum() - this will return the count of NULLs/NaN values in each column.

If you want to get total no of NaN values, need to take sum once again -


If you want to get any particular column's NaN calculations - 

Here, I have attached the complete Jupyter Notebook for you -

If you want to download the data, You can get it from HERE.

Like the below page to get update

Thursday, 23 March 2017

Measures of Data Spread in Stats

What do we mean by SPREAD? - The measures which can tell us the variability of a dataset, width, average distribution falls into this category.

Let's see which measures we are taking about-

Input: 45, 67, 23, 12, 9, 43, 12, 17, 91
Sorted: 9, 12, 12, 17, 23, 43, 45, 67, 91

It is the simplest measures of Spread. It is the difference between max and min value of a dataset but this will not give you the idea about the data distribution. It may be given a wrong interpretation if our dataset is having outliers.

Range - Max - Min = 91 - 9 = 82

Interquartile Range (IQR):
IQR is the middle 50 percentile data which is difference between 75 percentile and 25 percentile. It is used in boxplot plotting. 

IQR = Q3 - Q1 = 56 - 12 = 44

Variance shows the distance of each element from its mean, If you simply sum this it will be zero and that is why we use squared distance to calculate it.

Standard Deviation (`\sigma` or s):
This measure is square root of Variance, the only difference between Variance and Standard deviation is the output unit as Variance.

`Variance = \sigma^2 or s^2 = \frac{\Sigma_{i=1}^N(x_i-\barx)^2}{N}`

`Standard Deviation = \sigma or s = \root{2}{\sigma^2} = \root{2}{\frac{\Sigma_{i=1}^N(x_i-\barx)^2}{N}}`

Like the below page to get update

Tuesday, 21 March 2017

Summary Statistics in Data Analysis

Summary statistics  are numbers that summarize properties of the data. i.e - Mean, Spread, tendency etc. We will see each one by one.

Let's take a input dataset -

Input: 45, 67, 23, 12, 9, 43, 12, 17, 91
Sorted: 9, 12, 12, 17, 23, 43, 45, 67, 91

Frequency: The frequency of an attribute value is the percentage of time the value occurs in the data set.
In our dataset, Frequency of 12 is 2.

Mode: The mode of a an attribute is the most frequent attribute value

Mode for our dataset is 2 as 12 is the most frequent item which occurs 2 time

Things to remember:
i- There is no mode if all the values are same
ii - Same is applicable if all values occurrence is 1 

Usually, Mode and Frequency are used for categorical data

Percentiles: This used for continuous data.
Given an ordinal or continuous attribute x and a number p between 0 and 100, the pth percentile
is a value xp of x such that p% of the observed values of x are less than xp.

How to calculate the Percentile:
1. Count the total item in dataset = N
2. Multiply the percentile p with total no of items = N*p
3. This will give you a no which can be a float or integer
4. If it is a float, round off it to nearest integer, named pth no
          i. Sort the data into increasing order
          ii. Now, pth no in this dataset is your percentile value
5. If it is an integer no
          i. Sort the data into increasing order
          ii. Now, average of pth no and (p+1)th no in this dataset is your percentile value

So when we say, 20% means -

No of items in dataset = 9
No of items which should be less than xp. - 9*20% = 1.8
Round off this to nearest integer - 2
Our dataset is already sorted in increasing order, so check the 2nd value - 12

likewise, 25%, 50% and 75% is - 9*25%, 9*50%, 9*75% = 2.25th, 4.5th, 6.75th
2th, 5th, 7th - 12, 23, 45

This is one way to calculate the percentile, If you use calculator or some other method, it might be slightly different.

Mean or Average:  Sum(all items) / Total no of element

Mean -  (9+12+12+17+23+43+45+67+91)/9 = 34.4

However, the mean is very sensitive to outliers. So to understand the data tendency, we go for median rather than means.

Median: Median is 50 percentile, or middle value

How to get Median/Middle value - a. Sort the data into increasing orderb. Get total no of elements - N     if N is even -  median =   ( N/2th element + [N/2 + 1]th element) / 2     if N is odd - median = ceil(N/2)th element

For our case, N = 9, which is odd, so ceil(9/2) = ceil(4.5) = 5th element 
Median = 23

Range:  Difference between Max and Min is called range. 

Input dataset range - 91-9 = 82

Variance: The variance or standard deviation is the most common measure of the spread of a set of points.

`variance(x) = \sigma^2 = \frac{1}{n-1}\Sigma_{i=1}^n(x_i-\bar{x})^2`

where `\bar{x}` is Mean of all value of x
m = total no of items in dataset
`\sigma` is standard deviation

Like the below page to get update

Sunday, 19 March 2017

5 Number Summary - Statistics Basics

What is 5 no summary?

5 no summary is an statistical measure to get the idea about the data tendency.

It includes :

1.  Minimum
2.  Q1 (25 percentile)
3.  Median (middle value or 50 percentile)
4.  Q3 (75 percentile)
5.  Maximum

How to calculate or get these values??

Input data :  45, 67, 23, 12, 9, 43, 12, 17, 91

Step1:  Sort the data

9, 12, 12, 17, 23, 43, 45, 67, 91

Step2:  You can easily get the minimum and maximum no

Min : 9
Max : 91

Step 3: Finding the median - Finding the middle value, dont confuse with Mean or Average. 

How to get Median/Middle value - 
a. Sort the data into increasing order
b. Get total no of elements - N
     if N is even -  median =   ( N/2th element + [N/2 + 1]th element) / 2
     if N is odd - median = ceil(N/2)th element

For our case, N = 9, which is odd, so ceil(9/2) = ceil(4.5) = 5th element 
Median = 23

Step 4: Finding our the Q1 and Q3 (called Quantile) is very easy. Divide the element list into 2 list by Median value - 

 (9, 12, 12, 17), 23, (43, 45, 67, 91) 

Now, Find out the Median for 1st list which is Q1 and Median for 2nd list which is Q3

As we can see, list1 and list2 both are having even no of elements so  - 

Median of list1 (Q1) =  ( N/2th element + [N/2 + 1]th element) / 2
                                  =  ( 4/2th element + [4/2 +1]th element) / 2
                                  =  ( 2nd element  + 3rd element ) /2
                                  =  (12 + 12 ) / 2 
                            Q1 = 12

Median of list2 (Q3) = ( 45 + 67 ) / 2
                                  = 112 / 2
                                  = 56 

We got the Q1 (12) and Q3 (56). 

Our 5 no summary is calculated which is -  

min, Q1, median, Q3, max 
9,     12,  23,         56, 91

Like the below page to get update

Saturday, 14 January 2017

Learning Numpy #2

Thursday, 12 January 2017

Learning Numpy #1

Numpy is a python library used for numerical calculations and this is better performant than pure python. In this notebook, I have shared some basics of Numpy and will share more in next few posts. I hope you find these useful.

Like the below page to get update

Wednesday, 11 January 2017

My Learning Path for Machine Learning

I am a Python Lover guy so my way includes lots of Python points. If you dont know the basics of this wonderful language, start it from HERE else you can follow the links which I am going to share.

Learning ML is not only studying ML algorithms, it includes Basic Algebra, Statistics, Algorithms, Programming and lot more. But no need to afraid as such :-) we need to start from somewhere.....

This is my github repo, you can fork it and follow me with these 2 links --

Fork Fork
Follow - Follow @atulsingh0
I am still updating this list and welcome you to update this as well.

Like the below page to get update

Friday, 6 January 2017

10 minutes with pandas library

Thursday, 5 January 2017

Learning Pandas #5 - read & write data from file

Wednesday, 4 January 2017

Learning Pandas #4 - Hierarchical Indexing

Sunday, 1 January 2017

Learning Pandas #3 - Working on Summary & MissingData

Saturday, 31 December 2016

Learning Pandas - DataFrame #2

Friday, 30 December 2016

Learning Pandas - Series #1

Thursday, 6 October 2016

Difference between IBM CDC and CDD

Now a days I am busy with an another IBM tool called CDC - Change Data Capture, now knows as IBM Infosphere Data Replication.
IBM CDC replicates your heterogeneous data in near real time to support data migrations, application consolidation, data synchronization, dynamic warehousing, master data management (MDM), business analytics and data quality processes.
In layman's terms, you can replicate any data automatically in near real time.

When we were at initial stage of our POC, we got a doubt between IBM CDC and CDD (Change Data Delivery) which one we have to use and what is the difference and all. Whenever we were googling about IBM CDD, getting the result with IBM CDC links, In some way we were sure that both are same tool or related with each other. Luckily we found an IBM link which says -
InfoSphere CDD is the exact same code (product) as InfoSphere CDC. The only difference is the licensing model. Please reach out to your IBM Sales Representative for additional details 

So we have contacted our IBM sales buddy to understand the licensing model and this is what we got to know -

* IBM CDC and CDD is same product
* IBM CDC comes with Source and Target Agents and there is individual licensing cost for each one agent.
i.e. - Assuming you are having 2 different database vender (Oracle, Db2) at source side and 3 target db vender (SQL Server, MySQL, Netezza) so you have to pay for 5 replication CDC agent which can be increase or decrease with no of different database software you are using.

* IBM CDD (Change Data Delivery) comes with little relief with costing but only if you already have IBM DataStage license. IBM CDD comes with multiple source agents software and one target agents for IBM DataStage which saves the cost.
 i.e . - Let's take the previous example, We need to buy 2 (for source dbs) + 1(for datastage) license to use Replication.
The benfit here is, you only need to pay for 1 target agent (for DataStage) and use DataStage to deliver your data to any target which saves multiple target license cost.

This is the only reason why they have different names despite being the same software - 
IBM CDC - Change Data Capture ( n source + n target )  [DataStage not required]
IBM CDD - Change Data Delivery ( n source  + 1 target ) [DataStage required]

Like the below page to get update

Wednesday, 3 August 2016

#3 How to Copy DataSet from One Server to Another Server

This post is third and last of How to Copy DataSet from One Server to Another Server Series

We have generated a populated a dataset and identified the files which we need to move to another server serverB from serverA


4. Reading the dataset on another server

This is the most crucial step, Now all 4 files are moved on serverB or the common location which can be accessible from serverB.

For my case, common dir is my home - /home/users/atul

A. Change the default.apt file
We need to change the fastname in default.apt (config file) which we copied from the serverA, [ NOT the default.apt for serverB]

Open the file in any text editor or vi and change as below screen shot -

Temporarily create the "resource disk" and "resource scratchdisk" location if not existing as defined in above config file.

B. Copy the dataset data files 

Move the dataset data file from common directory to "resource disk" as defined in config file.

cp ~/dummy.ds.* /opt/IBM/InformationServer/Server/DataSets/

Now, all files locations are like -

Config file and Dataset descriptor file - my home dir or common dir
Dataset data files - /opt/IBM/InformationServer/Server/DataSets/

Design a job which will read thess dataset files and populate data into sequential file or any other output.

Job Paramaters -
APT_CONFIG_FILE = /home/users/atul/default.apt

DataSet Properties
DataSet File - /home/users/atul/dummy.ds

That is all, you can read the copied dataset on serverB, you can populate this data to some other output such as seq file, table so that you can avoid the use of copied default.apt config file which is not for serverB.

Try it out, let me know if you have any question.

If you like this post, follow the below pages to get update

Monday, 1 August 2016

#2 How to Copy DataSet from One Server to Another Server

This post is second part of How to Copy DataSet from One Server to Another Server


 After generating the dummy dataset, next step is to identify the files which we need to copy.

2. Files which we need to move

a. APT_CONFIG_FILE - configuration file which used in dataset
b. DataSet Descriptor file - *.ds file, in our case it is dummy.ds
c. DataSet Data files - Actual data files which stored in RESOURCE DISK location

So let's get all the path which we need to access -

APT_CONFIG_FILE = /opt/IBM/InformationServer/Server/Configulations/default.apt
RESOURCE DISK = /opt/IBM/InformationServer/Server/DataSets
DATASET LOC = /home/users/atul/dummy.ds

Use commands or any FTP tool to copy these files in a shared location which can be accessible from another server (serverB)

For my case, I have stored all of them into my linux home direcory which is common in both server.

So I have executed these commands to copy all the required files into my home directory.

cp  /opt/IBM/InformationServer/Server/Configulations/default.apt ~
cp  /opt/IBM/InformationServer/Server/DataSets/dummy.ds.* ~
cp  /home/users/atul/dummy.ds ~

Now, my home directory is having these files -

You can copy these 4 files on serverB where you want to move your dataset. I am not doing the same as my home directory is common for both server.

3. Why we need these files only

Config file was used by datastage to create dataset ( descriptot file, data files, data file location)
So, we needed - config file, dataset descriptor file and dataset data files.

Like the below page to get update

Saturday, 30 July 2016

#1 How to Copy DataSet from One Server to Another Server

Hi Guys...
I've been asked so many times that how can we move/copy one dataset from one server to another So here is the way which I follow.

At very first step, Analyze if you can avoid this by using some other way like creating sequential file and ftp Or load the data into temporary table which can be accessible on another server, if using datastage packs then via mqs, xml or json formats etc. Why I am suggesting these solutions coz these are easy to design and guaranteed the data quality at other end.

If above solutions are not possible, please follow the below steps -

Points I am going to cover here -
1. Generating a dummy dataset
2. Files which we need to move
3. Why we need these files only
4. Reading the dataset on another server


1. Generating a dummy dataset

I have created a dummy job which is generating a dataset with default APT_Config_file which has 2 nodes.

Here, I am generating 10 dummy rows with the help of Row Generator stage and storing them into a datasset.

a. Config File - I am using the default config file (replaced the server name in "fastname" with serverA)

APT_CONFIG_FILE = /opt/IBM/InformationServer/Server/Configulations/default.apt

check out the "resource disk" location in config file, we need it for further processing

RESOURCE DISK = /opt/IBM/InformationServer/Server/DataSets

b. dataset location - I have created this dataset in my home dir named dummy.ds

DATASET LOC = /home/atul/dummy.ds

Keep looking for next post........

Like the below page to get update

Tuesday, 17 May 2016

Lookup Stage behaviour

Today, I am gonna ask you a question, What value I will get from lookup when my datatype is integer (Not Null) and there is no match b/w source and reference data???

Generally, we say, NULL as there is no match b/w source and reference. But that's not true.
So let's see how the DataStage and Lookup behave :-)
When Source and Reference are NULLable -
-       If there is no match b/ source and reference, we will get NULL in output 

When Source and Reference are Not-NULLable -
-       If there is no match b/ source and reference, we will get DataStage Defaults for that datatype.
        such as - 0 for integer and empty string or '' for varchar when data is going out from lookup stage.

So, Be careful when you are planning to filter the data outside lookup stage based on referenced columns value as field in output file is not null, transformer stage don't receive a null (because it comes with the default value 0) and can't handle it as you expec.

Hoping, this will add one pointer in your learning. Let me know your thoughts in comment section.

Like the below page to get update

Tuesday, 10 May 2016

ETL Data Validation

    Organizations typically have “dirty data” that must be cleansed or scrubbed before being loaded into the data warehouse.  In an ideal world, there would not be dirty data.  The data in operational systems would be clean. Unfortunately, this is virtually never the case.  The data in these source systems is the result of poor data quality practices and little can be done about the data that is already there.  While organizations should move toward improving data quality at the source system level, nearly all data warehousing initiatives must cope with dirty data, at least in the short term. There are many reasons for dirty data, including:

Dummy values.  Inappropriate values have been entered into fields.  For example, a customer service representative, in a hurry and not perceiving entering correct data as being particularly important, might enter the store’s ZIP code rather than the customer’s ZIP, or enters 999-99-9999 whenever a SSN is unknown.  The operational system accepts the input, but it is not correct.

Absence of data.  Data was not entered for certain fields. This is not always attributable to lazy data entry habits and the lack of edit checks, but to the fact that different business units may have different needs for certain data values in order to run their operations.  For example, the department that originates mortgage loans may have a federal reporting requirement to capture the sex and ethnicity of a customer, whereas the department that originates consumer loans does not.

Multipurpose fields.  A field is used for multiple purposes; consequently, it does not consistently store the same thing.  This can happen with packaged applications that include fields that are not required to run the application.  Different departments may use the “extra” fields for their own purposes, and as a result, what is stored in the fields is not consistent.

Cryptic data.  It is not clear what data is stored in a field.  The documentation is poor and the attribute name provides little help in understanding the field’s content.  The field may be derived from other fields or the field may have been used for different purposes over the years.

Contradicting data.  The data should be the same but it isn’t.  For example, a customer may have different addresses in different source systems.

Inappropriate use of address lines.  Data has been incorrectly entered into address lines.  Address lines are commonly broken down into, for example, Line 1 for first, middle, and last name, Line 2 for street address, Line 3 for apartment number, and so on.  Data is not always entered into the correct line, which makes it difficult to parse the data for later use.

Violation of business rules.  Some of the values stored in a field are inconsistent with business reality.  For example, a source system may have recorded an adjustable rate mortgage loan where the value of the minimum interest rate is higher than the value of the maximum interest rate. 

Reused primary keys.  A primary key is not unique; it is used with multiple occurrences.  There are many ways that this problem can occur.  For example, assume that a branch bank has a unique identifier (i.e., a primary key).  The branch is closed and the primary key is no longer in use.  But two years later, a new branch is opened, and the old identifier is reused.  The primary key is the same for the old and the new branch.

Non-unique identifiers.  An item of interest, such as a customer, has been assigned multiple identifiers.  For example, in the health care field, it is common for health care providers to assign their own identifier to patients.  This makes it difficult to integrate patient records to provide a comprehensive understanding of a patient’s health care history.

Data integration problems.  The data is difficult or impossible to integrate.  This can be due to non-unique identifiers, or the absence of an appropriate primary key.  To illustrate, for decades customers have been associated with their accounts through a customer name field on the account record.  Integrating multiple customer accounts in this situation can be difficult.  When we examine all the account records that belong to one customer, we find different spellings or abbreviations of the same customer name, sometimes the customer is recorded under an alias or a maiden name, and occasionally two or three customers have a joint account and all of their names are squeezed into one name field.

Like the below page to get update

Sunday, 27 March 2016

Data Warehouse Glossary #3

Drill Through:
Data analysis that goes from an OLAP cube into the relational database.

Data Warehousing:
The process of designing, building, and maintaining a data warehouse system.

Conformed Dimension:
A dimension that has exactly the same meaning and content when being referred from different fact tables.

Central Warehouse
A database created from operational extracts that adheres to a single, consistent, enterprise data model to ensure consistency of decision-support data across the corporation. A style of computing where all the information systems are located and managed from a single physical location.

Change Data Capture
The process of capturing changes made to a production data source. Change data capture is typically performed by reading the source DBMS log. It consolidates units of work, ensures data is synchronized with the original source, and reduces data volume in a data warehousing environment.

Classic Data Warehouse Development
The process of building an enterprise business model, creating a system data model, defining and designing a data warehouse architecture, constructing the physical database, and lastly populating the warehouses database.

Data Access Tools
An end-user oriented tool that allows users to build SQL queries by pointing and clicking on a list of tables and fields in the data warehouse.

Data Analysis and Presentation Tools

Software that provides a logical view of data in a warehouse. Some create simple aliases for table and column names; others create data that identify the contents and location of data in the warehouse.

Data Dictionary
A database about data and database structures. A catalog of all data elements, containing their names, structures, and information about their usage. A central location for metadata. Normally, data dictionaries are designed to store a limited set of available metadata, concentrating on the information relating to the data elements, databases, files and programs of implemented systems.

Data Warehouse Architecture
An integrated set of products that enable the extraction and transformation of operational data to be loaded into a database for end-user analysis and reporting.

Data Warehouse Architecture Development
A SOFTWARE AG service program that provides an architecture for a data warehouse that is aligned with the needs of the business. This program identifies and designs a warehouse implementation increment and ensures the required infrastructure, skill sets, and other data warehouse foundational aspects are in place for a Data Warehouse Incremental Delivery.

Data Warehouse Engines
Relational databases (RDBMS) and Multi-dimensional databases (MDBMS). Data warehouse engines require strong query capabilities, fast load mechanisms, and large storage requirements.

Data Warehouse Incremental Delivery
A SOFTWARE AG program that delivers one data warehouse increment from design review through implementation.

Data Warehouse Infrastructure
A combination of technologies and the interaction of technologies that support a data warehousing environment.

Data Warehouse Management Tools
Software that extracts and transforms data from operational systems and loads it into the data warehouse.

Data Warehouse Network
An industry organization for know-how exchange. SOFTWARE AG was the first vendor member of the Data Warehouse Network.

Functional Data Warehouse
A warehouse that draws data from nearby operational systems. Each functional warehouse serves a distinct and separate group (such as a division), functional area (such as manufacturing), geographic unit, or product marketing group.

On-Line Transaction Processing. OLTP describes the requirements for a system that is used in an operational environment.

The ability to scale to support larger or smaller volumes of data and more or less users. The ability to increase or decrease size or capability in cost-effective increments with minimal impact on the unit cost of business and the procurement of additional services.

The logical and physical definition of data elements, physical characteristics and inter-relationships.

Slice and Dice
A term used to describe a complex data analysis function provided by MDBMS tools.

Warehouse Business Directory
Provides business professionals access to the data warehouse by browsing a catalog of contents.

Warehouse Technical Directory
Defines and manages an information life cycle, a definition of warehouse construction, change management, impact analysis, distribution and operation of a warehouse.

Rules applied to change data.