Showing posts with label Concept. Show all posts
Showing posts with label Concept. Show all posts

Thursday, 23 March 2017

Measures of Data Spread in Stats

What do we mean by SPREAD? - The measures which can tell us the variability of a dataset, width, average distribution falls into this category.

Let's see which measures we are taking about-

Input: 45, 67, 23, 12, 9, 43, 12, 17, 91
Sorted: 9, 12, 12, 17, 23, 43, 45, 67, 91

It is the simplest measures of Spread. It is the difference between max and min value of a dataset but this will not give you the idea about the data distribution. It may be given a wrong interpretation if our dataset is having outliers.

Range - Max - Min = 91 - 9 = 82

Interquartile Range (IQR):
IQR is the middle 50 percentile data which is difference between 75 percentile and 25 percentile. It is used in boxplot plotting. 

IQR = Q3 - Q1 = 56 - 12 = 44

Variance shows the distance of each element from its mean, If you simply sum this it will be zero and that is why we use squared distance to calculate it.

Standard Deviation (`\sigma` or s):
This measure is square root of Variance, the only difference between Variance and Standard deviation is the output unit as Variance.

`Variance = \sigma^2 or s^2 = \frac{\Sigma_{i=1}^N(x_i-\barx)^2}{N}`

`Standard Deviation = \sigma or s = \root{2}{\sigma^2} = \root{2}{\frac{\Sigma_{i=1}^N(x_i-\barx)^2}{N}}`

Like the below page to get update

Wednesday, 22 March 2017

The Three M in Statis : Measures of Center

In Statistics, 3M summary is very important as it tells a lot about data distribution. These Ms are - Mean, Median and Mode

Mean - Average
Median - Middile Value
Mode - Frequent Item count

for the calulations.

Like the below page to get update

Monday, 14 November 2016

DataStage Partitioning #3

Best allocation of Partitions in DataStage for storage area

No of Ways
Volume of Data
Best way of Partition
Allocation of Configuration File (Node)
DB2 EEE  extraction in serial
DB2 EEE extraction in parallel
Node number = current node (key)
64 (Depends on how many nodes are allocated)
Partition or Repartition in the Stages of DataStage
Modulus (It should be single key that to integer)
Hash (Any number of keys with different data type)
8 (Depends on how many nodes are allocated for the job)
Writing into DB2
Writing into Dataset
1,2,4,8,16,32,64 etc… (Based on the incoming records it writes into it.)
Writing into Sequential File


Best allocation of Partitions in DataStage for each stage

S. No
Best way of Partition
Important points
Left and Right link: Hash or Modulus
All the input links should be sorted based on the joining key and partitioned with higher key order.

Main link: Hash or same
Reference link: Entire
Both the links need not be in the sorted order

Master and update link: Hash or Modulus
All the input links should be sorted based on the merging key and partitioned with higher key order. Pre-sort makes merge “lightweight” for memory.

Remove Duplicate, Aggregator
Hash or Modulus
If the input link is in sorted order based on the key it will perform better.

Hash or Modulus
Sorting happens after partitioning

Transformer, Funnel, Copy, Filter
Change Capture
Left and Right link: Hash or Modulus
Both the input links should be in the sorted order based on the key and partitioned with higher key order.

Like the below page to get update

Friday, 11 November 2016

DataStage Partitioning #2

Keyless partition
    Round Robin
    Entire
    Same
    Random

Round Robin
The first record goes to the first processing node, the second to the second processing node, and so on. When DataStage reaches the last processing node in the system, it starts over. This method is useful for resizing partitions of an input data set that are not equal in size. The round robin method always creates approximately equal-sized partitions. This method is the one normally used when DataStage initially partitions data.
Example: Assume 8 nodes are allocated to store the records then 1st record will go into the first node and the 2nd record will go into the second node ……8th record will go into the eighth node and the 9th record will go into the first node and so on….

Send all rows down all partitions.
Example: Assume 8 nodes are allocated, then in all the 8 nodes all the records will be passed.

Preserve the same partitioning.
Example: Two stages in a job (Sort and Dataset). In sort stage you have done “Hash” partition and in the dataset you have given “Same” partition. In the dataset the data will be preserved with the hash partition.

DataStage uses a random algorithm to choose where the rows goes. The result of Random is that you cannot know where a row will end up.

Application Execution: Parallel jobs can be executed in two ways
> Sequential
> Parallel
In the first slot of the below figure sequential execution is shown. Parallel job can be executed in two processing, SMP and MPP, in the second and third slot of the below figure it is shown.

Ref -

Like the below page to get update

Thursday, 10 November 2016

DataStage Partitioning #1

Partitioning mechanism divides a portion of data into smaller segments, which is then processed independently by each node in parallel. It helps make a benefit of parallel architectures like SMP, MPP, Grid computing and Clusters.

Partition is logical. Partition is to divide memory or mass storage into isolated sections. Memory space will be split into many partitions to have high parallelism. In DOS systems, you can partition a disk, and each partition will behave like a separate disk drive.

In hash partitioning no specified space will be allocated to a partition in the memory. The partition space is allocated depending upon the data.

Why Partition?
•    Ability to run multiple operating systems, or multiple versions of an operating system, on the same server
•    Ability to improve workload balancing and distribution by managing processor allocations across applications and users on the server
•    Ability to leverage hardware models such as “Capacity on Demand” and "Pay as You Grow.”

Types of partition
  • Hash
  • Modulus
  • DB2
  • Auto
  • Random
  • Range
  • Round Robin
  • Entire
  • Same


DataStage inserts partitioners as necessary to ensure correct result. Generally chooses Round Robin or Same. Since Datastage has limited awareness of data and business rules, best practice is to explicitly specify partitioning as per requirement when processing requires groups of related records.

Key based partition
  • Hash
  • Modulus
  • DB2
  • Range

Determines partition based on key value(s). One or more keys with different data type are supported. DataStage’s internal algorithm applied to key values determines the partition. All key values are converted to characters before the algorithm is applied.
Example: Key is State. All “CA” rows go into one partition; all “MA” rows go into one partition. Two rows of the same state never go into different partitions.


Partition based on modulus of key divided by the number of partitions. Key is an Integer type. ( partition=MOD(key_value/number of partition) )
Example: Key is OrderNumber (Integer type). Rows with the same order number will all go into the same partition.

Matches DB2 EEE partitioning, DB2 published its hashing algorithm and DataStage copies that.
Example: This partition is used when loading data into the DB2 table. It takes the partition key from the loading DB2 table and inserts the records effectively. If the partition key is defined in the DB2 database then it takes that Partition key otherwise it defaults to primary key.

The partition is chosen based on a range map, which maps ranges of values to specified partitions. This is similar to Hash, but partition mapping is user-determined and partitions are ordered. Range partitioning requires processing the data twice which makes it hard to find a reason for using it.

This figure gives the clear view of Key based Partitioning and repartitioning.

DataStage's parallel technology operates by a divide-and-conquer technique, splitting the largest integration jobs into subsets ("partition parallelism") and flowing these subsets concurrently across all available processors ("pipeline parallelism"). This combination of pipeline and partition parallelism delivers true linear scalability (defined as an increase in performance proportional to the number of processors) and makes hardware the only mitigating factor to performance.

                However, downstream processes may need data partitioned differently. Consider a transformation that is based on customer last name, but the enriching needs to occur on zip code - for house-holding purposes - with loading into the warehouse based on customer credit card number (more on parallel database interfaces below). With dynamic data re-partitioning, data is re-partitioned on-the-fly between processes - without landing the data to disk - based on the downstream process data partitioning needs.

Ref -

Like the below page to get update

Sunday, 16 October 2016

Jenkins with Windows #1

One of my team member is assigned to install and configure "jenkins" on our server so out of curiosity, I asked what is this now ?? but didn't get a satisfactory answer :-) so I thought of having my hand dirty with it. Here, I am sharing whatever I learn.

What is Jenkins:-
Wikipedia sources says,  Jenkins is an open source automation server written in Java. Jenkins helps automating the non-human part of the whole software development process, with now common things like Continuous Integration, but by further empowering teams to implement the technical part of a Continuous Delivery.

What is Continuous Integration & Continuous Delivery:-
CI is a process that most developers follow to keep their code base intact. It's mostly a common practice when you work in a group environment. For example, an analogy for this would be constructing a new home. There will be multiple contractors working on the site. So, if we have installed the window glasses and the paint person comes in and paints the house there are high chances that he will drop some paint on the glasses or end up breaking the glass. So, the inspector comes and checks it every day to see if something broke. The same process is applied for constructing a new code. CI system gathers all your code from different developers and makes sure it compiles and build fine. This is good. But, not complete. I will get to that once I complete talking about Jenkins.

Jenkins is the inspector in the analogy. Jenkins is nothing but a middle man between your code repo and your build server. It checks for changes on your server every few minutes. If it found them, it gathers them and sends them to your build server. That's what Jenkins is.

Basically Continuous Integration is the practice of running your tests on a non-developer machine automatically everytime someone pushes new code into the source repository.

This has the tremendous advantage of always knowing if all tests work and getting fast feedback. The fast feedback is important so you always know right after you broke the build (introduced changes that made either the compile/build cycle or the tests fail) what you did that failed and how to revert it.

If you only run your tests occasionally the problem is that a lot of code changes may have happened since the last time and it is rather hard to figure out which change introduced the problem. When it is run automatically on every push then it is always pretty obvious what and who introduced the problem.

Built on top of Continuous Integration are Continuous Deployment/Delivery where after a successful test run your instantly and automatically release the latest version of your codebase. Makes deployment a non-issue and helps you speed up your development.

                              Jenkins offers the following major features out of the box, and many more can be added through plugins:

Developer time is focused on work that matters — Much of the work of frequent integrations is handled by automated build and testing systems, meaning developer time isn't wasted on large-scale error-ridden integrations.

Software quality is improved — Any issues are detected and resolved almost immediately, keeping software in a state where it can be safely released at any time.

Faster Development - Integration costs are reduced both because serious integration issues are less likely and because much of the work of integration is automated.

Easy installation: Just run java -jar jenkins.war, deploy it in a servlet container. No additional install, no database. Prefer an installer or native package? We have those as well.

Easy configuration: Jenkins can be configured entirely from its friendly web GUI with extensive on-the-fly error checks and inline help.

Rich plugin ecosystem: Jenkins integrates with virtually every SCM or build tool that exists.

Extensibility: Most parts of Jenkins can be extended and modified, and it's easy to create new Jenkins plugins. This allows you to customize Jenkins to your needs.

Distributed builds: Jenkins can distribute build/test loads to multiple computers with different operating systems. Building software for OS X, Linux, and Windows? No problem.

Check out the part 2 for Installation.


Like the below page to get update

Tuesday, 17 May 2016

Lookup Stage behaviour

Today, I am gonna ask you a question, What value I will get from lookup when my datatype is integer (Not Null) and there is no match b/w source and reference data???

Generally, we say, NULL as there is no match b/w source and reference. But that's not true.
So let's see how the DataStage and Lookup behave :-)
When Source and Reference are NULLable -
-       If there is no match b/ source and reference, we will get NULL in output 

When Source and Reference are Not-NULLable -
-       If there is no match b/ source and reference, we will get DataStage Defaults for that datatype.
        such as - 0 for integer and empty string or '' for varchar when data is going out from lookup stage.

So, Be careful when you are planning to filter the data outside lookup stage based on referenced columns value as field in output file is not null, transformer stage don't receive a null (because it comes with the default value 0) and can't handle it as you expec.

Hoping, this will add one pointer in your learning. Let me know your thoughts in comment section.

Like the below page to get update

Thursday, 3 March 2016

Data Warehouse Approaches #1

It has been said there are as many ways to build data warehouses as there are companies to build them. Each data warehouse is unique because it must adapt to the needs of business users in different functional areas, whose companies face different business conditions and competitive pressures.
Nonetheless, three major approaches to building a data warehousing environment exist. These approaches are generally referred to as:

1. Top-down
2. Bottom-up
3. Hybrid

Although we have been building data warehouses since the early 1990s, there is still a great deal of confusion about the similarities and differences among these architectures. This is especially true of the "top-down" and "bottom-up" approaches, which have existed the longest and occupy the polar ends of the development spectrum.
As a result, some organizations fail to adopt a clear vision for the way the data warehousing environment can and should evolve. Others, paralysed by confusion or fear of deviating from prescribed tenets for success, cling too rigidly to one approach or another, undermining their ability to respond flexibly to new or unexpected situations. Ideally, organizations need to borrow concepts and tactics from each approach to create environments that uniquely meets their needs.

Semantic and Substantive Differences The two most influential approaches are championed by industry heavyweights Bill Inmon and Ralph Kimball, both prolific authors and consultants in the data warehousing field.
Inmon, who is credited with coining the term "data warehousing" in the early 1990s, advocates a top-down approach, in which companies first build a data warehouse followed by data marts.
Kimball’s approach, on the other hand, is often called bottom-up because it starts and ends with data marts, negating the need for a physical data warehouse altogether.

On the surface, there is considerable friction between top-down and bottom-up approaches. But in reality, the differences are not as stark as they may appear. Both approaches advocate building a robust enterprise architecture that adapts easily to changing business needs and delivers a single version of the truth. In some cases, the differences are more semantic than substantive in nature. For example, both approaches collect data from source systems into a single data store, from which data marts are populated. But while "top-down" subscribers call this a data warehouse, "bottom-up" adherents often call this a "staging area."

Nonetheless, significant differences exist between the two approaches, Data warehousing professionals need to understand the substantial, subtle, and semantic differences among the approaches and which industry "gurus" or consultants advocate each approach. This will provide a clearer understanding of the different routes to achieve data warehousing success and how to translate between the advice and rhetoric of the different approaches.

Like the below page to get update

Saturday, 27 February 2016

Audit Strategy in ETL #2

First - Audit Strategy in ETL #1

As shown in the above diagram, the audit services are dependent on source systems and target systems for capturing audit data; unlike error handling services are dependent on ETL process to capture error data. Audit services are created as shared services that plug in to each ETL process.

Within the proposed Next Gen BI environment there are 4 stages (Initial Staging, Clean Staging, EDW, and Data Marts) where data is landed after extracting from source systems.
There will be at least one ETL job flow process at each of these stages extracting and loading data. Audit services are called at each stage by the ETL job flow process and the audit data is captured and reported.

Some of the key audit data that will be captured by the audit services is given below :

Field # Field Description
1 Data Server Name A name that identifies a database service.  This is called a "server" by some products.  Oracle refers to it by the name "instance" or "database".
2 Database Name
3 Data Table Name A name that identifies a Target Data Table within a Database.
4 Source Name A name that identifies the file, table or other data structure that is the origin of the data to be loaded into the target table.
5 ETL Process Name DataStage Sequencer
6 ETL Sub Process Name DataStage Job
7 New Rows in Source Count The number or rows in the source table that have been added since the prior run date.  This is the number that the process attempted to add to the target table.  (If the source is a file, this field should be NULL.)
8 Changed Rows in Source Count The number of rows in the source table which were modified since the prior run date.  This is the number that the process attempted to modify in the target. (If the source is a file, this field will be NULL.)
9 Processed Rows in Source Count The total number of rows in the source table or file that the process attempted to load to the target.  (If the source is a table, this is the sum of the New_Rows_in_Source_Cnt and Changed_Rows_in_Source_Cnt.)
10 Starting Row Count The number of rows in a table at the beginning of a Table Load.
11 Ending Row Count The number of rows in a table at the conclusion of a Table Load.
12 Insert Row Count The number of relational database table rows that are processed in an INSERT operation.
13 Update Row Count The number of relational database table rows that are processed in an UPDATE operation.
14 DW_CREATE_DTTM                            Date/time at which audit data was captured

Like the below page to get update