Thursday, 31 March 2016

echo the 'echo'




"echo" command is the very basic command of Linux/Unix. Most of us known this command to print whatever is being passed to it as argument. So let's revise once more if we are missing something here.


"echo" command prints its argument by single space.
[atul@localhost ~]$ echo Atul learns commands 
Atul learns commands

print argument by single space mean it will suppress the space between its arguments.
[atul@localhost ~]$ echo Atul          learns                         commands 
Atul learns commands

if you want to preserver the space between them, we need to make these 3 argument to 1 by using a quote.
[atul@localhost ~]$ echo "Atul     learns   commands" 
Atul     learns   commands



One more example :



** Need to remember -
* echo expect arguments to print whether they are separated by single space or multiple, it will print them with single space
* if you want to keep multiple space in argument string, make them one argument by quoting them.





Monday, 28 March 2016

10 Scenario based Interview Questions #1 - DataStage


1. Design a job which convert single source row to three target row.
2. Design a job which can identify the row if they are duplicate in input.
3. Design a job which will fetch the input file header and footer.
4. Design a job which will segregate unique and duplicate records in different files.
5. Design a job which remove the header from the input file.

For more ->  DataStage Scenario
6. Design a job which remove the footer from the input file.
7. Design a job which throw a mail if footer is not there in input file.
8. Design a job which extract the alternate records from the input file.
9. Design a job which extract the Nth row from the input file
10. Design a job which extract data from two input files and load them in alternate in target.


For more ->  DataStage Scenario

Sunday, 27 March 2016

Data Warehouse Glossary #3


Drill Through:
Data analysis that goes from an OLAP cube into the relational database.

Data Warehousing:
The process of designing, building, and maintaining a data warehouse system.

Conformed Dimension:
A dimension that has exactly the same meaning and content when being referred from different fact tables.

Central Warehouse
A database created from operational extracts that adheres to a single, consistent, enterprise data model to ensure consistency of decision-support data across the corporation. A style of computing where all the information systems are located and managed from a single physical location.

Change Data Capture
The process of capturing changes made to a production data source. Change data capture is typically performed by reading the source DBMS log. It consolidates units of work, ensures data is synchronized with the original source, and reduces data volume in a data warehousing environment.

Classic Data Warehouse Development
The process of building an enterprise business model, creating a system data model, defining and designing a data warehouse architecture, constructing the physical database, and lastly populating the warehouses database.

Data Access Tools
An end-user oriented tool that allows users to build SQL queries by pointing and clicking on a list of tables and fields in the data warehouse.

Data Analysis and Presentation Tools

Software that provides a logical view of data in a warehouse. Some create simple aliases for table and column names; others create data that identify the contents and location of data in the warehouse.



Data Dictionary
A database about data and database structures. A catalog of all data elements, containing their names, structures, and information about their usage. A central location for metadata. Normally, data dictionaries are designed to store a limited set of available metadata, concentrating on the information relating to the data elements, databases, files and programs of implemented systems.

Data Warehouse Architecture
An integrated set of products that enable the extraction and transformation of operational data to be loaded into a database for end-user analysis and reporting.

Data Warehouse Architecture Development
A SOFTWARE AG service program that provides an architecture for a data warehouse that is aligned with the needs of the business. This program identifies and designs a warehouse implementation increment and ensures the required infrastructure, skill sets, and other data warehouse foundational aspects are in place for a Data Warehouse Incremental Delivery.

Data Warehouse Engines
Relational databases (RDBMS) and Multi-dimensional databases (MDBMS). Data warehouse engines require strong query capabilities, fast load mechanisms, and large storage requirements.

Data Warehouse Incremental Delivery
A SOFTWARE AG program that delivers one data warehouse increment from design review through implementation.

Data Warehouse Infrastructure
A combination of technologies and the interaction of technologies that support a data warehousing environment.

Data Warehouse Management Tools
Software that extracts and transforms data from operational systems and loads it into the data warehouse.

Data Warehouse Network
An industry organization for know-how exchange. SOFTWARE AG was the first vendor member of the Data Warehouse Network.

Functional Data Warehouse
A warehouse that draws data from nearby operational systems. Each functional warehouse serves a distinct and separate group (such as a division), functional area (such as manufacturing), geographic unit, or product marketing group.

OLTP
On-Line Transaction Processing. OLTP describes the requirements for a system that is used in an operational environment.

Scalability
The ability to scale to support larger or smaller volumes of data and more or less users. The ability to increase or decrease size or capability in cost-effective increments with minimal impact on the unit cost of business and the procurement of additional services.

Schema
The logical and physical definition of data elements, physical characteristics and inter-relationships.

Slice and Dice
A term used to describe a complex data analysis function provided by MDBMS tools.

Warehouse Business Directory
Provides business professionals access to the data warehouse by browsing a catalog of contents.

Warehouse Technical Directory
Defines and manages an information life cycle, a definition of warehouse construction, change management, impact analysis, distribution and operation of a warehouse.

Transformers
Rules applied to change data.



Wednesday, 23 March 2016

5 Tips For Better DataStage Design #11




  • When writing intermediate results that will only be shared between parallel jobs, always write to persistent data sets (using Data Set stages). You should ensure that the data is partitioned, and that the partitions, and sort order, are retained at every stage. Avoid format conversion or serial I/O.
  • Data Set stages should be used to create restart points in the event that a job or sequence needs to be rerun. But, because data sets are platform and configuration specific, they should not be used for long-term backup and recovery of source data.
  • Depending on available system resources, it might be possible to optimize overall processing time at run time by allowing smaller jobs to run concurrently. However, care must be taken to plan for scenarios when source files arrive later than expected, or need to be reprocessed in the event of a failure.
  • Parallel configuration files allow the degree of parallelism and resources used by parallel jobs to be set dynamically at run time. Multiple configuration files should be used to optimize overall throughput and to match job characteristics to available hardware resources in development, test, and production modes.
  • The proper configuration of scratch and resource disks and the underlying file system and physical hardware architecture can significantly affect overall job performance.






Like the below page to get update  
https://www.facebook.com/datastage4you
https://twitter.com/datagenx
https://plus.google.com/+AtulSingh0/posts
https://datagenx.slack.com/messages/datascience/

Tuesday, 22 March 2016

Transformer Stage alternative - A Quick DataStage Recipe



What to use instead of "Transformer" Stage

Copy Stage

Use "Copy" stage instead of "Transformer" Stage for following:
Renaming columns
Dropping columns
Default type conversions
Job design placeholder between stages


Modify Stage

Use "Modify" stage
Non default type conversions
Null handling
Character string trimming



Filter Stage

Use "Filter" Stage
Using constraints on output data


Will add more.......






Like the below page to get update  
https://www.facebook.com/datastage4you
https://twitter.com/datagenx
https://plus.google.com/+AtulSingh0/posts
https://datagenx.slack.com/messages/datascience/

Monday, 21 March 2016

Data Warehouse Glossary #2


Dimension:
A variable, perspective or general category of information that is used to organize and analyze information in a multi-dimensional data cube.

Drill Down:
The ability of a data-mining tool to move down into increasing levels of detail in a data mart, data warehouse or multi-dimensional data cube.

Drill Up:
The ability of a data-mining tool to move back up into higher levels of data in a data mart, data warehouse or multi-dimensional data cube.

Executive Information Management System (EIS):
A type of decision support system designed for executive management that reports summary level information as opposed to greater detail derived in a decision support system.

Extraction, Transformation and Loading (ETL) Tool:
Software that is used to extract data from a data source like a operational system or data warehouse, modify the data and then load it into a data mart, data warehouse or multi-dimensional data cube.

Granularity:
The level of detail in a data store or report.

Hierarchy:
The organization of data, e.g. a dimension, into a outline or logical tree structure.  The strata of a hierarchy are referred to as levels.  The individual elements within a level are referred to as categories.  The next lower level in a hierarchy is the child; the next higher level containing the children is their parent.

Legacy System:
Older systems developed on platforms that tend to be one or more generations behind the current state-of-the-art applications.  Data marts and warehouses were developed in large part due to the difficulty in extracting data from these system and the inconsistencies and incompatibilities among them.

Level:
A tier or strata in a dimensional hierarchy. Each lower level represents an increasing degree of detail.  Levels in a location dimension might include country, region, state, county, city, zip code, etc.

Measure:
A quantifiable variable or value stored in a multi-dimensional OLAP cube.  It is a value in the cell at the intersection of two or more dimensions.

Member:
One of the data points for a level of a dimension.

Meta Data:
Information in a data mart or warehouse that describes the tables, fields, data types, attributes and other objects in the data warehouse and how they map to their data sources.  Meta data is contained in database catalogs and data dictionaries.

Multi-Dimensional Online Processing (MOLAP):
Software that creates and analyzes multi-dimensional cubes to store its information.



Non-Volatile Data:
Data that is static or that does not change.  In transaction processing systems the data is updated on a continual regular basis.  In a data warehouse the database is added to or appended, but the existing data seldom changes.

Normalization:
The process of eliminating duplicate information in a database by creating a separate table that stores the redundant information.  For example, it would be highly inefficient to re-enter the address of an insurance company with every claim.  Instead, the database uses a key field to link the claims table to the address table.  Operational or transaction processing systems are typically “normalized”.  On the other hand, some data warehouses find it advantageous to de-normalize the data allowing for some degree of redundancy.

Online Analytical Processing (OLAP):
The process employed by multi-dimensional analysis software to analyze the data resident in data cubes.  There are different types of OLAP systems named for the type of database employed to create them and the data structures produced.

Open Database Connectivity (ODBC):
A database standard developed by Microsoft and the SQL Access Group Consortium that defines the “rules” for accessing or retrieving data from a database.

Relational Database Management System:
Database management systems that have the ability to link tables of data through a common or key field.  Most databases today use relational technologies and support a standard programming language called Structured Query Language (SQL).

Relational Online Analytical Processing (ROLAP):
OLAP software that employs a relational strategy to organize and store the data in its database.

Replication:
The process of copying data from one database table to another.

Scalable:
The attribute or capability of a database to significantly expand the number of records that it can manage.  It also refers to hardware systems and their ability to be expanded or upgraded to increase their processing speed and handle larger volumes of data.

Structured Query Language (SQL):
A standard programming language used by contemporary relational database management systems.

Synchronization:
The process by which the data in two or more separate database are synchronized so that the records contain the same information.  If the fields and records are updated in one database the same fields and records are updated in the other.

Dimensional Model: 
A type of data modeling suited for data warehousing. In a dimensional model, there are two types of tables: dimensional tables and fact tables. Dimensional table records information on each dimension, and fact table records all the "fact", or measures.

Dimensional Table: 
Dimension tables store records related to this particular dimension. No facts are stored in a dimensional table.

Drill Across:
Data analysis across dimensions.

Friday, 18 March 2016

Data Warehouse Glossary #1



Ad Hoc Query:

A database search that is designed to extract specific information from a database.  It is ad hoc if it is designed at the point of execution as opposed to being a “canned” report.  Most ad hoc query software uses the structured query language (SQL).

Aggregation:

The process of summarizing or combining data.

Catalog:

A component of a data dictionary that describes and organizes the various aspects of a database such as its folders, dimensions, measures, prompts, functions, queries and other database objects.  It is used to create queries, reports, analyses and cubes.

Cross Tab:

A type of multi-dimensional report that displays values or measures in cells created by the intersection of two or more dimensions in a table format.

Dashboard:

A data visualization method and workflow management tool that brings together useful information on a series of screens and/or web pages.  Some of the information that may be contained on a dashboard includes reports, web links, calendar, news, tasks, e-mail, etc.  When incorporated into a DSS or EIS key performance indicators may be represented as graphics that are linked to various hyperlinks, graphs, tables and other reports.  The dashboard draws its information from multiple sources applications, office products, databases, Internet, etc.

Cube:

A multi-dimensional matrix of data that has multiple dimensions (independent variables) and measures (dependent variables) that are created by an Online Analytical Processing System (OLAP).  Each dimension may be organized into a hierarchy with multiple levels.  The intersection of two or more dimensional categories is referred to as a cell.


Data-based Knowledge:

Factual information used in the decision making process that is derived from data marts or warehouses using business intelligence tools.  Data warehousing organizes information into a format so that it represents an organizations knowledge with respect to a particular subject area, e.g. finance or clinical outcomes.

Data Cleansing:

The process of cleaning or removing errors, redundancies and inconsistencies in the data that is being imported into a data mart or data warehouse.  It is part of the quality assurance process.

Data Mart:

A database that is similar in structure to a data warehouse, but is typically smaller and is focused on a more limited area.  Multiple, integrated data marts are sometimes referred to as an Integrated Data Warehouse.  Data marts may be used in place of a larger data warehouse or in conjunction with it.  They are typically less expensive to develop and faster to deploy and are therefore becoming more popular with smaller organizations.

Data Migration:

The transfer of data from one platform to another.  This may include conversion from one language, file structure and/or operating environment to another.

Data Mining:

The process of researching data marts and data warehouses to detect specific patterns in the data sets.  Data mining may be performed on databases and multi-dimensional data cubes with ad hoc query tools and OLAP software.  The queries and reports are typically designed to answer specific questions to uncover trends or hidden relationships in the data.

Data Scrubbing:

See Data Cleansing


Data Transformation:

The modification of transaction data extracted from one or more data sources before it is loaded into the data mart or warehouse.  The modifications may include data cleansing, translation of data into a common format so that is can be aggregated and compared, summarizing the data, etc.

Data Warehouse:

An integrated, non-volatile database of historical information that is designed around specific content areas and is used to answer questions regarding an organizations operations and environment.

Database Management System:

The software that is used to create data warehouses and data marts.  For the purposes of data warehousing, they typically include relational database management systems and multi-dimensional database management systems.  Both types of database management systems create the database structures, store and retrieve the data and include various administrative functions.

Decision Support System (DSS):

A set of queries, reports, rule-based analyses, tables and charts that are designed to aid management with their decision-making responsibilities.  These functions are typically “wrapped around” a data mart or data warehouse.  The DSS tends to employ more detailed level data than an EIS.







Like the below page to get update  
https://www.facebook.com/datastage4you
https://twitter.com/datagenx
https://plus.google.com/+AtulSingh0/posts
https://datagenx.slack.com/messages/datascience/

Tuesday, 15 March 2016

yet another Shell script to access DataStage job logs



Introduction


We have various data warehouses hosted on AIX/Linux/Unix operating systems. DataStage Director is one of the ETL monitoring & scheduling tool used in numerous data warehouses. In case of ETL job failures, we need to login to DataStage Director and check the log for error messages. The ETL job logs can tend to span into several pages with various types of informative messages. Usually we need to locate the error message under ‘Fatal’ message type. Doing this task by searching the DataStage log manually can be very time-consuming and tiring.

This shell script facilitates the user to access the error messages from the ease of the  AIX/Linux screen. The script also has a facility to email the filtered log messages to the user’s mailbox.


Tool Logic 


Accept jobname and other parameters while script execution.
Establish proper environment settings for local DataStage use
Locate event ids for fatal errors in the ETL joblog.
Extract detail log for fatal errors
Mail the filtered job log with the exact fatal error message to the user.


Usage 

getDSjob







Like the below page to get update  
https://www.facebook.com/datastage4you
https://twitter.com/datagenx
https://plus.google.com/+AtulSingh0/posts
https://datagenx.slack.com/messages/datascience/

Monday, 14 March 2016

Machine Learning - Learning Steps



Step 1
INTRODUCTION TO MACHINE LEARNING
Overview of Machine Learning
Supervised vs. Unsupervised Learning
Classification vs. Regression
Real-world Applications

Step 2
WORKING WITH REAL-WORLD DATA
Cleaning and mining real-world data
Data pre-processing
Exploratory data analysis and visualisation

Step 3
BUILDING YOUR FIRST CLASSIFICATION MODEL
The K Nearest Neighbour (KNN) algorithm
Reporting performance metrics
Decision boundary visualisation



Step 4
VALIDATION AND OPTIMISATION
Validation techniques
The bias-variance trade-off
Hyper-parameter tuning, grid search and model selection

Step 5
RANDOM FORESTS
Decision Trees
Ensemble models
Random Forests
Extremely Randomised Trees

Step 6
PRACTICE
Build and optimise a classifier on new real-world data

Step 7
NEURAL NETWORKS
Biological inspiration and architecture
Network topologies
Learning algorithms and cost functions

Step 8
DEEP LEARNING
Motivation and architecture
Real-world examples
Impact and limitations of Deep Learning





Like the below page to get update  
https://www.facebook.com/datastage4you
https://twitter.com/datagenx
https://plus.google.com/+AtulSingh0/posts
https://datagenx.slack.com/messages/datascience/

Machine Learning Links you must Visit



1. Scikit-Learn Tutorial Series - http://buff.ly/1XnDWv6
2. 7 Free Machine Learning Courses - http://buff.ly/1XoBapa 
3. k-nearest neighbor algorithm using Python - http://buff.ly/1SLyUZX
4. 7 Steps to Mastering Machine Learning With Python - http://buff.ly/1SLyZwR


Analytics in Python 

1. Learning Pandas #1 - Series - http://bit.ly/2hAlZ0u
2. Learning Pandas #2 - DataFrame  -  http://bit.ly/2ii6Mlu
3. Learning Pandas #3 - Working on Summary & Missing Data - http://bit.ly/2iUROTB
4. Learning Pandas #4 - Hierarchical Indexing - http://bit.ly/2i8AMx9






Like the below page to get update  
https://www.facebook.com/datastage4you
https://twitter.com/datagenx
https://plus.google.com/+AtulSingh0/posts
https://datagenx.slack.com/messages/datascience/

DataScience Links you must Visit



1. Python (and R) for Data Science - sample code, libraries, projects, tutorials - http://buff.ly/22bwhXi
2. 19 Worst Mistakes at Data Science Job Interviews - http://buff.ly/1XnJ3eJ 














Like the below page to get update  
https://www.facebook.com/datastage4you
https://twitter.com/datagenx
https://plus.google.com/+AtulSingh0/posts
https://datagenx.slack.com/messages/datascience/

Sunday, 13 March 2016

Capture Unmatched records from Join Stage - A Quick DataStage Recipe



How to capture unmatched records from "Join" Stage. This stage does not provide reject handling for unmatched records (such as in Inner Join)



Solution:

Use Outer Join operation. All rows on an outer link are output regardless of match on key values.


How to:

Add Outer join operation for "Join’ stage
When a match does not occur, Join stage inserts NULL values into unmatched columns
Filter stage can be used to test NULL values in unmatched columns




Like the below page to get update  
https://www.facebook.com/datastage4you
https://twitter.com/datagenx
https://plus.google.com/+AtulSingh0/posts
https://datagenx.slack.com/messages/datascience/

Friday, 11 March 2016

Python Points #12 - execute OS commads

Thursday, 10 March 2016

Python Points #11 - Set

Tuesday, 8 March 2016

Python - Get line no with print statement


Sometime when we are working on some code and there are lot of print statement, It's difficult to check for which print statement produce this statement.

It helps a lot while debugging the code and helped me to create better python learning script which will generate the output with the code line no which is easy to relate.




Code:
from inspect import currentframe

def lno():
    cf = currentframe()
    val = str(cf.f_back.f_lineno)+". "
    return val

print "this is Me", lno()
print lno(), "Hi! there"

Output:

this is Me  8.
9.  Hi! there


** This is a little overhead on the code as with each print statement the function is being called. So use it wisely :-)





Like the below page to get update  
https://www.facebook.com/datastage4you
https://twitter.com/datagenx
https://plus.google.com/+AtulSingh0/posts
https://datagenx.slack.com/messages/datascience/

Monday, 7 March 2016

Data Warehouse Approaches #2



Top-down approach(Inmon)

The top-down approach views the data warehouse from the top of the entire analytic environment.

The data warehouse holds atomic or transaction data that is extracted from one or more source systems and integrated within a normalized, enterprise data model. From there, the data is summarized, dimensionalized, and distributed to one or more “dependent” data marts. These data marts are “dependent” because they derive all their data from a centralized data warehouse.

Sometimes, organizations supplement the data warehouse with a staging area to collect and store source system data before it can be moved and integrated within the data warehouse. A separate staging area is particularly useful if there are numerous source systems, large volumes of data, or small batch windows with which to extract data from source systems.


Pros/Cons 

The major benefit of a “top-down” approach is that it provides an integrated, flexible architecture to support downstream analytic data structures.
First, this means the data warehouse provides a departure point for all data marts, enforcing consistency and standardization so that organizations can achieve a single version of the truth. Second, the atomic data in the warehouse lets organizations re-purpose that data in any number of ways to meet new and unexpected business needs.

For example, a data warehouse can be used to create rich data sets for statisticians, deliver operational reports, or support operational data stores (ODS) and analytic applications. Moreover, users can query the data warehouse if they need cross-functional or enterprise views of the data.

On the downside, a top-down approach may take longer and cost more to deploy than other approaches, especially in the initial increments. This is because organizations must create a reasonably detailed enterprise data model as well as the physical infrastructure to house the staging area, data warehouse, and the data marts before deploying their applications or reports. (Of course, depending on the size of an implementation, organizations can deploy all three “tiers” within a single database.) This initial delay may cause some groups with their own IT budgets to build their own analytic applications. Also, it may not be intuitive or seamless for end users to drill through from a data mart to a data warehouse to find the details behind the summary data in their reports.


Bottom-up approach(Kimball)

In a bottom-up approach, the goal is to deliver business value by deploying dimensional data marts as quickly as possible. Unlike the top-down approach, these data marts contain all the data — both atomic and summary — that users may want or need, now or in the future. Data is modeled in a star schema design to optimize usability and query performance. Each data mart builds on the next, reusing dimensions and facts so users can query across data marts, if desired, to obtain a single version of the truth as well as both summary and atomic data.

The “bottom-up” approach consciously tries to minimize back-office operations, preferring to focus an organization’s effort on developing dimensional designs that meet end-user requirements. The “bottom-up” staging area is non-persistent, and may simply stream flat files from source systems to data marts using the file transfer protocol. In most cases, dimensional data marts are logically stored within a single database. This approach minimizes data redundancy and makes it easier to extend existing dimensional models to accommodate new subject areas.


Pros/Cons 

The major benefit of a bottom-up approach is that it focuses on creating user-friendly, flexible data structures using dimensional, star schema models. It also delivers value rapidly because it doesn’t lay down a heavy infrastructure up front.
Without an integration infrastructure, the bottom-up approach relies on a “dimensional bus” to ensure that data marts are logically integrated and stovepipe applications are avoided. To integrate data marts logically, organizations use “conformed” dimensions and facts when building new data marts. Thus, each new data mart is integrated with others within a logical enterprise dimensional model.
Another advantage of the bottom-up approach is that since the data marts contain both summary and atomic data, users do not have to “drill through” from a data mart to another structure to obtain detailed or transaction data. The use of a staging area also eliminates redundant extracts and overhead required to move source data into the dimensional data marts.

One problem with a bottom-up approach is that it requires organizations to enforce the use of standard dimensions and facts to ensure integration and deliver a single version of the truth. When data marts are logically arrayed within a single physical database, this integration is easily done. But in a distributed, decentralized organization, it may be too much to ask departments and business units to adhere and reuse references and rules for calculating facts. There can be a tendency for organizations to create “independent” or non-integrated data marts.

In addition, dimensional marts are designed to optimize queries, not support batch or transaction processing. Thus, organizations that use a bottom-up approach need to create additional data structures outside of the bottom-up architecture to accommodate data mining, ODSs, and operational reporting requirements. However, this may be achieved simply by pulling a subset of data from a data mart at night when users are not active on the system.







Like the below page to get update  
https://www.facebook.com/datastage4you
https://twitter.com/datagenx
https://plus.google.com/+AtulSingh0/posts
https://datagenx.slack.com/messages/datascience/

Sunday, 6 March 2016

Python SyntaxError - Non-ASCII character '\xe2' in file


If you get below error while running your python code - 

SyntaxError: Non-ASCII character '\xe2' in file .\set_learn.py on line 32, but no encoding declared; see http://python.org/dev/peps/pep-0263/ for details

and You are using Notepad++ so here is how you have to resolve this -

1. By converting the Text Encoding

Go to Menu -> Encoding -> Convert to UTF-8

and save the file.


2. By seach and replace the \xe2 value to empty

Use Ctrl-F
Find [^\xe2]+
or Find [^\x00-\x7F]+ to delete all non-ascii char
Select Search mode as -Regular Expression
Hit Enter to replace all values


3. In Linux

a. Find the line which is having bad charaters -
grep -nP "[\x80-\xFF]" INPUT_FILE


b. Some ways to remove 
sed -i 's/[^[:print:]]//g' INPUT_FILE > clean-file
sed 's/[\x80-\xff]//g' INPUT_FILE > clean-file
tr -cd '\11\12\15\40-\176' < INPUT_FILE > clean-file

** word of caution - It may remove some charaters which you need file as we are using range, so take a backup of your file first



Like the below page to get update  
https://www.facebook.com/datastage4you
https://twitter.com/datagenx
https://plus.google.com/+AtulSingh0/posts
https://datagenx.slack.com/messages/datascience/



Thursday, 3 March 2016

Data Warehouse Approaches #1



It has been said there are as many ways to build data warehouses as there are companies to build them. Each data warehouse is unique because it must adapt to the needs of business users in different functional areas, whose companies face different business conditions and competitive pressures.
Nonetheless, three major approaches to building a data warehousing environment exist. These approaches are generally referred to as:

1. Top-down
2. Bottom-up
3. Hybrid

Although we have been building data warehouses since the early 1990s, there is still a great deal of confusion about the similarities and differences among these architectures. This is especially true of the "top-down" and "bottom-up" approaches, which have existed the longest and occupy the polar ends of the development spectrum.
As a result, some organizations fail to adopt a clear vision for the way the data warehousing environment can and should evolve. Others, paralysed by confusion or fear of deviating from prescribed tenets for success, cling too rigidly to one approach or another, undermining their ability to respond flexibly to new or unexpected situations. Ideally, organizations need to borrow concepts and tactics from each approach to create environments that uniquely meets their needs.



Semantic and Substantive Differences The two most influential approaches are championed by industry heavyweights Bill Inmon and Ralph Kimball, both prolific authors and consultants in the data warehousing field.
Inmon, who is credited with coining the term "data warehousing" in the early 1990s, advocates a top-down approach, in which companies first build a data warehouse followed by data marts.
Kimball’s approach, on the other hand, is often called bottom-up because it starts and ends with data marts, negating the need for a physical data warehouse altogether.

On the surface, there is considerable friction between top-down and bottom-up approaches. But in reality, the differences are not as stark as they may appear. Both approaches advocate building a robust enterprise architecture that adapts easily to changing business needs and delivers a single version of the truth. In some cases, the differences are more semantic than substantive in nature. For example, both approaches collect data from source systems into a single data store, from which data marts are populated. But while "top-down" subscribers call this a data warehouse, "bottom-up" adherents often call this a "staging area."

Nonetheless, significant differences exist between the two approaches, Data warehousing professionals need to understand the substantial, subtle, and semantic differences among the approaches and which industry "gurus" or consultants advocate each approach. This will provide a clearer understanding of the different routes to achieve data warehousing success and how to translate between the advice and rhetoric of the different approaches.





Like the below page to get update  
https://www.facebook.com/datastage4you
https://twitter.com/datagenx
https://plus.google.com/+AtulSingh0/posts
https://datagenx.slack.com/messages/datascience/

Wednesday, 2 March 2016

5 Tips For Better DataStage Design #10



1. Establish baselines (especially with I/O), use copy with no output
2. Avoid the use of only one flow for tuning/performance testing.  Prototyping can be a powerful tool.
3. Work in increments...change 1 thing at a time.
4. Evaluate data skew:  repartition to balance the data flow
5. Isolate and Solve - determine which stage is causing a problem.

6. distribute file systems (if possible) to eliminate bottlenecks
7. Do NOT involve the RDBMS in initial testing.
8. Understand and evaluate the tuning knobs available





Like the below page to get update  
https://www.facebook.com/datastage4you
https://twitter.com/datagenx
https://plus.google.com/+AtulSingh0/posts
https://datagenx.slack.com/messages/datascience/

Tuesday, 1 March 2016

ETL Development Standards



These development standards provide consistency in the artifacts the ETL developers create.  This consistency improves testing, operational support, maintenance, and performance.  

Code Development Guidelines


1 Commenting in Code and Objects

As a primary guideline where it is possible and does not interfere with the operation of the applications, all code must contain a developer comment/note.  
All ETL jobs must have a proper annotation (short description of the functionality of the job).
The target output files (.csv files) should not contain any leading or trailing spaces.
While deciding record level delimiter, “Delimiter Collision” issue needs to be considered. No such delimiter should be used as a field defaults that is present as a part of data.

2 ETL Naming Standards

The standardized naming conventions ease the burden on developers switching from one project to another.  Knowing the names and where things are located are very useful to understand before the occurrence of the design and development phases.

The following table identifies DataStage elements and their standard naming convention.


2.1 Job and Properties Naming Conventions

GUI Component Entity Convention
Designer Parallel Job <<Application>>_<<job_Name>>_JOB
Designer Sequence  <<Application>>_<<job_Name>>_SEQ
Designer Server Job  <<Application>>_<<job_Name>>_SVR
Designer Parameter  <<Application>>_<<job_Name>>_PARM

2.2 Job Processing Stage Naming Conventions

GUI Component Entity Convention
Designer Aggregator  AGG_<<PrimaryFunction>>
Designer Copy  CP_<<PrimaryFunction>>
Designer Filter  FLT_<<PrimaryFunction>>
Designer Funnel  FNL_<<PrimaryFunction>>
Designer Join (Inner)  JN_<<PrimaryFunction>>
Designer FTP Enterprise FTP_<<PrimaryFunction>>
Designer Lookup  LKP_<< Value Name or table Name>>
Designer Merge  MRG_<<PrimaryFunction>>
Designer Modify  MOD_<<PrimaryFunction>>
Designer Sort  SRT_<<PrimaryFunction>>

2.3 Links Naming Conventions

GUI Component Entity Convention
Designer Reference (Lookup)  Lnk_Ref_<<Number or Additional descriptor, if needed to form a unique object name>>
Designer Reject (Lookup, File, DB)  Lnk_Rej_<<Number or Additional descriptor, if needed to form a unique object name>>
Designer Input  Lnk_In_<<Number or Additional descriptor, if needed to form a unique object name>>
Designer Output  Lnk_Out_<<Number or Additional descriptor, if needed to form a unique object name>>
Designer Delete  Lnk_Del_<<Number or Additional descriptor, if needed to form a unique object name>>
Designer Insert  Lnk_Ins_<<Number or Additional descriptor, if needed to form a unique object name>>
Designer Update  Lnk_Upd_<<Number or Additional descriptor, if needed to form a unique object name>>

2.4 Data Store Naming Conventions:

In the case of a data store, the class word refers to the type of data store (e.g. Dataset, Sequential File, Table, View, and so forth).

GUI Component Entity Convention
Designer Database  DB_<<DatabasName>>
Designer Table  TBL_<<TableName>>
Designer View  VIEW_<<ViewName>>
Designer Dimension  DM_<<TableName>>
Designer Fact  TRAN_<<TableName>>
Designer Source SRC_<<Table or Object Name>>
Designer  Target  TRGT_<<Table or objectName>>

2.5 File Stage Naming Conventions:

GUI Component Entity Convention
Designer Sequential File  SEQ_
Designer Complex Flat File  CFF_
Designer Parallel dataset  DS_







Like the below page to get update  
https://www.facebook.com/datastage4you
https://twitter.com/datagenx
https://plus.google.com/+AtulSingh0/posts
https://datagenx.slack.com/messages/datascience/