Tuesday, 28 February 2017

ETL Preformance Tuning : Identification of Performance Bottleneck


The goal of performance tuning is to optimize session performance by eliminating performance bottlenecks. To tune the performance of a session, we need to identify a performance bottleneck, eliminate it, and then identify the next performance bottleneck until we are satisfied with the session performance. Test Load option can be used to run sessions while tune session performance.
The most common performance bottleneck occurs when the ETL Server writes to a target database. We can identify performance bottlenecks by the following methods:



Running test sessions. We can configure a test session to read from a flat file source or to write to a flat file target or any stage which can hold the data but not write, to identify source and target bottlenecks.

Studying performance details. We can create a set of information called performance details to identify session bottlenecks. Performance details provide information such as buffer input and output efficiency. Collect Performance Data option in Session Property will enable the Session to generate a Counter of Input and Output rows through each Transformation

Monitoring system performance. System monitoring tools can be used to view percent CPU usage, I/O waits, and paging to identify system bottlenecks.


Once the location of a performance bottleneck is determined, we can eliminate the bottleneck by following these guidelines:

Eliminate source and target database bottlenecks. 
Optimize the query, Increase the database network packet size, or   configuring index and key constraints.

Eliminate mapping bottlenecks. 
Fine tune the pipeline logic and transformation settings and options in  mappings to eliminate mapping bottlenecks.

Eliminate session bottlenecks. 
Session strategy can be optimized performance details can be used to help in tuning session  configuration.

Eliminate system bottlenecks. 
Have the system administrator analyze information from system monitoring tools and improve CPU and network performance.


If all the bottlenecks above are tuned, further optimization of session performance can be done by increasing the number of pipeline partitions in the session. Adding partitions can improve performance by utilizing more of the system hardware while processing the session.




Like the below page to get update  
https://www.facebook.com/datastage4you
https://twitter.com/datagenx
https://plus.google.com/+AtulSingh0/posts
https://datagenx.slack.com/messages/datascience/

Thursday, 23 February 2017

Shell script to access DataStage Director ETL joblogs



We have various data warehouses hosted on AIX/Linux/Unix operating systems. DataStage Director is one of the ETL monitoring & scheduling tool used in numerous data warehouses. In case of ETL job failures we need to login to DataStage Director and check the log for error messages. The ETL job logs can tend to span into several pages with various types of informative messages. Usually we need to locate the error message under ‘Fatal’ message type. Doing this task by searching the DataStage log manually can be very time consuming and tiring.




This shell script facilitates the user to access the error messages from the ease of the  Linux screen. The script also has facility to email the filtered log messages to the user’s mail box.


Ø  Accept jobname and other parameters while script execution.
Ø  Establish proper environment settings for local DataStage use
Ø  Locate event ids for fatal errors in the ETL joblog.
Ø  Extract detail log for fatal errors
Ø  Mail the filtered job log with exact fatal error message to the user.




Like the below page to get update  
https://www.facebook.com/datastage4you
https://twitter.com/datagenx
https://plus.google.com/+AtulSingh0/posts
https://datagenx.slack.com/messages/datascience/

Tuesday, 21 February 2017

PDF Split & Merge with Perl


Sharing two utility scripts (Perl), which can be used for extracting only few pages from a PDF file and also for combining different pdf files into a single pdf files. There might be lot of situation where in, we might require only few pages from a big PDF files. So instead of storing / carrying such a big PDF files, we could extract the page or pages which we require from the original big PDF files. Also we could combine different pdf files into a single pdf file using the second script. As an example, if we have extracted pages 10-15 and pages 100-120 from a big pdf file using the pdf_extract.pl and
we can combine these two pdf files (i.e. pdf which contains pages 10-15 and pdf which contains
pages 100-120) into a single pdf file using pdf_merge.pl


NOTE : These two perl scripts uses a perl module called PDF::API2. If this is not present on your system as part of the perl installation, you can download these modules from www.cpan.org and install. Please see the installation section for more details.


These two scripts can be used on windows, unix or linux. Currently tested on Windows with active perl 5.8.8, but it should work on unix and linux as well. For the pdf_extract.pl script to work on unix and linux, please change the variable called "path_separator" to "/" instead of "\\". This variable can be seen at the starting of the script. pdf_merge.pl can be used both on windows and unix/linux without any modification


Usage:

1) pdf_extract.pl

     perl pdf_extract.pl -i <input pdf file> -p <page or page range which needs to be extracted>

        where
        -i : Please give the full path to the input PDF file
        -p : Page Number or Page range which needs to be extracted from the input PDF

        example : To extract pages 3 to 5, execute

           perl pdf_extract.pl -i /tmp/abc.pdf -p 3-5

        example : To extract only page 3, execute

        perl pdf_extract.pl -i /tmp/abc.pdf -p 3


Executing with -h option will display the usage onto the screen

Example : perl ./pdf_extract.pl -h


2) pdf_merge.pl


perl pdf_merge.pl <output pdf file with full path> <input pdf file 1> <input pdf file 2> etc

Execute the script with all the pdf file which needs to be merged.
Script will merge in the same order which is given in the input

i.e. If you execute like pdf_merge.pl /tmp/out.pdf /tmp/abc.pdf /tmp/xyz.pdf

then pages from xyz.pdf will be after pages from abc.pdf

Executing with -h option will display the usage onto the screen

Example : ./pdf_merge.pl -h


CodeBase:

README File:
pdf_extract.pl
pdf_merge.pl







Like the below page to get update  
https://www.facebook.com/datastage4you
https://twitter.com/datagenx
https://plus.google.com/+AtulSingh0/posts
https://datagenx.slack.com/messages/datascience/

Thursday, 9 February 2017

get Queue Depths

Monday, 6 February 2017

Implementing Slowly Changing Dimension in ETL


Dimension is a term in data management and data warehousing that refers to logical groupings of data such as geographical location, customer information, or product information. Slowly Changing Dimensions (SCD) are dimensions that have data that slowly changes.

                    For example, you may have a Dimension in your database that tracks the sales records of your salespersons in different pharmacies. Creating sales reports seems simple enough, until a salesperson is transferred from one pharmacy to another. How do you record such a change in your sales Dimension?

When we need to track change, it is unacceptable to put everything into the fact table or make every dimension time-dependent to deal with these changes. We would quickly talk ourselves back into a full-blown normalized structure with the consequential loss of understand-ability and query performance. Instead, we take advantage of the fact that most dimensions are nearly constant over time. We can preserve the independent dimensional structure with only relatively minor adjustments to contend with the changes. We refer to these nearly constant dimensions as slowly changing dimensions. Since Ralph Kimball first introduced the notion of slowly changing dimensions in 1994, some IT professionals—in a never-ending quest to speak in acronym—have termed them SCDs.

For each attribute in our dimension tables, we must specify a strategy to handle change. In other words, when an attribute value changes in the operational world, how will we respond to the change in our dimensional models? In the following section we'll describe three basic techniques for dealing with attribute changes, along with a couple hybrid approaches. You may decide that you need to employ a combination of these techniques within a single dimension table.

Type 1: Overwrite the Value
With the type 1 response, we merely overwrite the old attribute value in the dimension row, replacing it with the current value. In so doing, the attribute always reflects the most recent assignment.

Type 2: Add a Dimension Row
A type 2 response is the predominant technique for maintaining the historical data when it comes to slowly changing dimensions. A type 2 SCD is a dimension where a new row is created when the value of an attribute changes

Type 3: Add a Dimension Column
While the type 2 response partitions history, it does not allow us to associate the new attribute value with old fact history or vice versa. A type 3 SCD is a dimension where an alternate old column is created when an attribute changes

Type 6: Hybrid Slowly Changing Dimension Techniques
The Type 6 method is one that combines the approaches of types 1, 2 and 3 (1 + 2 + 3 = 6). One possible explanation of the origin of the term was that it was coined by Ralph Kimball during a conversation with Stephen Pace from Kalido but has also been referred to by Tom Haughey. It is not frequently used because it has the potential to complicate end user access, but has some advantages over the other approaches especially when techniques are employed to mitigate the downstream complexity.
The approach is to use a Type 1 slowly changing dimension, but adding an additional pair of date columns to indicate the date range at which a particular row in the dimension applies and a flag to indicate if the record is the current record.

The aim is to move data from the source application system to the Reporting Warehouse for analysis and reporting of a worldwide pharmaceutical major. The code here is responsible for the movement of data between the Operational Data Store (ODS) and the data warehouse.

Implementation Flow:






Like the below page to get update  
https://www.facebook.com/datastage4you
https://twitter.com/datagenx
https://plus.google.com/+AtulSingh0/posts
https://datagenx.slack.com/messages/datascience/

Friday, 3 February 2017

Learning Matplotlib #2

Thursday, 2 February 2017

Plotting in Python - Learning Matplotlib #1