Tuesday, 21 February 2017

PDF Split & Merge with Perl


Sharing two utility scripts (Perl), which can be used for extracting only few pages from a PDF file and also for combining different pdf files into a single pdf files. There might be lot of situation where in, we might require only few pages from a big PDF files. So instead of storing / carrying such a big PDF files, we could extract the page or pages which we require from the original big PDF files. Also we could combine different pdf files into a single pdf file using the second script. As an example, if we have extracted pages 10-15 and pages 100-120 from a big pdf file using the pdf_extract.pl and
we can combine these two pdf files (i.e. pdf which contains pages 10-15 and pdf which contains
pages 100-120) into a single pdf file using pdf_merge.pl


NOTE : These two perl scripts uses a perl module called PDF::API2. If this is not present on your system as part of the perl installation, you can download these modules from www.cpan.org and install. Please see the installation section for more details.


These two scripts can be used on windows, unix or linux. Currently tested on Windows with active perl 5.8.8, but it should work on unix and linux as well. For the pdf_extract.pl script to work on unix and linux, please change the variable called "path_separator" to "/" instead of "\\". This variable can be seen at the starting of the script. pdf_merge.pl can be used both on windows and unix/linux without any modification


Usage:

1) pdf_extract.pl

     perl pdf_extract.pl -i <input pdf file> -p <page or page range which needs to be extracted>

        where
        -i : Please give the full path to the input PDF file
        -p : Page Number or Page range which needs to be extracted from the input PDF

        example : To extract pages 3 to 5, execute

           perl pdf_extract.pl -i /tmp/abc.pdf -p 3-5

        example : To extract only page 3, execute

        perl pdf_extract.pl -i /tmp/abc.pdf -p 3


Executing with -h option will display the usage onto the screen

Example : perl ./pdf_extract.pl -h


2) pdf_merge.pl


perl pdf_merge.pl <output pdf file with full path> <input pdf file 1> <input pdf file 2> etc

Execute the script with all the pdf file which needs to be merged.
Script will merge in the same order which is given in the input

i.e. If you execute like pdf_merge.pl /tmp/out.pdf /tmp/abc.pdf /tmp/xyz.pdf

then pages from xyz.pdf will be after pages from abc.pdf

Executing with -h option will display the usage onto the screen

Example : ./pdf_merge.pl -h


CodeBase:

README File:
pdf_extract.pl
pdf_merge.pl







Like the below page to get update  
https://www.facebook.com/datastage4you
https://twitter.com/datagenx
https://plus.google.com/+AtulSingh0/posts
https://datagenx.slack.com/messages/datascience/

Thursday, 9 February 2017

get Queue Depths

Monday, 6 February 2017

Implementing Slowly Changing Dimension in ETL


Dimension is a term in data management and data warehousing that refers to logical groupings of data such as geographical location, customer information, or product information. Slowly Changing Dimensions (SCD) are dimensions that have data that slowly changes.

                    For example, you may have a Dimension in your database that tracks the sales records of your salespersons in different pharmacies. Creating sales reports seems simple enough, until a salesperson is transferred from one pharmacy to another. How do you record such a change in your sales Dimension?

When we need to track change, it is unacceptable to put everything into the fact table or make every dimension time-dependent to deal with these changes. We would quickly talk ourselves back into a full-blown normalized structure with the consequential loss of understand-ability and query performance. Instead, we take advantage of the fact that most dimensions are nearly constant over time. We can preserve the independent dimensional structure with only relatively minor adjustments to contend with the changes. We refer to these nearly constant dimensions as slowly changing dimensions. Since Ralph Kimball first introduced the notion of slowly changing dimensions in 1994, some IT professionals—in a never-ending quest to speak in acronym—have termed them SCDs.

For each attribute in our dimension tables, we must specify a strategy to handle change. In other words, when an attribute value changes in the operational world, how will we respond to the change in our dimensional models? In the following section we'll describe three basic techniques for dealing with attribute changes, along with a couple hybrid approaches. You may decide that you need to employ a combination of these techniques within a single dimension table.

Type 1: Overwrite the Value
With the type 1 response, we merely overwrite the old attribute value in the dimension row, replacing it with the current value. In so doing, the attribute always reflects the most recent assignment.

Type 2: Add a Dimension Row
A type 2 response is the predominant technique for maintaining the historical data when it comes to slowly changing dimensions. A type 2 SCD is a dimension where a new row is created when the value of an attribute changes

Type 3: Add a Dimension Column
While the type 2 response partitions history, it does not allow us to associate the new attribute value with old fact history or vice versa. A type 3 SCD is a dimension where an alternate old column is created when an attribute changes

Type 6: Hybrid Slowly Changing Dimension Techniques
The Type 6 method is one that combines the approaches of types 1, 2 and 3 (1 + 2 + 3 = 6). One possible explanation of the origin of the term was that it was coined by Ralph Kimball during a conversation with Stephen Pace from Kalido but has also been referred to by Tom Haughey. It is not frequently used because it has the potential to complicate end user access, but has some advantages over the other approaches especially when techniques are employed to mitigate the downstream complexity.
The approach is to use a Type 1 slowly changing dimension, but adding an additional pair of date columns to indicate the date range at which a particular row in the dimension applies and a flag to indicate if the record is the current record.

The aim is to move data from the source application system to the Reporting Warehouse for analysis and reporting of a worldwide pharmaceutical major. The code here is responsible for the movement of data between the Operational Data Store (ODS) and the data warehouse.

Implementation Flow:






Like the below page to get update  
https://www.facebook.com/datastage4you
https://twitter.com/datagenx
https://plus.google.com/+AtulSingh0/posts
https://datagenx.slack.com/messages/datascience/

Friday, 3 February 2017

Learning Matplotlib #2

Thursday, 2 February 2017

Plotting in Python - Learning Matplotlib #1