Showing posts with label Strategy. Show all posts
Showing posts with label Strategy. Show all posts

Monday, 8 August 2016

ETL Strategy #2


Continued......ETL Strategy #1
 

Delta from Sources extracted by Timestamp
This project will use the Timestamp to capture the deltas for most of the Operational Database sources where a date/time value can be used. ETL process will extract data from the operational data stores based on date/time value column like Update_dt during processing of the delta records, and then populate it into the Initial Staging area. The flow chart shown below shows step by step flow.



As shown in the flow chart above. It is shown in two parts, one for initial load and the other for delta processing.

Ref #              Step Description
1    Insert record into control tables manually or using scripts for each ETL process. This is done only once when a table gets loaded for the first time in data warehouse 
2    Set the extract date to desired Initial load date on the control table. This is the timestamp which the ETL process will use to go against the source system.
3    Run ETL batch process which will read the control tables for extract timestamp.
4    Extract all data from source system greater than the set extract timestamp on the control table.
5    Check if the load completed successfully or failed with errors.
6    If the load failed with errors, then the error handling service is called.
7    If the load completed successfully then the load flag is set to successful.
8    The max timestamp of  the ETL load is obtained
9    A new record is inserted to the control structure with the timestamp obtained in the above step.
10    The process continues to pull the delta records with the subsequent runs.
   


Delta from Sources extracted by comparison
Where a transaction Date or Timestamp is not available, a process will compare the new and current version of a source to generate its delta. This strategy is mostly used for files as source of data. This is manageable for small to medium size files that are used in this project and should be avoided with larger source file. A transaction code (I=Insert; U=Update; D=Delete) will have to be generated so that the rest of the ETL stream can recognise the type of transaction and process it.
Files are pushed into ETL server or they are pulled from the FTP servers to ETL server. If the files contain delta records, then the files are uploaded directly to the Data warehouse. If the file is a full extract file, then the file comparison delta process will be used to identify the changed records before uploading to the Data warehouse.
         

E10 Validate Source Data transferred via FTP
Input:    Source Data File and Source Control File.
Output:    NONE.
Dependency: Availability of other systems files.
Functions:
•    Validate if the number of records in the Source File is the same number as the one contained in the Source Control File.  This will guarantee that the right number of records has been transferred from Source to Target.







Like the below page to get update  
https://www.facebook.com/datastage4you
https://twitter.com/datagenx
https://plus.google.com/+AtulSingh0/posts
https://datagenx.slack.com/messages/datascience/

Tuesday, 19 July 2016

ETL Strategy #1



The ETL Load will be designed to extract data from all the required Data Sources. The data to feed the Next Gen BI Database will have to be brought from the sources with specific Extraction, Transformation & Load (ETL) processes. It is essential to define a good ETL strategy to ensure that the execution of these tasks will support appropriate data volumes and Design Should be Scalable, and Maintenance free.


Initial Load Strategy

The Initial Load process is there to support the requirement to include historical data that can’t be included through the regular Delta Refresh.
For the Next Gen BI project, it is expected to have a full extract available for the required sources to prepare an Initial Load prior to the regular Delta Refresh. This Initial extraction of the data will then flow through the regular transformation and load process.
As discussed in the control process, Control tables will be used to initiate the first iteration of the Initial ETL process, a list of source table name with extraction dates will be loaded in the control tables. ETL process can be kicked off through the scheduler and the ETL process will read the control tables and process the full extract.
The rest of the process for an Initial load is same as the delta refresh. As shown in the flow chart under the section (Delta from Sources extracted by Timestamp), The only difference is the loading of the control tables to start the process for the first time when a table gets loaded in the Data Warehouse.


Delta Refresh or CDC Strategy

The Delta refresh process will apply only the appropriate transactions to the Data Warehouse. The result is a greatly reduced volume of information to be processed and applied. Also the Delta transactions for the Warehouse can be reused as input for the different Data mart, since they will be part of the Staging area and already processed. As discussed in the Control Process Strategy, control tables will be used to control delta refresh process.


Continued......ETL Strategy #2



Like the below page to get update  
https://www.facebook.com/datastage4you
https://twitter.com/datagenx
https://plus.google.com/+AtulSingh0/posts
https://datagenx.slack.com/messages/datascience/

Saturday, 27 February 2016

Audit Strategy in ETL #2


First - Audit Strategy in ETL #1




As shown in the above diagram, the audit services are dependent on source systems and target systems for capturing audit data; unlike error handling services are dependent on ETL process to capture error data. Audit services are created as shared services that plug in to each ETL process.

Within the proposed Next Gen BI environment there are 4 stages (Initial Staging, Clean Staging, EDW, and Data Marts) where data is landed after extracting from source systems.
There will be at least one ETL job flow process at each of these stages extracting and loading data. Audit services are called at each stage by the ETL job flow process and the audit data is captured and reported.


Some of the key audit data that will be captured by the audit services is given below :


Field # Field Description
1 Data Server Name A name that identifies a database service.  This is called a "server" by some products.  Oracle refers to it by the name "instance" or "database".
2 Database Name
3 Data Table Name A name that identifies a Target Data Table within a Database.
4 Source Name A name that identifies the file, table or other data structure that is the origin of the data to be loaded into the target table.
5 ETL Process Name DataStage Sequencer
6 ETL Sub Process Name DataStage Job
7 New Rows in Source Count The number or rows in the source table that have been added since the prior run date.  This is the number that the process attempted to add to the target table.  (If the source is a file, this field should be NULL.)
8 Changed Rows in Source Count The number of rows in the source table which were modified since the prior run date.  This is the number that the process attempted to modify in the target. (If the source is a file, this field will be NULL.)
9 Processed Rows in Source Count The total number of rows in the source table or file that the process attempted to load to the target.  (If the source is a table, this is the sum of the New_Rows_in_Source_Cnt and Changed_Rows_in_Source_Cnt.)
10 Starting Row Count The number of rows in a table at the beginning of a Table Load.
11 Ending Row Count The number of rows in a table at the conclusion of a Table Load.
12 Insert Row Count The number of relational database table rows that are processed in an INSERT operation.
13 Update Row Count The number of relational database table rows that are processed in an UPDATE operation.
14 DW_CREATE_DTTM                            Date/time at which audit data was captured






Like the below page to get update  
https://www.facebook.com/datastage4you
https://twitter.com/datagenx
https://plus.google.com/+AtulSingh0/posts
https://datagenx.slack.com/messages/datascience/

Friday, 26 February 2016

Audit Strategy in ETL #1



The Audit service will provide simple measures on the ETL process: number of records input/output, number of rejected records, number of records loaded, Starting and Ending Row Count in the Target tables, New Rows in the source Table/Files, Inserted Rows and Updated Rows.

These measures are captured during the batch load process and stored in the audit tables for load audit checks and reporting purposes. Each process of the ETL should produce these key statistics by calling the audit services at the beginning and ending of a load process. All Audit tables will be stored in the staging repository. Exception reports can be produced for action to be taken by the business users and production support personnel.





Ref #           Step Description
1 Represents all the source systems from which data is extracted. Audit services will collect all the source audit data during the ETL process from source systems
2 Represents the ETL process which extracts data from Source systems and loads the data to target systems. Mostly audit data is not captured at the ETL process level.
3 Represents the target systems where the data gets loaded after the completion of the ETL process. Audit services will collect all the target audit data at the end of the ETL process from Target systems.
4 Audit Tables are internal tables where all the audit data related to source and target systems are loaded. Entry is made in this table every time an ETL batch process runs. These tables reside internally in the staging area. Audit reports can be generated using these tables.
5 Once the Audit data is captured. Audit check routines can be used to detect any discrepancy in the source or the target systems. This will be an automated process which will run at the completion of each ETL job based on the predefined rules. 
6 As Mentioned in the previous step, These audit rules can be set to fail the job and stop the batch process or the batch process can continue if the rules did not fail.
7 Once the Audit check routines are complete for any job. Audit flag will be set to specify a success or failure on the audit check.
8 If the audit check routine does not indicate a failure flag, next batch process continues to run.
9 If the audit check routine indicates a failure, then the batch process is aborted.
10 When the batch process is aborted, the control passes to the alert system where appropriate action is taken by sending email notifications and pager messages.
11 Audit reports can be generated which will be useful to audit the ETL process and as well the source systems.


Continue in next part -  Audit Strategy in ETL #2





Like the below page to get update  
https://www.facebook.com/datastage4you
https://twitter.com/datagenx
https://plus.google.com/+AtulSingh0/posts
https://datagenx.slack.com/messages/datascience/