My e-Notes about DataScience, Machine Learning, Python, Data Analytics, DataStage, DWH and ETL Concepts


Monday, 8 August 2016

ETL Strategy #2

Continued......ETL Strategy #1

Delta from Sources extracted by Timestamp
This project will use the Timestamp to capture the deltas for most of the Operational Database sources where a date/time value can be used. ETL process will extract data from the operational data stores based on date/time value column like Update_dt during processing of the delta records, and then populate it into the Initial Staging area. The flow chart shown below shows step by step flow.

As shown in the flow chart above. It is shown in two parts, one for initial load and the other for delta processing.

Ref #              Step Description
1    Insert record into control tables manually or using scripts for each ETL process. This is done only once when a table gets loaded for the first time in data warehouse 
2    Set the extract date to desired Initial load date on the control table. This is the timestamp which the ETL process will use to go against the source system.
3    Run ETL batch process which will read the control tables for extract timestamp.
4    Extract all data from source system greater than the set extract timestamp on the control table.
5    Check if the load completed successfully or failed with errors.
6    If the load failed with errors, then the error handling service is called.
7    If the load completed successfully then the load flag is set to successful.
8    The max timestamp of  the ETL load is obtained
9    A new record is inserted to the control structure with the timestamp obtained in the above step.
10    The process continues to pull the delta records with the subsequent runs.

Delta from Sources extracted by comparison
Where a transaction Date or Timestamp is not available, a process will compare the new and current version of a source to generate its delta. This strategy is mostly used for files as source of data. This is manageable for small to medium size files that are used in this project and should be avoided with larger source file. A transaction code (I=Insert; U=Update; D=Delete) will have to be generated so that the rest of the ETL stream can recognise the type of transaction and process it.
Files are pushed into ETL server or they are pulled from the FTP servers to ETL server. If the files contain delta records, then the files are uploaded directly to the Data warehouse. If the file is a full extract file, then the file comparison delta process will be used to identify the changed records before uploading to the Data warehouse.

E10 Validate Source Data transferred via FTP
Input:    Source Data File and Source Control File.
Output:    NONE.
Dependency: Availability of other systems files.
•    Validate if the number of records in the Source File is the same number as the one contained in the Source Control File.  This will guarantee that the right number of records has been transferred from Source to Target.

Like the below page to get update

No comments:

Post a comment


The postings on this site are my own and don't necessarily represent IBM's or other companies positions, strategies or opinions. All content provided on this blog is for informational purposes and knowledge sharing only.
The owner of this blog makes no representations as to the accuracy or completeness of any information on this site or found by following any link on this site. The owner will not be liable for any errors or omissions in this information nor for the availability of this information. The owner will not be liable for any losses, injuries, or damages from the display or use of his information.