My e-Notes about DataScience, Machine Learning, Python, Data Analytics, DataStage, DWH and ETL Concepts


Thursday, 14 April 2016

ETL Code Review Checklist

Design jobs for restartability/ if not designed then what is the reason ?
Do not follow Sequential File stage with "Same" partitioning.
check if the APT_CONFIG_FILE parameter is added. This is required to change the number of nodes  during runtime.
Do not hard-code parameters.
ETL Code Review Checklist
Do not hard-code directory paths.
Do not use fork-joins to generate lookup data sets.
Use "Hash" aggregation for limited distinct key values.  Outputs after all rows are read.
Use "Sort" aggregation for large number of distinct key values.  Data must be pre-sorted.  Outputs after each aggregation group.
Use multiple aggregators to reduce collection time when aggregating all rows.  Define a constant key column using row generator.  First aggregator sums in parallel.  Second aggregator sums sequentially.
Make sure sequences are not too long.  Break up into logical units of work.
Is the error handling done properly? It is prefered to propogate errors from lower jobs to the highest level( ex a sequence)
What is the volume of extract data( is there a where clause in the SQL)
Are the correct scripts to clean up datasets after job complete revoked ?
Is there a reject process in place ?
Can we combine or split so we can reduce number of jobs or complexity respectively?
It is not recommended to have an increase in the number of nodes if there are too many stages in the job( this increases the number of processes spun off)
Volume information and growth information for the Lookup/Join tables?
Check if there is a select * in any of the queries. It is not advised to have select * , instead the required columns have to be added in the statement
Check the paritioning and sorting at each stage
When a sequence is used make sure none of the parameters passed are left blank
Check if there are separate jobs for atleast extract, transform and load 
Check if there is annotation for each stage and the job, the job properties should have the author,date etc filled out
Check for naming convention of the jobs, stages and links
Try avoiding peeks in production jobs, peeks are generally used for debug in the development
Make sure the developer has not suppressed many warnings that are valid
Verify that the jobs conform to the Flat File and Dataset naming specification.  This is especially important for cleaning up files and logging errors appropriately.
Verify that all fields are written to the Reject flat files.  This is necessary for debugging and reconciliation.

Like the below page to get update

No comments:

Post a comment


The postings on this site are my own and don't necessarily represent IBM's or other companies positions, strategies or opinions. All content provided on this blog is for informational purposes and knowledge sharing only.
The owner of this blog makes no representations as to the accuracy or completeness of any information on this site or found by following any link on this site. The owner will not be liable for any errors or omissions in this information nor for the availability of this information. The owner will not be liable for any losses, injuries, or damages from the display or use of his information.