ETL Code Review Checklist

by Atul Singh on April 13, 2016 in Checklist, Datastage, designer, develop, Environment, ETL, Parameter, Partitioning, Review, Variable

Guideline
Design jobs for restartability/ if not designed then what is the reason ?
Do not follow Sequential File stage with "Same" partitioning.
check if the APT_CONFIG_FILE parameter is added. This is required to change the number of nodes during runtime.
Do not hard-code parameters.
Do not hard-code directory paths.
Do not use fork-joins to generate lookup data sets.
Use "Hash" aggregation for limited distinct key values. Outputs after all rows are read.
Use "Sort" aggregation for large number of distinct key values. Data must be pre-sorted. Outputs after each aggregation group.
Use multiple aggregators to reduce collection time when aggregating all rows. Define a constant key column using row generator. First aggregator sums in parallel. Second aggregator sums sequentially.
Make sure sequences are not too long. Break up into logical units of work.
Is the error handling done properly? It is prefered to propogate errors from lower jobs to the highest level( ex a sequence)
What is the volume of extract data( is there a where clause in the SQL)
Are the correct scripts to clean up datasets after job complete revoked ?
Is there a reject process in place ?
Can we combine or split so we can reduce number of jobs or complexity respectively?
It is not recommended to have an increase in the number of nodes if there are too many stages in the job( this increases the number of processes spun off)
Volume information and growth information for the Lookup/Join tables?
Check if there is a select * in any of the queries. It is not advised to have select * , instead the required columns have to be added in the statement
Check the paritioning and sorting at each stage
When a sequence is used make sure none of the parameters passed are left blank
Check if there are separate jobs for atleast extract, transform and load
Check if there is annotation for each stage and the job, the job properties should have the author,date etc filled out
Check for naming convention of the jobs, stages and links
Try avoiding peeks in production jobs, peeks are generally used for debug in the development
Make sure the developer has not suppressed many warnings that are valid
Verify that the jobs conform to the Flat File and Dataset naming specification. This is especially important for cleaning up files and logging errors appropriately.
Verify that all fields are written to the Reject flat files. This is necessary for debugging and reconciliation.

Like the below page to get update
https://www.facebook.com/datastage4you
https://twitter.com/datagenx
https://plus.google.com/+AtulSingh0/posts
https://datagenx.slack.com/messages/datascience/

About Atul Singh
I am a Data Consultant at a Canadian financial firm. My keen interests varies from Data Analytics, ML, Kubernetes, NLP to ETL. I love to blog and travel in my spare time. If you’d like to get in touch, feel free to say hello through any of the social links.

DataGenX - Atul's Scratchpad

Breaking

Wednesday, April 13, 2016

ETL Code Review Checklist

No comments:

Post a Comment

-

Follow Us

Search This Blog

Blog Archive

Disclaimer