My e-Notes about Cloud, K8s, OpenShift, DataScience, Machine Learning, Python, Data Analytics, DataStage, DWH and ETL Concepts


Wednesday, 1 July 2020

How to start with any ETL tool?

Lots of my readers/friends/colleague  asked me this same question again and again so I thought of sharing my experience how I have started with DataStage? What topics I have choose to start... blah...blah...blah....

I am not so expert to teach or preach you but sharing some steps which can help you to easy start with any ETL tool, These steps can be differ from person to person and their understanding or Knowledge :-)

ETL -- Extract, Transform and Load 

Step 1: Learn Basics of SQLs
        This step is very crucial, You should have understanding of RDBMS concept and hands on SQL basics, so that, you can correlate with ETL functionality. You should be aware with the fact that there is nothing which can be done in ETL but not in SQL one way or other.

Step 2: Understanding of ETL tool architecture  
        This is the step where we got confuse and start learning or studying the Installation of tool and Administration. Actually, we are trying to be Sachin Tendulkar (a cricket reference) without having any knowledge of Batting :-) . So don't do that.
ETL Architecture means - understanding of tools type and its functionality, how they communicate with each other, how the request flows from one tool client to another. Try to get basics, If you feel little uncomfortable with it, no worries, everyone feels the same way when they start learning how to walk.

Step 3: Understanding of Tool functionality
        We will now try to go deep down, basically, with every ETL tool, there are 3 tools/clients available (can be more) - Admin Client, Designing Client, Monitoring Client. Pick the last 2 first - Designing and Monitoring if want to be an ETL developer and pick first and last if going for admin. You can find huge database on ETL tools helping docs over internet.
I am not going to discuss Admin or Monitoring but Designer client as this is the place where you are going to fight most of your battle. 

Designer tool - Learning basics of E(extract) and L(load) part is quite easy, so skipping these, we should put our focus on T(transformation) part, like - What are the stages/functions provided by the tool to manipulate the data? Can we relate them with SQL functions? 
        Try to design very simple design/jobs which is simply extracting the data from DB or flat file, doing some basic manipulations/transformation on data and loading that data to a flat file. Don't try to design a very complex or smart job in first go but if you think you can, go ahead, who am I to hold you.  

Step 4: How to design a job/flow 
        A lot of preaching till now, now moving to action. Next, what I am going to share may or may not work for you, this is just an idea which might be useful. I am directly jumping on designing part. 
    As soon as, I get the requirement, I tend to analyze what are my sources/targets and what transformations are required, after than I pseudo-code them on a paper or in mind ;-). Next, start with the same on ETL designer tool, add source and target without adding any transformation in between and voila!! you just developed your first job. 
    Now, relate the SQL functions with ETL tool component/stage and with the ask. Suppose, you have to get the max salary from each department to get the highest paid employee, you have to think, how you have to achieve the same in SQL (need to work with group by clause on department and use max function on salary) and now same you have to perform on ETL, check which component/stage provide you the same functionality and which kind of input data it is asking for. Now, it's time to align the input data as expected by transformation component/stage and link with the output. You need to continue the same till you get your desired result, it might be possible, it might be possible, in starting, you feel fuzzy or confused but as you design, you grow your knowledge in that area. 

    As I told, there is nothing in ETL which can't be done in SQLs. So relate the tools functionality with SQL term, this will help you to remember the stage functions. If I can say about DataStage, this is basic functionality provide by stage.

SQL DataStage
Group By Aggregator
Where Filter
union all Funnel
join join, lookup, merge
surrogate key surrogate key generator
dummy column column generator
top rows head
botton rows tail
order by sort
case stmt switch

Please let me know what do you think in comment sections. 
Till then....Happy Learning !!! 

No comments:

Post a Comment


The postings on this site are my own and don't necessarily represent IBM's or other companies positions, strategies or opinions. All content provided on this blog is for informational purposes and knowledge sharing only.
The owner of this blog makes no representations as to the accuracy or completeness of any information on this site or found by following any link on this site. The owner will not be liable for any errors or omissions in this information nor for the availability of this information. The owner will not be liable for any losses, injuries, or damages from the display or use of his information.