Showing posts with label designer. Show all posts
Showing posts with label designer. Show all posts

Wednesday, 18 January 2017

5 Tips For Better DataStage Design #17





**  There is an automap button in some stages,it can maps fields with the same names.

**  When you add a shared container into your job you need to map the columns of the container to your job link. What you might miss is the extra option you get on the Columns tab "Load" button. In addition to the normal column load you get "Load from Container" which is a quick way to load the container metadata into your job.

**  Don't create a job from an empty canvas. Always copy and use an existing job. Don't create shared containers from a blank canvas, always build and test a full job and then turn part of it into a container.



**  If you want to copy and paste settings between jobs,you had better open two Designers,then you can have two property windows open at the same time and copy or compare them more easily.As most property windows in DataStage are modal and you can only have one property window open per Designer session.

**  You can load metadata into a stage by using the "Load" button on the column tab or by dragging and dropping a table definition from the Designer repository window onto a link in your job. For sequential file stages the drag and drop is faster as it loads both the column names and the format values in one go. If you used the load button you would need to load the column names and then the format details separately.

**  Maybe you often meet a Modify stage or stage function working incorrectly, trial and error should be often the only way to work out the syntax of a function. If you do this in a large and complex job, it can be consumed a lot of times to debug it. The better way is have a couple test jobs in your project with a row generator, a modify or transformer stage and a peek stage. Have a column of each type in this test job. Use this throughout your project as a quick way to test a function or conversion. By the way, to correctly running the transformer stage need install the c++ compiler.




Like the below page to get update  
https://www.facebook.com/datastage4you
https://twitter.com/datagenx
https://plus.google.com/+AtulSingh0/posts
https://datagenx.slack.com/messages/datascience/

Monday, 22 August 2016

5 Tips For Better DataStage Design #15



1. Stage variable does not accept null value. Hence no null able column should be directly mapped to stage variable without null handling.

2. Use of SetNull() function in stage variables should be avoided because it causes compilation error.

http://www.datagenx.net/2016/08/5-tips-for-better-datastage-design-15.html

3. If input links are not already partitioned on join key then they should be hash partitioned on the join key in join stage. In case of multiple join key it is recommended to partition on one key and sort by the other keys.

4. If there is a need to do the repartition on an input link then we need to clear the preserve partitioning flag in the previous stage. Otherwise it will generate warning in job log.

5. If database table has less volume of data as a reference then it is good to use lookup stage.

6. It is always advisable to avoid Transformation stage. Because the Transformation stage is not written in DataStage native language, instead it is written in c. So every time you compile a job it embeds the c code with the native code in the executable file, which degrades the performance of the job.





Like the below page to get update  
https://www.facebook.com/datastage4you
https://twitter.com/datagenx
https://plus.google.com/+AtulSingh0/posts
https://datagenx.slack.com/messages/datascience/

Tuesday, 3 May 2016

Otherwise Constraint - A Quick DataStage Recipe


Recipe:

How to use "Otherwise" constraint in Transformer Stage


www.datagenx.net

How To:

To use "Otherwise" constraint in Transformer stage, Order of link is important.
Typically link with "Otherwise" constraint should be last in Transformer stage link order




Like the below page to get update  
https://www.facebook.com/datastage4you
https://twitter.com/datagenx
https://plus.google.com/+AtulSingh0/posts
https://datagenx.slack.com/messages/datascience/

Monday, 25 April 2016

Suppress Warnings - A Quick DataStage Recipe



Recipe:

How to suppress job warnings in DS job log

HowTo:

Add "rule" in Message Handler

www.datagenx.net

Method:

From DS Director, from Message Handler, add a rule
Select Warning message as example of Message text
Or
Open the job log, select the message you want to suppress
right click and add in Message Handler





Like the below page to get update  
https://www.facebook.com/datastage4you
https://twitter.com/datagenx
https://plus.google.com/+AtulSingh0/posts
https://datagenx.slack.com/messages/datascience/

Monday, 18 April 2016

10 Scenario based Interview Questions #2 - DataStage



11. Design a job which insert the data if not exists and update if exists in target
12. Design a job which includes a header and footer in output file
13. Design a job which checks whether Currency data is having 'Rs. 9999.99' format or not.
14. Design a job which checks the date passed validation. Date input in YYYYMMDD format.

For more ->  DataStage Scenario

15. Design a job which check the date difference in Hrs and Mins ( HH:MM)
16. Design a job which delete the data if exists in target first and then load.
17. Design a job which check whether each column is having a value or not
18. Design a job which populate the Date dimention table.
19. Design a job which transpose the columns into row with the help of transformer stage.
20. Design a job which remove duplicate without using Remove Duplicate Stage.


For more ->  DataStage Scenario



Like the below page to get update  
https://www.facebook.com/datastage4you
https://twitter.com/datagenx
https://plus.google.com/+AtulSingh0/posts
https://datagenx.slack.com/messages/datascience/

Thursday, 14 April 2016

ETL Code Review Checklist




Guideline
Design jobs for restartability/ if not designed then what is the reason ?
Do not follow Sequential File stage with "Same" partitioning.
check if the APT_CONFIG_FILE parameter is added. This is required to change the number of nodes  during runtime.
Do not hard-code parameters.
ETL Code Review Checklist
Do not hard-code directory paths.
Do not use fork-joins to generate lookup data sets.
Use "Hash" aggregation for limited distinct key values.  Outputs after all rows are read.
Use "Sort" aggregation for large number of distinct key values.  Data must be pre-sorted.  Outputs after each aggregation group.
Use multiple aggregators to reduce collection time when aggregating all rows.  Define a constant key column using row generator.  First aggregator sums in parallel.  Second aggregator sums sequentially.
Make sure sequences are not too long.  Break up into logical units of work.
Is the error handling done properly? It is prefered to propogate errors from lower jobs to the highest level( ex a sequence)
What is the volume of extract data( is there a where clause in the SQL)
Are the correct scripts to clean up datasets after job complete revoked ?
Is there a reject process in place ?
Can we combine or split so we can reduce number of jobs or complexity respectively?
It is not recommended to have an increase in the number of nodes if there are too many stages in the job( this increases the number of processes spun off)
Volume information and growth information for the Lookup/Join tables?
Check if there is a select * in any of the queries. It is not advised to have select * , instead the required columns have to be added in the statement
Check the paritioning and sorting at each stage
When a sequence is used make sure none of the parameters passed are left blank
Check if there are separate jobs for atleast extract, transform and load 
Check if there is annotation for each stage and the job, the job properties should have the author,date etc filled out
Check for naming convention of the jobs, stages and links
Try avoiding peeks in production jobs, peeks are generally used for debug in the development
Make sure the developer has not suppressed many warnings that are valid
Verify that the jobs conform to the Flat File and Dataset naming specification.  This is especially important for cleaning up files and logging errors appropriately.
Verify that all fields are written to the Reject flat files.  This is necessary for debugging and reconciliation.







Like the below page to get update  
https://www.facebook.com/datastage4you
https://twitter.com/datagenx
https://plus.google.com/+AtulSingh0/posts
https://datagenx.slack.com/messages/datascience/

Sunday, 10 April 2016

5 Tips For Better DataStage Design #12



1. Minimum number of sort stages should be use in a datastage job. “Don’t sort if previously sorted” in sort Stage, this option should be set this to “true”, which improves the Sort Stage performance. The same Hash key should be used.  In Transformer Stage “Preserve Sort Order” can be used to maintain sort order option.

2. Minimum number of stages should be used in a job; otherwise it affects the performance of the job.
If a job is having more stages then the job should be decomposed into smaller number of small jobs. The use of container is a best way for better visualize and readability. If the existing active stages occupy almost all the CPU resources, the performance can be improved by running multiple parallel copies of the same stage process. This is done by using a share container.





3. Use of minimum of Stage variables in transformer is a good practice. The performance degrades when more stage variables are used.

4. The use of column propagation should be taken care . Columns, which are not needed in the job flow, should not be propagated from one Stage to another and from one job to the next. The best option is to disable the RCP.

5. When there is a need of renaming columns or addition of new columns, use of copy or modify stage is good practice.





Like the below page to get update  
https://www.facebook.com/datastage4you
https://twitter.com/datagenx
https://plus.google.com/+AtulSingh0/posts
https://datagenx.slack.com/messages/datascience/

Monday, 4 April 2016

DataStage Scenario #16 - Cross duplicate Removal


Need to remove duplicate where source or destination can be switched.


Input:


source   destination   distance
city1 city2 500
city2 city1 500
city3 city4 500
city4 city3 500 
city5 city7 700
city7 city5 700



Output:

source   destination   distance
city1 city2 500
city3 city4 500
city5 city7 700





Like the below page to get update  
https://www.facebook.com/datastage4you
https://twitter.com/datagenx
https://plus.google.com/+AtulSingh0/posts
https://groups.google.com/forum/#!forum/datagenx

Monday, 28 March 2016

10 Scenario based Interview Questions #1 - DataStage


1. Design a job which convert single source row to three target row.
2. Design a job which can identify the row if they are duplicate in input.
3. Design a job which will fetch the input file header and footer.
4. Design a job which will segregate unique and duplicate records in different files.
5. Design a job which remove the header from the input file.

For more ->  DataStage Scenario
6. Design a job which remove the footer from the input file.
7. Design a job which throw a mail if footer is not there in input file.
8. Design a job which extract the alternate records from the input file.
9. Design a job which extract the Nth row from the input file
10. Design a job which extract data from two input files and load them in alternate in target.


For more ->  DataStage Scenario

Wednesday, 23 March 2016

5 Tips For Better DataStage Design #11




  • When writing intermediate results that will only be shared between parallel jobs, always write to persistent data sets (using Data Set stages). You should ensure that the data is partitioned, and that the partitions, and sort order, are retained at every stage. Avoid format conversion or serial I/O.
  • Data Set stages should be used to create restart points in the event that a job or sequence needs to be rerun. But, because data sets are platform and configuration specific, they should not be used for long-term backup and recovery of source data.
  • Depending on available system resources, it might be possible to optimize overall processing time at run time by allowing smaller jobs to run concurrently. However, care must be taken to plan for scenarios when source files arrive later than expected, or need to be reprocessed in the event of a failure.
  • Parallel configuration files allow the degree of parallelism and resources used by parallel jobs to be set dynamically at run time. Multiple configuration files should be used to optimize overall throughput and to match job characteristics to available hardware resources in development, test, and production modes.
  • The proper configuration of scratch and resource disks and the underlying file system and physical hardware architecture can significantly affect overall job performance.






Like the below page to get update  
https://www.facebook.com/datastage4you
https://twitter.com/datagenx
https://plus.google.com/+AtulSingh0/posts
https://datagenx.slack.com/messages/datascience/

Tuesday, 22 March 2016

Transformer Stage alternative - A Quick DataStage Recipe



What to use instead of "Transformer" Stage

Copy Stage

Use "Copy" stage instead of "Transformer" Stage for following:
Renaming columns
Dropping columns
Default type conversions
Job design placeholder between stages


Modify Stage

Use "Modify" stage
Non default type conversions
Null handling
Character string trimming



Filter Stage

Use "Filter" Stage
Using constraints on output data


Will add more.......






Like the below page to get update  
https://www.facebook.com/datastage4you
https://twitter.com/datagenx
https://plus.google.com/+AtulSingh0/posts
https://datagenx.slack.com/messages/datascience/

Wednesday, 2 March 2016

5 Tips For Better DataStage Design #10



1. Establish baselines (especially with I/O), use copy with no output
2. Avoid the use of only one flow for tuning/performance testing.  Prototyping can be a powerful tool.
3. Work in increments...change 1 thing at a time.
4. Evaluate data skew:  repartition to balance the data flow
5. Isolate and Solve - determine which stage is causing a problem.

6. distribute file systems (if possible) to eliminate bottlenecks
7. Do NOT involve the RDBMS in initial testing.
8. Understand and evaluate the tuning knobs available





Like the below page to get update  
https://www.facebook.com/datastage4you
https://twitter.com/datagenx
https://plus.google.com/+AtulSingh0/posts
https://datagenx.slack.com/messages/datascience/

Tuesday, 23 February 2016

DataStage Scenario #15 - Get First & Last Date of Last Month



Design a job which can generate the First and Last date of Last Month and pass this into an SQL which executes inside a parallel job?









Like the below page to get update  
https://www.facebook.com/datastage4you
https://twitter.com/datagenx
https://plus.google.com/+AtulSingh0/posts
https://datagenx.slack.com/messages/datascience/

Monday, 15 February 2016

5 Tips For Better DataStage Design #9



#1. Always save the metadata (for source, target or lookup definitions) in the repository to ensure re-usability and consistency.

#2. Make sure that the pathname/format details are not hard coded and job parameters are used for the same. These details are generally set as environmental variable.




#3. Ensure that all file names from external source are parameterized. This will prevent the developer from the trouble of changing the job or file name if the file name is changed. File names/Datasets created in the job for intermediate purpose can be hard coded.

#4. Ensure that the environment variable $APT_DISABLE_COMBINATION is set to ‘False’.
Ensure that $APT_STRING_PADCHAR is set to spaces.

#5. The parameters used across the jobs should be with same name. This helps to avoid unnecessary confusions

#6. Be consistent with where the slashes in the path live. Either in the design or the variableThomas McNicol




Like the below page to get update  
https://www.facebook.com/datastage4you
https://twitter.com/datagenx
https://plus.google.com/+AtulSingh0/posts
https://datagenx.slack.com/messages/datascience/

Friday, 5 February 2016

DataStage Scenario #14 - Get Day specific file



Hi Guys,
Design jobs for one more realtime scenario -


Requirement-
Job will run once in a day, which read a file from a folder, but filename is changing on each day.

File Name -- FileName_Date.txt

Here -
Date -  Job run date
File - File Name


Example -

FileName on Monday -   InfoSphere_20160201.txt
FileName on Tuesday -   Info_Search_20160202.txt
FileName on Wednesday -   InfoLables_20160203.txt
FileName on Thursday -   InfoLocation_20160204.txt
FileName on Friday -   InfoOptions_20160205.txt




Like the below page to get update  
https://www.facebook.com/datastage4you
https://twitter.com/datagenx
https://plus.google.com/+AtulSingh0/posts
https://datagenx.slack.com/messages/datascience/

Monday, 1 February 2016

5 Tips For Better DataStage Design #8



#1. Templates have to be created to enhance reusability and enforce coding standard. Jobs should be created using templates.
#2. The template should contain the standard job flow along with proper naming conventions of components, proper Job level annotation and short/long description. Change record section should be kept in log description to keep track.



#3. Don't copy the job design only. copy using 'save as' or create copy option at the job level.
#4. The DataStage connection should be logged off after completion of work to avoid locked jobs.
#5. Creation of common lookup jobs
#6. Some extraction jobs can be created to created reference datasets. The datasets can then be used in different conversion modules





Like the below page to get update  
https://www.facebook.com/datastage4you
https://twitter.com/datagenx
https://plus.google.com/+AtulSingh0/posts
https://groups.google.com/forum/#!forum/datagenx

Wednesday, 13 January 2016

DataStage Scenario #12 - Combine data column wise



Col1 Col2 Col3
1 2 3
4 5 6
7 8 9



Col4 Col5 Col6
a b c
d e f
g h i



Col1 Col2 Col3 Col4 Col5 Col6
1 2 3 a b c
4 5 6 d e f
7 8 9 g h i




Like the below page to get update  
https://www.facebook.com/datastage4you
https://twitter.com/datagenx
https://plus.google.com/+AtulSingh0/posts
https://groups.google.com/forum/#!forum/datagenx

Wednesday, 23 December 2015

5 Tips For Better DataStage Design #6



#1. If you are using a copy or a filter stage either immediately after or immediately before a transformer stage, you are reducing the efficiency by using more stages because a transformer does the job of both copy stage as well as a filter stage

#2. Work done by "COPY Stage"
a) Columns order can be altered.
b) And columns can be dropped.
c) We can change the column names.



#3. When you need to run the same sequence of jobs again and again, better create a sequencer with all the jobs that you need to run. Running this sequencer will run all the jobs. You can provide the sequence as per your requirement.

#4. Sort the data as much as possible in DB and reduced the use of DS-Sort for better performance of jobs. Avoid the work done by DataStage which is possible in DB. But it doesn't mean you have to put all the complexity in SQL only, for that we are using datastage.

#5. Ensure that all the character fields are trimmed before any processing. Normally extra spaces in the data may lead to some errors like lookup mismatch which are hard to detect.





Like the below page to get update  
https://www.facebook.com/datastage4you
https://twitter.com/datagenx
https://plus.google.com/+AtulSingh0/posts
https://groups.google.com/forum/#!forum/datagenx

Wednesday, 25 November 2015

DataStage Scenario #10 - two realtime scenario



Hello guys, Hoping you are enjoying while solving DataStage Scenarios. Today I am going to ask two real time scenario. try to solve these :-)

Scn1:
We have to design a job, which will extract data from table tab1, when we get some value in a file file1. No relation between table and file.
Simple Hhh? Let's make it little restricted. You can not use the sequencer job, All functionality we need in a single parallel job.





When you able to solve first one, come to this -

Scn2:
Reading source table Stab which is having 20 columns (Sc1, Sc2, Sc3.... ), Need to validate individual column from Sc1 to Scl0 from another table Rtab column Rc1 to Rc10 ( means Sc1 with Rc1, Sc2 with Rc2 .........). The condition is, If any column is got invalid whole row will be dropped and that column captured in a single reject report. Design such a way that we should get two rows in reject file if two column are not valid in a single input row.

Wish you a luck !!



Like the below page to get update  
https://www.facebook.com/datastage4you
https://twitter.com/datagenx
https://plus.google.com/+AtulSingh0/posts
https://groups.google.com/forum/#!forum/datagenx

Monday, 9 November 2015

DataStage Scenario #9 - Add Header & Trailer



Scenario:
         Design a job to add Header and Trailer to input data.

Example Input:

Atul
Neeraj
Anita
Amruta
Divya
Swapnil
Pramod
Vivek
Ashish
Amit
Santosh



Output

Employee Name
Atul
Neeraj
Anita
Amruta
Divya
Swapnil
Pramod
Vivek
Ashish
Amit
Santosh
Employee Count : 11