Showing posts with label Development. Show all posts
Showing posts with label Development. Show all posts

Wednesday, 5 April 2017

NULL Handling in Sequential File Stage



DataStage has a mechanism for denoting NULL field values. It is slightly different in server and parallel jobs. In the sequential file stage a character or string may be used to represent NULL column values. Here's how represent NULL with the character "~":

Server Job:
1. Create a Sequential file stage and make sure there is an Output link from it.
2. Open the Sequential file stage and click the "Outputs" tab ans Select "Format"
3. On the right enter the "~" next to "Default NULL string:"

Parallel Job:
1. Create a Sequential file stage and make sure there is an Output link from it.
2. Open the Sequential file stage and click the "Outputs" tab ans Select "Format"
3. Right click on "Field defaults" ==> "Add sub-property" and select "Null field value"
4. Enter the "~" in the newly created field.





Like the below page to get update  
https://www.facebook.com/datastage4you
https://twitter.com/datagenx
https://plus.google.com/+AtulSingh0/posts
https://datagenx.slack.com/messages/datascience/

Wednesday, 18 January 2017

5 Tips For Better DataStage Design #17





**  There is an automap button in some stages,it can maps fields with the same names.

**  When you add a shared container into your job you need to map the columns of the container to your job link. What you might miss is the extra option you get on the Columns tab "Load" button. In addition to the normal column load you get "Load from Container" which is a quick way to load the container metadata into your job.

**  Don't create a job from an empty canvas. Always copy and use an existing job. Don't create shared containers from a blank canvas, always build and test a full job and then turn part of it into a container.



**  If you want to copy and paste settings between jobs,you had better open two Designers,then you can have two property windows open at the same time and copy or compare them more easily.As most property windows in DataStage are modal and you can only have one property window open per Designer session.

**  You can load metadata into a stage by using the "Load" button on the column tab or by dragging and dropping a table definition from the Designer repository window onto a link in your job. For sequential file stages the drag and drop is faster as it loads both the column names and the format values in one go. If you used the load button you would need to load the column names and then the format details separately.

**  Maybe you often meet a Modify stage or stage function working incorrectly, trial and error should be often the only way to work out the syntax of a function. If you do this in a large and complex job, it can be consumed a lot of times to debug it. The better way is have a couple test jobs in your project with a row generator, a modify or transformer stage and a peek stage. Have a column of each type in this test job. Use this throughout your project as a quick way to test a function or conversion. By the way, to correctly running the transformer stage need install the c++ compiler.




Like the below page to get update  
https://www.facebook.com/datastage4you
https://twitter.com/datagenx
https://plus.google.com/+AtulSingh0/posts
https://datagenx.slack.com/messages/datascience/

Monday, 29 August 2016

Modify Stage - What's been promised


Modify stage, one of the most un-used stage in DataStage but very useful in terms of performance tuning. It is advisable to Developers not to use transformer stage to just Trimming or NULL handling but if and only if in the case when they are aware and comfortable with the syntax and derivations supported by modify stage as there is no drop down or right click options to help us with functions/synatx.
http://buff.ly/2bqOV7Z

The definition of Modify Stage as IBM documented -

"The Modify stage alters the record schema of its input data set. The modified data set is then output. You
can drop or keep columns from the schema, or change the type of a column.
The Modify stage is a processing stage. It can have a single input link and a single output link."


http://www.ibm.com/support/knowledgecenter/SSZJPZ_11.3.0/com.ibm.swg.im.iis.ds.parjob.dev.doc/topics/c_deeref_Modify_Stage.html
The operations offered by Modify Stage is -
1. Dropping of Columns
2. Keeping Columns
3. Create Duplicate Columns
4. Rename Columns
5. Change Null-ability of Columns
6. Change Data Type of Columns


Stage is supporting only 1 input stage and 1 output stage.

All these operations are easily done in other stages such as copy, transformer etc. But why Modify stage is required or can say, we should use this?

Answer of this Datastage problem is simple - Performance Tuning of jobs

** Why not to use Transformer -
Cause, Whenever we call the transformer functions, data processed to and through C++ code (transformer implementation) which cause the performance letancy(delay). This delay is negligible for less no of records than higher no. So, Prefer the Modify stage when no of records are high to process.

Keep looking for this place as we are going to learn lot of tips on Modify stage.


Get this Article as PDF - http://bit.ly/2fdroHR and  http://bit.ly/2fdrDCr



Like the below page to get update  
https://www.facebook.com/datastage4you
https://twitter.com/datagenx
https://plus.google.com/+AtulSingh0/posts
https://datagenx.slack.com/messages/datascience/

Monday, 22 August 2016

5 Tips For Better DataStage Design #15



1. Stage variable does not accept null value. Hence no null able column should be directly mapped to stage variable without null handling.

2. Use of SetNull() function in stage variables should be avoided because it causes compilation error.

http://www.datagenx.net/2016/08/5-tips-for-better-datastage-design-15.html

3. If input links are not already partitioned on join key then they should be hash partitioned on the join key in join stage. In case of multiple join key it is recommended to partition on one key and sort by the other keys.

4. If there is a need to do the repartition on an input link then we need to clear the preserve partitioning flag in the previous stage. Otherwise it will generate warning in job log.

5. If database table has less volume of data as a reference then it is good to use lookup stage.

6. It is always advisable to avoid Transformation stage. Because the Transformation stage is not written in DataStage native language, instead it is written in c. So every time you compile a job it embeds the c code with the native code in the executable file, which degrades the performance of the job.





Like the below page to get update  
https://www.facebook.com/datastage4you
https://twitter.com/datagenx
https://plus.google.com/+AtulSingh0/posts
https://datagenx.slack.com/messages/datascience/

Thursday, 16 June 2016

5 Tips For Better DataStage Design #14



1. The use of Lookup stage depends upon the volume of data.Sparse lookup type should be used when primary input data volume is small.If the reference data volume is more, Lookup Stage should be avoided.

2. Use of ORDER BY clause in the database is good as compared to use of sort stage.



3. In Dtatastage Administrator, Tuned the 'Project Tunable' for better performance.

4. For Funnel, the use of this stage reduces the performance of a job. Funnel Stage should be run in continuous mode.

5. If the hash file is used only for lookup then "enable Preload to memory". This will improve the performance.






Like the below page to get update  
https://www.facebook.com/datastage4you
https://twitter.com/datagenx
https://plus.google.com/+AtulSingh0/posts
https://datagenx.slack.com/messages/datascience/

Monday, 25 April 2016

Suppress Warnings - A Quick DataStage Recipe



Recipe:

How to suppress job warnings in DS job log

HowTo:

Add "rule" in Message Handler

www.datagenx.net

Method:

From DS Director, from Message Handler, add a rule
Select Warning message as example of Message text
Or
Open the job log, select the message you want to suppress
right click and add in Message Handler





Like the below page to get update  
https://www.facebook.com/datastage4you
https://twitter.com/datagenx
https://plus.google.com/+AtulSingh0/posts
https://datagenx.slack.com/messages/datascience/

Monday, 18 April 2016

10 Scenario based Interview Questions #2 - DataStage



11. Design a job which insert the data if not exists and update if exists in target
12. Design a job which includes a header and footer in output file
13. Design a job which checks whether Currency data is having 'Rs. 9999.99' format or not.
14. Design a job which checks the date passed validation. Date input in YYYYMMDD format.

For more ->  DataStage Scenario

15. Design a job which check the date difference in Hrs and Mins ( HH:MM)
16. Design a job which delete the data if exists in target first and then load.
17. Design a job which check whether each column is having a value or not
18. Design a job which populate the Date dimention table.
19. Design a job which transpose the columns into row with the help of transformer stage.
20. Design a job which remove duplicate without using Remove Duplicate Stage.


For more ->  DataStage Scenario



Like the below page to get update  
https://www.facebook.com/datastage4you
https://twitter.com/datagenx
https://plus.google.com/+AtulSingh0/posts
https://datagenx.slack.com/messages/datascience/

Sunday, 10 April 2016

5 Tips For Better DataStage Design #12



1. Minimum number of sort stages should be use in a datastage job. “Don’t sort if previously sorted” in sort Stage, this option should be set this to “true”, which improves the Sort Stage performance. The same Hash key should be used.  In Transformer Stage “Preserve Sort Order” can be used to maintain sort order option.

2. Minimum number of stages should be used in a job; otherwise it affects the performance of the job.
If a job is having more stages then the job should be decomposed into smaller number of small jobs. The use of container is a best way for better visualize and readability. If the existing active stages occupy almost all the CPU resources, the performance can be improved by running multiple parallel copies of the same stage process. This is done by using a share container.





3. Use of minimum of Stage variables in transformer is a good practice. The performance degrades when more stage variables are used.

4. The use of column propagation should be taken care . Columns, which are not needed in the job flow, should not be propagated from one Stage to another and from one job to the next. The best option is to disable the RCP.

5. When there is a need of renaming columns or addition of new columns, use of copy or modify stage is good practice.





Like the below page to get update  
https://www.facebook.com/datastage4you
https://twitter.com/datagenx
https://plus.google.com/+AtulSingh0/posts
https://datagenx.slack.com/messages/datascience/

Monday, 4 April 2016

DataStage Scenario #16 - Cross duplicate Removal


Need to remove duplicate where source or destination can be switched.


Input:


source   destination   distance
city1 city2 500
city2 city1 500
city3 city4 500
city4 city3 500 
city5 city7 700
city7 city5 700



Output:

source   destination   distance
city1 city2 500
city3 city4 500
city5 city7 700





Like the below page to get update  
https://www.facebook.com/datastage4you
https://twitter.com/datagenx
https://plus.google.com/+AtulSingh0/posts
https://groups.google.com/forum/#!forum/datagenx

Wednesday, 23 March 2016

5 Tips For Better DataStage Design #11




  • When writing intermediate results that will only be shared between parallel jobs, always write to persistent data sets (using Data Set stages). You should ensure that the data is partitioned, and that the partitions, and sort order, are retained at every stage. Avoid format conversion or serial I/O.
  • Data Set stages should be used to create restart points in the event that a job or sequence needs to be rerun. But, because data sets are platform and configuration specific, they should not be used for long-term backup and recovery of source data.
  • Depending on available system resources, it might be possible to optimize overall processing time at run time by allowing smaller jobs to run concurrently. However, care must be taken to plan for scenarios when source files arrive later than expected, or need to be reprocessed in the event of a failure.
  • Parallel configuration files allow the degree of parallelism and resources used by parallel jobs to be set dynamically at run time. Multiple configuration files should be used to optimize overall throughput and to match job characteristics to available hardware resources in development, test, and production modes.
  • The proper configuration of scratch and resource disks and the underlying file system and physical hardware architecture can significantly affect overall job performance.






Like the below page to get update  
https://www.facebook.com/datastage4you
https://twitter.com/datagenx
https://plus.google.com/+AtulSingh0/posts
https://datagenx.slack.com/messages/datascience/

Tuesday, 22 March 2016

Transformer Stage alternative - A Quick DataStage Recipe



What to use instead of "Transformer" Stage

Copy Stage

Use "Copy" stage instead of "Transformer" Stage for following:
Renaming columns
Dropping columns
Default type conversions
Job design placeholder between stages


Modify Stage

Use "Modify" stage
Non default type conversions
Null handling
Character string trimming



Filter Stage

Use "Filter" Stage
Using constraints on output data


Will add more.......






Like the below page to get update  
https://www.facebook.com/datastage4you
https://twitter.com/datagenx
https://plus.google.com/+AtulSingh0/posts
https://datagenx.slack.com/messages/datascience/

Wednesday, 2 March 2016

5 Tips For Better DataStage Design #10



1. Establish baselines (especially with I/O), use copy with no output
2. Avoid the use of only one flow for tuning/performance testing.  Prototyping can be a powerful tool.
3. Work in increments...change 1 thing at a time.
4. Evaluate data skew:  repartition to balance the data flow
5. Isolate and Solve - determine which stage is causing a problem.

6. distribute file systems (if possible) to eliminate bottlenecks
7. Do NOT involve the RDBMS in initial testing.
8. Understand and evaluate the tuning knobs available





Like the below page to get update  
https://www.facebook.com/datastage4you
https://twitter.com/datagenx
https://plus.google.com/+AtulSingh0/posts
https://datagenx.slack.com/messages/datascience/

Tuesday, 1 March 2016

ETL Development Standards



These development standards provide consistency in the artifacts the ETL developers create.  This consistency improves testing, operational support, maintenance, and performance.  

Code Development Guidelines


1 Commenting in Code and Objects

As a primary guideline where it is possible and does not interfere with the operation of the applications, all code must contain a developer comment/note.  
All ETL jobs must have a proper annotation (short description of the functionality of the job).
The target output files (.csv files) should not contain any leading or trailing spaces.
While deciding record level delimiter, “Delimiter Collision” issue needs to be considered. No such delimiter should be used as a field defaults that is present as a part of data.

2 ETL Naming Standards

The standardized naming conventions ease the burden on developers switching from one project to another.  Knowing the names and where things are located are very useful to understand before the occurrence of the design and development phases.

The following table identifies DataStage elements and their standard naming convention.


2.1 Job and Properties Naming Conventions

GUI Component Entity Convention
Designer Parallel Job <<Application>>_<<job_Name>>_JOB
Designer Sequence  <<Application>>_<<job_Name>>_SEQ
Designer Server Job  <<Application>>_<<job_Name>>_SVR
Designer Parameter  <<Application>>_<<job_Name>>_PARM

2.2 Job Processing Stage Naming Conventions

GUI Component Entity Convention
Designer Aggregator  AGG_<<PrimaryFunction>>
Designer Copy  CP_<<PrimaryFunction>>
Designer Filter  FLT_<<PrimaryFunction>>
Designer Funnel  FNL_<<PrimaryFunction>>
Designer Join (Inner)  JN_<<PrimaryFunction>>
Designer FTP Enterprise FTP_<<PrimaryFunction>>
Designer Lookup  LKP_<< Value Name or table Name>>
Designer Merge  MRG_<<PrimaryFunction>>
Designer Modify  MOD_<<PrimaryFunction>>
Designer Sort  SRT_<<PrimaryFunction>>

2.3 Links Naming Conventions

GUI Component Entity Convention
Designer Reference (Lookup)  Lnk_Ref_<<Number or Additional descriptor, if needed to form a unique object name>>
Designer Reject (Lookup, File, DB)  Lnk_Rej_<<Number or Additional descriptor, if needed to form a unique object name>>
Designer Input  Lnk_In_<<Number or Additional descriptor, if needed to form a unique object name>>
Designer Output  Lnk_Out_<<Number or Additional descriptor, if needed to form a unique object name>>
Designer Delete  Lnk_Del_<<Number or Additional descriptor, if needed to form a unique object name>>
Designer Insert  Lnk_Ins_<<Number or Additional descriptor, if needed to form a unique object name>>
Designer Update  Lnk_Upd_<<Number or Additional descriptor, if needed to form a unique object name>>

2.4 Data Store Naming Conventions:

In the case of a data store, the class word refers to the type of data store (e.g. Dataset, Sequential File, Table, View, and so forth).

GUI Component Entity Convention
Designer Database  DB_<<DatabasName>>
Designer Table  TBL_<<TableName>>
Designer View  VIEW_<<ViewName>>
Designer Dimension  DM_<<TableName>>
Designer Fact  TRAN_<<TableName>>
Designer Source SRC_<<Table or Object Name>>
Designer  Target  TRGT_<<Table or objectName>>

2.5 File Stage Naming Conventions:

GUI Component Entity Convention
Designer Sequential File  SEQ_
Designer Complex Flat File  CFF_
Designer Parallel dataset  DS_







Like the below page to get update  
https://www.facebook.com/datastage4you
https://twitter.com/datagenx
https://plus.google.com/+AtulSingh0/posts
https://datagenx.slack.com/messages/datascience/

Tuesday, 23 February 2016

DataStage Scenario #15 - Get First & Last Date of Last Month



Design a job which can generate the First and Last date of Last Month and pass this into an SQL which executes inside a parallel job?









Like the below page to get update  
https://www.facebook.com/datastage4you
https://twitter.com/datagenx
https://plus.google.com/+AtulSingh0/posts
https://datagenx.slack.com/messages/datascience/

Monday, 15 February 2016

5 Tips For Better DataStage Design #9



#1. Always save the metadata (for source, target or lookup definitions) in the repository to ensure re-usability and consistency.

#2. Make sure that the pathname/format details are not hard coded and job parameters are used for the same. These details are generally set as environmental variable.




#3. Ensure that all file names from external source are parameterized. This will prevent the developer from the trouble of changing the job or file name if the file name is changed. File names/Datasets created in the job for intermediate purpose can be hard coded.

#4. Ensure that the environment variable $APT_DISABLE_COMBINATION is set to ‘False’.
Ensure that $APT_STRING_PADCHAR is set to spaces.

#5. The parameters used across the jobs should be with same name. This helps to avoid unnecessary confusions

#6. Be consistent with where the slashes in the path live. Either in the design or the variableThomas McNicol




Like the below page to get update  
https://www.facebook.com/datastage4you
https://twitter.com/datagenx
https://plus.google.com/+AtulSingh0/posts
https://datagenx.slack.com/messages/datascience/

Friday, 5 February 2016

DataStage Scenario #14 - Get Day specific file



Hi Guys,
Design jobs for one more realtime scenario -


Requirement-
Job will run once in a day, which read a file from a folder, but filename is changing on each day.

File Name -- FileName_Date.txt

Here -
Date -  Job run date
File - File Name


Example -

FileName on Monday -   InfoSphere_20160201.txt
FileName on Tuesday -   Info_Search_20160202.txt
FileName on Wednesday -   InfoLables_20160203.txt
FileName on Thursday -   InfoLocation_20160204.txt
FileName on Friday -   InfoOptions_20160205.txt




Like the below page to get update  
https://www.facebook.com/datastage4you
https://twitter.com/datagenx
https://plus.google.com/+AtulSingh0/posts
https://datagenx.slack.com/messages/datascience/

Monday, 1 February 2016

5 Tips For Better DataStage Design #8



#1. Templates have to be created to enhance reusability and enforce coding standard. Jobs should be created using templates.
#2. The template should contain the standard job flow along with proper naming conventions of components, proper Job level annotation and short/long description. Change record section should be kept in log description to keep track.



#3. Don't copy the job design only. copy using 'save as' or create copy option at the job level.
#4. The DataStage connection should be logged off after completion of work to avoid locked jobs.
#5. Creation of common lookup jobs
#6. Some extraction jobs can be created to created reference datasets. The datasets can then be used in different conversion modules





Like the below page to get update  
https://www.facebook.com/datastage4you
https://twitter.com/datagenx
https://plus.google.com/+AtulSingh0/posts
https://groups.google.com/forum/#!forum/datagenx

Wednesday, 13 January 2016

DataStage Scenario #12 - Combine data column wise



Col1 Col2 Col3
1 2 3
4 5 6
7 8 9



Col4 Col5 Col6
a b c
d e f
g h i



Col1 Col2 Col3 Col4 Col5 Col6
1 2 3 a b c
4 5 6 d e f
7 8 9 g h i




Like the below page to get update  
https://www.facebook.com/datastage4you
https://twitter.com/datagenx
https://plus.google.com/+AtulSingh0/posts
https://groups.google.com/forum/#!forum/datagenx