Showing posts with label Variable. Show all posts
Showing posts with label Variable. Show all posts

Thursday, 14 September 2017

Evaluation Sequence in Transformer Stage - A Quick DataStage Recipe



Recipe:

What is evaluation sequence in Transformer Stage Or Order of Stage & Loop Variable and Derivations

Ingredients:

1. Transformer Stage
     a. Stage Variables
     b. Loop Variables
     c. Derivations


How To:

Evaluate each stage variable initial value
For each input row to process:
Evaluate each stage variable derivation value, unless the derivation is empty
For each output link:
Evaluate each column derivation value
Write the output record
Next output link
Next input row


** The stage variables and the columns within a link are evaluated in the order in which they are displayed in the Transformer editor. Similarly, the output links are also evaluated in the order in which they are displayed




Like the below page to get update  
https://www.facebook.com/datastage4you
https://twitter.com/datagenx
https://plus.google.com/+AtulSingh0/posts
https://datagenx.slack.com/messages/datascience/

Sunday, 5 March 2017

How to get DataStage Project Settings


DSListProjectSettings.bat is used to report all available settings of existing projects.

USAGE: DSListProjectSettings <SERVER> <USER> <PASSWORD> <PROJECT>

The DSListProjectSettings batch file reports all available project settings and environment variable names and values to standard output. The first section shows available DataStage project properties and their settings. The second section shows environment variable names and values. When run successfully, each section will report a status code of zero.



..


Like the below page to get update  
https://www.facebook.com/datastage4you
https://twitter.com/datagenx
https://plus.google.com/+AtulSingh0/posts
https://datagenx.slack.com/messages/datascience/

Wednesday, 28 September 2016

DS_PXDEBUG - DataStage Parallel Debugging Variable





* Controlled with an environment variable, not exposed on GUI.  DS_PXDEBUG set to activate feature (e.g. DS_PXDEBUG=1)
* Warning logged when job run with this debug feature on.
* Debug collected under a new project-level directory "Debugging" on the server. Subdirectories on a per-job basis, named after the job (created as required). For multi-instance jobs jobs run with a non-empty invocation ID, the directory will be "<jobname>.<invocationID>".
* Internally turns on Osh environment variable APT_MSG_FILELINE so that warnings/errors issued by Osh have source filename & linenumber attached.
* Internally turns on Osh environment variable APT_ENGLISH_MESSAGES so that unlocalised copies of PX-originated error / warning messages are issued in addition to the localised copy (where available).

* Internally turns on Osh environment variables APT_PM_PLAYER_TIMING APT_PM_PLAYER_MEMORY APT_RECORD_COUNTS for more reporting from playes
* Places content of jobs RT_SC<jobnum> directory in the debug location (includes job parameter file, Osh script, parent shell script, any osh and compile scripts associated with transformers). These will be in the same characterset as the original files.
* Places content of jobs RT_BP<jobnum>.O directory in the debug location. Includes library file binaries for PX transformers (plus possibly binaries associated with any Server portions of the job).
* Dump of environment varaible values at startup (same as in the log) placed in a named file in the debug location.
* Dump of osh command options placed in a named file in the debug location. Note that this is as issued from the Server wrapper code. Particularly in the case of Windows, it may not represent exactly what is received by the Osh command line, due to the action of the OshWrapper program,  and interpretation of quotes and backslash-escapes.
* Copy of received raw osh output messages in a named file in the debug location. These will typically be in the host characterset, even though on an NLS system Orchestrate will be originating them in UTF8.
* Copy of PX configuration file placed in the debug location. This will be in the same characterset as the original file.
* This new feature collects together and enhances a number of debug features already exposed with other environment variables. In order to minimise code impact risk, the original features will not be removed at this stage.
* The exception is the "dump of raw osh output messages"; it was previously placed in the &COMO& directory. If the old and the new debug options are both enabled, the new one will take precedence and there will not be a copy in &COMO&. Again this decision has been taken to minimise code change.

Contributed by Christ Thornton 2/2/2007





Like the below page to get update  
https://www.facebook.com/datastage4you
https://twitter.com/datagenx
https://plus.google.com/+AtulSingh0/posts
https://datagenx.slack.com/messages/datascience/

Monday, 22 August 2016

5 Tips For Better DataStage Design #15



1. Stage variable does not accept null value. Hence no null able column should be directly mapped to stage variable without null handling.

2. Use of SetNull() function in stage variables should be avoided because it causes compilation error.

http://www.datagenx.net/2016/08/5-tips-for-better-datastage-design-15.html

3. If input links are not already partitioned on join key then they should be hash partitioned on the join key in join stage. In case of multiple join key it is recommended to partition on one key and sort by the other keys.

4. If there is a need to do the repartition on an input link then we need to clear the preserve partitioning flag in the previous stage. Otherwise it will generate warning in job log.

5. If database table has less volume of data as a reference then it is good to use lookup stage.

6. It is always advisable to avoid Transformation stage. Because the Transformation stage is not written in DataStage native language, instead it is written in c. So every time you compile a job it embeds the c code with the native code in the executable file, which degrades the performance of the job.





Like the below page to get update  
https://www.facebook.com/datastage4you
https://twitter.com/datagenx
https://plus.google.com/+AtulSingh0/posts
https://datagenx.slack.com/messages/datascience/

Thursday, 14 April 2016

ETL Code Review Checklist




Guideline
Design jobs for restartability/ if not designed then what is the reason ?
Do not follow Sequential File stage with "Same" partitioning.
check if the APT_CONFIG_FILE parameter is added. This is required to change the number of nodes  during runtime.
Do not hard-code parameters.
ETL Code Review Checklist
Do not hard-code directory paths.
Do not use fork-joins to generate lookup data sets.
Use "Hash" aggregation for limited distinct key values.  Outputs after all rows are read.
Use "Sort" aggregation for large number of distinct key values.  Data must be pre-sorted.  Outputs after each aggregation group.
Use multiple aggregators to reduce collection time when aggregating all rows.  Define a constant key column using row generator.  First aggregator sums in parallel.  Second aggregator sums sequentially.
Make sure sequences are not too long.  Break up into logical units of work.
Is the error handling done properly? It is prefered to propogate errors from lower jobs to the highest level( ex a sequence)
What is the volume of extract data( is there a where clause in the SQL)
Are the correct scripts to clean up datasets after job complete revoked ?
Is there a reject process in place ?
Can we combine or split so we can reduce number of jobs or complexity respectively?
It is not recommended to have an increase in the number of nodes if there are too many stages in the job( this increases the number of processes spun off)
Volume information and growth information for the Lookup/Join tables?
Check if there is a select * in any of the queries. It is not advised to have select * , instead the required columns have to be added in the statement
Check the paritioning and sorting at each stage
When a sequence is used make sure none of the parameters passed are left blank
Check if there are separate jobs for atleast extract, transform and load 
Check if there is annotation for each stage and the job, the job properties should have the author,date etc filled out
Check for naming convention of the jobs, stages and links
Try avoiding peeks in production jobs, peeks are generally used for debug in the development
Make sure the developer has not suppressed many warnings that are valid
Verify that the jobs conform to the Flat File and Dataset naming specification.  This is especially important for cleaning up files and logging errors appropriately.
Verify that all fields are written to the Reject flat files.  This is necessary for debugging and reconciliation.







Like the below page to get update  
https://www.facebook.com/datastage4you
https://twitter.com/datagenx
https://plus.google.com/+AtulSingh0/posts
https://datagenx.slack.com/messages/datascience/

Sunday, 10 April 2016

5 Tips For Better DataStage Design #12



1. Minimum number of sort stages should be use in a datastage job. “Don’t sort if previously sorted” in sort Stage, this option should be set this to “true”, which improves the Sort Stage performance. The same Hash key should be used.  In Transformer Stage “Preserve Sort Order” can be used to maintain sort order option.

2. Minimum number of stages should be used in a job; otherwise it affects the performance of the job.
If a job is having more stages then the job should be decomposed into smaller number of small jobs. The use of container is a best way for better visualize and readability. If the existing active stages occupy almost all the CPU resources, the performance can be improved by running multiple parallel copies of the same stage process. This is done by using a share container.





3. Use of minimum of Stage variables in transformer is a good practice. The performance degrades when more stage variables are used.

4. The use of column propagation should be taken care . Columns, which are not needed in the job flow, should not be propagated from one Stage to another and from one job to the next. The best option is to disable the RCP.

5. When there is a need of renaming columns or addition of new columns, use of copy or modify stage is good practice.





Like the below page to get update  
https://www.facebook.com/datastage4you
https://twitter.com/datagenx
https://plus.google.com/+AtulSingh0/posts
https://datagenx.slack.com/messages/datascience/

Monday, 15 February 2016

5 Tips For Better DataStage Design #9



#1. Always save the metadata (for source, target or lookup definitions) in the repository to ensure re-usability and consistency.

#2. Make sure that the pathname/format details are not hard coded and job parameters are used for the same. These details are generally set as environmental variable.




#3. Ensure that all file names from external source are parameterized. This will prevent the developer from the trouble of changing the job or file name if the file name is changed. File names/Datasets created in the job for intermediate purpose can be hard coded.

#4. Ensure that the environment variable $APT_DISABLE_COMBINATION is set to ‘False’.
Ensure that $APT_STRING_PADCHAR is set to spaces.

#5. The parameters used across the jobs should be with same name. This helps to avoid unnecessary confusions

#6. Be consistent with where the slashes in the path live. Either in the design or the variableThomas McNicol




Like the below page to get update  
https://www.facebook.com/datastage4you
https://twitter.com/datagenx
https://plus.google.com/+AtulSingh0/posts
https://datagenx.slack.com/messages/datascience/