My e-Notes about Cloud, K8s, OpenShift, DataScience, Machine Learning, Python, Data Analytics, DataStage, DWH and ETL Concepts

Breaking

Tuesday, 1 December 2020

5 Tips For Better DataStage Design #17

1. For data being passed between Jobs, such as parallel jobs in a sequence, Datasets must be used and not sequential files.

2. Utility jobs to dump datasets to sequential files can be used for debugging

3. When processing Sequential Files with fixed length columns larger than 500 MB or with more than 50 columns
     - Define just one field with the total record length in the sequential file stage, then add a COLUMN IMPORT stage right after it to describe each column, data type, scale, etc.
     - Use the COLUMN EXPORT stage to create sequential files in the same manner (does the opposite of the COLUMN IMPORT stage).
      - Add the option 'Read from multiple nodes' to the sequential file stage and set to the number of partitions the job will be running with.

4. Full path names must be used when referencing files or scripts

No comments:

Post a comment

Disclaimer

The postings on this site are my own and don't necessarily represent IBM's or other companies positions, strategies or opinions. All content provided on this blog is for informational purposes and knowledge sharing only.
The owner of this blog makes no representations as to the accuracy or completeness of any information on this site or found by following any link on this site. The owner will not be liable for any errors or omissions in this information nor for the availability of this information. The owner will not be liable for any losses, injuries, or damages from the display or use of his information.