Monday, 29 February 2016

Building Test data from Live Data in DataStage



 Datastage is providing below stages to generate test data from live data (real data). We are well aware about these stages but not aware with this kind of use (I know :-), we don't)

Think about it !



Head stage: selects the first N records from each partition of an input data set and copies the selected records to an output data set.






Tail stage: selects the last N records from each partition of an input data set and copies the selected records to an output data set.






Sample stage: samples an input data set. Operates in two modes: Percent mode, extracts rows, selecting them by means of a random number generator, and writes a given percentage of these to each output data set; Period mode, extracts every Nth row from each partition, where N is the period which you supply.


Filter stage: transfers, unmodified, the records of the input data set which satisfy the specified requirements and filters out all other records.  You can specify different requirements to route rows down different output links.



External Filter stage: allows you to specify a UNIX command that acts as a filter on the data you are processing. An example would be to use the stage to grep a data set for a certain string, or pattern, and discard records which did not contain a match



Sequential File stage: FILTER OPTION - use this to specify that the data is passed through a filter program before being written to a file or files on output or before being placed in a dataset on input.








Like the below page to get update  
https://www.facebook.com/datastage4you
https://twitter.com/datagenx
https://plus.google.com/+AtulSingh0/posts
https://datagenx.slack.com/messages/datascience/

Saturday, 27 February 2016

Audit Strategy in ETL #2


First - Audit Strategy in ETL #1




As shown in the above diagram, the audit services are dependent on source systems and target systems for capturing audit data; unlike error handling services are dependent on ETL process to capture error data. Audit services are created as shared services that plug in to each ETL process.

Within the proposed Next Gen BI environment there are 4 stages (Initial Staging, Clean Staging, EDW, and Data Marts) where data is landed after extracting from source systems.
There will be at least one ETL job flow process at each of these stages extracting and loading data. Audit services are called at each stage by the ETL job flow process and the audit data is captured and reported.


Some of the key audit data that will be captured by the audit services is given below :


Field # Field Description
1 Data Server Name A name that identifies a database service.  This is called a "server" by some products.  Oracle refers to it by the name "instance" or "database".
2 Database Name
3 Data Table Name A name that identifies a Target Data Table within a Database.
4 Source Name A name that identifies the file, table or other data structure that is the origin of the data to be loaded into the target table.
5 ETL Process Name DataStage Sequencer
6 ETL Sub Process Name DataStage Job
7 New Rows in Source Count The number or rows in the source table that have been added since the prior run date.  This is the number that the process attempted to add to the target table.  (If the source is a file, this field should be NULL.)
8 Changed Rows in Source Count The number of rows in the source table which were modified since the prior run date.  This is the number that the process attempted to modify in the target. (If the source is a file, this field will be NULL.)
9 Processed Rows in Source Count The total number of rows in the source table or file that the process attempted to load to the target.  (If the source is a table, this is the sum of the New_Rows_in_Source_Cnt and Changed_Rows_in_Source_Cnt.)
10 Starting Row Count The number of rows in a table at the beginning of a Table Load.
11 Ending Row Count The number of rows in a table at the conclusion of a Table Load.
12 Insert Row Count The number of relational database table rows that are processed in an INSERT operation.
13 Update Row Count The number of relational database table rows that are processed in an UPDATE operation.
14 DW_CREATE_DTTM                            Date/time at which audit data was captured






Like the below page to get update  
https://www.facebook.com/datastage4you
https://twitter.com/datagenx
https://plus.google.com/+AtulSingh0/posts
https://datagenx.slack.com/messages/datascience/

Friday, 26 February 2016

Audit Strategy in ETL #1



The Audit service will provide simple measures on the ETL process: number of records input/output, number of rejected records, number of records loaded, Starting and Ending Row Count in the Target tables, New Rows in the source Table/Files, Inserted Rows and Updated Rows.

These measures are captured during the batch load process and stored in the audit tables for load audit checks and reporting purposes. Each process of the ETL should produce these key statistics by calling the audit services at the beginning and ending of a load process. All Audit tables will be stored in the staging repository. Exception reports can be produced for action to be taken by the business users and production support personnel.





Ref #           Step Description
1 Represents all the source systems from which data is extracted. Audit services will collect all the source audit data during the ETL process from source systems
2 Represents the ETL process which extracts data from Source systems and loads the data to target systems. Mostly audit data is not captured at the ETL process level.
3 Represents the target systems where the data gets loaded after the completion of the ETL process. Audit services will collect all the target audit data at the end of the ETL process from Target systems.
4 Audit Tables are internal tables where all the audit data related to source and target systems are loaded. Entry is made in this table every time an ETL batch process runs. These tables reside internally in the staging area. Audit reports can be generated using these tables.
5 Once the Audit data is captured. Audit check routines can be used to detect any discrepancy in the source or the target systems. This will be an automated process which will run at the completion of each ETL job based on the predefined rules. 
6 As Mentioned in the previous step, These audit rules can be set to fail the job and stop the batch process or the batch process can continue if the rules did not fail.
7 Once the Audit check routines are complete for any job. Audit flag will be set to specify a success or failure on the audit check.
8 If the audit check routine does not indicate a failure flag, next batch process continues to run.
9 If the audit check routine indicates a failure, then the batch process is aborted.
10 When the batch process is aborted, the control passes to the alert system where appropriate action is taken by sending email notifications and pager messages.
11 Audit reports can be generated which will be useful to audit the ETL process and as well the source systems.


Continue in next part -  Audit Strategy in ETL #2





Like the below page to get update  
https://www.facebook.com/datastage4you
https://twitter.com/datagenx
https://plus.google.com/+AtulSingh0/posts
https://datagenx.slack.com/messages/datascience/

Tuesday, 23 February 2016

DataStage Scenario #15 - Get First & Last Date of Last Month



Design a job which can generate the First and Last date of Last Month and pass this into an SQL which executes inside a parallel job?









Like the below page to get update  
https://www.facebook.com/datastage4you
https://twitter.com/datagenx
https://plus.google.com/+AtulSingh0/posts
https://datagenx.slack.com/messages/datascience/

Monday, 22 February 2016

Python Points #10d - Writing into Files

Wednesday, 17 February 2016

Python Points #10c - File Methods

Monday, 15 February 2016

5 Tips For Better DataStage Design #9



#1. Always save the metadata (for source, target or lookup definitions) in the repository to ensure re-usability and consistency.

#2. Make sure that the pathname/format details are not hard coded and job parameters are used for the same. These details are generally set as environmental variable.




#3. Ensure that all file names from external source are parameterized. This will prevent the developer from the trouble of changing the job or file name if the file name is changed. File names/Datasets created in the job for intermediate purpose can be hard coded.

#4. Ensure that the environment variable $APT_DISABLE_COMBINATION is set to ‘False’.
Ensure that $APT_STRING_PADCHAR is set to spaces.

#5. The parameters used across the jobs should be with same name. This helps to avoid unnecessary confusions

#6. Be consistent with where the slashes in the path live. Either in the design or the variableThomas McNicol




Like the below page to get update  
https://www.facebook.com/datastage4you
https://twitter.com/datagenx
https://plus.google.com/+AtulSingh0/posts
https://datagenx.slack.com/messages/datascience/

Sunday, 7 February 2016

Python Points #10b - Reading Files

Saturday, 6 February 2016

Python Points #10a - Reading Files

Friday, 5 February 2016

DataStage Scenario #14 - Get Day specific file



Hi Guys,
Design jobs for one more realtime scenario -


Requirement-
Job will run once in a day, which read a file from a folder, but filename is changing on each day.

File Name -- FileName_Date.txt

Here -
Date -  Job run date
File - File Name


Example -

FileName on Monday -   InfoSphere_20160201.txt
FileName on Tuesday -   Info_Search_20160202.txt
FileName on Wednesday -   InfoLables_20160203.txt
FileName on Thursday -   InfoLocation_20160204.txt
FileName on Friday -   InfoOptions_20160205.txt




Like the below page to get update  
https://www.facebook.com/datastage4you
https://twitter.com/datagenx
https://plus.google.com/+AtulSingh0/posts
https://datagenx.slack.com/messages/datascience/

Thursday, 4 February 2016

DataStage Scenario #13 - Validate Decimals



Input:

Data
1 2 3 4. 345
423 4w 4232.23
42342.43243
56765.74 545
4 4o 646. 4343
3232. 4232
f432 4.6 66
43 4 6 4.3 3



Expected Output:

Data Valid
1234. 345 1234.345
4234w4232.23     0
42342.43243 42342.43243
56765.74 545 56765.7455
4 4o 646. 4343 0
3232. 4232 3232.4232
f432 4.6 66 0
43 4 6 4.3 3 43464.33




Like the below page to get update  
https://www.facebook.com/datastage4you
https://twitter.com/datagenx
https://plus.google.com/+AtulSingh0/posts
https://groups.google.com/forum/#!forum/datagenx

Wednesday, 3 February 2016

Python Points #9 - Tuples

Monday, 1 February 2016

5 Tips For Better DataStage Design #8



#1. Templates have to be created to enhance reusability and enforce coding standard. Jobs should be created using templates.
#2. The template should contain the standard job flow along with proper naming conventions of components, proper Job level annotation and short/long description. Change record section should be kept in log description to keep track.



#3. Don't copy the job design only. copy using 'save as' or create copy option at the job level.
#4. The DataStage connection should be logged off after completion of work to avoid locked jobs.
#5. Creation of common lookup jobs
#6. Some extraction jobs can be created to created reference datasets. The datasets can then be used in different conversion modules





Like the below page to get update  
https://www.facebook.com/datastage4you
https://twitter.com/datagenx
https://plus.google.com/+AtulSingh0/posts
https://groups.google.com/forum/#!forum/datagenx