Showing posts with label type. Show all posts
Showing posts with label type. Show all posts

Friday, 11 November 2016

DataStage Partitioning #2


Keyless partition
    Round Robin
    Entire
    Same
    Random

Round Robin
The first record goes to the first processing node, the second to the second processing node, and so on. When DataStage reaches the last processing node in the system, it starts over. This method is useful for resizing partitions of an input data set that are not equal in size. The round robin method always creates approximately equal-sized partitions. This method is the one normally used when DataStage initially partitions data.
Example: Assume 8 nodes are allocated to store the records then 1st record will go into the first node and the 2nd record will go into the second node ……8th record will go into the eighth node and the 9th record will go into the first node and so on….

Entire
Send all rows down all partitions.
Example: Assume 8 nodes are allocated, then in all the 8 nodes all the records will be passed.

Same
Preserve the same partitioning.
Example: Two stages in a job (Sort and Dataset). In sort stage you have done “Hash” partition and in the dataset you have given “Same” partition. In the dataset the data will be preserved with the hash partition.

Random
DataStage uses a random algorithm to choose where the rows goes. The result of Random is that you cannot know where a row will end up.


Application Execution: Parallel jobs can be executed in two ways
> Sequential
> Parallel
In the first slot of the below figure sequential execution is shown. Parallel job can be executed in two processing, SMP and MPP, in the second and third slot of the below figure it is shown.




Ref - www.ibm.com


Like the below page to get update  
https://www.facebook.com/datastage4you
https://twitter.com/datagenx
https://plus.google.com/+AtulSingh0/posts
https://datagenx.slack.com/messages/datascience/

Thursday, 10 November 2016

DataStage Partitioning #1


Partitioning mechanism divides a portion of data into smaller segments, which is then processed independently by each node in parallel. It helps make a benefit of parallel architectures like SMP, MPP, Grid computing and Clusters.

Partition is logical. Partition is to divide memory or mass storage into isolated sections. Memory space will be split into many partitions to have high parallelism. In DOS systems, you can partition a disk, and each partition will behave like a separate disk drive.


Note:
In hash partitioning no specified space will be allocated to a partition in the memory. The partition space is allocated depending upon the data.



Why Partition?
•    Ability to run multiple operating systems, or multiple versions of an operating system, on the same server
•    Ability to improve workload balancing and distribution by managing processor allocations across applications and users on the server
•    Ability to leverage hardware models such as “Capacity on Demand” and "Pay as You Grow.”

Types of partition
  • Hash
  • Modulus
  • DB2
  • Auto
  • Random
  • Range
  • Round Robin
  • Entire
  • Same

Auto

DataStage inserts partitioners as necessary to ensure correct result. Generally chooses Round Robin or Same. Since Datastage has limited awareness of data and business rules, best practice is to explicitly specify partitioning as per requirement when processing requires groups of related records.

Key based partition
  • Hash
  • Modulus
  • DB2
  • Range

Hash
Determines partition based on key value(s). One or more keys with different data type are supported. DataStage’s internal algorithm applied to key values determines the partition. All key values are converted to characters before the algorithm is applied.
Example: Key is State. All “CA” rows go into one partition; all “MA” rows go into one partition. Two rows of the same state never go into different partitions.

Modulus

Partition based on modulus of key divided by the number of partitions. Key is an Integer type. ( partition=MOD(key_value/number of partition) )
Example: Key is OrderNumber (Integer type). Rows with the same order number will all go into the same partition.

DB2
Matches DB2 EEE partitioning, DB2 published its hashing algorithm and DataStage copies that.
Example: This partition is used when loading data into the DB2 table. It takes the partition key from the loading DB2 table and inserts the records effectively. If the partition key is defined in the DB2 database then it takes that Partition key otherwise it defaults to primary key.

Range
The partition is chosen based on a range map, which maps ranges of values to specified partitions. This is similar to Hash, but partition mapping is user-determined and partitions are ordered. Range partitioning requires processing the data twice which makes it hard to find a reason for using it.

This figure gives the clear view of Key based Partitioning and repartitioning.



DataStage's parallel technology operates by a divide-and-conquer technique, splitting the largest integration jobs into subsets ("partition parallelism") and flowing these subsets concurrently across all available processors ("pipeline parallelism"). This combination of pipeline and partition parallelism delivers true linear scalability (defined as an increase in performance proportional to the number of processors) and makes hardware the only mitigating factor to performance.

                However, downstream processes may need data partitioned differently. Consider a transformation that is based on customer last name, but the enriching needs to occur on zip code - for house-holding purposes - with loading into the warehouse based on customer credit card number (more on parallel database interfaces below). With dynamic data re-partitioning, data is re-partitioned on-the-fly between processes - without landing the data to disk - based on the downstream process data partitioning needs.


Ref - www.ibm.com



Like the below page to get update  
https://www.facebook.com/datastage4you
https://twitter.com/datagenx
https://plus.google.com/+AtulSingh0/posts
https://datagenx.slack.com/messages/datascience/

Tuesday, 10 May 2016

ETL Data Validation



    Organizations typically have “dirty data” that must be cleansed or scrubbed before being loaded into the data warehouse.  In an ideal world, there would not be dirty data.  The data in operational systems would be clean. Unfortunately, this is virtually never the case.  The data in these source systems is the result of poor data quality practices and little can be done about the data that is already there.  While organizations should move toward improving data quality at the source system level, nearly all data warehousing initiatives must cope with dirty data, at least in the short term. There are many reasons for dirty data, including:

Dummy values.  Inappropriate values have been entered into fields.  For example, a customer service representative, in a hurry and not perceiving entering correct data as being particularly important, might enter the store’s ZIP code rather than the customer’s ZIP, or enters 999-99-9999 whenever a SSN is unknown.  The operational system accepts the input, but it is not correct.

Absence of data.  Data was not entered for certain fields. This is not always attributable to lazy data entry habits and the lack of edit checks, but to the fact that different business units may have different needs for certain data values in order to run their operations.  For example, the department that originates mortgage loans may have a federal reporting requirement to capture the sex and ethnicity of a customer, whereas the department that originates consumer loans does not.

Multipurpose fields.  A field is used for multiple purposes; consequently, it does not consistently store the same thing.  This can happen with packaged applications that include fields that are not required to run the application.  Different departments may use the “extra” fields for their own purposes, and as a result, what is stored in the fields is not consistent.

Cryptic data.  It is not clear what data is stored in a field.  The documentation is poor and the attribute name provides little help in understanding the field’s content.  The field may be derived from other fields or the field may have been used for different purposes over the years.

Contradicting data.  The data should be the same but it isn’t.  For example, a customer may have different addresses in different source systems.

Inappropriate use of address lines.  Data has been incorrectly entered into address lines.  Address lines are commonly broken down into, for example, Line 1 for first, middle, and last name, Line 2 for street address, Line 3 for apartment number, and so on.  Data is not always entered into the correct line, which makes it difficult to parse the data for later use.

Violation of business rules.  Some of the values stored in a field are inconsistent with business reality.  For example, a source system may have recorded an adjustable rate mortgage loan where the value of the minimum interest rate is higher than the value of the maximum interest rate. 

Reused primary keys.  A primary key is not unique; it is used with multiple occurrences.  There are many ways that this problem can occur.  For example, assume that a branch bank has a unique identifier (i.e., a primary key).  The branch is closed and the primary key is no longer in use.  But two years later, a new branch is opened, and the old identifier is reused.  The primary key is the same for the old and the new branch.

Non-unique identifiers.  An item of interest, such as a customer, has been assigned multiple identifiers.  For example, in the health care field, it is common for health care providers to assign their own identifier to patients.  This makes it difficult to integrate patient records to provide a comprehensive understanding of a patient’s health care history.

Data integration problems.  The data is difficult or impossible to integrate.  This can be due to non-unique identifiers, or the absence of an appropriate primary key.  To illustrate, for decades customers have been associated with their accounts through a customer name field on the account record.  Integrating multiple customer accounts in this situation can be difficult.  When we examine all the account records that belong to one customer, we find different spellings or abbreviations of the same customer name, sometimes the customer is recorded under an alias or a maiden name, and occasionally two or three customers have a joint account and all of their names are squeezed into one name field.




Like the below page to get update  
https://www.facebook.com/datastage4you
https://twitter.com/datagenx
https://plus.google.com/+AtulSingh0/posts
https://datagenx.slack.com/messages/datascience/

Thursday, 4 February 2016

DataStage Scenario #13 - Validate Decimals



Input:

Data
1 2 3 4. 345
423 4w 4232.23
42342.43243
56765.74 545
4 4o 646. 4343
3232. 4232
f432 4.6 66
43 4 6 4.3 3



Expected Output:

Data Valid
1234. 345 1234.345
4234w4232.23     0
42342.43243 42342.43243
56765.74 545 56765.7455
4 4o 646. 4343 0
3232. 4232 3232.4232
f432 4.6 66 0
43 4 6 4.3 3 43464.33




Like the below page to get update  
https://www.facebook.com/datastage4you
https://twitter.com/datagenx
https://plus.google.com/+AtulSingh0/posts
https://groups.google.com/forum/#!forum/datagenx

Monday, 2 November 2015

5 Tips For Better DataStage Design #4



1) While using AsInteger() function in datastage transformer always trim the imput column before passing it to function because if there are extra spaces or unwanted characters which generates zeros when actual integer values are expected. We should use APT_STRING_PADCHAR=0x20 (space) env var for fixed field padding.

2) The Len(col) will return the incorrect length if the input column is having some non-ASCII or double byte characters in it. So check your NLS settings for the job to fix this.



3) To remove embedded spaces from decimal data, use StripWhiteSpace(input.field) function to remove all spaces.

4) To get the datastage job no, Open the log view of the job in datastage director and double click on any entry of the log. The job number will be listed under the field "Job Number:"

5) Set these 2 parameters APT_NO_PART_INSERTION, APT_NO_SORT_INSERTION to TRUE to avoid datastage to insert partitioning or sorting method to improve the job performance at compile job. This will remove the warning also "When checking operator: User inserted sort "<name>" does not fulfill the sort requirements of the downstream operator "<name>""





Like the below page to get update  
https://www.facebook.com/datastage4you
https://twitter.com/datagenx
https://plus.google.com/+AtulSingh0/posts
https://groups.google.com/forum/#!forum/datagenx