My e-Notes about Cloud, K8s, OpenShift, DataScience, Machine Learning, Python, Data Analytics, DataStage, DWH and ETL Concepts


Wednesday, 9 November 2016

DataStage Partitioning #1

Partitioning mechanism divides a portion of data into smaller segments, which is then processed independently by each node in parallel. It helps make a benefit of parallel architectures like SMP, MPP, Grid computing and Clusters.

Partition is logical. Partition is to divide memory or mass storage into isolated sections. Memory space will be split into many partitions to have high parallelism. In DOS systems, you can partition a disk, and each partition will behave like a separate disk drive.

In hash partitioning no specified space will be allocated to a partition in the memory. The partition space is allocated depending upon the data.

Why Partition?
•    Ability to run multiple operating systems, or multiple versions of an operating system, on the same server
•    Ability to improve workload balancing and distribution by managing processor allocations across applications and users on the server
•    Ability to leverage hardware models such as “Capacity on Demand” and "Pay as You Grow.”

Types of partition
  • Hash
  • Modulus
  • DB2
  • Auto
  • Random
  • Range
  • Round Robin
  • Entire
  • Same


DataStage inserts partitioners as necessary to ensure correct result. Generally chooses Round Robin or Same. Since Datastage has limited awareness of data and business rules, best practice is to explicitly specify partitioning as per requirement when processing requires groups of related records.

Key based partition
  • Hash
  • Modulus
  • DB2
  • Range

Determines partition based on key value(s). One or more keys with different data type are supported. DataStage’s internal algorithm applied to key values determines the partition. All key values are converted to characters before the algorithm is applied.
Example: Key is State. All “CA” rows go into one partition; all “MA” rows go into one partition. Two rows of the same state never go into different partitions.


Partition based on modulus of key divided by the number of partitions. Key is an Integer type. ( partition=MOD(key_value/number of partition) )
Example: Key is OrderNumber (Integer type). Rows with the same order number will all go into the same partition.

Matches DB2 EEE partitioning, DB2 published its hashing algorithm and DataStage copies that.
Example: This partition is used when loading data into the DB2 table. It takes the partition key from the loading DB2 table and inserts the records effectively. If the partition key is defined in the DB2 database then it takes that Partition key otherwise it defaults to primary key.

The partition is chosen based on a range map, which maps ranges of values to specified partitions. This is similar to Hash, but partition mapping is user-determined and partitions are ordered. Range partitioning requires processing the data twice which makes it hard to find a reason for using it.

This figure gives the clear view of Key based Partitioning and repartitioning.

DataStage's parallel technology operates by a divide-and-conquer technique, splitting the largest integration jobs into subsets ("partition parallelism") and flowing these subsets concurrently across all available processors ("pipeline parallelism"). This combination of pipeline and partition parallelism delivers true linear scalability (defined as an increase in performance proportional to the number of processors) and makes hardware the only mitigating factor to performance.

                However, downstream processes may need data partitioned differently. Consider a transformation that is based on customer last name, but the enriching needs to occur on zip code - for house-holding purposes - with loading into the warehouse based on customer credit card number (more on parallel database interfaces below). With dynamic data re-partitioning, data is re-partitioned on-the-fly between processes - without landing the data to disk - based on the downstream process data partitioning needs.

Ref -

Like the below page to get update

No comments:

Post a comment


The postings on this site are my own and don't necessarily represent IBM's or other companies positions, strategies or opinions. All content provided on this blog is for informational purposes and knowledge sharing only.
The owner of this blog makes no representations as to the accuracy or completeness of any information on this site or found by following any link on this site. The owner will not be liable for any errors or omissions in this information nor for the availability of this information. The owner will not be liable for any losses, injuries, or damages from the display or use of his information.