Wednesday, 30 November 2016

Learning Graphlab - SFrame #1

Hoping you guys went through the last post (Lnk -> Getting Started with Graphlab), In this post we will do some handson SFrame datatype of Graphlab which is same as dataframe of pandas python library.

i. Reading the CSV file

ii. save DataSet 

iii. load DataSet

iv. Check Total Rows and Columns

v. Check Columns data type and Name

vi. Add new column

vii. Delete column

viii. Rename column

ix. Column Swapping (location)

Like the below page to get update

Sunday, 27 November 2016

Getting Started with Graphlab - A Python library for Machine Learning

Before Starting with Graphlab, We have to configure our system with some basic tools such as Python, Jupyter Notebook etc. You can find 'How-To' on this link -

What is GraphLab ??
GraphLab Create is a Python library, backed by a C++ engine, for quickly building large-scale, high-performance data products. Some key features of GraphLab Create are:
  • Analyze terabyte scale data at interactive speeds, on your desktop.
  • A Single platform for tabular data, graphs, text, and images.
  • State of the art machine learning algorithms including deep learning, boosted trees, and factorization machines.
  • Run the same code on your laptop or in a distributed system, using a Hadoop Yarn or EC2 cluster.
  • Focus on tasks or machine learning with the flexible API.
  • Visualize data for exploration and production monitoring.
After the installation of Graphlab library we can use it as any python library.

Use Jupyter Notebook for starter, Open a Python notebook in Jupyter Notebook and execute below commands to see graphlab working -

 a. Importing Graphlab - 


b. Reading CSV file
This method will parse the input file and convert it into a SFrame variable


c. Getting Started with SFrame 

i. View content of SFrame variable sf


ii. View Head lines (top lines) 


ii. View Tail lines (last lines)

Like the below page to get update

Monday, 21 November 2016

Reading DSParam - datastage parameter file

I am sharing a utility which can help you to read DSParam file which holds all the environmental datastage parameters.

Utility to view contents of DSParams file. Useful when trying to see what all the customer has set at the project level.

$ cat DSParams | ./ | more
$ cat DSParams | ./ > outputfile

1. copy script text below to a file ( on a UNIX system
2. Set execute permissions on this file. chmod 777
3. Usually perl is in /usr/bin/perl but you might have to adjust this path if neccessary. (hint "which perl" should tell you which one to use)
4. cat the DSParams file from the project you are concerned with and redirect the output to this script. You may have to put the Fully Qualified Path for this file.
5. capture the output to screen or file. File may be useful to have the customer send the info to you in email.

Like the below page to get update

Monday, 14 November 2016

DataStage Partitioning #3

Best allocation of Partitions in DataStage for storage area

No of Ways
Volume of Data
Best way of Partition
Allocation of Configuration File (Node)
DB2 EEE  extraction in serial
DB2 EEE extraction in parallel
Node number = current node (key)
64 (Depends on how many nodes are allocated)
Partition or Repartition in the Stages of DataStage
Modulus (It should be single key that to integer)
Hash (Any number of keys with different data type)
8 (Depends on how many nodes are allocated for the job)
Writing into DB2
Writing into Dataset
1,2,4,8,16,32,64 etc… (Based on the incoming records it writes into it.)
Writing into Sequential File


Best allocation of Partitions in DataStage for each stage

S. No
Best way of Partition
Important points
Left and Right link: Hash or Modulus
All the input links should be sorted based on the joining key and partitioned with higher key order.

Main link: Hash or same
Reference link: Entire
Both the links need not be in the sorted order

Master and update link: Hash or Modulus
All the input links should be sorted based on the merging key and partitioned with higher key order. Pre-sort makes merge “lightweight” for memory.

Remove Duplicate, Aggregator
Hash or Modulus
If the input link is in sorted order based on the key it will perform better.

Hash or Modulus
Sorting happens after partitioning

Transformer, Funnel, Copy, Filter
Change Capture
Left and Right link: Hash or Modulus
Both the input links should be in the sorted order based on the key and partitioned with higher key order.

Like the below page to get update

Friday, 11 November 2016

DataStage Partitioning #2

Keyless partition
    Round Robin
    Entire
    Same
    Random

Round Robin
The first record goes to the first processing node, the second to the second processing node, and so on. When DataStage reaches the last processing node in the system, it starts over. This method is useful for resizing partitions of an input data set that are not equal in size. The round robin method always creates approximately equal-sized partitions. This method is the one normally used when DataStage initially partitions data.
Example: Assume 8 nodes are allocated to store the records then 1st record will go into the first node and the 2nd record will go into the second node ……8th record will go into the eighth node and the 9th record will go into the first node and so on….

Send all rows down all partitions.
Example: Assume 8 nodes are allocated, then in all the 8 nodes all the records will be passed.

Preserve the same partitioning.
Example: Two stages in a job (Sort and Dataset). In sort stage you have done “Hash” partition and in the dataset you have given “Same” partition. In the dataset the data will be preserved with the hash partition.

DataStage uses a random algorithm to choose where the rows goes. The result of Random is that you cannot know where a row will end up.

Application Execution: Parallel jobs can be executed in two ways
> Sequential
> Parallel
In the first slot of the below figure sequential execution is shown. Parallel job can be executed in two processing, SMP and MPP, in the second and third slot of the below figure it is shown.

Ref -

Like the below page to get update

Thursday, 10 November 2016

DataStage Partitioning #1

Partitioning mechanism divides a portion of data into smaller segments, which is then processed independently by each node in parallel. It helps make a benefit of parallel architectures like SMP, MPP, Grid computing and Clusters.

Partition is logical. Partition is to divide memory or mass storage into isolated sections. Memory space will be split into many partitions to have high parallelism. In DOS systems, you can partition a disk, and each partition will behave like a separate disk drive.

In hash partitioning no specified space will be allocated to a partition in the memory. The partition space is allocated depending upon the data.

Why Partition?
•    Ability to run multiple operating systems, or multiple versions of an operating system, on the same server
•    Ability to improve workload balancing and distribution by managing processor allocations across applications and users on the server
•    Ability to leverage hardware models such as “Capacity on Demand” and "Pay as You Grow.”

Types of partition
  • Hash
  • Modulus
  • DB2
  • Auto
  • Random
  • Range
  • Round Robin
  • Entire
  • Same


DataStage inserts partitioners as necessary to ensure correct result. Generally chooses Round Robin or Same. Since Datastage has limited awareness of data and business rules, best practice is to explicitly specify partitioning as per requirement when processing requires groups of related records.

Key based partition
  • Hash
  • Modulus
  • DB2
  • Range

Determines partition based on key value(s). One or more keys with different data type are supported. DataStage’s internal algorithm applied to key values determines the partition. All key values are converted to characters before the algorithm is applied.
Example: Key is State. All “CA” rows go into one partition; all “MA” rows go into one partition. Two rows of the same state never go into different partitions.


Partition based on modulus of key divided by the number of partitions. Key is an Integer type. ( partition=MOD(key_value/number of partition) )
Example: Key is OrderNumber (Integer type). Rows with the same order number will all go into the same partition.

Matches DB2 EEE partitioning, DB2 published its hashing algorithm and DataStage copies that.
Example: This partition is used when loading data into the DB2 table. It takes the partition key from the loading DB2 table and inserts the records effectively. If the partition key is defined in the DB2 database then it takes that Partition key otherwise it defaults to primary key.

The partition is chosen based on a range map, which maps ranges of values to specified partitions. This is similar to Hash, but partition mapping is user-determined and partitions are ordered. Range partitioning requires processing the data twice which makes it hard to find a reason for using it.

This figure gives the clear view of Key based Partitioning and repartitioning.

DataStage's parallel technology operates by a divide-and-conquer technique, splitting the largest integration jobs into subsets ("partition parallelism") and flowing these subsets concurrently across all available processors ("pipeline parallelism"). This combination of pipeline and partition parallelism delivers true linear scalability (defined as an increase in performance proportional to the number of processors) and makes hardware the only mitigating factor to performance.

                However, downstream processes may need data partitioned differently. Consider a transformation that is based on customer last name, but the enriching needs to occur on zip code - for house-holding purposes - with loading into the warehouse based on customer credit card number (more on parallel database interfaces below). With dynamic data re-partitioning, data is re-partitioned on-the-fly between processes - without landing the data to disk - based on the downstream process data partitioning needs.

Ref -

Like the below page to get update

Monday, 7 November 2016

Modify Stage - Drop Columns

Sunday, 6 November 2016

Export the jobs from DS windows client

Datastage jobs Export/Import are occasional activity (Deployment time :-)) for a developer But it becomes very tedious if the job list are long or it's daily routine to export or import jobs.

So I have written a batch script (windows script) which we can execute from the Client Machine (where datastage clients are installed) and automate this process.

DSjobExportClient.bat :
Export Script read the job name from the file (ExportJobList.txt) and exports the jobName.dsx from the project to the export base location and maintain the folder structure specified in the ExportJobList.txt file. File "ExportJobList.txt" and “” should be updated before running the export script.

Copy the above file on any location of your DS windows client machine and update the "ExportJobList.txt" and “” files.
-    ExportJobList.txt
-    ExportJobs.bat :

ExportJobList.txt :

DsExportJobsClient.bat :

Like the below page to get update