Saturday, 31 December 2016

Learning Pandas - DataFrame #2

Friday, 30 December 2016

Learning Pandas - Series #1

Wednesday, 28 December 2016

Learning Graphlab - SFrame #2

In last post Learning Graphlab - SFrame #1, we have learn basics of SFrame, like how to create, add or delete the columns in SFrame. In this post, we will revise it once again and learn some advance features of SFrame. Have a good learnng !!!

You can view the Jupyter Notebook for the same HERE




Thursday, 22 December 2016

DataStage Scenario #17 - Get Transitive relation between columns


Goal : To get the data from two columns which have transitive relationship between them

A -> B
B -> C

then 

A -> C



Input
Col1 Col2
a b
b c
s u
u p
1 2
2 3






Output
Col1 Col2
a c
s p
1 3




Like the below page to get update  

Wednesday, 21 December 2016

R Points #2 - DataFrame & List Basics

Sunday, 18 December 2016

R Points #1 - Matrix & Factor Basics

Saturday, 17 December 2016

R Points #0 - Basics n Vector




=======================================================
=======================================================



Like the below page to get update  
https://www.facebook.com/datastage4you
https://twitter.com/datagenx
https://plus.google.com/+AtulSingh0/posts
https://datagenx.slack.com/messages/datascience/

Monday, 12 December 2016

vmware player powering on internal error



vmware workstation logoLast night, I struggled with "vmware player powering on internal error" for almost an hour when I was trying to run a vmware os on my machine and followed so many tweaks to resolve this but no success.
After lots of googling, I found one solution which worked for me So sharing the same here if it can help someone stuck like me :)


Thursday, 8 December 2016

Notepad++ tip - Format JSON file


Notepad++ is a very powerful tool with lots of plugins and functionality which can reduce a lot of our work. Today, we will see how to deal with JSON data in Notepad++.

1. First of all, whenever you are opening any data or code file, always select respective language style. How you do it is below -

Open data/code file --> Go to Language Menu --> Select respective language setting (in our case, its J - JSON )

After doing this, you will see that code/data text is more visible to your eyes.

2. Install some plugins, for JSON, install below ones -
a. JSON Viewer
b. JSToolNpp

Go to Plugins --> Plugin Manager --> Show Plugin Manager --> Available ---> Select & Install

3. For formatting JSON, select all content, now use Ctrl+Alt+Shift+M or Ctrl+Alt+J
4. Your JSON file has been formatted :-)

You can download some useful plugins from here ->  http://bit.ly/2h5iygb
If dont have access, Use this - https://db.tt/H5VKcNA0

Place this plugin folder into your notepad++ installation directory and restart the notepad++.





Like the below page to get update  
https://www.facebook.com/datastage4you
https://twitter.com/datagenx
https://plus.google.com/+AtulSingh0/posts
https://datagenx.slack.com/messages/datascience/

Wednesday, 7 December 2016

Import the jobs from DS windows client #iLoveScripting


As we have discussed a script which can export the datastage jobs from your client system (http://bit.ly/2frNPKj) likewise we can write another one to import the jobs. Let's see how -

DsImportJobsClient.bat :

This Script read all the *.dsx job name from the specified Directory and Sub-Directory and import to the Specified project. It can also build (Only BUILD) the existing package created on Information Server Manager and send it to the specified location on client machine.

To use the build feature you need to make sure the package has been created with all the needed jobs, saved and closed. Only update to the selected job will be taken care automatically. To add/delete a job, you need to do manually.

Modify the Import.properties and ImportJobList.txt file and Go the .bat dir and then execute the importAndBuild.bat.




Import.properties :


ImportJobList.txt :


DsImportJobsClient.bat :





Like the below page to get update  
https://www.facebook.com/datastage4you
https://twitter.com/datagenx
https://plus.google.com/+AtulSingh0/posts
https://datagenx.slack.com/messages/datascience/

Wednesday, 30 November 2016

Learning Graphlab - SFrame #1


Hoping you guys went through the last post (Lnk -> Getting Started with Graphlab), In this post we will do some handson SFrame datatype of Graphlab which is same as dataframe of pandas python library.

i. Reading the CSV file
==
rdCSV

ii. save DataSet 
==

iii. load DataSet
==


iv. Check Total Rows and Columns
==
rowNum

v. Check Columns data type and Name
==
colTypes

vi. Add new column
==
addCol

vii. Delete column
==

viii. Rename column
==
renameCol

ix. Column Swapping (location)
==






Like the below page to get update  
https://www.facebook.com/datastage4you
https://twitter.com/datagenx
https://plus.google.com/+AtulSingh0/posts
https://datagenx.slack.com/messages/datascience/

Sunday, 27 November 2016

Getting Started with Graphlab - A Python library for Machine Learning


Before Starting with Graphlab, We have to configure our system with some basic tools such as Python, Jupyter Notebook etc. You can find 'How-To' on this link - http://bit.ly/2gvuG95

What is GraphLab ??
GraphLab Create is a Python library, backed by a C++ engine, for quickly building large-scale, high-performance data products. Some key features of GraphLab Create are:
  • Analyze terabyte scale data at interactive speeds, on your desktop.
  • A Single platform for tabular data, graphs, text, and images.
  • State of the art machine learning algorithms including deep learning, boosted trees, and factorization machines.
  • Run the same code on your laptop or in a distributed system, using a Hadoop Yarn or EC2 cluster.
  • Focus on tasks or machine learning with the flexible API.
  • Visualize data for exploration and production monitoring.
After the installation of Graphlab library we can use it as any python library.

Use Jupyter Notebook for starter, Open a Python notebook in Jupyter Notebook and execute below commands to see graphlab working -

 a. Importing Graphlab - 

=





b. Reading CSV file
This method will parse the input file and convert it into a SFrame variable

==


c. Getting Started with SFrame 

i. View content of SFrame variable sf

==


ii. View Head lines (top lines) 

==



ii. View Tail lines (last lines)
 
==







Like the below page to get update  
https://www.facebook.com/datastage4you
https://twitter.com/datagenx
https://plus.google.com/+AtulSingh0/posts
https://datagenx.slack.com/messages/datascience/

Monday, 21 November 2016

Reading DSParam - datastage parameter file



I am sharing a utility which can help you to read DSParam file which holds all the environmental datastage parameters.

Utility to view contents of DSParams file. Useful when trying to see what all the customer has set at the project level.



Usage:
$ cat DSParams | ./DSParamReader.pl | more
or
$ cat DSParams | ./DSParamReader.pl > outputfile


Instructions:
1. copy script text below to a file (DSParamReader.pl) on a UNIX system
2. Set execute permissions on this file. chmod 777 envvar.pl
3. Usually perl is in /usr/bin/perl but you might have to adjust this path if neccessary. (hint "which perl" should tell you which one to use)
4. cat the DSParams file from the project you are concerned with and redirect the output to this script. You may have to put the Fully Qualified Path for this file.
5. capture the output to screen or file. File may be useful to have the customer send the info to you in email.








Like the below page to get update  
https://www.facebook.com/datastage4you
https://twitter.com/datagenx
https://plus.google.com/+AtulSingh0/posts
https://datagenx.slack.com/messages/datascience/

Monday, 14 November 2016

DataStage Partitioning #3



Best allocation of Partitions in DataStage for storage area

Srno
No of Ways
Volume of Data
Best way of Partition
Allocation of Configuration File (Node)
1
DB2 EEE  extraction in serial
Low
-
1
2
DB2 EEE extraction in parallel
High
Node number = current node (key)
64 (Depends on how many nodes are allocated)
3
Partition or Repartition in the Stages of DataStage
Any
Modulus (It should be single key that to integer)
Hash (Any number of keys with different data type)
8 (Depends on how many nodes are allocated for the job)
4
Writing into DB2
Any
DB2
-
5
Writing into Dataset
Any
Same
1,2,4,8,16,32,64 etc… (Based on the incoming records it writes into it.)
6
Writing into Sequential File
Low
-
1

 

Best allocation of Partitions in DataStage for each stage

S. No
Stage
Best way of Partition
Important points
1
Join
Left and Right link: Hash or Modulus
All the input links should be sorted based on the joining key and partitioned with higher key order.

  1.  
Lookup
Main link: Hash or same
Reference link: Entire
Both the links need not be in the sorted order

  1.  
Merge
Master and update link: Hash or Modulus
All the input links should be sorted based on the merging key and partitioned with higher key order. Pre-sort makes merge “lightweight” for memory.

  1.  
Remove Duplicate, Aggregator
Hash or Modulus
If the input link is in sorted order based on the key it will perform better.

  1.  
Sort
Hash or Modulus
Sorting happens after partitioning


Transformer, Funnel, Copy, Filter
Same
None
7
Change Capture
Left and Right link: Hash or Modulus
Both the input links should be in the sorted order based on the key and partitioned with higher key order.





Like the below page to get update  
https://www.facebook.com/datastage4you
https://twitter.com/datagenx
https://plus.google.com/+AtulSingh0/posts
https://datagenx.slack.com/messages/datascience/