Friday, 29 September 2017

Letters and Symbols in MarkdDown

Thursday, 28 September 2017

Mathematics in Markdown


From Wiki - 
Markdown is a lightweight markup language with plain text formatting syntax designed so that it can be converted to HTML and many other formats using a tool by the same name.[8] Markdown is often used to format readme files, for writing messages in online discussion forums, and to create rich text using a plain text editor(extensions - *.markdown , *.md ). website
Best thing of markdown files is you can convert the same into html without any issue. 



I've introduced with Markdown files when I have started to put my code on GitHub (https://github.com/atulsingh0). Started with little up n downs but after I've fallen for it, Its easy to write ReadMe or Math Equations files in markdown with little help.

In this tutorial, I have focused on Mathematics part only, for writing math formulas, Markdown is using LaTeX symbols for Greek letters, Brackets, Sign operator and lots of other symbols.

I have consolidated few of them and will add more,
Hoping, you will find it useful -  Direct Link

==








Like the below page to get update  
https://www.facebook.com/datastage4you
https://twitter.com/datagenx
https://plus.google.com/+AtulSingh0/posts
https://datagenx.slack.com/messages/datascience/

Thursday, 14 September 2017

Evaluation Sequence in Transformer Stage - A Quick DataStage Recipe



Recipe:

What is evaluation sequence in Transformer Stage Or Order of Stage & Loop Variable and Derivations

Ingredients:

1. Transformer Stage
     a. Stage Variables
     b. Loop Variables
     c. Derivations


How To:

Evaluate each stage variable initial value
For each input row to process:
Evaluate each stage variable derivation value, unless the derivation is empty
For each output link:
Evaluate each column derivation value
Write the output record
Next output link
Next input row


** The stage variables and the columns within a link are evaluated in the order in which they are displayed in the Transformer editor. Similarly, the output links are also evaluated in the order in which they are displayed




Like the below page to get update  
https://www.facebook.com/datastage4you
https://twitter.com/datagenx
https://plus.google.com/+AtulSingh0/posts
https://datagenx.slack.com/messages/datascience/

Tuesday, 12 September 2017

A Newer Version of Jupyter Notebook - Jupyter Lab


We all use Jupyter Notebook (previously known as IPython Notebook) a lot when researching on something or doing stuff on stuff :-)
For those, Who dont know what it is, It is a browser based Notebook which holds the Python code as well as executed Output. You can export it to html, pdf or its native format (*.ipynb).

So coming back to topic, there is a next generation of Jupyter Notebook is available, try it once, I am sure you will fell in love with it. So let's quickly check how you can get it -
Installation:
$ pip install jupyterlab

Execution:
$ jupyter lab

Features which I like the most:
1. You will get a file browser in left side of notebook window for easy access on files
2. Provide 5 quick access button at Leftmost panel (Files, Running, Commands, CellTools & Tabs)
3. Each Notebook will open in same browser tab, means there will be one browser tab and inside that tab there will be multiple jupyter notebook tab will open


For more details, visit - https://github.com/jupyterlab/jupyterlab





Like the below page to get update  
https://www.facebook.com/datastage4you
https://twitter.com/datagenx
https://plus.google.com/+AtulSingh0/posts
https://datagenx.slack.com/messages/datascience/

Monday, 11 September 2017

Datastage Calling script in Remote Server


Step 1: Setting up UNIX server to automatically login without prompt-ing a password.

1. SSH must be installed in both servers. (primary and remote)
2. User ID for both servers

You can find the Step by Step detail on this Link - Configuring_SSH_on_Linux


Step 2: Creating Datastage job to run script in a remote server

1. Create a new sequencer job
2. Add an Execute Command stage
3. In the Command text value in the ExecCommand tab,

type -

 ssh UserB@ServerB ksh /home/b/test.ksh

This command will execute a script in the remote server.




Like the below page to get update  
https://www.facebook.com/datastage4you
https://twitter.com/datagenx
https://plus.google.com/+AtulSingh0/posts
https://datagenx.slack.com/messages/datascience/

Friday, 8 September 2017

BI Report Testing Trends


Extending my quite old post (ETL Testing - Trends and Challenge - link) by sharing my thoughts on BI Report Testing Issues and Solutions. Feel free to add your views and comments if any.

Business Intelligence
Business Intelligence (BI) and Data Warehouse (DW) systems allow companies to use their data for data analysis, business scenarios and forecasting, business planning, operation optimization and financial management and compliance. To construct Data Warehouses, Extract, Transform and Load (ETL) technologies help to collect data from various sources, transform the data depending on business rules and needs, and load the data into a destination database.

Consolidating data into a single corporate view enables the gathering of intelligence about your business, pulling from different data sources, such as your Physician, Hospitals, Labs and Claims systems.


  • BI Report Testing Trends
Once the ETL part is tested , the data being showed onto the reports hold utmost importance. QA team should verify the data reported with the source data for consistency and accuracy.
  • Verify Report data from source (DWH tables/views)
QA should verify the report data (field/column level) from source by creating required SQL at their own based on different filter criteria (as available on report filter page).
  • Creating SQLs 
Create SQL queries to fetch and verify the data from Source and Target. Sometimes it’s not possible to do the complex transformations done in ETL. In such a case the data can be transferred to some file and calculations can be performed.
  • GUI & Layout
Verifying Report GUI (selection page) and layout (report output layout).
  • Performance verification
Verifying Report’s performance (report’s response time should be under predefined time limit as specified by business need). Also report’s performance can be tested for multiple users (those # of users are expected to access the report at same time and this limit should be defined by business need)
  • Security verification
Verifying that only authorized users can access the report OR some specific part of report (if that part should not allow to any general user)





Like the below page to get update  
https://www.facebook.com/datastage4you
https://twitter.com/datagenx
https://plus.google.com/+AtulSingh0/posts
https://datagenx.slack.com/messages/datascience/

Saturday, 12 August 2017

Delete all lines in Notepad++ except lines containing a pattern


Who dont want to keep the only items which needed and we want to have when working with lots of junk data and want to fetch only which requires :-)

This can be done very easily in NotePad++ .

1. Use Ctrl-F to open Search box and  Select "Mark"

2. Put the Pattern in "Find What" box (in my case, I want to keep "Singh" only)




3. Then Click on Mark All to Mark all the lines which is having "Singh" in it.

4. When Marked, it will look like below -






5. After marking, Close this pop-up and Go To Search --> Bookmark --> Remove Unmarked Lines



As soon as Click on Remove Unmarked Lines, it will delete the all lines other than marked line.



Like the below page to get update  
https://www.facebook.com/datastage4you
https://twitter.com/datagenx
https://plus.google.com/+AtulSingh0/posts
https://datagenx.slack.com/messages/datascience/

Wednesday, 26 July 2017

View Hidden File in WinSCP


Recently I have stuck with the problem where I have to download one of configuration file from server which is hidden. Usually we use WinSCP or Filezilla to do these kind of task quickly so I went for WinSCP.  But I was unable to locate the file on WinSCP as this is hidden.

Google helped me once again... This is how we can view the hidden file in WinSCP


Short Way:
  • Use Alt-Clt-H key combination to view the hidden files.

Long Way:
  • Go to Options --> Preferences --> Panels
  • At Right Hand Side, Check the Show Hidden Files Option





Like the below page to get update  
https://www.facebook.com/datastage4you
https://twitter.com/datagenx
https://plus.google.com/+AtulSingh0/posts
https://datagenx.slack.com/messages/datascience/

Friday, 14 July 2017

Utility to Get Last Run Log of DataStage Job


Love the new feature of IBM DataStage to fetch the last run log of any jobs with the help of "dsjob" command.
In older version of DataStage, it is very tedious to get the last run log but from/after v9.1 IBM added an additional feature in dsjob command. Lets see hows this works -


  • To Fetch Last Run Log:
  • To Fetch Second Last Run Log
  • To Fetch Third Last Run Log
  • To Fetch Nth Last Run Log





Like the below page to get update  
https://www.facebook.com/datastage4you
https://twitter.com/datagenx
https://plus.google.com/+AtulSingh0/posts
https://datagenx.slack.com/messages/datascience/

Thursday, 13 July 2017

How to Run Python Code in NotePad++


Can you run the python code from NotePad++ ?? Question seems to be little odd but we can tweak our notepad++ settings and configure it that way. Let's see how -



1. Write few line as your python code, You can use below lines-

print("Today we are going to learn how to use notepad++ to run the python code")
print("As first step, we have to write few python code line")

input("Press Enter to Exit..........")




2. Save this file, in my case, it is saved as "npp_run.py"
3. Check the python executable path in your system. In my case it is - C:\tools\Anaconda3\python.exe  (It can be different as per your python installation)

4. Now go to Run menu or Press F5. This will open a run window as below -

5. Python below code in 'the program to run' -
Python_Executable_Path $(FULL_CURRENT_PATH)
C:\tools\Anaconda3\python.exe $(FULL_CURRENT_PATH)



6. Save this Run configuration by clicking on Save button on same window

7. Choose Run button combination (can use Ctrl + Alt + Shift + Key) and Save.

8. You can see this combination under Run Menu.


9. Now, You can run the Python code by pressing the combination buttons (My case - F9)




Things to Remember:
1. This tweak is not replacement for Python IDE :-) such as PyCharm, Spider or many others.
2. Always put the  input("Press Enter to Exit..........")at very last line of your code else you will not able to see the python code output.




Like the below page to get update  
https://www.facebook.com/datastage4you
https://twitter.com/datagenx
https://plus.google.com/+AtulSingh0/posts
https://datagenx.slack.com/messages/datascience/

Friday, 7 July 2017

ICONV mystery - the UV function


Iconv (Internal CONVersion) is a function supported by UniVerse DB (UV db) to convert the DATA, not only DATE, into internal format. DataStage Server Jobs are using lots of UV functions to manipulate the data.

Today, I will try to unwrap the mystery behind the Iconv function and try to put the details in simpler words. We will not go into data conversion details but date conversion which is used by DataStage :-)

Like most of other date functions (Parallel one), Iconv also accept the date(string) and its format.

Suppose, Date =   June 17, 2017

To Convert this date into internal format, we have to use -

Iconv("2017-06-17", D-YMD)  = 18066
Iconv("2017/06/17", D/YMD)   = 18066
Iconv("2017:17:06", D:YDM)  = 18066
Iconv("17-06-17", D-Y2MD)    = 18066



D-  --> D for Delimiter followed by delimiter char
Y --> year in YYYY
M --> month in MM
D --> date in DD

As we can see, if we provide the date format with date string, Iconv convert the date to an integer no and it is very important to do because now datastage can understand the given date and we can use Oconv function to re-format the date as required.

I will cover Oconv in next post, till then Keep Learning !!




Like the below page to get update  
https://www.facebook.com/datastage4you
https://twitter.com/datagenx
https://plus.google.com/+AtulSingh0/posts
https://datagenx.slack.com/messages/datascience/

Wednesday, 5 July 2017

Conditionally Aborting Jobs with Transformer Stage


How to develop a job which will stop processing when FROM_DATE and TO_DATE is equal in data? Or
I want to abort the job when reject row count is more than 50?

Above scenarios can be implemented using Transformer Stage but How? Let's check this out -

  • The Transformer can be used to conditionally abort a job when incoming data matches a specific rule. 
    • In our case 1, it is FROM_DATE  = TO_DATE 
    • In our case 2, it is some reject condition 
  • Create a new output link that will handle rows that match the abort rule. 
  • Within the link constraints dialog box, apply the abort rule to this output link
  • Set the “Abort After Rows” count to the number of rows allowed before the job should be aborted .
    • In case 1, it should be 1. as we want to abort the job when FROM_DATE is equal to TO_DATE
    • In case 2, it should be 50 as we want to abort the job when reject condition have more than 50 records
xfm

But, since the Transformer will abort the entire job flow immediately, it is possible that valid rows will not have been flushed from Sequential File (export) buffers, or committed to database tables.
It is important to set the Sequential File buffer flush  or database commit parameters otherwise we have to manually remove the data which has been inserted into sequential file or database.





Like the below page to get update  
https://www.facebook.com/datastage4you
https://twitter.com/datagenx
https://plus.google.com/+AtulSingh0/posts
https://datagenx.slack.com/messages/datascience/

Tuesday, 20 June 2017

Crontab for Windows #iLoveScripting


While working on one of my project, I required to take backup of all the work which I have completed coz workplace is shared among many developers.
     So being a Linux person, I was looking for something simple like Crontab but ended with Windows Task Scheduler. 

Tool is simple but not as Linux Crontab But it did the work asked by me :-)

How to Use Task Scheduler - 

  • Login with Admin privilege user account  
  • Open Run and "Taskschd.msc"
  •  Or  Go to Start --> Control Panel --> System and Maintenance --> Administrative Tools --> Task Scheduler
  • Click on "Create Task" on right hand side
  • This will open a Wizard to create Task
  • Fill the Task Name, Owner, Privilege and Configured for as below - 
  • Now, Click on Next Tab - Trigger, Here you have to define the time when you want to execute the program
  • You can fill the different Setting to customize your schedule.
  • Now, Click on "Action Tab"
  • In this tab, you have to define the action, such as when triggered what program/script should be execute
  • Click OK
  • You can see your task created under "Task Scheduler Library"


For More details on Task Scheduler, You can visit below links -
https://technet.microsoft.com/en-us/library/cc748993(v=ws.11).aspx




Like the below page to get update  
https://www.facebook.com/datastage4you
https://twitter.com/datagenx
https://plus.google.com/+AtulSingh0/posts
https://datagenx.slack.com/messages/datascience/

Wednesday, 31 May 2017

Remove ctrl-M character from all files within Directory #iLoveScripting


Continuing our journey on #iLoveScripting,..............
This script will do the same task as "clnM.sh" but this will accept Directory Path as an input rather than the filename. It will iterate through each file within given directory and remove all Ctrl-M characters.


If you are unable to see the Script, Please find it here - LINK







Like the below page to get update  
https://www.facebook.com/datastage4you
https://twitter.com/datagenx
https://plus.google.com/+AtulSingh0/posts
https://datagenx.slack.com/messages/datascience/

Tuesday, 30 May 2017

Remove ctrl-M character from file #iLoveScripting


This is my first post under #iLoveScripting which will have lots of shell script which are helping me in my day to day task and sharing here for all guys for easing their work as well.

 The very magical script, which I have use, is "clnM.sh". This script is remove the ctrl-M characters (^M) from your windows file.

Usage:  clnM.sh <FILE>


If you are unable to see the Script, Please find it here - LINK




Like the below page to get update  
https://www.facebook.com/datastage4you
https://twitter.com/datagenx
https://plus.google.com/+AtulSingh0/posts
https://datagenx.slack.com/messages/datascience/

Sunday, 21 May 2017

dos2unix - A script to convert DOS to LINUX formatting #iLoveScripting



dos2unix - a simple filter to convert text files in DOS format to UNIX/LINUX end of line conventions by removing the carriage return character(\r).  This will leave the newline character(\n) which unix expects.

Usgae:
dos2unix [file1] :  Remove DOS End of Line (EOL) char from file1, write back to file1
dos2unix [file1] [file2] : Remove DOS EOL char from file1, write to file2
dos2unix -d [directory] : Remove DOS EOL char from all files in directory



Like the below page to get update  
https://www.facebook.com/datastage4you
https://twitter.com/datagenx
https://plus.google.com/+AtulSingh0/posts
https://datagenx.slack.com/messages/datascience/

Friday, 12 May 2017

#3 - Measuring Data Similarity or Dissimilarity


Continue from -
 'Measuring Data Similarity or Dissimilarity #1'
 'Measuring Data Similarity or Dissimilarity #2',


3. For Ordinal Attributes:

Ordinal attribute is an attribute with possible values that have a meaningful order or ranking among them but the magnitude between successive values is not known. Ordinal values are same as Categorical Values but with the Order.

Such as, For "Performance" columns Values are - Best, Better, Good, Average, Below Average, Bad

These values are Categorical values with order or rank so called Ordinal Values. Ordinal attributes can also be derived from discretization of numeric attributes by splitting the value range into finite number of ordered categories.

We assign rank to these categories to calculate the similarity or dissimilarity, i.e. - There is an attribute f having N possible state can have `1, 2, 3........f_N` ranking.


Measuring Data Similarity or Dissimilarity for Ordinal Attributes


How to Calculate Similarity or Dissimilarity: 

1, Assign the Rank `R_if`to each category of attribute f having N possible states.
2. Normalize the Rank between [0.0, 1.0] so that each attribute have equal weight.
Can be calculated as

`R_in = \frac{R_if - 1}{N - 1}`

3. Now Similarity or Dissimilarity can be calculated with any distance measuring techniques. ( 'Measuring Data Similarity or Dissimilarity #2)






Like the below page to get update  
https://www.facebook.com/datastage4you
https://twitter.com/datagenx
https://plus.google.com/+AtulSingh0/posts
https://datagenx.slack.com/messages/datascience/

Tuesday, 9 May 2017

Measuring Data Similarity or Dissimilarity #2


Continuing from our last discussion 'Measuring Data Similarity or Dissimilarity #1',  In this post we are going to see how to calculate the similarity or dissimilarity between Numeric Data Types.

2. For Numeric Attribute:

For measuring the dissimilarity between two numeric data points, the easiest or most used way to calculate the 'Euclidean distance', Higher the value of distance, higher the dissimilarity.
           There are two more distance measuring methods named 'Manhattan distance' and 'Minkowski distance'. We are going to look into these one by one. 


a. Euclidean distance: 

Euclidean distance is widely used to calculate the dissimilarity between numeric data points, this is actually derived from 'Pythagoras Theorem' so also known as 'Pythagorean metric' or `L^2` norm.

Euclidean distance between two points `p(x_1, y_1)` and `q(x_2, y_2)` is the length which connects point p from point q.

`dis(p,q) = dis(q,p) = \sqrt((x_2 - x_1)^2 + (y_2 - y_1)^2) = \sqrt(\sum_(i=1)^N(q_i - p_i)^2)`

In One Dimention:

`dis(p,q) = dis(q,p) = \sqrt((q - p)^2) = q - p`

In Two Dimentions:

`dis(p,q) = dis(q,p) = \sqrt((q_1 - p_1)^2 + (q_2 - p_2)^2)`

In Three Dimentions:

`dis(p,q) = dis(q,p) = \sqrt((q_1 - p_1)^2 + (q_2 - p_2)^2 + (q_3 - p_3)^2)`

In N Dimentions:

`dis(p,q) = dis(q,p) = \sqrt((q_1 - p_1)^2 + (q_2 - p_2)^2 + (q_3 - p_3)^2 +.......................+ (q_N - p_N)^2)`


b. Manhattan distance: 

It is also known as "City Block" distance as it is calculated same as we calculate the distance between any two block of city. It is simple difference between the data points.

`dis(p, q) = |(x_2 - x_1)| + |(y_2 - y_1)| = \sum_(i=1)^N|(q_i - p_i)|`

Manhattan distance is also know as `L^1` norm.


c. Minkowski distance: 

This is the generalized form of Euclidean or Manhattan distance and represented as - 

`dis(p,q) = dis(q,p) = [(x_2 - x_1)^n + (y_2 - y_1)^n]^{1/n} = [\sum_(i=1)^N(q_i - p_i)^n]^{1/n}`

where n = 1, 2, 3.......






Like the below page to get update  
https://www.facebook.com/datastage4you
https://twitter.com/datagenx
https://plus.google.com/+AtulSingh0/posts
https://datagenx.slack.com/messages/datascience/

Measuring Data Similarity or Dissimilarity #1


Yet another question is in data mining to measure whether two datasets are similar or not. There are so many ways to calculate these values based on Data Type. Let's see into these methods -

1. For Binary Attribute:

Binary attributes are those which is having only two states 0 or 1, where 0 means attribute is absent and 1 means it is present. For calculating similarity/dissimilarity between binary attributes we use contingency table -

Contingency Table

q - if i and j both are equal to 1
r - if i is 1 and j is 0
s - if i is 0 and j is 1
t - if i and j both are equal to 0
p - total ( q+r+s+t)

a. Symmetric Binary Dissimilarity - 

For symmetric binary attribute, each state is equally valuable. If i and j are symmetric binary attribute then dissimilarity is calculates as -

`  d(i, j) = \frac{r + s}{q + r + s + t}  `


b. Asymmetric Binary Dissimilarity - 

For asymmetric binary attribute, two states are not equally important. Any one state overshadow the other, such binary attribute are often called "monary" (having one state). For these kind of attribute, dissimilarity is calculates as - 

`d(i, j) = \frac{r + s}{q + r + s}`

likewise, we can calculate the similarity (asymmetric binary similarity)

` sim(i, j) = 1 - d(i, j) `

which leave us with  

` JC = sim(i, j) = \frac{q}{q + r + s} `

The coefficient sim(i, j) is also known as Jaccard coefficient. 





Like the below page to get update  
https://www.facebook.com/datastage4you
https://twitter.com/datagenx
https://plus.google.com/+AtulSingh0/posts
https://datagenx.slack.com/messages/datascience/

Tuesday, 25 April 2017

Graphical Display of Basic Stats of Data



After a long time, got a chance to share somethings with you guys, so feeling awesome :-), Today we are gonna see the Graphical Display of Data Stats or sometime we call it Exploratory Data Analysis as well, This is the best way to understand your data in very less time and set your analysis path for it. So without doing more chats, let's start -

1. Scatter Plot

Very Basic, Very Easy and Most Used EDA(Exploratory Data Analysis) technique. It is 2-D plot between X and Y variables where X or Y can be numeric data features or columns.
               With this plot we can easily see if there is any relationship, pattern or trends between between these 2 features or any data outlier existing. It is also useful to explore possibility of correlation relationships. Correlation can be positive, negative or neutral.

Now, let's look into a scatter plot -
I am using IRIS dataset and Python matplotlib library for this illustration - -

scatter iris

2. Histogram

Histogram plot is one of the oldest plotting technique to summarize the data distribution of a attribute X. X can be numerical feature and height of bar is frequency. Resulting plot is also called Bar Chart.

histogram bar chart iris

3. Quantile Plot (Bar Charts)

Quantile Plot or Bar Charts also used to display the uni-variate variables data distribution as well as plot the percentile information with outlier detection.

qunatile box plot

Keep looking for this space for further update.

Happy Learning




Like the below page to get update  
https://www.facebook.com/datastage4you
https://twitter.com/datagenx
https://plus.google.com/+AtulSingh0/posts
https://datagenx.slack.com/messages/datascience/