Showing posts with label Atul. Show all posts
Showing posts with label Atul. Show all posts

Saturday, 31 December 2016

Learning Pandas - DataFrame #2

Saturday, 30 July 2016

#1 How to Copy DataSet from One Server to Another Server

Hi Guys...
I've been asked so many times that how can we move/copy one dataset from one server to another So here is the way which I follow.

At very first step, Analyze if you can avoid this by using some other way like creating sequential file and ftp Or load the data into temporary table which can be accessible on another server, if using datastage packs then via mqs, xml or json formats etc. Why I am suggesting these solutions coz these are easy to design and guaranteed the data quality at other end.

If above solutions are not possible, please follow the below steps -

Points I am going to cover here -
1. Generating a dummy dataset
2. Files which we need to move
3. Why we need these files only
4. Reading the dataset on another server


1. Generating a dummy dataset

I have created a dummy job which is generating a dataset with default APT_Config_file which has 2 nodes.

Here, I am generating 10 dummy rows with the help of Row Generator stage and storing them into a datasset.

a. Config File - I am using the default config file (replaced the server name in "fastname" with serverA)

APT_CONFIG_FILE = /opt/IBM/InformationServer/Server/Configulations/default.apt

check out the "resource disk" location in config file, we need it for further processing

RESOURCE DISK = /opt/IBM/InformationServer/Server/DataSets

b. dataset location - I have created this dataset in my home dir named dummy.ds

DATASET LOC = /home/atul/dummy.ds

Keep looking for next post........

Like the below page to get update

Monday, 14 March 2016

Machine Learning Links you must Visit

1. Scikit-Learn Tutorial Series -
2. 7 Free Machine Learning Courses - 
3. k-nearest neighbor algorithm using Python -
4. 7 Steps to Mastering Machine Learning With Python -

Analytics in Python 

1. Learning Pandas #1 - Series -
2. Learning Pandas #2 - DataFrame  -
3. Learning Pandas #3 - Working on Summary & Missing Data -
4. Learning Pandas #4 - Hierarchical Indexing -

Like the below page to get update

DataScience Links you must Visit

1. Python (and R) for Data Science - sample code, libraries, projects, tutorials -
2. 19 Worst Mistakes at Data Science Job Interviews - 

Like the below page to get update

Wednesday, 17 February 2016

Python Points #10c - File Methods

Tuesday, 12 January 2016

Python Points #6 - Strings

Sunday, 13 December 2015

Swirl Learn R in R

Step 1:  Install R
Step 2:  Install RStudio
Step 3:  Install swirl in R or RStudio
> install.packages("swirl")              
Step 4: Using R
> library("swirl")
> swirl()            

Step 5: Install R courses from R Course Repository

Like the below page to get update!forum/datagenx

Data Science Tools Installation in Linux

Wednesday, 9 September 2015

DataStage Terminology

 DataStage Term                     Description
DataStage Administrator Client program that is used to manage the project of DataStage.
DataStage server DataStage server engine component that links to DataStage client programs.
DataStage Director Execution and monitoring of the program that you want to use DataStage job.
DataStage Designer Graphical design tool that developers use to design and development of the DataStage job.
DataStage Manager Program used to manage the contents of the repository and see DataStage. Please refer to the "DataStage Manager Guide."
Stage Component that represents the data source for DataStage job, and process steps.
Source I pointed to the source of data extraction. STEPS, such as, for example.
Target Refers to the destination of the data. For example, point to the file to be loaded to AML (output by DataStage).
Category Category names used to categorize the jobs in DataStage.
Container Process of calling a common process.
Job Program that defines how to do data extraction, transformation, integration, and loading data into the target database.
Job templates And job processing is performed similar role model.
Job parameters Variables that are included in the job design. (For example,. Is the file name and password, for example)
Job sequence Control is a job for a start and run other jobs.
Scratch disk Disk space to store the data set, such as virtual record.
Table definition Definition that describes the required data includes information about the associated data tables and columns. Also referred to as metadata.
Partitioning For high-speed processing of large amounts of data, the mechanism of DataStage to perform split of the data.
Parallel engine Engine running on multiple nodes to control jobs DataStage.
Parallel jobs Available to parallel processing, the DataStage job type.
Project Jobs and a collection of components required to develop and run. The entry level to DataStage from the client. DataStage project must be licensed.
Metadata Data about the data. For example, a table definition that describes the columns that are building data.
Link Each stage of the job combined to form a data flow or reference lookup.
Routine Functions that are used in common.
Column definition Define the columns to be included in the data table. Contains the names and data types of the columns that make up the column.
Environmental parameters Variables that are included in the job design. For example, the file name and password.
DB2 stage Stage to be able to read and write to DB2 database.
FileSet stage Collection of files used to store the data.
Lookup Stage Performing processing such as a file reference to be used in DataStage or text file or table.
LookupFileSet Stage The process of storing a lookup table.
Sequential Stage Want to manipulate text files.
Custom Stage The process can not be implemented in stages, which is provided as standard DataStage, the process of implementing the language C.
Copy Stage The process of copying the data set.
Stage Generator The process of generating a dummy data set.
DataSet stage Data file to be used in the parallel engine.
Funnel Stage The process of copying the data from one set of multiple data sets.
Filter Stage Processing to extract records from the input data set.
Merge Stage Processing the join more than one input record.
LinkCorrector Stage Process of collecting the data that was previously divided.
RemoveDuplicate Stage The process of removing duplicate entries from the data set.
ChangeCapture Stage The process of comparing the two data sets, the difference between the two records.
RowImport Stage Process of importing a column from a string or binary column.
RowExport Stage Processing to export a column of another type string or binary column.
Transformer Stage Processing to edit item, or type conversion.
Modify Stage Process of conversion to the specified data type, the conversion of the value to the specified value NULL.
XML input stage Reading and processing of XML file, the extraction of the required element from the XML data.
Sort Stage Process to sort data in ascending or descending order.
Join Stage Processing the join more than one input record.
RowGenerator Stage Process of adding a line to an existing dataset
ColumnGenerator Stage Process of adding a column to an existing dataset
Aggregator stage The process of aggregating the data.
Pivot Stage The process of converting multiple columns into multiple rows of records.
Peek Stage Treatment to destroy the records.
Stream link Link that represents the flow of data.
Reference Links Input link that represents the reference data.
Reject link Link to output the data that do not meet the criteria you specify.
Integer type Data type that represents the integer value.
Decimal type Data type represents a number that contains the value of the decimal point.
NULL value Specific value indicating the unknown value. 0, not the same as a blank or empty sequence.
DataSet Collection of data.
SQLStatement Statement to manipulate the data in the table.
TWS Stands for Tivoli Workload Scheduler. The name of the product you want to create a job net.
Hash One way of dividing the data specified in the partitioning function of DataStage. Partitioning is performed by using a hash value.
Modulus One way of dividing the data specified in the partitioning function of DataStage. I do partitioning by dividing the number of partitions in the specified value.
Same One way of dividing the data specified in the partitioning function of DataStage. Processing as input partition without subdivision has been output by the previous stage.
Job Properties Property sheet to make settings for the job.

Like the below page to get update!forum/datagenx


Saturday, 22 August 2015

Linux Shell Script Scenario - 1

Write a Shell Script to read a parameter file and run the other script with these parameters.

Parameter File :

Script_Name = 'script.ksh'
Arg1       = 13
Arg2       = 36 

Now, Read this parameter file and kick off the command like below -

./script.ksh 13 36

For more scenario -  CLICK HERE

Thursday, 20 August 2015

DataStage - A Journey from VMark to IBM

DataStage was conceived at VMark, a spin off from Prime Computers that developed two notable products: UniVerse database and the DataStage ETL tool. The first VMark ETL prototype was built by Lee Scheffler in the first half of 1996[1]. Peter Weyman was VMark VP of Strategy and identified the ETL market as an opportunity. He appointed Lee Scheffler as the architect and conceived the product brand name "Stage" to signify modularity and component-orientation[2]. This tag was used to name DataStage and subsequently used in related products QualityStage, ProfileStage, MetaStage and AuditStage. Lee Scheffler presented the DataStage product overview to the board of VMark in June 1996 and it was approved for development. The product was in alpha testing in October, beta testing in November and was generally available in January of 1997.

VMark acquired UniData in October of 1997 and renamed itself to Ardent Software[3(1)]. In 1999 Ardent Software was acquired by Informix[4(2)] the database software vendor. In April of 2001 IBM acquired Informix and took just the database business leaving the data integration tools to be spun off as an independent software company called Ascential Software[5(3)]. In March of 2005 IBM acquired Ascential Software[6(4)] and made DataStage part of the WebSphere family as WebSphere DataStage. In 2006 the product was released as part of the IBM Information Server under the Information Management family but was still known as WebSphere DataStage. In 2008 the suite was renamed to InfoSphere Information Server and the product was renamed to InfoSphere DataStage.

DataStage Editions
Enterprise Edition: a name give to the version of DataStage that had a parallel processing architecture and parallel ETL jobs.
Server Edition: the name of the original version of DataStage representing Server Jobs. Early DataStage versions only contained Server Jobs. DataStage 5 added Sequence Jobs and DataStage 6 added Parallel Jobs via Enterprise Edition.
MVS Edition: mainframe jobs, developed on a Windows or Unix/Linux platform and transferred to the mainframe as compiled mainframe jobs.
DataStage for PeopleSoft: a server edition with prebuilt PeopleSoft EPM jobs under an OEM arrangement with PeopeSoft and Oracle Corporation.
DataStage TX: for processing complex transactions and messages, formerly known as Mercator.
DataStage SOA: Real Time Integration pack can turn server or parallel jobs into SOA services.

Tuesday, 18 August 2015

Putty - Command Line Magic

We all are using ssh client PUTTY in day to day task and it is very irritating to login in different server again and again. Today, I come up with Putty command line by which we can make this so easy. Wanna to login, Just a click and Voila !!!

So let's start --

a.  First of all make a shortcut of your putty.exe file by right click --> Send to --> Desktop
b. ,This will display like below, just rename it with your server name or address to know which server is going to connect when we click on it.

c. Here, I have used my linux server  192,168,37.129
d. Now, Right click on Putty Shortcut  ---> Properties. This will display like below -

e. We have to edit the Target command  -- 

For me  :-
Server Address -
User Name - atul
Password - atul

edited command ---    -ssh user@server -pw password
For my case  ---    -ssh atul@ -pw atul

Add this command to Target value , after whatever is existed there.  So Target's new value is ( in my case ) -   C:\_mine\putty.exe -ssh atul@ -pw atul 

f. Click on Apply and OK.
g. For accessing without entering username and password, simply click on edited shortcut.

** Caution ** :  Do not edit putty in public computer as your username and password is normal text which can be misused by anyone.

Saturday, 15 August 2015

MongoDB - Installation and Configuration in Linux

MongoDB  is an open-source document database, and the leading NoSQL database. Written in C++.
MongoDB features:

    Document-Oriented Storage
    Full Index Support
    Replication & High Availability
    Fast In-Place Updates

Reduce cost, accelerate time to market, and mitigate risk with proactive support and enterprise-grade capabilities.

Today, We will see how to install and run the MongoDB.