Showing posts with label dataset. Show all posts
Showing posts with label dataset. Show all posts

Tuesday, 16 January 2018

Python Pickle - to save your efforts


This tutorial is for Python beginners who just started getting dirty in Python :-) Lots of time when we are working on some data set and completed lots of cleaning and pre-processing steps, It's advisable from GURUs to save that intermediary dataset to avoid the re-do all the steps if something unexpected happen with Python, Jupyter notebook or your system for rebooted without your permission, weird though :-/



So, let's learn how to save any dataset or variable with Pickle library -

To Import:


To Save: 


To Read:





Like the below page to get update  
https://www.facebook.com/datastage4you
https://twitter.com/datagenx
https://plus.google.com/+AtulSingh0/posts
https://datagenx.slack.com/messages/datascience/

Wednesday, 3 August 2016

#3 How to Copy DataSet from One Server to Another Server

This post is third and last of How to Copy DataSet from One Server to Another Server Series

We have generated a populated a dataset and identified the files which we need to move to another server serverB from serverA

Continue.......

4. Reading the dataset on another server

This is the most crucial step, Now all 4 files are moved on serverB or the common location which can be accessible from serverB.

For my case, common dir is my home - /home/users/atul


A. Change the default.apt file
We need to change the fastname in default.apt (config file) which we copied from the serverA, [ NOT the default.apt for serverB]

Open the file in any text editor or vi and change as below screen shot -


Temporarily create the "resource disk" and "resource scratchdisk" location if not existing as defined in above config file.

B. Copy the dataset data files 

Move the dataset data file from common directory to "resource disk" as defined in config file.

cp ~/dummy.ds.* /opt/IBM/InformationServer/Server/DataSets/


Now, all files locations are like -

Config file and Dataset descriptor file - my home dir or common dir
Dataset data files - /opt/IBM/InformationServer/Server/DataSets/


Design a job which will read thess dataset files and populate data into sequential file or any other output.


Job Paramaters -
APT_CONFIG_FILE = /home/users/atul/default.apt

DataSet Properties
DataSet File - /home/users/atul/dummy.ds

That is all, you can read the copied dataset on serverB, you can populate this data to some other output such as seq file, table so that you can avoid the use of copied default.apt config file which is not for serverB.

Try it out, let me know if you have any question.




If you like this post, follow the below pages to get update
https://www.facebook.com/datastage4you
https://twitter.com/datagenx
https://plus.google.com/+AtulSingh0/posts
https://datagenx.slack.com/messages/datascience/

Monday, 1 August 2016

#2 How to Copy DataSet from One Server to Another Server


This post is second part of How to Copy DataSet from One Server to Another Server

Continue.......

 After generating the dummy dataset, next step is to identify the files which we need to copy.

2. Files which we need to move

a. APT_CONFIG_FILE - configuration file which used in dataset
b. DataSet Descriptor file - *.ds file, in our case it is dummy.ds
c. DataSet Data files - Actual data files which stored in RESOURCE DISK location

So let's get all the path which we need to access -

APT_CONFIG_FILE = /opt/IBM/InformationServer/Server/Configulations/default.apt
RESOURCE DISK = /opt/IBM/InformationServer/Server/DataSets
DATASET LOC = /home/users/atul/dummy.ds



Use commands or any FTP tool to copy these files in a shared location which can be accessible from another server (serverB)

For my case, I have stored all of them into my linux home direcory which is common in both server.

So I have executed these commands to copy all the required files into my home directory.


cp  /opt/IBM/InformationServer/Server/Configulations/default.apt ~
cp  /opt/IBM/InformationServer/Server/DataSets/dummy.ds.* ~
cp  /home/users/atul/dummy.ds ~


Now, my home directory is having these files -


You can copy these 4 files on serverB where you want to move your dataset. I am not doing the same as my home directory is common for both server.

3. Why we need these files only

Config file was used by datastage to create dataset ( descriptot file, data files, data file location)
So, we needed - config file, dataset descriptor file and dataset data files.





Like the below page to get update  
https://www.facebook.com/datastage4you
https://twitter.com/datagenx
https://plus.google.com/+AtulSingh0/posts
https://datagenx.slack.com/messages/datascience/

Saturday, 30 July 2016

#1 How to Copy DataSet from One Server to Another Server



Hi Guys...
I've been asked so many times that how can we move/copy one dataset from one server to another So here is the way which I follow.

At very first step, Analyze if you can avoid this by using some other way like creating sequential file and ftp Or load the data into temporary table which can be accessible on another server, if using datastage packs then via mqs, xml or json formats etc. Why I am suggesting these solutions coz these are easy to design and guaranteed the data quality at other end.

If above solutions are not possible, please follow the below steps -

Points I am going to cover here -
1. Generating a dummy dataset
2. Files which we need to move
3. Why we need these files only
4. Reading the dataset on another server

http://www.datagenx.net/2016/06/datastage-quiz-1.html

 

1. Generating a dummy dataset

I have created a dummy job which is generating a dataset with default APT_Config_file which has 2 nodes.

http://www.datagenx.net/2015/12/how-to-use-universe-shell-uvsh-in.html





Here, I am generating 10 dummy rows with the help of Row Generator stage and storing them into a datasset.

a. Config File - I am using the default config file (replaced the server name in "fastname" with serverA)

APT_CONFIG_FILE = /opt/IBM/InformationServer/Server/Configulations/default.apt

http://www.datagenx.net/2016/06/5-tips-for-better-datastage-design-14.html

check out the "resource disk" location in config file, we need it for further processing

RESOURCE DISK = /opt/IBM/InformationServer/Server/DataSets

b. dataset location - I have created this dataset in my home dir named dummy.ds

DATASET LOC = /home/atul/dummy.ds


Keep looking for next post........





Like the below page to get update  
https://www.facebook.com/datastage4you
https://twitter.com/datagenx
https://plus.google.com/+AtulSingh0/posts
https://datagenx.slack.com/messages/datascience/

Wednesday, 23 March 2016

5 Tips For Better DataStage Design #11




  • When writing intermediate results that will only be shared between parallel jobs, always write to persistent data sets (using Data Set stages). You should ensure that the data is partitioned, and that the partitions, and sort order, are retained at every stage. Avoid format conversion or serial I/O.
  • Data Set stages should be used to create restart points in the event that a job or sequence needs to be rerun. But, because data sets are platform and configuration specific, they should not be used for long-term backup and recovery of source data.
  • Depending on available system resources, it might be possible to optimize overall processing time at run time by allowing smaller jobs to run concurrently. However, care must be taken to plan for scenarios when source files arrive later than expected, or need to be reprocessed in the event of a failure.
  • Parallel configuration files allow the degree of parallelism and resources used by parallel jobs to be set dynamically at run time. Multiple configuration files should be used to optimize overall throughput and to match job characteristics to available hardware resources in development, test, and production modes.
  • The proper configuration of scratch and resource disks and the underlying file system and physical hardware architecture can significantly affect overall job performance.






Like the below page to get update  
https://www.facebook.com/datastage4you
https://twitter.com/datagenx
https://plus.google.com/+AtulSingh0/posts
https://datagenx.slack.com/messages/datascience/

Monday, 1 February 2016

5 Tips For Better DataStage Design #8



#1. Templates have to be created to enhance reusability and enforce coding standard. Jobs should be created using templates.
#2. The template should contain the standard job flow along with proper naming conventions of components, proper Job level annotation and short/long description. Change record section should be kept in log description to keep track.



#3. Don't copy the job design only. copy using 'save as' or create copy option at the job level.
#4. The DataStage connection should be logged off after completion of work to avoid locked jobs.
#5. Creation of common lookup jobs
#6. Some extraction jobs can be created to created reference datasets. The datasets can then be used in different conversion modules





Like the below page to get update  
https://www.facebook.com/datastage4you
https://twitter.com/datagenx
https://plus.google.com/+AtulSingh0/posts
https://groups.google.com/forum/#!forum/datagenx