Showing posts with label Sort. Show all posts
Showing posts with label Sort. Show all posts

Wednesday, 2 January 2019

How to iterate MongoDB Cursor in Python - Intermediate I


Whenever querying on mongodb, always store the output into a variable, called cursor, before performing any operation on data. It will keep your data into variable without messing your output ground. PyMongo Cursor variable supports few functions which helps with few information without actual seeing your data such as retrieved data count or distinct values in particular key etc.

In this session, we will learn about the mongo db cursor variables, for this exercise also we are going to use 'USER' database hosted on free tier MongoDB Atlas (M0) Server.


Jupyter Notebook can be accessed HERE also

= =


Next Post on this Series and more on MongoDB can be find here -> LINK





Like the below page to get the update  
Facebook Page      Facebook Group      Twitter Feed      Google+ Feed      Telegram Group     


Thursday, 16 June 2016

5 Tips For Better DataStage Design #14



1. The use of Lookup stage depends upon the volume of data.Sparse lookup type should be used when primary input data volume is small.If the reference data volume is more, Lookup Stage should be avoided.

2. Use of ORDER BY clause in the database is good as compared to use of sort stage.



3. In Dtatastage Administrator, Tuned the 'Project Tunable' for better performance.

4. For Funnel, the use of this stage reduces the performance of a job. Funnel Stage should be run in continuous mode.

5. If the hash file is used only for lookup then "enable Preload to memory". This will improve the performance.






Like the below page to get update  
https://www.facebook.com/datastage4you
https://twitter.com/datagenx
https://plus.google.com/+AtulSingh0/posts
https://datagenx.slack.com/messages/datascience/

Sunday, 10 April 2016

5 Tips For Better DataStage Design #12



1. Minimum number of sort stages should be use in a datastage job. “Don’t sort if previously sorted” in sort Stage, this option should be set this to “true”, which improves the Sort Stage performance. The same Hash key should be used.  In Transformer Stage “Preserve Sort Order” can be used to maintain sort order option.

2. Minimum number of stages should be used in a job; otherwise it affects the performance of the job.
If a job is having more stages then the job should be decomposed into smaller number of small jobs. The use of container is a best way for better visualize and readability. If the existing active stages occupy almost all the CPU resources, the performance can be improved by running multiple parallel copies of the same stage process. This is done by using a share container.





3. Use of minimum of Stage variables in transformer is a good practice. The performance degrades when more stage variables are used.

4. The use of column propagation should be taken care . Columns, which are not needed in the job flow, should not be propagated from one Stage to another and from one job to the next. The best option is to disable the RCP.

5. When there is a need of renaming columns or addition of new columns, use of copy or modify stage is good practice.





Like the below page to get update  
https://www.facebook.com/datastage4you
https://twitter.com/datagenx
https://plus.google.com/+AtulSingh0/posts
https://datagenx.slack.com/messages/datascience/

Sunday, 24 January 2016

ps command #4 - Sorting with sort command



We can sort the ps command output by unix sort also which is easy to use. Need to pass ps command output to sort command with proper argument and Volla !! You will get the output as you want.

Let's see how this is work [ sort command arguement can differ per your linux flavour and version ]
 I am using - CentOS 6.3


1. Display the top CPU consuming process (Column 3 - %CPU)
$ ps aux | head -1; ps aux | sort -k3 -nr |grep -v 'USER'| head
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND 
atul     21210  2.0  0.1 110232  1140 pts/3    R+   00:13   0:00 ps aux
hduser    2671  0.8  4.1 960428 42436 pts/1    Sl+  Aug22   5:29 mongod
root      1447  0.2  0.3 185112  3384 ?        Sl   Aug22   1:36 /usr/sbin/vmtoolsd
atul      2478  0.2  2.1 448120 21876 ?        Sl   Aug22   1:51 /usr/lib/vmware-tools/sbin64/vmtoolsd -n vmusr --blockFd 3
rtkit     2359  0.1  0.1 168448  1204 ?        SNl  Aug22   0:44 /usr/libexec/rtkit-daemon
root         7  0.1  0.0      0     0 ?        S    Aug22   0:53 [events/0]
root      2204  0.1  4.3 147500 43872 tty1     Ss+  Aug22   0:45 /usr/bin/Xorg :0 -nr -verbose -audit 4 -auth /var/run/gdm/auth-for-gdm-wEmBs1/database -nolisten tcp vt1
root       920  0.0  0.0      0     0 ?        S    Aug22   0:00 [bluetooth]
root         9  0.0  0.0      0     0 ?        S    Aug22   0:00 [khelper]
root         8  0.0  0.0      0     0 ?        S    Aug22   0:00 [cgroup] 

For my linux sort command arguements are --
-kn  ==> This use to select the column n, such as for column 4, -k4
-n   ==> column is numeric
-r   ==> reverse order

sort -k3 -nr ==> sort the third column of output in numeric reverse sort (largest to smallest)

2. Display the top 10 memory consuming process (Column 4 - %MEM)
$ ps aux | head -1; ps aux | sort -k4 -nr |grep -v 'USER'| head
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND 
root      2204  0.1  4.3 147500 43872 tty1     Ss+  Aug22   0:46 /usr/bin/Xorg :0 -nr -verbose -audit 4 -auth /var/run/gdm/auth-for-gdm-wEmBs1/database -nolisten tcp vt1
hduser    2671  0.8  4.1 960428 42436 pts/1    Sl+  Aug22   5:32 mongod
atul      2458  0.0  2.3 943204 23624 ?        S    Aug22   0:16 nautilus
atul      2516  0.0  2.2 275280 22316 ?        Ss   Aug22   0:06 gnome-screensaver
atul      2478  0.2  2.1 448120 21876 ?        Sl   Aug22   1:52 /usr/lib/vmware-tools/sbin64/vmtoolsd -n vmusr --blockFd 3
atul      2507  0.0  1.6 321388 16680 ?        S    Aug22   0:01 python /usr/share/system-config-printer/applet.py
atul      2589  0.0  1.4 292556 14600 ?        Sl   Aug22   0:14 gnome-terminal
atul      2536  0.0  1.3 395832 13372 ?        S    Aug22   0:00 /usr/bin/gnote --panel-applet --oaf-activate-iid=OAFIID:GnoteApplet_Factory --oaf-ior-fd=22
atul      2502  0.0  1.3 474620 13952 ?        Sl   Aug22   0:01 gpk-update-icon
atul      2537  0.0  1.2 459964 12736 ?        S    Aug22   0:10 /usr/libexec/clock-applet --oaf-activate-iid=OAFIID:GNOME_ClockApplet_Factory --oaf-ior-fd=34
 




3. Display the process by time (Column 4 - TIME)
$ ps vx | head -1; ps vx | sort -k4 -r| grep -v 'PID' | head
PID TTY      STAT   TIME  MAJFL   TRS   DRS   RSS %MEM COMMAND 2478 ?        Sl     1:52    351   593 447526 21876  2.1 /usr/lib/vmware-tools/sbin64/vmtoolsd -n vmusr --blockFd 3
 2458 ?        S      0:16    228  1763 941440 23624  2.3 nautilus
 2589 ?        Sl     0:15     28   296 292259 14600  1.4 gnome-terminal
 2421 ?        Ssl    0:14     22    34 500541 9676  0.9 /usr/libexec/gnome-settings-daemon
 2479 ?        S      0:13     23   403 310472 11996  1.1 nm-applet --sm-disable
 2537 ?        S      0:10     37   168 459795 12736  1.2 /usr/libexec/clock-applet --oaf-activate-iid=OAFIID:GNOME_ClockApplet_Factory --oaf-ior-fd=34
 2444 ?        Ssl    0:10     25    64 445791 4872  0.4 /usr/bin/pulseaudio --start --log-target=syslog
 2445 ?        S      0:07     51   593 322206 12684  1.2 gnome-panel
 2516 ?        Ss     0:06      4   151 275128 22316  2.2 gnome-screensaver
 2522 ?        Sl     0:05      5    41 231870 1960  0.1 /usr/libexec/gvfs-afc-volume-monitor

 


4. Display the top 10 real memory usage process (Column 8 - RSS)
$ ps vx | head -1; ps vx | sort -k8 -nr| grep -v 'PID' | head 
PID TTY      STAT   TIME  MAJFL   TRS   DRS   RSS %MEM COMMAND
 2458 ?        S      0:16    228  1763 941440 23624  2.3 nautilus
 2516 ?        Ss     0:06      4   151 275128 22316  2.2 gnome-screensaver
 2478 ?        Sl     1:52    351   593 447526 21876  2.1 /usr/lib/vmware-tools/sbin64/vmtoolsd -n vmusr --blockFd 3
 2507 ?        S      0:01     73     2 321385 16680  1.6 python /usr/share/system-config-printer/applet.py
 2589 ?        Sl     0:15     28   296 292259 14600  1.4 gnome-terminal
 2502 ?        Sl     0:01     29   257 474362 13952  1.3 gpk-update-icon
 2536 ?        S      0:00     92  1607 394224 13372  1.3 /usr/bin/gnote --panel-applet --oaf-activate-iid=OAFIID:GnoteApplet_Factory --oaf-ior-fd=22
 2537 ?        S      0:10     37   168 459795 12744  1.2 /usr/libexec/clock-applet --oaf-activate-iid=OAFIID:GNOME_ClockApplet_Factory --oaf-ior-fd=34
 2445 ?        S      0:07     51   593 322206 12684  1.2 gnome-panel
 2438 ?        Sl     0:03     30   542 433105 12512  1.2 metacity


Like above examples you can create so many one liners for you. But before using anyone of above one command, check your ps and sort command behavior then use them.
Mostly, every other shell has its own argument for ps and sort but basics are same. For sorting any command output by particular column first understand that output/column and then use sort commnd.

I hope, you find this helpful. Keep Learning !!





Like the below page to get update  
https://www.facebook.com/datastage4you
https://twitter.com/datagenx
https://plus.google.com/+AtulSingh0/posts
https://groups.google.com/forum/#!forum/datagenx

Friday, 8 January 2016

5 Tips For Better DataStage Design #7



#1. In case the partition type for the next immediate stage is to be changed then the ‘Propagate partition’ should be set to ‘Clear’ in the current stage.



#2. Make sure that appropriate partitioning and sorting are used in the stages, where ever possible. This enhances the performances. Make sure that you understand the partitioning being used. Otherwise leave it auto.

#3. For fixed width files, final delimiter should be set to 'none' in the file format property.

#4. If any processing stage requires a key ( like remove duplicate, merge, join, etc ) the Keys, sorting keys and Partitioning keys should be same and in the same order

#5. To improve Funnel, all the input links must be hash partitioned on the sort keys.





Like the below page to get update  
https://www.facebook.com/datastage4you
https://twitter.com/datagenx
https://plus.google.com/+AtulSingh0/posts
https://groups.google.com/forum/#!forum/datagenx

Wednesday, 23 December 2015

5 Tips For Better DataStage Design #6



#1. If you are using a copy or a filter stage either immediately after or immediately before a transformer stage, you are reducing the efficiency by using more stages because a transformer does the job of both copy stage as well as a filter stage

#2. Work done by "COPY Stage"
a) Columns order can be altered.
b) And columns can be dropped.
c) We can change the column names.



#3. When you need to run the same sequence of jobs again and again, better create a sequencer with all the jobs that you need to run. Running this sequencer will run all the jobs. You can provide the sequence as per your requirement.

#4. Sort the data as much as possible in DB and reduced the use of DS-Sort for better performance of jobs. Avoid the work done by DataStage which is possible in DB. But it doesn't mean you have to put all the complexity in SQL only, for that we are using datastage.

#5. Ensure that all the character fields are trimmed before any processing. Normally extra spaces in the data may lead to some errors like lookup mismatch which are hard to detect.





Like the below page to get update  
https://www.facebook.com/datastage4you
https://twitter.com/datagenx
https://plus.google.com/+AtulSingh0/posts
https://groups.google.com/forum/#!forum/datagenx

Friday, 20 November 2015

5 Tips For Better DataStage Design #5



#1. Use the Data Set Management utility, which is available in the Tools menu of the DataStage Designer or the DataStage Manager, to examine the schema, look at row counts, and delete a Parallel Data Set. You can also view the data itself.

#2. Use Sort stages instead of Remove duplicate stages. Sort stage has got more grouping options and sort indicator options.

#3. for quick checking if DS job is running on Server or not, from UNIX
ps -ef | grep 'DSD.RUN'



#4. Make use of Order By clause when a DB stage is being used in join. The intention is to make use of Database power for sorting instead of Data Stage resources. Keep the join partitioning as Auto. Indicate don’t sort option between DB stage and join stage using sort stage when using order by clause.

#5. There are two types of variables - string and encrypted. If you create an encrypted environment variable it will appears as the string "*******" in the Administrator tool.





Like the below page to get update  
https://www.facebook.com/datastage4you
https://twitter.com/datagenx
https://plus.google.com/+AtulSingh0/posts
https://groups.google.com/forum/#!forum/datagenx

Monday, 5 October 2015

Hashing & Sorting Criteria in stages


As we all aware about the best partitioning method is Round Robin but this method distribute the whole data to all the partition irrespective of Key ( Round Robin is Keyless partitioning method) which is usually we do not want and when we consider the key, It's Hash.

              DataStage sorting and hashing improves the data processing speed which is one of our targets to achieve in projects. So, let's create a list of some important stages and see whether they need the partitioning or sorting to perform better.



Stages Partition(Hash) Sort
Sort Yes No
Aggregator Yes Yes
Join Yes Yes
Remove Duplicate No No
Merge Yes Yes
Lookup No No








Like the below page to get update