Monday, January 26, 2009

New Line command

Once in a while, I come across a task that i need to perform, may be using a python script and then it amazes me to find a tool built in to Linux / Unix.

I had a series of numbers (stored as a column) representing a physical quantity measured with respect to time. The data was stored as a text file. I had to add the time (in seconds) column before the column in the file. The first row had time of 1 second, the second row had a time of 2 seconds etc. In short, I just needed to add the line number to each row.

Ordinarily, one could use python to open the file, read each row and add a incremental number in front of each row. But with Linux all that I had to do was

nl file1.txt > file2.txt

where 'nl' is the new line command. The file 'file1.txt' contained the one column of numbers, the measured values and 'file2.txt' will contain two columns, the seconds column and the datacolumn.

The new line command can also work across pages, can number headers, footers etc. Please check the man page for more details.

Dec 28, 2009
I found a second method to add line number:

To add line number including blank line cat -n file1.txt > file2.txt. If you wish to add line numbers only for non-blank lines, use cat -b file1.txt > file2.txt

Friday, January 23, 2009

xargs - taking output of one command and making input for another

If you would like to take the output of one command and pass it on to another UNIX program, you are in luck. Like many things in Unix/Linux, there are many ways to perform this operation.

For example if I need to find all jobs that are running in the queue and get complete details, I could perform either

qstat -f `qstat | grep R | awk '{print $1}'`


qstat | grep R | awk '{print $1}' | xargs -n1 qstat -f

In the first version, the command within `backticks` is evaluated to obtain the list of jobs that are "running". In this command, 'qstat | grep R' will give a list of rows for jobs running. The awk command splits that string and obtains the list of job names.

In the second version, 'qstat | grep R | awk '{print $1}'' is evaluated and the output is passed to xargs command. Both these commands are equivalent, except that the second option is more robust in handling whitespace and null character in output before it becomes an input to the next command.

We will continue further with the use of xargs and calculate the total time all the currently running jobs would need.

qstat | grep R | awk '{print $1}' | xargs -n1 qstat -f | grep Resource_List.walltime | awk '{split($3,a,":");sum+=a[1]}END{print sum}'

As seen earlier, the output of xargs command gives the input to 'grep Resource_List.walltime' command. The output of this command is the rows that contain information about the wallclock time for each of these jobs. These rows are then parsed to obtain just the 3rd column ($3 in the command) which contains wallclock time formatted as hh:mm:ss.

This string is split across ":" and the first number, namely the hour, is obtained as "a[1]" in the command. This process is repeated for each of the job and at end the total sum is printed.

I will keep posting more of such tidbits in the future ....

Thursday, January 22, 2009

Creating module files

A complex program in Linux will generally be installed in multiple locations. For example, a C++ library like Magick++ after installation will require include files, library files and binary files. Each of these is located in different folders and may be in different parent directories aw well.

In one of our installation, Magick++ is installed in /usr/local/magick++/magick++/. Under this directory, the include files are in include, the library files are in lib and binary files are in bin.

For a user defined program using the Magick++ library, the path to all three have to be in their environment variable. This is set using a module file.

A module file is a TCL script. An example of the module file for adding path to magick++ is given below.

## Magick++ Module
proc ModulesHelp { } {
puts stderr "\tThis module adds PATH that allow you to compile Magick++

set MAGICK_LIB_HOME "/usr/local/magick++/magick++/lib"
set MAGICK_BIN_HOME "/usr/local/magick++/magick++/bin"
set MAGICK_INCLUDE_HOME "/usr/local/magick++/magick++/include"


The module file begins with #%Module, which helps to identify a module file. The proc ModulesHelp prints a helpful message whenever "module help magick++" is typed in linux command prompt. The next three lines create variables that store the location of lib, bin and include direcory. Finally these paths are appended to the environment variables PATH, LD_LIBRARY_PATH and LD_INCLUDE_PATH respectively.

To invoke this module file and attach all these path to environment variables, type "module load magick++" or "module add magick++" at the Linux command line.

To unload these path and environment variables, type "module unload magick++"

In addition to appending path, we can also prepend paths, set and unset environment variables, set and unset aliases etc. Refer to the manpage for more details.

Configuring Make file

Any of us who have used Linux will eventually end up installing softwares from source. The most common method for installing softwares written is C, C++, Fortran etc is using Makefile.

The Makefile contains the list of commands that will be used to create the various libraries, binaries etc.

In the most simplest of the scenario, the installation will involve

make install
make clean

The first step prepares the makefile with all relavant configuration depending on the system on which it is being installed. This could include, the location where the files will be stored after installation, the type of CPU etc. The second step "make" compiles and "make install" builds and places the program in the appropriate locations. "make clean" clears any temporary files that have been created.

Depending on the different scenarios and type of software being installed, different configuration may have to be set. In the example below, we will configure the installation of magick++ (a C++ library for ImageMagick) so that it is installed in /usr/local/magick++/magick++/ instead of the default location /usr/local/

./configure --with-quantum-depth=8 --prefix=/usr/local/magick++/magick++/ --exec-prefix=/usr/local/magick++/magick++/

--prefix is the location where the lib and include files will be stored after compiling. If not specified, it will be assumed as /usr/local

--exec-prefix is the location where bin files will be located.

By default, --prefix = --exec-prefix

There are many other configuration parameters that can be set, which again depends on the software. I will keep posting more such configuration.

Saturday, January 3, 2009

Spell checker in Unicode using Python

I have a project where I had to perform spell check on characters recognized using an optical character recognition program (OCR). My first choice was to search for an existing program written preferably in python, my favorite choice for such work. You can download the complete file here.

Amazingly I found this work by Peter Norvig . It was very well documented and well written piece of code.

But I had few issues that I needed to fix and so I could not use it directly.

1. In my program, unicode characters need to defined as the default character for all input and output unlike peter's program which works on ascii.

This is performed in the following code

#!/usr/bin/python -Wall
# -*- coding: utf-8 -*-

import re, collections, pprint,os
import sys
import codecs

if __name__ == '__main__':
reload(sys) sys.setdefaultencoding('iso8859-1')

2. The list of alphabets will also include the unicode characters applicable in my situation like
alphabet = u'abcdefghijklmnopqrstuvwxyzàáâãäåæçèéêëìíîïðñòóôõöøùúûüýþÿß'

3. The unicode feature of python is smart enough to recognise the right characters for conversion from upper case to lower case. All that needs to be done is to call the .lower() function on any unicode characters in the following function.

def words(text): return re.findall(u'[abcdefghijklmnopqrstuvwxyzàáâãäåæçèéêëìíîïðñòóôõöøùúûüýþÿß]+',text.lower())

4. Peter's program trains different words by determining the probability of its occurence. In simple terms, it counts the number of time a word appears in a standard piece of text. The larger the piece of text, the more representative it is to the real world. This scenario was not true in my case, as I do not have a piece of text where a word gets repeated multiple times.

In my case, I have a list of words in a text file. Almost all the word gets repeated only once and not any more. So the rank of a word was not in frequency but its ordinality.

The ord function in python returns the unicode position of a character input. In the function below, I first determine the ordinality of each word in the possible candidates (i.e., the original set of words). Then the ordinality of the word to be spell checked is also found. The difference between the two ordinalities is determined and the location of the lowest value gives the location of the correct word in the candidates.

def best_candidate(candidates,word):
     clist = list(candidates)
     #Find ordinality for the complete list
     so = []
     for cl in clist:
         sum_ord = 0
         for c in cl:
             sum_ord = sum_ord+ord(c)

     #Find ordinality of the given word
     sum_ord = 0
     for c in word:
         sum_ord = sum_ord+ord(c)

#Find difference in ordinality and also lowest value location
     so_item_l = []
     for so_item in so:
     min_loc = so_item_l.index(min(so_item_l))

     return clist[min_loc]