Showing posts with label python. Show all posts
Showing posts with label python. Show all posts

Saturday, May 10, 2014

My new textbook and job

So many things have happened in the last few months and I have not had time to add new entries to my blog. The first is the textbook that I co-authored with Sridevi Pudipeddi which was released at the end of February.  It was marathon run to complete and get the book ready for publishing.  I also moved to a new job in California in April.


During my work as image processing consultant at the Minnesota Supercomputing Institute, I have worked with students in various disciplines of science.  In all these cases, images were acquired using x-ray, CT, MRI, Electron microscope and Optical microscope.  It is important that the students have knowledge of both the physical methods of obtaining images and the analytical processing methods to understand the science behind the images. Thus, a course in image acquisition and processing has broad appeal across the STEM disciplines and is useful for transforming undergraduate and graduate curriculum to better prepare students for their future.

There are books that discusses image acquisition alone and there are books on image processing alone.  The image processing algorithms depend on the image acquisition method. We wrote a book that discusses both, so that students can learn from one source. You can check out sample chapter of the book at reedwith.us. You can buy the book at Amazon by clicking on the image below.


I also changed job.  I started working as Senior Engineer at Elekta in Sunnyvale, CA. I will be focussing mostly on x-ray and CT during my tenure at Elekta.

Thursday, February 14, 2013

Checkpointing in Python

Scientific programs are known to be  computationally expensive.   They generally require significant amount of time for processing. For example, a program  might run  for several hours or days.   It is not always possible to guarantee that the machine in which the program is running will be available to the user over a long period.  In such cases, Checkpointing can be used to store the state of the program at different times, so that the program can be restarted without the need to restart the computation from the beginning. In this post, we will discuss a  method for checkpointing Python programs.

The first program shown below performs  the addition of all numbers from 1 to 200  and prints the sum.   This program might look too simplistic for scientific computation but it provides a good platform for discussing checkpointing.    Instead of computing the sum  from 1 to 200,  imagine computing the same to a really high value such as trillions.   In such case, the  computation might run for hours.  If the program is interrupted, the computation needs to be started from the beginning.

The second program is the check pointed version of the first program.  In  checkpointing, the current state of the program is stored as a file.   Whenever a program runs, it looks for the checkpoint file, so that it can restore the program state. If it does not exist, the program assumes that it is being run for for the first time.  During the computation, the  program outputs its state to a file at regular interval.  The exact time interval is dependent on the program being solved.  If the program runs successfully, the content of the checkpoint file is no longer needed and hence removed from the disk.

In line 7-12,  the  program checks if the checkpoint exist and if it does,  reads its content and stores it in the variable start and total_sum.    If the  checkpoint file does not exist, it  applies a default value.  In the process of computation, the current  state of the program is output to the  checkpoint file (checkpt_file)  every 5th iteration.  The sleep statement  is added to slow the execution.  In this program, the state of the program is stored at every 5th iteration. If the program successfully completes, the checkpoint file is removed using os.unlink method (line 28).

The checkpoint file is written and read using Python's pickle. Thus, any Python datatype that can be pickled can be stored.   Alternate formats such as hdf5, csv, xls etc can also be used to store the file.  pickle was chosen as it is built in to Python and also due to the data stored being a picklable dictionary.

The  image below is a snapshot after running the second program.   The program was interrupted at the 7th iteration.  When the program was restarted, using the command line, it begins with iteration 6 as the state of the program up to iteration 5 was stored in the checkpoint file.









Friday, October 12, 2012

Installing large number of packages in R



I have been learning R, more specifically install and maintain R for users.  Recently, I had to install the latest version 2.15.1 from source. Once it was installed, the next step was to install the packages from the older version (2.15.0) in the latest version.  Although installing packages in R is as simple as invoking the function install.packages(), it quickly becomes cumbersome when you have to install more than 400 packages.

Instead I resorted to a combination of R and Python to complete this process.

First, determine the list of all packages in the older version of R using the following commands

packs <- installed.packages()
exc <- names(packs[,'Package'])

Here, exc contains a column of package names.

Then, store this list in a text file, 'test.Rdata', so that it can be processed using Python.

write(exc,'test.Rdata')

To install, more than one package, a command similar to the one below can be used. This command will install the three packages, BiocInstaller, coda and DEGseq.  However, the aim of this blog post is to describe a method to install much more than these three packages.

install.packages(c("BiocInstaller", "coda", "DEGseq"),dependencies=TRUE) 

The Python script that generates this command is given below.  This program reads the column of data and concatenates them after adding " and , as appropriate.


fp = open("test.RData","r")
s = 'install.packages(c('
for i in fp.readlines():
        s = s+'"'+i.strip()+'",'
s = s+'),dependencies=TRUE)'
print s


Finally, copy the output of the Python program in to the R command line and wait a few hours to finish the installation.

I am assuming this is a common problem.  How do you handle it?  You can give your advice in the form of comments.

PS:  The concatenation can be performed using any other scripting language like perl, php, bash etc.

Saturday, March 3, 2012

Comparing Matlab and Python for large image processing problems

Matlab is a popular scripting language for numerical computing.  It is popular and powerful due to its various toolbox.  Since Matlab is developed commercially, dedicated programmers are working in adding new features and enhancing the existing ones all the time.  Large image data sets created using latest imaging modalities typically take a long time to process.  The typical approach is to run the processing in parallel on many cores either on a desktop, a cluster or a supercomputer.  Commercial software like Matlab require dedicated license such as Distributed Computing Server for running such parallel jobs, which has an added cost.

Graphical Processing Unit (GPU) programming is becoming popular and easier than ever.  The license cost for GPU programming in Matlab is typically lower than that for parallel programming.  GPU programming is useful, if there is large amount of computation and fewer data transfer.  This limit exists because of smaller bandwidth between the CPU and GPU, the typical path that data has to take for processing in the GPU.  Large data set image processing typically involves large amount of I/O and data transfer between CPU and GPU and hence may not provide enough scalability.

Advantage Python

Python is a free and open source scripting language.  It can be scaled to large number of processors.  It has been shown that there is not a significant difference in computational time for a program written in Matlab vs the one written in python using numpy.   The author found that numpy run time is in the same order of magnitude as Fortan and C programs with optimization enabled.  Although, the processing time are for slightly older versions of the various languages, my experience has shown that the range of processing time remains similar.  The processing time will also vary depending on the nature of the algorithm and the expertise of the programmer.

Python has parallel processing modules like mpi4py, that allow scaling the application to large number of cores. In a supercomputing environment with large number of cores, python can be run on many of the cores.  Matlab on the other hand can only scale to the extent of the number of license. 

Python also has GPU programming capability through pycuda, in case if GPU programming suits the application.

Python is a more general purpose language compared to Matlab.  Hence integrating various databases, servers, file handling, string handling in to the program is easy.

Disadvantage Python

Since Python is an open-source package, it does not have a wide variety of canned functions and toolboxes as Matlab.  Hence, the user has to work on developing some of them.  I hope that over time this issue will be resolved.

Friday, February 17, 2012

Under-graduate education in Image processing

SURVEY URL:  http://goo.gl/ORDDz

Image acquisition and processing have become a standard method for qualifying and quantifying experimental measurements in various Science Technology Engineering and Mathematics (STEM) disciplines.  Discoveries have been made possible in medical sciences by advances in diagnostic imaging such as x-ray based computed tomography (CT) and magnetic resonance imaging (MRI).  Biological and cellular functions have been revealed with new imaging techniques in light based microscopy.  Advancements in material sciences have been aided by electron microscopy analysis of nanoparticles.    All these examples and many more require both knowledge of the physical methods to obtain images and the analytical processing methods to understand the science behind the images. 

Imaging technology continues to advance with new modalities and methods available to students and researchers in STEM disciplines.  Thus, a course in image acquisition and processing would have broad appeal across the STEM disciplines and be useful for transforming undergraduate and graduate curriculum to better prepare students for their future.

Image analysis is an extraordinarily practical technique that need not be limited to highly analytical individuals with a math and engineering background.  Since researchers in biology, medicine, and chemistry along with students and scientists from mathematics, physics and various engineering fields use these techniques regularly; there is a need for a course that provides a gradual introduction to both acquisition and processing. 

Such a course will prepare students in the common image acquisition techniques like CT, MRI, light  microscope and electron microscope.   It will also introduce the practical aspects of image acquisition such as noise, resolution etc.  The students will also be programming image processing using Python as a part of the curriculum.  They will be introduced to python modules such as numpy and scipy.   They will learn the various image processing operations like segmentation, morphological operations, measurements and visualization.

We wanted our understanding of this need with the data collected from surveying people interested in image processing or people who wish that they had such a course during their senior year in under-graduate or in graduate school.   We created a survey to obtain your feedback.    It will take only a minute of your time.   We request that you fill as much information as you can.  Please forward this URL or this blogpost to your friends as well.


Sunday, February 12, 2012

Python modules for scientific image processing



Recently my friend, Nick Labello working at University of Chicago performed a large scale, large data image processing.  It took one week to process the data acquired over 12 hours.  The program was written in Matlab and was run on a desktop. If the processing time needs to be reduced, the best course of action is to parallelize the program and run it on multiple cores / nodes.  Matlab parallelization can be expensive as it is closed-source commercial software.  Python on the other hand is free and open-source and hence can be parallelized to large number of cores.  Python also has parallelization modules like mpi4py that eases the task.  With a clear choice of programming language, Nick worked on evaluating the various python modules.  The chosen python module was used to rewrite the Matlab program.   Specifically, he reviewed the following

  1. Numpy
  2. Scipy
  3. Pymorph
  4. Mahotas
  5. Scikits-image
  6. Python Imaging Library (PIL)
Nick’s view on the various modules along with some of my view is given below.  

Numpy

It doesn't actually give you any image processing capabilities, but all of the image processing libraries rely on Numpy arrays to store the image data.  As a happy result, it is trivial to bounce between all of the different image processing libraries in the code because they all read and write the same datatype. 

It provides a stable, solid, easy to use very basic functionality (erosion, dilation, etc..).    It is missing a lot of the more advanced functions you might find in Matlab such as functions to find endpoints, perimeter pixels, etc.  These must be pieced together from elementary operations.  The documentation for NDImage is NOT very good for new users.  I had a hard time with it at first.  

It has lots of image processing functions that do not rely on anything but Numpy.  It does not depend on Scipy.  It provides a python only library.  A second interesting thing about PyMorph is that it has quite a lot of functionality compared to the other libraries.  Unfortunately, since it is written in Python, it is hundreds of times slower than Scipy and the other libraries.  This will become an issue for advanced image processing users.


Mahotas  

It provides only the most basic functions, but blazing fast, and, in my experience, twice as fast as the equivalent functions in Scipy.

Scikits-Image  

It picks up where Scipy leaves off and offers quite a few functions not found in Scipy.

Python Imaging Library (PIL)  

It is free to use but is not open-source.  It has very few image processing algorithms and not necessarily useful for scientific imaging.


Enthought  

It is __not__ an image processing library.  It is a python distribution ready for scientific programming.  It has over 100 scientific packages pre-installed, and is compiled against fast MKL/BLAS.  My tests were a little bit faster in Enthought-python than the regular Python by 2-3%.  Your mileage may vary.  The big advantage, though, is that it has everything I need except Mahotas and PyMorph.  Also, Mahotas installs easily on Enthought, whereas it was difficult on the original python installation, due to various dependencies. Enthought is free for students and employees of Universities and Colleges. Others need to pay for the service.

Ravi's view: Personally, I do not have much problem installing python modules to regular python interpreter. The only time it becomes difficult is installing scientific packages like Scipy that have numerous dependencies like Boost libraries. Readymade binary packages do eliminate the installation steps but are not necessarily well tuned for optimal performance. Enthought is a great alternative that is as easy to install and yet optimized for performance.

Wednesday, October 6, 2010

Spamming of HTML forms - one case

Recently I found that a newspaper in its online edition switched from image based CAPTCHA system to solving a mathematical puzzle, in-order to prevent spamming of their comment section using a computer program. A screen capture of the same can be found below.



The problem with such a system is that they can be easily solved using a computer, which defeats the purpose of using it to differentiate human and computers apart. To test my own skill, I wanted to write a program that can download the page, read it and solve the puzzle as well. Using the information I obtain, I could then post comments without human intervention.

To accomplish this task, I used the usual suspects like python and and the HTML parser, BeautifulSoup. BeautifulSoup reads a string of html or xml and converts it to a tree. Using the tree, it is easy to navigate through the tags or search for a particular one based on id or name. It is also powerful enough to differentiate tags based on CSS class in html tags.


1. import urllib
2. from BeautifulSoup import BeautifulSoup
3. import string,re

4. doc = urllib.urlopen('http://www.somesight.com/comment/reply/1854565').read()
5. soup = BeautifulSoup(''.join(doc))

6. a = soup.findAll("span",{"class":"field-prefix"})
7. b = a[0].contents[0].split("=")[0].split("+")
8. c = [int(bs) for bs in b]
9. captcha_response = sum(c)
10. print a,captcha_response

11. token1 = soup.findAll("input",id="edit-captcha-token")
12. token1_val = token1[0]['value']
13. print token1,token1_val


The two important information that I need to calculate are the captcha_response which is the solution to the mathematical problem and the captcha_token, a hidden html field in the webpage. Line #6 searches for a class, field-prefix in span tag. This tag contains the string for the mathematical puzzle that needs to be solved. I obtain the contents of this string and split it in-order to obtain the individual numbers in a list. Finally I convert those numbers from string to integer in Line #8 and sum them using line #9.

Line #11 searches the hidden captcha token, stored in the input tag with id="edit-captcha-token".

Armed with these two information, we can post any name and comment to the form. The comments were moderated but it would still require lot of human intervention to clear the spams.

I informed the webmaster of this issue. They have since moved to a image based system. I removed all reference to the site in this blog post and program in-order to keep their anonymity.

Friday, July 16, 2010

Keep tab on items you lent using borrow-err.com

On my free time, I have been working on a small but useful and interesting project. It resulted in the site http://www.borrow-err.com/

The reason for creating the site was my forgetfulness. I lend books to others and I forget about it. So, I decided to create a website where I can keep track of items I lent to others. If you would like to try it, just key in the details of the items and the name of the borrower in to the home page. The website will then send you a reminder email every month for the items you lent. If you are lucky and you get your items back, you can remove them from the list using the links provided in the email.

You can access it using any browser. The page is also light enough that it loads fine in mobile browsers as well. So whether you are at home, office or on road, you can use borrow-err.com to keep track of items that you lent. Try it and give me your feedback.

Remember: borrower's err, so you need borrow-err.com

October 5th, 2010: The previous version had the bare minimum styling. Hence I made some changes and I believe the new version is more pleasing to your eye.

Thursday, February 19, 2009

Python and Abaqus

Recently I had the opportunity to work with a student who needed to perform a Finite Element Analysis on roughly 400 files using Abaqus. Processing such large number of images using the graphical user interface would have been impractical. We were happy to learn that Abaqus has a python scripting module which could help us with the automation.

We set about writing the python program from the individual function calls given in their manuals. But considering the scope of the software and its complexity, the approach quickly became difficult. Instead, we resorted to creating "macros" and modifying it to our purpose.

Creating Macros

The macros in Abaqus lets you perform a series of operations and record them as python scripts. The scripts by default are stored in "abaqusMacros.py" with each macro recorded as a function. Since the macro was created for a particular model, we modified the names in the function " Entire_Work_Flow" to be generic so that other models can be loaded. We then added other functions that will call the function created using macro.

The other function created were "getvalues" that obtains the relavant von mises stress values from the ODB file. The main function reads each of the solid model (.sat file) in a given directory and passes the filename to the macro function. It then calls getvalues function and stores the result in a CSV file for further analysis.

To run the script, type "abaqus cae -noGUI scriptname.py".

The Entire_Work_Flow has been trimmed to show only the relevant lines that corresponds to creation of the name of parts, instance and job name.


Abaqus_python - Free Legal Forms

Saturday, January 3, 2009

Spell checker in Unicode using Python

I have a project where I had to perform spell check on characters recognized using an optical character recognition program (OCR). My first choice was to search for an existing program written preferably in python, my favorite choice for such work. You can download the complete file here.

Amazingly I found this work by Peter Norvig . It was very well documented and well written piece of code.

But I had few issues that I needed to fix and so I could not use it directly.

1. In my program, unicode characters need to defined as the default character for all input and output unlike peter's program which works on ascii.

This is performed in the following code


#!/usr/bin/python -Wall
# -*- coding: utf-8 -*-

import re, collections, pprint,os
import sys
import codecs

if __name__ == '__main__':
...
reload(sys) sys.setdefaultencoding('iso8859-1')


2. The list of alphabets will also include the unicode characters applicable in my situation like
alphabet = u'abcdefghijklmnopqrstuvwxyzàáâãäåæçèéêëìíîïðñòóôõöøùúûüýþÿß'

3. The unicode feature of python is smart enough to recognise the right characters for conversion from upper case to lower case. All that needs to be done is to call the .lower() function on any unicode characters in the following function.


def words(text): return re.findall(u'[abcdefghijklmnopqrstuvwxyzàáâãäåæçèéêëìíîïðñòóôõöøùúûüýþÿß]+',text.lower())


4. Peter's program trains different words by determining the probability of its occurence. In simple terms, it counts the number of time a word appears in a standard piece of text. The larger the piece of text, the more representative it is to the real world. This scenario was not true in my case, as I do not have a piece of text where a word gets repeated multiple times.

In my case, I have a list of words in a text file. Almost all the word gets repeated only once and not any more. So the rank of a word was not in frequency but its ordinality.

The ord function in python returns the unicode position of a character input. In the function below, I first determine the ordinality of each word in the possible candidates (i.e., the original set of words). Then the ordinality of the word to be spell checked is also found. The difference between the two ordinalities is determined and the location of the lowest value gives the location of the correct word in the candidates.


def best_candidate(candidates,word):
     clist = list(candidates)
     #Find ordinality for the complete list
     so = []
     for cl in clist:
         sum_ord = 0
         for c in cl:
             sum_ord = sum_ord+ord(c)
         so.append(sum_ord)

     #Find ordinality of the given word
     sum_ord = 0
     for c in word:
         sum_ord = sum_ord+ord(c)

#Find difference in ordinality and also lowest value location
     so_item_l = []
     for so_item in so:
         so_item_l.append(abs(so_item-sum_ord))
     min_loc = so_item_l.index(min(so_item_l))

     return clist[min_loc]