Saturday, May 10, 2014

My new textbook and job

So many things have happened in the last few months and I have not had time to add new entries to my blog. The first is the textbook that I co-authored with Sridevi Pudipeddi which was released at the end of February.  It was marathon run to complete and get the book ready for publishing.  I also moved to a new job in California in April.


During my work as image processing consultant at the Minnesota Supercomputing Institute, I have worked with students in various disciplines of science.  In all these cases, images were acquired using x-ray, CT, MRI, Electron microscope and Optical microscope.  It is important that the students have knowledge of both the physical methods of obtaining images and the analytical processing methods to understand the science behind the images. Thus, a course in image acquisition and processing has broad appeal across the STEM disciplines and is useful for transforming undergraduate and graduate curriculum to better prepare students for their future.

There are books that discusses image acquisition alone and there are books on image processing alone.  The image processing algorithms depend on the image acquisition method. We wrote a book that discusses both, so that students can learn from one source. You can check out sample chapter of the book at reedwith.us. You can buy the book at Amazon by clicking on the image below.


I also changed job.  I started working as Senior Engineer at Elekta in Sunnyvale, CA. I will be focussing mostly on x-ray and CT during my tenure at Elekta.

Monday, September 9, 2013

Installing large number of packages in R - method 2

In one my previous blog, I discussed a method for installing multiple R packages from one version in another.  In the post, I used a combination of R and Python.  In this post, I will present a method that uses only R.

There are two functions in the file (listed below), listmultipack and installmultipack.  You can copy the content of the listing below and name it 'packageinstall.r'.

The function, listmultipack reads the list of all packages in a given version and writes them to a file. The default file name is 'requirements.txt'.  The name has been chosen to follow the Python convention.  Alternately, you can name it to any other file name.

Once the new R version is installed on the same machine or a different machine, the installmultipack function can be used to install all the packages in the 'requirements.txt' or the file name that you chose.  Finally the function prints all warnings that were generated during the installation.

To use the file, at the R command prompt, you need to load the file using

>> source("packageinstall.r")

To obtain the list of packages on to file 'mylist.txt', type

>> listmultipack('mylist.txt')

To install all the packages in the mylist.txt file, type

>> installmultipack('mylist.txt')




Alternately, if you do not have a previous installation of R and would like to install multiple R packages using one command, you can also create a requirements file by listing the package names in a text file.  Each package name should be in a line by itself.

Thursday, February 14, 2013

Checkpointing in Python

Scientific programs are known to be  computationally expensive.   They generally require significant amount of time for processing. For example, a program  might run  for several hours or days.   It is not always possible to guarantee that the machine in which the program is running will be available to the user over a long period.  In such cases, Checkpointing can be used to store the state of the program at different times, so that the program can be restarted without the need to restart the computation from the beginning. In this post, we will discuss a  method for checkpointing Python programs.

The first program shown below performs  the addition of all numbers from 1 to 200  and prints the sum.   This program might look too simplistic for scientific computation but it provides a good platform for discussing checkpointing.    Instead of computing the sum  from 1 to 200,  imagine computing the same to a really high value such as trillions.   In such case, the  computation might run for hours.  If the program is interrupted, the computation needs to be started from the beginning.

The second program is the check pointed version of the first program.  In  checkpointing, the current state of the program is stored as a file.   Whenever a program runs, it looks for the checkpoint file, so that it can restore the program state. If it does not exist, the program assumes that it is being run for for the first time.  During the computation, the  program outputs its state to a file at regular interval.  The exact time interval is dependent on the program being solved.  If the program runs successfully, the content of the checkpoint file is no longer needed and hence removed from the disk.

In line 7-12,  the  program checks if the checkpoint exist and if it does,  reads its content and stores it in the variable start and total_sum.    If the  checkpoint file does not exist, it  applies a default value.  In the process of computation, the current  state of the program is output to the  checkpoint file (checkpt_file)  every 5th iteration.  The sleep statement  is added to slow the execution.  In this program, the state of the program is stored at every 5th iteration. If the program successfully completes, the checkpoint file is removed using os.unlink method (line 28).

The checkpoint file is written and read using Python's pickle. Thus, any Python datatype that can be pickled can be stored.   Alternate formats such as hdf5, csv, xls etc can also be used to store the file.  pickle was chosen as it is built in to Python and also due to the data stored being a picklable dictionary.

The  image below is a snapshot after running the second program.   The program was interrupted at the 7th iteration.  When the program was restarted, using the command line, it begins with iteration 6 as the state of the program up to iteration 5 was stored in the checkpoint file.