Friday, October 12, 2012

Installing large number of packages in R

I have been learning R, more specifically install and maintain R for users.  Recently, I had to install the latest version 2.15.1 from source. Once it was installed, the next step was to install the packages from the older version (2.15.0) in the latest version.  Although installing packages in R is as simple as invoking the function install.packages(), it quickly becomes cumbersome when you have to install more than 400 packages.

Instead I resorted to a combination of R and Python to complete this process.

First, determine the list of all packages in the older version of R using the following commands

packs <- installed.packages()
exc <- names(packs[,'Package'])

Here, exc contains a column of package names.

Then, store this list in a text file, 'test.Rdata', so that it can be processed using Python.


To install, more than one package, a command similar to the one below can be used. This command will install the three packages, BiocInstaller, coda and DEGseq.  However, the aim of this blog post is to describe a method to install much more than these three packages.

install.packages(c("BiocInstaller", "coda", "DEGseq"),dependencies=TRUE) 

The Python script that generates this command is given below.  This program reads the column of data and concatenates them after adding " and , as appropriate.

fp = open("test.RData","r")
s = 'install.packages(c('
for i in fp.readlines():
        s = s+'"'+i.strip()+'",'
s = s+'),dependencies=TRUE)'
print s

Finally, copy the output of the Python program in to the R command line and wait a few hours to finish the installation.

I am assuming this is a common problem.  How do you handle it?  You can give your advice in the form of comments.

PS:  The concatenation can be performed using any other scripting language like perl, php, bash etc.

Saturday, March 3, 2012

Comparing Matlab and Python for large image processing problems

Matlab is a popular scripting language for numerical computing.  It is popular and powerful due to its various toolbox.  Since Matlab is developed commercially, dedicated programmers are working in adding new features and enhancing the existing ones all the time.  Large image data sets created using latest imaging modalities typically take a long time to process.  The typical approach is to run the processing in parallel on many cores either on a desktop, a cluster or a supercomputer.  Commercial software like Matlab require dedicated license such as Distributed Computing Server for running such parallel jobs, which has an added cost.

Graphical Processing Unit (GPU) programming is becoming popular and easier than ever.  The license cost for GPU programming in Matlab is typically lower than that for parallel programming.  GPU programming is useful, if there is large amount of computation and fewer data transfer.  This limit exists because of smaller bandwidth between the CPU and GPU, the typical path that data has to take for processing in the GPU.  Large data set image processing typically involves large amount of I/O and data transfer between CPU and GPU and hence may not provide enough scalability.

Advantage Python

Python is a free and open source scripting language.  It can be scaled to large number of processors.  It has been shown that there is not a significant difference in computational time for a program written in Matlab vs the one written in python using numpy.   The author found that numpy run time is in the same order of magnitude as Fortan and C programs with optimization enabled.  Although, the processing time are for slightly older versions of the various languages, my experience has shown that the range of processing time remains similar.  The processing time will also vary depending on the nature of the algorithm and the expertise of the programmer.

Python has parallel processing modules like mpi4py, that allow scaling the application to large number of cores. In a supercomputing environment with large number of cores, python can be run on many of the cores.  Matlab on the other hand can only scale to the extent of the number of license. 

Python also has GPU programming capability through pycuda, in case if GPU programming suits the application.

Python is a more general purpose language compared to Matlab.  Hence integrating various databases, servers, file handling, string handling in to the program is easy.

Disadvantage Python

Since Python is an open-source package, it does not have a wide variety of canned functions and toolboxes as Matlab.  Hence, the user has to work on developing some of them.  I hope that over time this issue will be resolved.

Friday, February 17, 2012

Under-graduate education in Image processing


Image acquisition and processing have become a standard method for qualifying and quantifying experimental measurements in various Science Technology Engineering and Mathematics (STEM) disciplines.  Discoveries have been made possible in medical sciences by advances in diagnostic imaging such as x-ray based computed tomography (CT) and magnetic resonance imaging (MRI).  Biological and cellular functions have been revealed with new imaging techniques in light based microscopy.  Advancements in material sciences have been aided by electron microscopy analysis of nanoparticles.    All these examples and many more require both knowledge of the physical methods to obtain images and the analytical processing methods to understand the science behind the images. 

Imaging technology continues to advance with new modalities and methods available to students and researchers in STEM disciplines.  Thus, a course in image acquisition and processing would have broad appeal across the STEM disciplines and be useful for transforming undergraduate and graduate curriculum to better prepare students for their future.

Image analysis is an extraordinarily practical technique that need not be limited to highly analytical individuals with a math and engineering background.  Since researchers in biology, medicine, and chemistry along with students and scientists from mathematics, physics and various engineering fields use these techniques regularly; there is a need for a course that provides a gradual introduction to both acquisition and processing. 

Such a course will prepare students in the common image acquisition techniques like CT, MRI, light  microscope and electron microscope.   It will also introduce the practical aspects of image acquisition such as noise, resolution etc.  The students will also be programming image processing using Python as a part of the curriculum.  They will be introduced to python modules such as numpy and scipy.   They will learn the various image processing operations like segmentation, morphological operations, measurements and visualization.

We wanted our understanding of this need with the data collected from surveying people interested in image processing or people who wish that they had such a course during their senior year in under-graduate or in graduate school.   We created a survey to obtain your feedback.    It will take only a minute of your time.   We request that you fill as much information as you can.  Please forward this URL or this blogpost to your friends as well.

Sunday, February 12, 2012

Python modules for scientific image processing

Recently my friend, Nick Labello working at University of Chicago performed a large scale, large data image processing.  It took one week to process the data acquired over 12 hours.  The program was written in Matlab and was run on a desktop. If the processing time needs to be reduced, the best course of action is to parallelize the program and run it on multiple cores / nodes.  Matlab parallelization can be expensive as it is closed-source commercial software.  Python on the other hand is free and open-source and hence can be parallelized to large number of cores.  Python also has parallelization modules like mpi4py that eases the task.  With a clear choice of programming language, Nick worked on evaluating the various python modules.  The chosen python module was used to rewrite the Matlab program.   Specifically, he reviewed the following

  1. Numpy
  2. Scipy
  3. Pymorph
  4. Mahotas
  5. Scikits-image
  6. Python Imaging Library (PIL)
Nick’s view on the various modules along with some of my view is given below.  


It doesn't actually give you any image processing capabilities, but all of the image processing libraries rely on Numpy arrays to store the image data.  As a happy result, it is trivial to bounce between all of the different image processing libraries in the code because they all read and write the same datatype. 

It provides a stable, solid, easy to use very basic functionality (erosion, dilation, etc..).    It is missing a lot of the more advanced functions you might find in Matlab such as functions to find endpoints, perimeter pixels, etc.  These must be pieced together from elementary operations.  The documentation for NDImage is NOT very good for new users.  I had a hard time with it at first.  

It has lots of image processing functions that do not rely on anything but Numpy.  It does not depend on Scipy.  It provides a python only library.  A second interesting thing about PyMorph is that it has quite a lot of functionality compared to the other libraries.  Unfortunately, since it is written in Python, it is hundreds of times slower than Scipy and the other libraries.  This will become an issue for advanced image processing users.


It provides only the most basic functions, but blazing fast, and, in my experience, twice as fast as the equivalent functions in Scipy.


It picks up where Scipy leaves off and offers quite a few functions not found in Scipy.

Python Imaging Library (PIL)  

It is free to use but is not open-source.  It has very few image processing algorithms and not necessarily useful for scientific imaging.


It is __not__ an image processing library.  It is a python distribution ready for scientific programming.  It has over 100 scientific packages pre-installed, and is compiled against fast MKL/BLAS.  My tests were a little bit faster in Enthought-python than the regular Python by 2-3%.  Your mileage may vary.  The big advantage, though, is that it has everything I need except Mahotas and PyMorph.  Also, Mahotas installs easily on Enthought, whereas it was difficult on the original python installation, due to various dependencies. Enthought is free for students and employees of Universities and Colleges. Others need to pay for the service.

Ravi's view: Personally, I do not have much problem installing python modules to regular python interpreter. The only time it becomes difficult is installing scientific packages like Scipy that have numerous dependencies like Boost libraries. Readymade binary packages do eliminate the installation steps but are not necessarily well tuned for optimal performance. Enthought is a great alternative that is as easy to install and yet optimized for performance.

Wednesday, February 8, 2012

Plotting three variable graph using Matlab

Recently, a user wanted to visualize the effect of four different test conditions causing changes in three different parameters.  This visualization will help understand the effect of change in one parameter on others.

The user suggested to plot the three parameters along three different axis.  For example, the three parameters with values of [95.0, 1.2, 4.5] will correspond to the co-ordinates of [95.0, 0.0, 0.0], [0, 1.2, 0] and [0, 0, 4.5].  Using these coordinates, one can form a triangle.  The shape of the triangle will be different for the various test conditions and it will be easier to visualize the effect of the test conditions on the parameters.

I was not initially sure whether I could accomplish this using a standard Matlab plots.  I did a search on google but was not successful, as I did not have a good search term.  I resorted to creating a OpenGL program using GL_TRIANGLES.   I later found that Matlab has similar functionality.  Triangles and other polygons can be easily constructed in Matlab using the "patch" function.

The program

The vals contains the value that needs to be plotted.  Each column is one test condition and the rows contain the parameters that need to be plotted along the axis.  The "for loop" runs for each column and creates the x, y and z coordinates and stores them in a, b and c.  The patch command will create a triangle using the three coordinates.  The last parameter in the patch command is the color of the patch.  By default, patch is rendered with all surface opaque and the color specified in the patch command.  Since there are too many surfaces, the patch was made transparent and the edges were given different line style and thickness using "plottools".  The resulting plot can be seen below.

% The variables vals has three rows and four columns. The columns contain co-ordinate values along x,y,z axis respectively.  The four rows will result in four triangular surfaces.
vals = [11.11,3.55,4.97,2.14;

hold on;
for i = 1:size(vals,2) % For each column in vals
   a = [vals(1,i)     0            0]
   b = [0         vals(2,i)        0]
   c = [0             0       vals(3,i)]
grid on;