Monday, September 9, 2013

Installing large number of packages in R - method 2

In one my previous blog, I discussed a method for installing multiple R packages from one version in another.  In the post, I used a combination of R and Python.  In this post, I will present a method that uses only R.

There are two functions in the file (listed below), listmultipack and installmultipack.  You can copy the content of the listing below and name it 'packageinstall.r'.

The function, listmultipack reads the list of all packages in a given version and writes them to a file. The default file name is 'requirements.txt'.  The name has been chosen to follow the Python convention.  Alternately, you can name it to any other file name.

Once the new R version is installed on the same machine or a different machine, the installmultipack function can be used to install all the packages in the 'requirements.txt' or the file name that you chose.  Finally the function prints all warnings that were generated during the installation.

To use the file, at the R command prompt, you need to load the file using

>> source("packageinstall.r")

To obtain the list of packages on to file 'mylist.txt', type

>> listmultipack('mylist.txt')

To install all the packages in the mylist.txt file, type

>> installmultipack('mylist.txt')




Alternately, if you do not have a previous installation of R and would like to install multiple R packages using one command, you can also create a requirements file by listing the package names in a text file.  Each package name should be in a line by itself.

Thursday, February 14, 2013

Checkpointing in Python

Scientific programs are known to be  computationally expensive.   They generally require significant amount of time for processing. For example, a program  might run  for several hours or days.   It is not always possible to guarantee that the machine in which the program is running will be available to the user over a long period.  In such cases, Checkpointing can be used to store the state of the program at different times, so that the program can be restarted without the need to restart the computation from the beginning. In this post, we will discuss a  method for checkpointing Python programs.

The first program shown below performs  the addition of all numbers from 1 to 200  and prints the sum.   This program might look too simplistic for scientific computation but it provides a good platform for discussing checkpointing.    Instead of computing the sum  from 1 to 200,  imagine computing the same to a really high value such as trillions.   In such case, the  computation might run for hours.  If the program is interrupted, the computation needs to be started from the beginning.

The second program is the check pointed version of the first program.  In  checkpointing, the current state of the program is stored as a file.   Whenever a program runs, it looks for the checkpoint file, so that it can restore the program state. If it does not exist, the program assumes that it is being run for for the first time.  During the computation, the  program outputs its state to a file at regular interval.  The exact time interval is dependent on the program being solved.  If the program runs successfully, the content of the checkpoint file is no longer needed and hence removed from the disk.

In line 7-12,  the  program checks if the checkpoint exist and if it does,  reads its content and stores it in the variable start and total_sum.    If the  checkpoint file does not exist, it  applies a default value.  In the process of computation, the current  state of the program is output to the  checkpoint file (checkpt_file)  every 5th iteration.  The sleep statement  is added to slow the execution.  In this program, the state of the program is stored at every 5th iteration. If the program successfully completes, the checkpoint file is removed using os.unlink method (line 28).

The checkpoint file is written and read using Python's pickle. Thus, any Python datatype that can be pickled can be stored.   Alternate formats such as hdf5, csv, xls etc can also be used to store the file.  pickle was chosen as it is built in to Python and also due to the data stored being a picklable dictionary.

The  image below is a snapshot after running the second program.   The program was interrupted at the 7th iteration.  When the program was restarted, using the command line, it begins with iteration 6 as the state of the program up to iteration 5 was stored in the checkpoint file.