Scientific programs are known to be computationally expensive. They generally require significant amount of time for processing. For example, a program might run for several hours or days. It is not always possible to guarantee that the machine in which the program is running will be available to the user over a long period. In such cases, Checkpointing can be used to store the state of the program at different times, so that the program can be restarted without the need to restart the computation from the beginning. In this post, we will discuss a method for checkpointing Python programs.
The first program shown below performs the addition of all numbers from 1 to 200 and prints the sum. This program might look too simplistic for scientific computation but it provides a good platform for discussing checkpointing. Instead of computing the sum from 1 to 200, imagine computing the same to a really high value such as trillions. In such case, the computation might run for hours. If the program is interrupted, the computation needs to be started from the beginning.
The second program is the check pointed version of the first program. In checkpointing, the current state of the program is stored as a file. Whenever a program runs, it looks for the checkpoint file, so that it can restore the program state. If it does not exist, the program assumes that it is being run for for the first time. During the computation, the program outputs its state to a file at regular interval. The exact time interval is dependent on the program being solved. If the program runs successfully, the content of the checkpoint file is no longer needed and hence removed from the disk.
In line 7-12, the program checks if the checkpoint exist and if it does, reads its content and stores it in the variable start and total_sum. If the checkpoint file does not exist, it applies a default value. In the process of computation, the current state of the program is output to the checkpoint file (checkpt_file) every 5th iteration. The sleep statement is added to slow the execution. In this program, the state of the program is stored at every 5th iteration. If the program successfully completes, the checkpoint file is removed using os.unlink method (line 28).
The checkpoint file is written and read using Python's pickle. Thus, any Python datatype that can be pickled can be stored. Alternate formats such as hdf5, csv, xls etc can also be used to store the file. pickle was chosen as it is built in to Python and also due to the data stored being a picklable dictionary.
The image below is a snapshot after running the second program. The program was interrupted at the 7th iteration. When the program was restarted, using the command line, it begins with iteration 6 as the state of the program up to iteration 5 was stored in the checkpoint file.