bidvertiser

My Blog List

Search This Blog

Monday, March 1, 2010

Carrying Out a Single Experiment

You have now created your directory structure, and you have created a directory for the current data, with the intention of carrying out a particular experiment in that directory. How do you proceed?

The general principle is that you should record every operation that you perform, and make those operations as transparent and reproducible as possible. In practice, this means that I create either a README file, in which I store every command line that I used while performing the experiment, or a driver script (I usually call this runall) that carries out the entire experiment automatically. The choices that you make at this point will depend strongly upon what development environment you prefer. If you are working in a language such as Matlab or R, you may be able to store everything as a script in that language. If you are using compiled code, then you will need to store the command lines separately. Personally, I work in a combination of shell scripts, Python, and C. The appropriate mix of these three languages depends upon the complexity of the experiment. Whatever you decide, you should end up with a file that is parallel to the lab notebook entry. The lab notebook contains a prose description of the experiment, whereas the driver script contains all the gory details.

Here are some rules of thumb that I try to follow when developing the driver script:

  1. Record every operation that you perform.

  2. Comment generously. The driver script typically involves little in the way of complicated logic, but often invokes various scripts that you have written, as well as a possibly eclectic collection of Unix utilities. Hence, for this type of script, a reasonable rule of thumb is that someone should be able to understand what you are doing solely from reading the comments. Note that I am refraining from advocating a particular mode of commenting for compiled code or more complex scripts—there are many schools of thought on the correct way to write such comments.

  3. Avoid editing intermediate files by hand. Doing so means that your script will only be semi-automatic, because the next time you run the experiment, you will have to redo the editing operation. Many simple editing operations can be performed using standard Unix utilities such as sed, awk, grep, head, tail, sort, cut, and paste.

  4. Store all file and directory names in this script. If the driver script calls other scripts or functions, then files and directory names should be passed from the driver script to these auxiliary scripts. Forcing all of the file and directory names to reside in one place makes it much easier to keep track of and modify the organization of your output files.

  5. Use relative pathnames to access other files within the same project. If you use absolute pathnames, then your script will not work for people who check out a copy of your project in their local directories (see “The Value of Version Control” below).

  6. Make the script restartable. I find it useful to embed long-running steps of the experiment in a loop of the form if () then . If I want to rerun selected parts of the experiment, then I can delete the corresponding output files.

For experiments that take a long time to run, I find it useful to be able to obtain a summary of the experiment's progress thus far. In these cases, I create two driver scripts, one to run the experiment (runall) and one to summarize the results (summarize). The final line of runall calls summarize, which in turn creates a plot, table, or HTML page that summarizes the results of the experiment. The summarize script is written in such a way that it can interpret a partially completed experiment, showing how much of the computation has been performed thus far.

No comments:

Post a Comment