Biology: The Value of Version Control

Version control software was originally developed to maintain and coordinate the development of complex software engineering projects. Modern version control systems such as Subversion are based on a central repository that stores all versions of a given collection of related files. Multiple individuals can “check out” a working copy of these files into their local directories, make changes, and then check the changes back into the central repository.

I find version control software to be invaluable for managing computational experiments, for three reasons. First, the software provides a form of backup. Although our university computer systems are automatically backed up on a nightly basis, my laptop's backup schedule is more erratic. Furthermore, after mistakenly overwriting a file, it is often easier to retrieve yesterday's version from Subversion than to send an e-mail to the system administator. Indeed, one of my graduate students told me he would breathe a sigh of relief after typing svn commit, because that command stores a snapshot of his working directory in the central repository.

Second, version control provides a historical record that can be useful for tracking down bugs or understanding old results. Typically, a script or program will evolve throughout the course of a project. Rather than storing many copies of the script with slightly different names, I rely upon the version control system to keep track of those versions. If I need to reproduce exactly an experiment that I performed three months ago, I can use the version control software to check out a copy of the state of my project at that time. Note that most version control software can also assign a logical “tag” to a particular state of the repository, allowing you to easily retrieve that state later.

Third, and perhaps most significantly, version control is invaluable for collaborative projects. The repository allows collaborators to work simultaneously on a collection of files, including scripts, documentation, or a draft manuscript. If two individuals edit the same file in parallel, then the version control software will automatically merge the two versions and flag lines that were edited by both people. It is not uncommon, in the hours before a looming deadline, for me to talk by phone with a remote collaborator while we both edit the same document, checking in changes every few minutes.

Although the basic idea of version control software seems straightforward, using a system such as Subversion effectively requires some discipline. First, version control software is most useful when it is used regularly. A good rule of thumb is that changes should be checked in at least once a day. This ensures that your historical record is complete and that a recent backup is always available if you mistakenly overwrite a file. If you are in the midst of editing code, and you have caused a once-compilable program to no longer work, it is possible to check in your changes on a “branch” of the project, effectively stating that this is a work in progress. Once the new functionality is implemented, then the branch can be merged back into the “trunk” of the project. Only then will your changes be propagated to other members of the project team.

Second, version control should only be used for files that you edit by hand. Automatically generated files, whether they are compiled programs or the results of a computational experiment, do not belong under version control. These files tend to be large, so checking them into the project wastes disk space, both because they will be duplicated in the repository and in every working copy of the project, and also because these files will tend to change as you redo your experiment multiple times. Binary files are particularly wasteful: Because version control software operates on a line-by-line basis, the version history of a binary file is simply a complete copy of all versions of that file. There are exceptions to this rule, such as relatively small data files that will not change through the experiment, but these exceptions are rare.

One practical difficulty with not checking in automatically generated files is that each time you issue an update command, the version control software is likely to complain about all of these files in your working directory that have not been checked in. To avoid scrolling through multiple screens of filenames at each update, Subversion and CVS provide functionality to tell the system to ignore certain files or types of files.

Biology

bidvertiser

My Blog List

Search This Blog

Monday, March 1, 2010

The Value of Version Control

No comments:

Post a Comment

Followers

Blog Archive

About Me