Biology: March 2010

Sunday, March 7, 2010

Scientists measure impacts of changing climate on ocean biology

The study, Climate Variability on the East Coast (CliVEC), will also help validate ocean colour satellite measurements and refine biogeochemistry models of ocean processes.

Researchers from NOAA, NASA and Old Dominion University are collaborating through an existing NOAA Fisheries Service field program, the Ecosystem Monitoring or EcoMon program.

The EcoMon surveys are conducted six times each year by the Northeast Fisheries Science Center (NEFSC) at 120 randomly selected stations throughout the continental shelf and slope of the northeastern US, from Cape Hatteras, N.C., into Canadian waters to cover all of Georges Bank and the Gulf of Maine.

This area is known as the Northeast US continental shelf Large Marine Ecosystem.

The climate study team will participate in three annual EcoMon cruises aboard the 155-foot NOAA Fisheries Survey Vessel Delaware II, based at the NEFSC's laboratory in Woods Hole, Massachusetts.

Findings from the climate impact project, funded by NASA, will help scientists better understand how annual and decadal-scale climate variability affects the growth of phytoplankton, which is the basis of the oceanic food chain.

The project will also examine organic carbon distributions along the continental margin of the East Coast and collect data for ocean acidification studies.

According to laboratory colleague Jon Hare, an oceanographer and plankton specialist, "The CliVEC program will provide a more complete understanding of the northeast US shelf ecosystem."

"It extends our EcoMon survey efforts, and we are excited about the new knowledge and advances in satellite models that we will all gain from this collaboration and pooling of resources," he said.

The team of scientists from GSFC and ODU is conducting water sampling and experiments to quantify primary productivity and carbon distributions.

"Phytoplankton are the foundation of the food chain in the ocean and produce about half of the oxygen on Earth," said Antonio Mannino from NASA's Goddard Space Flight Center (GSFC).

"By understanding the distribution of phytoplankton populations and how they react to natural and anthropogenic forcing, we can better predict future responses of phytoplankton and possibly even fisheries," he added.

Zoo conducts experiment on preserving Va. state bat

From the outset, the National Zoo said, it knew it was risky to work with the Virginia big-eared bat, the odd-looking winged creature that happens to be Virginia's state bat.

But looking for a way to help the species survive a disease threat, the zoo set up quarters for 40 of the animals at the Smithsonian Conservation Biology Institute in Front Royal, Va. The idea was to learn how to keep at least some of the bats alive in case wild populations were devastated.

But efforts to maintain the big-eared bats in confinement "have proved challenging," and only 11 remain alive, the zoo said Friday.

A big problem was getting the animals to eat.

Normally, the big-eared bats dine in flight, picking juicy insects out of the air. In the experiment, some bats learned to eat mealworms (insect larva) from pans. But even some of them failed to thrive.

"They stress easily and do not do well in captivity," said Jeremy Coleman, of the U.S. Fish and Wildlife Service, a sponsor of the Zoo's project.

In recent years, some wild bat populations have been imperiled by a new threat: white-nose syndrome. The disease has spread to Virginia from the Northeast, said Coleman, the service's national white-nose coordinator.

The Virginia big-eared bat's susceptibility has not been confirmed but is suspected, he said.

Fossils of snake eating dino eggs found in India

BANGKOK (AP) — The fossilized remains of a 67 million-year-old snake found coiled around a dinosaur egg offer rare insight into the ancient reptile's dining habits and evolution, scientists said Tuesday.

The findings, which appeared in Tuesday's issue of the PLoS Biology journal, provide the first evidence that the 11.5-foot- (3.5-meter-) long snake fed on eggs and hatchlings of saurapod dinosaurs, meaning it was one of the few predators to prey on the long-necked herbivores.

They also suggest that, as early as 100 million years ago, snakes were developing mobile jaws similar to those of today's large-mouthed snakes, including vipers and boas.

"This is an early, well preserved snake, and it is doing something. We are capturing it's behavior," said University of Michigan paleontologist Jeff Wilson, who is credited with recognizing the snake bones amid the crushed dinosaur eggs and bones of hatchlings.

"We have information about what this early snake did for living," he said. "It also helps us understand the early evolution of snakes both anatomically and ecologically."

Dhananjay Mohabey of India's Geological Survey discovered the fossilized remains in 1987, but he was only able to make out the dinosaur eggshells and limb bones. Wilson examined the fossils in 2001 and was "astonished" to find a predator in the midst of the sauropod's nest.

"I saw the characteristic vertebral locking mechanism of snakes alongside dinosaur eggshell and larger bones, and I knew it was an extraordinary specimen," Wilson said.

Mohabey theorized that the snake — dubbed Sanajeh indicus, which means "ancient gaped one" in Sanskrit — had just arrived at the nest and was in the process of gobbling a hatchling emerging from its egg. But the entire scene was "frozen in time" when it was hit by a storm or some other disaster and buried under layers of sediment.

"We think the hatchlings had just exited its egg, and the activity attracted the snake," Mohabey said, adding that the site in Western state of Gujarat has revealed about 30 sauropod nests and at least two other snake specimens.

Michael Benton of the University of Bristol, also writing in the PLoS Biology, said it can be difficult to determine the behavior of ancient organisms. But he said that it was "most likely, as the authors argue, that this snake was waiting and snatching juveniles as they hatched."

"Of course, we cannot be entirely sure unless further specimens come to light showing the bones of juvenile dinosaurs in the stomach region of the snake," Benton said.

Ashok Sahni, a senior scientist at the Indian National Science Academy who was also not involved in the dig, described the find as "truly remarkable" because it is rare for fossil bones to be preserved at the site of fossilized eggs.

Friday, March 5, 2010

Foundations of modern biology

Cell theory

Cell theory states that the cell is the fundamental unit of life, and that all living things are composed of one or more cells or the secreted products of those cells (e.g. shells). All cells arise from other cells through cell division. In multicellular organisms, every cell in the organism's body derives ultimately from a single cell in a fertilized egg. The cell is also considered to be the basic unit in many pathological processes.Additionally, the phenomenon of energy flow occurs in cells in processes that are part of the function known as metabolism. Finally, cells contain hereditary information (DNA) which is passed from cell to cell during cell division.

Evolution

A central organizing concept in biology is that life changes and develops through evolution, and that all life-forms known have a common origin. Introduced into the scientific lexicon by Jean-Baptiste de Lamarck in 1809, Darwin established evolution fifty years later as a viable theory by articulating its driving force: natural selection.(Alfred Russel Wallace is recognized as the co-discoverer of this concept as he helped research and experiment with the concept of evolution.) Evolution is now used to explain the great variations of life found on Earth.

Darwin theorized that species and breeds developed through the processes of natural selection and artificial selection or selective breeding. Genetic drift was embraced as an additional mechanism of evolutionary development in the modern synthesis of the theory.

The evolutionary history of the species—which describes the characteristics of the various species from which it descended—together with its genealogical relationship to every other species is known as its phylogeny. Widely varied approaches to biology generate information about phylogeny. These include the comparisons of DNA sequences conducted within molecular biology or genomics, and comparisons of fossils or other records of ancient organisms in paleontology. Biologists organize and analyze evolutionary relationships through various methods, including phylogenetics, phenetics, and cladistics. (For a summary of major events in the evolution of life as currently understood by biologists, see evolutionary timeline.)

The theory of evolution postulates that all organisms on the Earth, both living and extinct, have descended from a common ancestor or an ancestral gene pool. This last universal common ancestor of all organisms is believed to have appeared about 3.5 billion years ago. Biologists generally regard the universality of the genetic code as definitive evidence in favor of the theory of universal common descent for all bacteria, archaea, and eukaryotes (see: origin of life).

Genetics

Genes are the primary units of inheritance in all organisms. A gene is a unit of heredity and corresponds to a region of DNA that influences the form or function of an organism in specific ways. All organisms, from bacteria to animals, share the same basic machinery that copies and translates DNA into proteins. Cells transcribe a DNA gene into an RNA version of the gene, and a ribosome then translates the RNA into a protein, a sequence of amino acids. The translation code from RNA codon to amino acid is the same for most organisms, but slightly different for some. For example, a sequence of DNA that codes for insulin in humans will also code for insulin when inserted into other organisms, such as plants.

DNA usually occurs as linear chromosomes in eukaryotes, and circular chromosomes in prokaryotes. A chromosome is an organized structure consisting of DNA and histones. The set of chromosomes in a cell and any other hereditary information found in the mitochondria, chloroplasts, or other locations is collectively known as its genome. In eukaryotes, genomic DNA is located in the cell nucleus, along with small amounts in mitochondria and chloroplasts. In prokaryotes, the DNA is held within an irregularly shaped body in the cytoplasm called the nucleoid.[26] The genetic information in a genome is held within genes, and the complete assemblage of this information in an organism is called its genotype.

Homeostasis

Homeostasis is the ability of an open system to regulate its internal environment to maintain stable conditions by means of multiple dynamic equilibrium adjustments controlled by interrelated regulation mechanisms. All living organisms, whether unicellular or multicellular, exhibit homeostasis.

In order to maintain dynamic equilibrium and effectively carry out certain functions, a system must detect and respond to perturbations. After the detection of a perturbation, a biological system will normally respond through negative feedback. This means reducing the output or activity of an organ or system. One example is the release of glucagon when sugar levels are too low.

Energy

The survival of a living organism depends on the continuous input of energy. Chemical reactions that are responsible for its structure and function are tuned to extract energy from substances that act as its food and transform them to help form new cells and sustain them. In this process, molecules of chemical substances that constitute food play two roles; first, they contain energy that can be transformed for biological chemical reactions; second, they develop new molecular structures made up of biomolecules.

Nearly all of the energy needed for life processes originates from the Sun. Plants and other autotrophs use solar energy via a process known as photosynthesis to convert raw materials into organic molecules, such as ATP, whose bonds can be broken to release energy. A few ecosystems, however, depend entirely on energy extracted by chemotrophs from methane, sulfides, or other non-luminal energy sources.

Some of the captured energy is used to produce biomass to sustain life and provide energy for its growth and development. The majority of the rest of this energy is lost as heat and waste molecules. The most important processes for converting the energy trapped in chemical substances into energy useful to sustain life are metabolism and cellular respiration.

Structural

Molecular biology is the study of biology at a molecular level. This field overlaps with other areas of biology, particularly with genetics and biochemistry. Molecular biology chiefly concerns itself with understanding the interactions between the various systems of a cell, including the interrelationship of DNA, RNA, and protein synthesis and learning how these interactions are regulated.

Cell biology studies the structural and physiological properties of cells, including their behaviors, interactions, and environment. This is done on both the microscopic and molecular levels, for single-celled organisms such as bacteria as well as the specialized cells in multicellular organisms such as humans. Understanding the structure and function of cells is fundamental to all of the biological sciences. The similarities and differences between cell types are particularly relevant to molecular biology.

Anatomy considers the forms of macroscopic structures such as organs and organ systems.

Genetics is the science of genes, heredity, and the variation of organisms. Genes encode the information necessary for synthesizing proteins, which in turn play a large role in influencing (though, in many instances, not completely determining) the final phenotype of the organism. In modern research, genetics provides important tools in the investigation of the function of a particular gene, or the analysis of genetic interactions. Within organisms, genetic information generally is carried in chromosomes, where it is represented in the chemical structure of particular DNA molecules.

Developmental biology studies the process by which organisms grow and develop. Originating in embryology, modern developmental biology studies the genetic control of cell growth, differentiation, and "morphogenesis," which is the process that progressively gives rise to tissues, organs, and anatomy. Model organisms for developmental biology include the round worm Caenorhabditis elegans, the fruit fly Drosophila melanogaster, the zebrafish Danio rerio, the mouse Mus musculus, and the weed Arabidopsis thaliana. (A model organism is a species that is extensively studied to understand particular biological phenomena, with the expectation that discoveries made in that organism will provide insight into the workings of other organisms.)

Physiological

Physiology studies the mechanical, physical, and biochemical processes of living organisms by attempting to understand how all of the structures function as a whole. The theme of "structure to function" is central to biology. Physiological studies have traditionally been divided into plant physiology and animal physiology, but some principles of physiology are universal, no matter what particular organism is being studied. For example, what is learned about the physiology of yeast cells can also apply to human cells. The field of animal physiology extends the tools and methods of human physiology to non-human species. Plant physiology borrows techniques from both research fields.

Physiology studies how for example nervous, immune, endocrine, respiratory, and circulatory systems, function and interact. The study of these systems is shared with medically oriented disciplines such as neurology and immunology.

Evolutionary

Evolutionary research is concerned with the origin and descent of species, as well as their change over time, and includes scientists from many taxonomically-oriented disciplines. For example, it generally involves scientists who have special training in particular organisms such as mammalogy, ornithology, botany, or herpetology, but use those organisms as systems to answer general questions about evolution.

Evolutionary biology is partly based on paleontology, which uses the fossil record to answer questions about the mode and tempo of evolution, and partly on the developments in areas such as population genetics and evolutionary theory. In the 1980s, developmental biology re-entered evolutionary biology from its initial exclusion from the modern synthesis through the study of evolutionary developmental biology. Related fields which are often considered part of evolutionary biology are phylogenetics, systematics, and taxonomy.

Systematics

Multiple speciation events create a tree structured system of relationships between species. The role of systematics is to study these relationships and thus the differences and similarities between species and groups of species. However, systematics was an active field of research long before evolutionary thinking was common. The classification, taxonomy, and nomenclature of biological organisms is administered by the International Code of Zoological Nomenclature, International Code of Botanical Nomenclature, and International Code of Nomenclature of Bacteria for animals, plants, and bacteria, respectively. The classification of viruses, viroids, prions, and all other sub-viral agents that demonstrate biological characteristics is conducted by the International Code of Virus classification and nomenclature. However, several other viral classification systems do exist.

Traditionally, living things have been divided into five kingdoms: Monera; Protista; Fungi; Plantae; Animalia.

However, many scientists now consider this five-kingdom system outdated. Modern alternative classification systems generally begin with the three-domain system: Archaea (originally Archaebacteria); Bacteria (originally Eubacteria); Eukaryota (including protists, fungi, plants, and animals)[55] These domains reflect whether the cells have nuclei or not, as well as differences in the chemical composition of the cell exteriors.

Further, each kingdom is broken down recursively until each species is separately classified. The order is: Domain; Kingdom; Phylum; Class; Order; Family; Genus; Species.

There is also a series of intracellular parasites that are "on the edge of life" in terms of metabolic activity, meaning that many scientists do not actually classify these structures as alive, due to their lack of at least one or more of the fundamental functions by which life is defined. They are classified as viruses, viroids, prions, or satellites.

The scientific name of an organism is generated from its genus and species. For example, humans would be listed as Homo sapiens. Homo would be the genus and sapiens is the species. Whenever writing the scientific name of an organism, it is proper to capitalize the first letter in the genus and put all of the species in lowercase. Additionally, the entire term would be italicized or underlined.

The dominant classification system is called Linnaean taxonomy, which includes ranks and binomial nomenclature. How organisms are named is governed by international agreements such as the International Code of Botanical Nomenclature (ICBN), the International Code of Zoological Nomenclature (ICZN), and the International Code of Nomenclature of Bacteria (ICNB).

A merging draft, BioCode, was published in 1997 in an attempt to standardize nomenclature in these three areas, but has yet to be formally adopted. The BioCode draft has received little attention since 1997; its originally planned implementation date of January 1, 2000, has passed unnoticed. However, a 2004 paper concerning the cyanobacteria does advocate a future adoption of a BioCode and interim steps consisting of reducing the differences between the cod The International Code of Virus Classification and Nomenclature (ICVCN) remains outside the BioCode.

Ecology

Ecology studies the distribution and abundance of living organisms, and the interactions between organisms and their environment. The habitat of an organism can be described as the local abiotic factors such as climate and ecology, in addition to the other organisms and biotic factors that share its environment. One reason that biological systems can be difficult to study is that so many different interactions with other organisms and the environment are possible, even on the smallest of scales. A microscopic bacterium responding to a local sugar gradient is responding to its environment as much as a lion is responding to its environment when it searches for food in the African savanna. For any given species, behaviors can be co-operative, aggressive, parasitic or symbiotic. Matters become more complex when two or more different species interact in an ecosystem. Studies of this type are within the province of ecology.

Ecological systems are studied at several different levels, from individuals and populations to ecosystems and the biosphere. The term population biology is often used interchangeably with population ecology, although population biology is more frequently used when studying diseases, viruses, and microbes, while population ecology is more commonly when studying plants and animals. As can be surmised, ecology is a science that draws on several disciplines.

Ethology studies animal behavior (particularly that of social animals such as primates and canids), and is sometimes considered a branch of zoology. Ethologists have been particularly concerned with the evolution of behavior and the understanding of behavior in terms of the theory of natural selection. In one sense, the first modern ethologist was Charles Darwin, whose book, The Expression of the Emotions in Man and Animals, influenced many ethologists.

Biogeography studies the spatial distribution of organisms on the Earth, focusing on topics like plate tectonics, climate change, dispersal and migration, and cladistics.

Monday, March 1, 2010

Introduction

Most bioinformatics coursework focuses on algorithms, with perhaps some components devoted to learning programming skills and learning how to use existing bioinformatics software. Unfortunately, for students who are preparing for a research career, this type of curriculum fails to address many of the day-to-day organizational challenges associated with performing computational experiments. In practice, the principles behind organizing and documenting computational experiments are often learned on the fly, and this learning is strongly influenced by personal predilections as well as by chance interactions with collaborators or colleagues.

The purpose of this article is to describe one good strategy for carrying out computational experiments. I will not describe profound issues such as how to formulate hypotheses, design experiments, or draw conclusions. Rather, I will focus on relatively mundane issues such as organizing files and directories and documenting progress. These issues are important because poor organizational choices can lead to significantly slower research progress. I do not claim that the strategies I outline here are optimal. These are simply the principles and practices that I have developed over 12 years of bioinformatics research, augmented with various suggestions from other researchers with whom I have discussed these issues.

Principles

The core guiding principle is simple: Someone unfamiliar with your project should be able to look at your computer files and understand in detail what you did and why. This “someone” could be any of a variety of people: someone who read your published article and wants to try to reproduce your work, a collaborator who wants to understand the details of your experiments, a future student working in your lab who wants to extend your work after you have moved on to a new job, your research advisor, who may be interested in understanding your work or who may be evaluating your research skills. Most commonly, however, that “someone” is you. A few months from now, you may not remember what you were up to when you created a particular set of files, or you may not remember what conclusions you drew. You will either have to then spend time reconstructing your previous experiments or lose whatever insights you gained from those experiments.

This leads to the second principle, which is actually more like a version of Murphy's Law: Everything you do, you will probably have to do over again. Inevitably, you will discover some flaw in your initial preparation of the data being analyzed, or you will get access to new data, or you will decide that your parameterization of a particular model was not broad enough. This means that the experiment you did last week, or even the set of experiments you've been working on over the past month, will probably need to be redone. If you have organized and documented your work clearly, then repeating the experiment with the new data or the new parameterization will be much, much easier.

To see how these two principles are applied in practice, let's begin by considering the organization of directories and files with respect to a particular project.

File and Directory Organization

When you begin a new project, you will need to decide upon some organizational structure for the relevant directories. It is generally a good idea to store all of the files relevant to one project under a common root directory. The exception to this rule is source code or scripts that are used in multiple projects. Each such program might have a project directory of its own.

Within a given project, I use a top-level organization that is logical, with chronological organization at the next level, and logical organization below that. A sample project, called msms, is shown in . At the root of most of my projects, I have a data directory for storing fixed data sets, a results directory for tracking computational experiments peformed on that data, a doc directory with one subdirectory per manuscript, and directories such as src for source code and bin for compiled binaries or scripts.

Within the data and results directories, it is often tempting to apply a similar, logical organization. For example, you may have two or three data sets against which you plan to benchmark your algorithms, so you could create one directory for each of them under data. In my experience, this approach is risky, because the logical structure of your final set of experiments may look drastically different from the form you initially designed. This is particularly true under the results directory, where you may not even know in advance what kinds of experiments you will need to perform. If you try to give your directories logical names, you may end up with a very long list of directories with names that, six months from now, you no longer know how to interpret.

Instead, I have found that organizing my data and results directories chronologically makes the most sense. Indeed, with this approach, the distinction between data and results may not be useful. Instead, one could imagine a top-level directory called something like experiments, with subdirectories with names like 2008-12-19. Optionally, the directory name might also include a word or two indicating the topic of the experiment therein. In practice, a single experiment will often require more than one day of work, and so you may end up working a few days or more before creating a new subdirectory. Later, when you or someone else wants to know what you did, the chronological structure of your work will be self-evident.

Below a single experiment directory, the organization of files and directories is logical, and depends upon the structure of your experiment. In many simple experiments, you can keep all of your files in the current directory. If you start creating lots of files, then you should introduce some directory structure to store files of different types. This directory structure will typically be generated automatically from a driver script, as discussed below.

The Lab Notebook

In parallel with this chronological directory structure, I find it useful to maintain a chronologically organized lab notebook. This is a document that resides in the root of the results directory and that records your progress in detail. Entries in the notebook should be dated, and they should be relatively verbose, with links or embedded images or tables displaying the results of the experiments that you performed. In addition to describing precisely what you did, the notebook should record your observations, conclusions, and ideas for future work. Particularly when an experiment turns out badly, it is tempting simply to link the final plot or table of results and start a new experiment. Before doing that, it is important to document how you know the experiment failed, since the interpretation of your results may not be obvious to someone else reading your lab notebook.

In addition to the primary text describing your experiments, it is often valuable to transcribe notes from conversations as well as e-mail text into the lab notebook. These types of entries provide a complete picture of the development of the project over time.

In practice, I ask members of my research group to put their lab notebooks online, behind password protection if necessary. When I meet with a member of my lab or a project team, we can refer to the online lab notebook, focusing on the current entry but scrolling up to previous entries as necessary. The URL can also be provided to remote collaborators to give them status updates on the project.

Note that if you would rather not create your own “home-brew” electronic notebook, several alternatives are available. For example, a variety of commercial software systems have been created to help scientists create and maintain electronic lab notebooks. Furthermore, especially in the context of collaborations, storing the lab notebook on a wiki-based system or on a blog site may be appealing.

Carrying Out a Single Experiment

You have now created your directory structure, and you have created a directory for the current data, with the intention of carrying out a particular experiment in that directory. How do you proceed?

The general principle is that you should record every operation that you perform, and make those operations as transparent and reproducible as possible. In practice, this means that I create either a README file, in which I store every command line that I used while performing the experiment, or a driver script (I usually call this runall) that carries out the entire experiment automatically. The choices that you make at this point will depend strongly upon what development environment you prefer. If you are working in a language such as Matlab or R, you may be able to store everything as a script in that language. If you are using compiled code, then you will need to store the command lines separately. Personally, I work in a combination of shell scripts, Python, and C. The appropriate mix of these three languages depends upon the complexity of the experiment. Whatever you decide, you should end up with a file that is parallel to the lab notebook entry. The lab notebook contains a prose description of the experiment, whereas the driver script contains all the gory details.

Here are some rules of thumb that I try to follow when developing the driver script:

Record every operation that you perform.
Comment generously. The driver script typically involves little in the way of complicated logic, but often invokes various scripts that you have written, as well as a possibly eclectic collection of Unix utilities. Hence, for this type of script, a reasonable rule of thumb is that someone should be able to understand what you are doing solely from reading the comments. Note that I am refraining from advocating a particular mode of commenting for compiled code or more complex scripts—there are many schools of thought on the correct way to write such comments.
Avoid editing intermediate files by hand. Doing so means that your script will only be semi-automatic, because the next time you run the experiment, you will have to redo the editing operation. Many simple editing operations can be performed using standard Unix utilities such as sed, awk, grep, head, tail, sort, cut, and paste.
Store all file and directory names in this script. If the driver script calls other scripts or functions, then files and directory names should be passed from the driver script to these auxiliary scripts. Forcing all of the file and directory names to reside in one place makes it much easier to keep track of and modify the organization of your output files.
Use relative pathnames to access other files within the same project. If you use absolute pathnames, then your script will not work for people who check out a copy of your project in their local directories (see “The Value of Version Control” below).
Make the script restartable. I find it useful to embed long-running steps of the experiment in a loop of the form if () then . If I want to rerun selected parts of the experiment, then I can delete the corresponding output files.

For experiments that take a long time to run, I find it useful to be able to obtain a summary of the experiment's progress thus far. In these cases, I create two driver scripts, one to run the experiment (runall) and one to summarize the results (summarize). The final line of runall calls summarize, which in turn creates a plot, table, or HTML page that summarizes the results of the experiment. The summarize script is written in such a way that it can interpret a partially completed experiment, showing how much of the computation has been performed thus far.

Handling and Preventing Errors

During the development of a complicated set of experiments, you will introduce errors into your code. Such errors are inevitable, but they are particularly problematic if they are difficult to track down or, worse, if you don't know about them and hence draw invalid conclusions from your experiment. Here are three suggestions for error handling.

First, write robust code to detect errors. Even in a simple script, you should check for bogus parameters, invalid input, etc. Whenever possible, use robust library functions to read standard file formats rather than writing ad hoc parsers.

Second, when an error does occur, abort. I typically have my program print a message to standard error and then exit with a non-zero exit status. Such behavior might seem like it makes your program brittle; however, if you try to skip over the problematic case and continue on to the next step in the experiment, you run the risk that you will never notice the error. A corollary of this rule is that your code should always check the return codes of commands executed and functions called, and abort when a failure is observed.

Third, whenever possible, create each output file using a temporary name, and then rename the file after it is complete. This allows you to easily make your scripts restartable and, more importantly, prevents partial results from being mistaken for full results.

Command Lines versus Scripts versus Programs

The design question that you will face most often as you formulate and execute a series of computational experiments is how much effort to put into software engineering. Depending upon your temperament, you may be tempted to execute a quick series of commands in order to test your hypothesis immediately, or you may be tempted to over-engineer your programs to carry out your experiment in a pleasingly automatic fashion. In practice, I find that a happy medium between these two often involves iterative improvement of scripts. An initial script is designed with minimal functionality and without the ability to restart in the middle of partially completed experiments. As the functionality of the script expands and the script is used more often, it may need to be broken into several scripts, or it may get “upgraded” from a simple shell script to Python, or, if memory or computational demands are too high, from Python to C or a mix thereof.

In practice, therefore, the scripts that I write tend to fall into these four categories:

Driver script. This is a top-level script; hence, each directory contains only one or two scripts of this type.
Single-use script. This is a simple script designed for a single use. For example, the script might convert an arbitrarily formatted file associated with this project into a format used by some of your existing scripts. This type of script resides in the same directory as the driver script that calls it.
Project-specific script. This type of script provides a generic functionality used by multiple experiments within the given project. I typically store such scripts in a directory immediately below the project root directory (e.g., the msms/bin/parse-sqt.py file in
Multi-project script. Some functionality is generic enough to be useful across many projects. I maintain a set of these generic scripts, which perform functions such as extracting specified sequences from a FASTA file, generating an ROC curve, splitting a file for n-fold cross-validation, etc.

Regardless of how general a script is supposed to be, it should have a clearly documented interface. In particular, every script or program, no matter how simple, should be able to produce a fairly detailed usage statement that makes it clear what the inputs and outputs are and what options are available.

The Value of Version Control

Version control software was originally developed to maintain and coordinate the development of complex software engineering projects. Modern version control systems such as Subversion are based on a central repository that stores all versions of a given collection of related files. Multiple individuals can “check out” a working copy of these files into their local directories, make changes, and then check the changes back into the central repository.

I find version control software to be invaluable for managing computational experiments, for three reasons. First, the software provides a form of backup. Although our university computer systems are automatically backed up on a nightly basis, my laptop's backup schedule is more erratic. Furthermore, after mistakenly overwriting a file, it is often easier to retrieve yesterday's version from Subversion than to send an e-mail to the system administator. Indeed, one of my graduate students told me he would breathe a sigh of relief after typing svn commit, because that command stores a snapshot of his working directory in the central repository.

Second, version control provides a historical record that can be useful for tracking down bugs or understanding old results. Typically, a script or program will evolve throughout the course of a project. Rather than storing many copies of the script with slightly different names, I rely upon the version control system to keep track of those versions. If I need to reproduce exactly an experiment that I performed three months ago, I can use the version control software to check out a copy of the state of my project at that time. Note that most version control software can also assign a logical “tag” to a particular state of the repository, allowing you to easily retrieve that state later.

Third, and perhaps most significantly, version control is invaluable for collaborative projects. The repository allows collaborators to work simultaneously on a collection of files, including scripts, documentation, or a draft manuscript. If two individuals edit the same file in parallel, then the version control software will automatically merge the two versions and flag lines that were edited by both people. It is not uncommon, in the hours before a looming deadline, for me to talk by phone with a remote collaborator while we both edit the same document, checking in changes every few minutes.

Although the basic idea of version control software seems straightforward, using a system such as Subversion effectively requires some discipline. First, version control software is most useful when it is used regularly. A good rule of thumb is that changes should be checked in at least once a day. This ensures that your historical record is complete and that a recent backup is always available if you mistakenly overwrite a file. If you are in the midst of editing code, and you have caused a once-compilable program to no longer work, it is possible to check in your changes on a “branch” of the project, effectively stating that this is a work in progress. Once the new functionality is implemented, then the branch can be merged back into the “trunk” of the project. Only then will your changes be propagated to other members of the project team.

Second, version control should only be used for files that you edit by hand. Automatically generated files, whether they are compiled programs or the results of a computational experiment, do not belong under version control. These files tend to be large, so checking them into the project wastes disk space, both because they will be duplicated in the repository and in every working copy of the project, and also because these files will tend to change as you redo your experiment multiple times. Binary files are particularly wasteful: Because version control software operates on a line-by-line basis, the version history of a binary file is simply a complete copy of all versions of that file. There are exceptions to this rule, such as relatively small data files that will not change through the experiment, but these exceptions are rare.

One practical difficulty with not checking in automatically generated files is that each time you issue an update command, the version control software is likely to complain about all of these files in your working directory that have not been checked in. To avoid scrolling through multiple screens of filenames at each update, Subversion and CVS provide functionality to tell the system to ignore certain files or types of files.

Conclusion

Many of the ideas outlined above have been described previously either in the context of computational biology or in general scientific computation. In particular, much has been written about the need to adopt sound software engineering principles and practices in the context of scientific software development. For example, Baxter et al. propose a set of five “best practices” for scientific software projects, and Wilson describes a variety of standard software engineering tools that can be used to make a computational scientist's life easier.

Although many practical issues described above apply generally to any type of scientific computational research, working with biologists and biological data does present some of its own issues. For example, many biological data sets are stored in central data repositories. Basic record keeping—recording in the lab notebook the URL as well as the version number and download date for a given data set—may be sufficient to track simpler data sets. But for very large or dynamic data, it may be necessary to use a more sophisticated approach. For example, Boyle et al. discuss how best to manage complex data repositories in the context of a scientific research program.

In addition, the need to make results accessible to and understandable by wet lab biologists may have practical implications for how a project is managed. For example, to make the results more understandable, significant effort may need to go into the prose descriptions of experiments in the lab notebook, rather than simply including a figure or table with a few lines of text summarizing the major conclusion. More practically, differences in operating systems and software may cause logistical difficulties. For example, computer scientists may prefer to write their documents in the LaTeX typesetting language, whereas biologists may prefer Microsoft Word.

As I mentioned in the Introduction, I intend this article to be more descriptive than prescriptive. Although I hope that some of the practices I describe above will prove useful for many readers, the most important take-home message is that the logistics of efficiently performing accurate, reproducible computational experiments is a subject worthy of consideration and discussion. Many relevant topics have not been covered here, including good coding practices, methods for automation of experiments, the logistics of writing a manuscript based on your experimental results, etc. I therefore encourage interested readers to post comments, suggestions

Acknowledgments

I am grateful for helpful input from Zafer Aydin, Mark Diekhans, and Michael Hoffman.

Human Body Project Ideas

Human body projects and studies allow us to get a better understanding of the human body. Not only do we gain an improved knowledge of anatomical functions, but we gain a greater understanding of human behavior as well. The following human body project ideas provide suggestions for topics that can be explored through experimentation.

Behavioral Project Ideas:

Does your sense of smell alter your sense of taste?
Which sense (taste, smell, touch) is best for identifying foods?
Does music affect blood pressure?
How does caffeine affect the body?
Does exercise affect memory retention?
Does music affect hand-eye coordination?
Does the weather affect a person's mood?
Does playing video games affect a person's heart rate?
Do colors evoke human emotions?
Does gender affect reaction time?
Is yawning contagious?
Does smiling affect a person's mood?
Does human behavior change during a full moon?
Does room temperature affect concentration?
Does sight affect the ability to determine sound direction?

Biological Project Ideas:

Does a person's BMI affect blood pressure?
Do all people have the same normal body temperature?
Which type of exercises increase muscle growth the most?
How do various types of acid (phosphoric acid, citric acid, etc.) affect tooth enamel?
Does a person's heart rate and blood pressure vary during the day?
Does exercise affect lung capacity?
Does blood vessel elasticity affect blood pressure?
Is calcium necessary for bone strength?
Does light intensity affect peripheral vision?
Does eye color affect a person's ability to distinguish colors?
Do different stressors ( heat, cold, etc.) affect nerve sensitivity?
Do food smells affect saliva production?

Animal Projects:

Animal projects and studies are important to understanding various biological processes in animals and even humans. Scientists study animals in order to learn ways to improve animal health for farm production, wildlife preservation, and human companionship. They also study animals to discover new methods to improve human health. Animal studies give us a better understanding of disease development and prevention, as well as standards for normal and abnormal behavior. The following animal project ideas introduce areas of animal studies that can be explored through experimentation.

Amphibian and Fish Project Ideas:

Does temperature affect tadpole growth?
Do water pH levels affect tadpole growth?
Does water temperature affect amphibian respiration?
Does magnetism affect limb regeneration in newts?
Does water temperature affect fish color?
Does the size of a population of fish affect growth?
Does music affect fish activity?
Does the amount of light affect fish activity?

Bird Project Ideas:

Which types of plants attract hummingbirds?
What factors increase egg-laying in birds?
Do different bird species prefer a particular color of bird seed?
Do certain bird species prefer to eat in a group or alone?
Do certain bird species prefer one type of habitat over another?

Insect Project Ideas:

How does temperature affect the growth of butterflies?
How does light affect ants?
Do different colors attract or repel insects?
How does pollution affect insects?
How do insects adapt to pesticides?
Do magnetic fields affect insects?
Does soil acidity affect insects?
Does color affect insect eating habits?
Does light or heat attract insects to lamps at night?
Do insects behave differently in a larger population as opposed to a smaller population?
What factors cause crickets to chirp more often?
What substances do mosquitoes find attractive or repellant?

Mammal Project Ideas:

Does light variation alter animal sleep habits?
Do cats or dogs have better night vision?
Does music affect an animal's mood?
Do bird sounds affect cat behavior?
Which animal sense has the greatest effect on short term memory?
Does animal saliva have antimicrobial properties?
Does colored water affect animal drinking habits?

Plant Project Ideas

Plants are tremendously important to life on earth. They are the foundation of food chains in almost every ecosystem. Plants also play a major role in the environment by influencing climate and producing life giving oxygen. Plant project studies allow us to learn about plant biology and potential usage for plants in other fields such as medicine, agriculture, and biotechnology. The following plant project ideas provide suggestions for topics that can be explored through experimentation.

Do magnetic fields affect plant growth?
Do different colors of light affect the direction of plant growth?
Do sounds (music, noise, etc.) affect plant growth?
Do different colors of light affect the rate of photosynthesis?
What are the effects of acid rain on plant growth?
Do household detergents affect plant growth?
Can plants conduct electricity?
Does cigarette smoke affect plant growth?
Does soil temperature affect root growth?
Does caffeine affect plant growth?
Does water salinity affect plant growth?
Does artificial gravity affect seed germination?
Does freezing affect seed germination?
Does burned soil affect seed germination?
Does seed size affect plant height?
Does fruit size affect the amount of seeds in the fruit?
Do vitamins or fertilizers promote plant growth better?
Do fertilizers extend plant life during drought?
Does leaf size affect plant transpiration rates?
Can plant spices inhibit bacterial growth?
Do different types of artificial light affect plant growth?

bidvertiser

My Blog List

Search This Blog