Lessons introduce and reinforce basic skills in the Unix shell and R, and are designed for learners with no programming experience. The general topics and skills covered include:
Microbes are ideal organisms for exploring 'Long-term Evolution Experiments' (LTEEs) - thousands of generations can be generated and stored in a way that would be virtually impossible for more complex eukaryotic systems. In Lenski et.al., 12 populations of Escherichia coli were propagated for more than 40,000 generations in a glucose-limited minimal medium. This medium was supplemented with citrate which E. coli cannot metabolize in the aerobic conditions of the experiment. Sequencing of the populations at regular time points reveals that spontaneous citrate-using mutants (Cit+) appeared in a population of E.coli (designated Ara-3) at around 31,000 generations. It should be noted that spontaneous Cit+ mutants are extraordinarily rare - inability to metabolize citrate is one of the defining characters of the E. coli species. Eventually, Cit+ mutants became the dominate population as the experimental growth medium contained a high concentration of citrate relative to glucose.
Data Carpentry workshops follow a narrative to illustrate key steps in the data analysis from the perspective of real experiment (and experimenter). We choose the Lenski data for several reasons, including:
Purpose of these lessons
These lessons teach fundamental data management and analysis skills needed to work with genomic data. Example workflows are used to illustrate the researchers skills needed to handle files, use bioinformatics tools, and analyze their output.
Teaching layout
Lessons are a series of modules designed for use at a two-day Data Carpentry workshop. They can also be covered by learners independently on their own - provided they follow the Doing this on your own setup requirements. Most lessons include work at the command line in the Unix shell (particularly the Bash shell). We will also cover visualization of datasets in R and with some other common genome visualization tools.
Intended audience
We created for learners who are just starting to analyze a genomic dataset - which we define as any project that takes high-throughput (next-generation) sequence data to any number of endpoints (genome/transcriptome assembly, variant detection, RNA/ChIP-Seq, etc.).
Content Contributors: Jon Badalamenti, Amanda Charbonneau, Shari Ellis, Chris Fields, Bob Freeman, Chris Hamm, Randall Hayes, Josh Herr, Kate Hertweck, Adina Howe, Hilmar Lapp, Michael Linderman, Andréa Matsunaga, Blaine Marchant, Sue McClatchy, Sheldon McKay, Susan Miller, Jeramia Ory, Deborah Paul, Mary Shelley, Mike Smorul, Sara Stevens, Tracy Teal, Juan Ugalde, Jason Williams, Laura Williams, Ryan Williams, Greg Wilson
Blount, Z.D., Barrick, J.E., Davidson, C.J., Lenski, R.E. Genomic analysis of a key innovation in an experimental Escherichia coli population (2012) Nature, 489 (7417), pp. 513-518.