Statistical methods in bioinformatics

Statistical Methods in Bioinformatics, 20.3.-8.5.2009, 5 credits

NB! No lectures on the following days: 18.3, 10.4., 1.5.

Lecturer: professor Jukka Corander, Matematiska Institionen, Åbo Akademi

Lectures are held Wed 15-17 and Fri 13-15 in auditorium Lindman, Gadolinia building.

Examination: There will be no written examination, but the participants are required to do a set of home assignments (see below for details).

Bioinformatics is one of the scientific fields that have witnessed extremely rapid development during the past two decades. It generally refers to computational and mathematical methods that are used for data storage, retrieval and analysis, as well as general modeling within all branches of molecular biology. For a short description, see this Wikipedia page. Due to the enormous range of situations encountered in bioinformatics, any attempt to gain deeper insight must be focused on certain subtopics. In this course the main focus is on probability model based statistical approaches to solving important problems in molecular biology. However, some more heuristic methods will also be discussed. The general topics planned to be discussed are:

1. Phylogenetic models and their mathematical properties

2. Modeling the structure of DNA using Markovian models

3. Alignment of sequences

4. Microarray data analysis

Given the numerous issues that need to be covered in the above topics, it may happen that the time is insufficient for discussing the microarray data analysis part. The course is open to anyone interested in these topics, however, the mathematical content of the majority of the materials is such that participants should preferably have previous experience of probability calculus, stochastic processes (in particular Markov chains) and at least introductory statistics. Some of the discussed materials use advanced statistical methods, such as Bayesian inference.

Course material:

1. Mathematical properties of phylogenetic models, free e-book, available here (only certain parts will be considered). Here you can find the basic details of Markov chains.

2. Bayesian inference about phylogenetic models, paper 1, paper 2, site for phylogenetic software Mr.Bayes (see e.g. manual).

3. Comparison of Bayesian and bootstrapped maximum likelihood inferences about phylogenetic trees, link to a paper by Erixon et al. (2003).

4. Probabilistic models for DNA structures, paper 1, paper 2, paper 3. Here you can find the basic details of Markov chains.

5. Alignment of sequences, paper 1, paper 2, BLAST

6. Microarrays. Course material from Fall term 2007. Links to relevant overall background information concerning molecular biology are provided.

Soft introduction to coalescent theory can be found here.

Some additional links kindly provided by José Gama:

What is Bioinformatics? A Proposed Definition and Overview of the Field
http://bioinfo.mbb.yale.edu/e-print/whatis-mim/gerstein_manuscript.pdf

What is T, G, A and C?
The Structures of Life
http://publications.nigms.nih.gov/structlife/structlife.pdf

BioDataBases
http://bio.fsu.edu/~stevet/BSC5936/BioDataBases.ppt
Multiple Alignment
http://bio.fsu.edu/~stevet/BSC5936/MultipleAlignment.ppt
More...
http://bio.fsu.edu/~stevet/BSC5936/

Molecular Biology for Computer Scientists
http://www.biostat.wisc.edu/~craven/hunter.pdf

Examination: There will be no written examination, but to gain the 5 credits, the participants are required to do the following set of assignments, and to deliver a detailed written report on the findings. In case the participants wish to do so, they may do the assignments as teams of 2 or 3 persons, and leave a joint report.

Exercise1: Use Seq-Gen (or some other comparable software, if you prefer that option) to simulate 5 DNA sequences of length 50 bases under both Jukes-Cantor and Kimura 2-parameter models and any given phylogenetic tree of your choice. Fit both models to both data sets using MEGA4 and Mr.Bayes software. Check how the estimated phylogeny corresponds to the generating model. Compare the bootstrap support values and the posterior probabilities of the internal nodes of the optimal tree. Fit also the neighbor-joining tree to the data using MEGA4 and compare the results with the earlier ones.

Exercise2: Take the two sets of 5 sequences generated in the first exetcise and remove a random number of bases from the each end of each sequence. Let the random number of bases removed from each end be uniformly distributed between [5,15]. In case you don’t wish to do this programmatically, simply chop out bits of the sequences manually, such that the bit lengths are in the interval [5,15]. Align separately both resulting sets of 5 sequences using the Clustal program, e.g. available online here. Check how well the alignment matches the original alignment in your data.

Exercise3: Generate 5 random sequences of the letters ACGT according to a simple Markov model of order 3(discussed in point 4 of the course material above), such that each sequence has length 200. Define a word of length 10 letters in the alphabet ACGT, for instance, the word could be AACGTTAACC. Let this word be a generating configuration and simulate 5 realizations from it using the following approach. Independently for all 10 letters in the generating configutation, change a letter randomly to some other letter with probability .1. Thus, you get 5 words which are slightly altered variants of the generating configuration. For each of the 5 sequences of length 200 you generated earlier, insert one of the five word realizations into a randomly chosen position within the sequence (each word should only be used once!). Use the Weeder program (available online here) to analyze jointly the 5 new sequences each of length 210. NB! Generate the sequences as lower case letters acgt for easy access to Weeder using the FASTA format! The purpose is to check how well the partially conserved words of length 10 letters will be discovered among the uninteresting parts of the sequence by the Weeder method. Repeat the task once more by using instead the probability .3 for changing a letter randomly in the previously defined word. Compare the results with the earlier ones.

Lecture diary:

Weeks 12-16: Following sections of the book by TK have been discussed: Ch. 1, Ch. 2 pp. 29-35, 38 (from 2.1.6) - 52 (not proof of 2.3.2), (see also p 55), 61-62, 71-75, Ch. 3 pp. 77-79, 82-87, Ch. 4 pp. 91-104, 110-116, Ch. 6 pp. 131-155.

Week 16: discussion of Neighbor-Joining tree estimation algorithm and the use of bootstrap (example 1, example 2). Examples with real data using MEGA.

Week 17: UPGMA (= average linkage clustering) method, see examples. Examples with real data using MEGA. Monte Carlo based search for topologies.

Updated by Jukka Corander March 10th 2009.