Chapter 4: Introduction to Probability and Statistical Analysis of Sequence Alignments

One of the most important recent advances in sequence analysis is the development of methods to assess the significance of a local alignment between DNA or protein sequences. For sequences that are obviously related—two proteins that are clearly in the same family, or two matching or overlapping DNA fragments—such an analysis is hardly necessary. The question of significance arises when comparing two sequences that are not so clearly similar but are shown to align in a promising way. In such a case, a significance test can help the biologist to decide whether an alignment found by the computer program is one that would be expected between related sequences or would just as likely be found if the sequences were not related. A significance test is also critical for evaluating the results of a database search for sequences that are found to be similar to a query sequence using the BLAST and FASTA programs (Chapter 6). The test is applied to every sequence matched so that the most significant matches can be reported. Finally, a significance test can also help to identify regions in a single sequence that have an unusual composition suggestive of an interesting function.

Our goal here is to examine the significance of sequence alignment scores obtained by the dynamic programming method. Adequate theory has been developed and supportive experimental data have been obtained that together provide a reliable evaluation of local sequence alignments. This chapter outlines some of the major features of statistical testing and probability calculations and shows how to use these features to evaluate the significance of a sequence alignment.


