Bioinformatics Online Home
   Chapters Links Problems Enroll for Updates Help
 Sample image
 Table 6.1
 Table 6.2
 Table 6.5
 Table 6.7
 Table 6.8
 Web search terms

Home  >  Chapters  >  6  > 

Chapter 6: Database Searching for Similar Sequences

Similarity searches in sequence databases have become a mainstay of bioinformatics, and large sequencing projects in which all of the genomic DNA of an organism is obtained have become quite commonplace. Similarity searches can also be remarkably useful for finding the function of genes whose sequences have been determined in the laboratory but for which there is no biological information. In these searches, the sequence of the gene of interest is compared to every sequence in a sequence database, and the similar ones are identified. Alignments with the best-matching sequences are shown and scored. If a query sequence can be readily aligned to a database sequence of known function, structure, or biochemical activity, the query sequence is predicted to have similar properties. The strength of these predictions depends on the quality of the alignment between the sequences. As a rough rule, for searches of a protein sequence database with a query protein sequence, if more than one-half of the amino acid sequence is identical in the sequence alignments, the prediction is very strong. For searches of a nucleic acid sequence database with a nucleic acid query sequence, the sequences should be translated if they encode proteins because related protein sequences are more readily identified. If only nucleic acid sequences are compared, then most of the sequences should be identical with few gaps for a strong prediction. As the degree of similarity decreases, confidence in the prediction also decreases. The programs used for these database searches provide statistical evaluations that serve as a guide for evaluation of the alignment scores.

Previous chapters have described methods for aligning sequences or for finding common patterns within sequences. The purpose of making alignments is to discover whether sequences are homologous, i.e., likely to be derived from a common ancestor sequence. If a strong homology relationship can be established, the sequences are likely to have maintained the same function as they diverged from each other during evolution. If an alignment can be found that would rarely be observed between random sequences, the sequences can be predicted to be related with a high degree of confidence. The presence of one or more conserved patterns in a group of sequences is also useful for establishing evolutionary and structure–function relationships among them.

The methods used for establishing sequence relationships in database searches are summarized in Table 6.1. In addition to standard searches of a sequence database with a query sequence, a matrix representation of a family of related protein sequences may be used to search a sequence database for additional proteins that are in the same family, or a query protein sequence may be searched for the presence of sequence patterns that represent a protein family to determine whether the sequence belongs to that particular family. Genomic DNA sequences may also be searched for consensus regulatory patterns such as those representing transcription-factor-binding sites, promoter recognition signals, or mRNA splicing sites; these types of searches are discussed in Chapter 9.


© 2004 by Cold Spring Harbor Laboratory Press. All rights reserved.
No part of these pages, either text or image, may be used for any purpose other than personal use. Therefore, reproduction, modification, storage in a retrieval system, or retransmission, in any form or by any means, electronic, mechanical, or otherwise, for reasons other than personal use, is strictly prohibited without prior written permission.


Home Chapters Links Problems Enroll for Updates Help CSHL Press