Bioinformatics Online Home
   Chapters Links Problems Enroll for Updates Help
 Problem 1
 Problem 2
 Problem 3
 Problem 4
 Problem 5
 Problem 6
 All problems

Home  >  Problems  >  Chapter 6

Web sites mentioned in the problems can be found in the chapter or by using a search engine. Sequences may be retrieved in FASTA format from Entrez or a protein sequence database such as PIR or SwissProt.

1. FASTA uses a lookup table as a rapid way to find common letters and words in the same order and of approximately the same separation in two sequences. Produce a lookup table for single amino acids in the following two protein sequences, and then explain how this information will be used to determine what the alignment should be.
sequence 1: ACNGTSCHQE
sequence 2: GCHCLSAGQD
Amino Acid Protein in Sequence 1 Protein in Sequence 2 Offset Value*
*Offset value is defined as the value in sequence 1 less the value in sequence 2

2. Retrieve the protein sequence of the E. coli RecA protein from SwissProt or Entrez and then submit the sequence to the University of Virginia FASTA server. The PIR identifier of the query sequence is RQECA and the GenBank index is 72985 (search for gi|72985 in Entrez). An identifier number or else the sequence itself in FASTA format may be pasted into the sequence entry window using the "Enter query sequence" drop-down window. Search the database described as the NCBI Human proteins library and use the default search parameters provided by the program.

Answer the following questions:
  1. Identify the name and gi (GenBank index) of the highest scoring sequence.
  2. How many standard deviations above the mean is this score? Note that z´ is a normalized score, calculated as z´ = 50 + 10z, where z is the raw z score. This raw z score represents the number of standard deviations that a given score s is from the mean, calculated by z = (s - m)/s, where m is the mean and s is the standard deviation.
  3. Using Equation 1 relating z score to probability of such a score between unrelated sequences, what is the probability of an alignment between unrelated sequences achieving this high a z score?
  4. How many database sequences were searched? What is the expect value (E) for a search of this many sequences achieving a score as high as z?
  5. By looking at the scores and E values from this search, what is the approximate value of z´ (z´ = 50 + 10z) that corresponds to an expect value of 0.02 (an approximate cutoff for significance)? How many sequences reached this high a score?
  6. Is the alignment of the highest scoring sequence with RecA protein significant and why? How could the significance be further tested?
  7. What biological information (protein structure and function) does this match suggest about the bacterial RecA protein and the human protein?
  8. What was the lowest reported score in this search, and is this score significant?
  9. What scoring matrix and gap penalties were used as default values by the FASTA program?

3. For protein database searches, the BLASTP algorithm first makes a list of three-letter words in the query sequence and then scores these words for matches with themselves and with all other possible words using the BLOSUM62 scoring matrix. The 50 highest-scoring matches are kept. Database sequences are then scanned for matches to these high-scoring words, and if such are found, a local alignment is made with the query sequence by dynamic programming. Use the BLOSUM62 scoring matrix in Figure 3.16, page 105. Note that the matrix values are in half-bit units.
  1. Suppose that the three-letter word HFA is in the query sequence, what is the log odds score of a match of HFA with itself?
  2. Scan through the table and find the highest-scoring match with H (say amino acid X). What would be the score for HFA in our query sequence matching XFA in the database sequence?
  3. Scan again and find any worst-scoring match with H. What is the score for a match of HFA with YFA?
  4. Repeat the last two questions for the second and third letters in HFA.
  5. How many possible matches are there with HFA? (BLASTP uses approximately the best 50.)
  6. How many words will be searched for, starting with a query sequence that is 300 amino acids long?

4. Run the E. coli RecA protein against the yeast genome on the BLAST server. Choose the BLASTP program and carefully review the various option windows on the page that comes up. Choose yeast as the genome database to be searched. Enter the RecA sequence in FASTA format or the PIR identifier into the input data window and indicate which choice was made in the small option window just above the input data window. Otherwise, use the default parameters provided by the program. You must wait in a queue for the results, then click on the format results window.

Answer the following questions:
  1. In the diagram that comes up, click the mouse on the yeast sequence which best matches the RecA query sequence. Identify the name and gi (GenBank index) of the highest-scoring sequence and the score in bits.
  2. What scoring matrix and gap penalties were used?
  3. What values of K and λ were used for calculating the expect values (E) for the gapped alignment (note that there are two sets of these parameters–one for ungapped and one for gapped alignments)? Where do these values come from?
  4. The score shown in the program output is in units of "normalized bits" = [(λ x raw score) - ln K] / ln 2. The raw score is shown in parentheses. What are the units of the raw score (those of the BLOSUM62 matrix)? Calculate the raw score in bits from the "normalized bits."
  5. How many database sequences were searched?
  6. Calculate the expect value E for a search of this many sequences achieving a score as high as that found in part 1. In the formula, be sure to use the effective lengths of the sequences given in the program output.
  7. By looking at the scores and E values from this search, what is the approximate value of the alignment score in normalized bits that corresponds to an expect value E of 0.06 (close to an approximate cutoff of 0.02–0.05 for significance)? How many sequences reached this high a score?
  8. Is the alignment of the highest-scoring sequence with RecA protein significant and why? What biological information (protein structure and function) does this match suggest about the bacterial RecA protein and the yeast protein?
  9. What was the lowest reported score in this search and is this score significant?

5. PSI-BLAST is a version of the BLAST algorithm that uses the results from an initial search for similar protein sequences to construct a type of scoring matrix that can then be used for additional rounds of searches, called iterations. The variability found in each column of the scoring matrix allows additional sequences that have different combinations of amino acids in the sequence positions to be found. The algorithm provides a rapid but less precise search than other methods because the scoring matrix produced is only approximate and includes most of the original query sequence. A note of caution: The iterations can lead to more sequences being added that do not share a region in common with the original query sequence, but share a totally different region in some of the added sequences; e.g., these new sequences are not true family members but alien sequences. The process will stop when no more sequences are found. The user can control the number of sequences to be included at each iteration or else use the score cutoff recommended by the program. The method is often used to perform a rapid and preliminary search for members of a sequence family. The sequences found can then be multiply aligned by other better-defined methods.

Perform the following analysis and answer these questions:

We provide a protein sequence of a DNA polymerase called iota that replicates past sites of DNA damage and makes mutations. This is a mouse homolog (Entrez search for gi|6755274) of a yeast gene called RAD30. Submit the sequence to PSI-BLAST searching the nr (nonredundant) Genpro database. Use the given (default) options of the program. Repeat the search for an additional iteration using the cutoff scores recommended by the program.

  1. How many matches were found above the cutoff score after the initial search?
  2. Using the Web links provided, identify some of the highest-scoring sequences. What classes of organisms do the matched genes originate from? Is this sequence representative of a protein family found in just a few or many organisms?
  3. How many additional matches were found after the first iteration, and do most appear to be the same type of function, e.g., DNA repair or replication?

6. MAST search with PSSMs obtained from MEME and BLOCKS alignments. We will use two Web sites that search for common patterns in a submitted group of protein sequences—the BLOCKS server at the Frederick Hutchison Cancer Facility, University of Washington, and the MEME server at the University of California at San Diego supercomputing center. These sites provide examples of well-defined pattern analyses. A family of related sequences from a PSI-BLAST search should usually be subjected to further analysis by these other methods. These searches produce a log odds scoring matrix (position-specific scoring matrix or PSSM; see Chapter 5) that may then be used to search through other sequences for the same pattern. There is no provision for gaps. The MAST program also at UCSD searches every sequence in a protein sequence database for those sequences that have high-scoring matches to the patterns. The BLOCKS server has a number of very useful programs for sequence analysis and maintains a database of aligned sequence patterns from related sequences called the BLOCKS database. BLOCKS define a region of similarity that is a signature of a particular protein family. A family may be defined by one or more BLOCKS. A single sequence may be aligned with all of the existing BLOCKS in the database to determine whether the sequence carries any of the patterns represented by the database. The BLOCKS server searches sequentially through the sequences for common patterns and also uses the Gibb's sampler to locate patterns. MEME uses the expectation maximization algorithm to locate patterns.

These servers produce large volumes of output and MEME E-mails the results in Web page (HTML) format. A family of five related protein sequences that are repair proteins in the RecA-Rad51 family were analyzed for common patterns (search for gi|54866, gi|118683, gi|132224, gi|3914552, and gi|1350566 in Entrez). These proteins bind to single-stranded and double-stranded DNAs and promote base-pairing between the molecules that can lead to genetic recombination. Retrieve them in FASTA format and paste them together in series in the FASTA msa format (see Chapter 2, p. 53) using a simple text editor.

  1. BLOCKS search: Perform a BLOCKS search of these protein sequences on the BLOCKS Web site and answer these questions:
    1. How many blocks were found by the MOTIF program and by the Gibbs sampler, and approximately how long were they?
    2. Were any of the patterns found by the MOTIF and Gibbs sampling the same ones?
    3. Are the patterns convincing; i.e., do at least some of the columns have a majority of one amino acid or is there a lot of variation?
    4. How do the relative positions of each pattern in the five original sequences compare?
  2. MEME search: Submit the same five sequences to the MEME Web site, requesting a search for three patterns that may or may not be present in all of the sequences with one copy per sequence. Use the default options of MEME. Examine the results of the MEME analysis and answer the following questions. (Note that MEME sends two files: the first one showing the patterns found, and the second a map of the sequence showing the relative positions of the patterns.)
    1. How many patterns were found and approximately how long were they?
    2. How does the relative position of each pattern in the five original sequences compare?
  3. MAST search: Use the first MEME output file to search the SwissProt database to find additional family members that share the same patterns. A very large output file will be produced. Scan the file, noting the expect values for the aligned regions, and answer the following questions:
    1. Can additional members of this family be identified by this approach? Give three examples of different types of organisms that are in the matched list.
    2. How does the relative order of the patterns in the matched sequences compare with those in the query sequences? Would you expect these sequences to align well?
    3. In the PSSM-to-sequence alignments shown, how was the alignment score determined?


© 2004 by Cold Spring Harbor Laboratory Press. All rights reserved.
No part of these pages, either text or image, may be used for any purpose other than personal use. Therefore, reproduction, modification, storage in a retrieval system, or retransmission, in any form or by any means, electronic, mechanical, or otherwise, for reasons other than personal use, is strictly prohibited without prior written permission.


Home Chapters Links Problems Enroll for Updates Help CSHL Press