Bioinformatics Online Home
   Chapters Links Problems Enroll for Updates Help
 CHAPTER 3 PROBLEMS
   
 Problem 1
 Problem 2
 Problem 3
 Problem 4
 Problem 5
 All problems

Home  >  Problems  >  Chapter 3

PART I. DOT MATRIX ANALYSIS

Using DNA Strider on a Macintosh (DNA Strider is available from Dr. Christian Marck, [email protected])

  1. Protein Sequence Comparison

    Compare two Escherichia coli phage repressor protein sequences by copying FASTA-formatted sequences of phage l cI repressor protein (accession no. RPBPL) and the phage p22 c2 repressor protein (accession no. RPBP22) from the GenBank display window at http://www.ncbi.nlm.nih.gov/entrez/ into two DNA Strider protein windows as follows:

    1. Highlight lamc1.pro (RPBPL) with the mouse and, using the Copy option in the edit window, copy the sequence into the clipboard.
    2. Start DNA Strider and open a new protein sequence window.
    3. Paste the lamc1.pro sequence into the new protein window using the Paste command in the Edit window.
    4. Place the cursor at the start of the sequence.
    5. Return to the editor and, using the same procedure as above, copy the p22c2.pro sequence (RPBP22) into the clipboard, and then into a second new protein window in DNA Strider.
    6. Place the cursor at the start of the sequence. The sequences in the two windows, one in the top window and the other in the bottom window, may now be compared using the matrix option of DNA Strider.
    7. Hold down the Option key, choose the matrix drag-down window, and choose the protein matrix option. Then release the mouse button.
    8. In the window that appears, set the protein matrix options on the right. Choose a window and stringency of 1, and a scale of auto by sliding the cursor in the windows to these choices. Choose an identity matrix. Then click on the matrix button for proteins.
    9. Examine the matrix for the presence of a row of dots that represents a region of sequence similarity. Note the background matching that also appears, and which will be eliminated below by using a larger window.
    10. Close the matrix window and then repeat the matrix analysis by choosing a window of 2 and a stringency of 2. Note that the similarity stands out much more clearly.
    11. Repeat the matrix analysis again looking for a stringency of 2 in a window of 3 amino acids. Note that the region of similarity stands out more clearly still but that the resolution, i.e., the exact position of the individual amino acid matches, is not as clear.
    12. Repeat the analysis using the amino acid scoring matrix BLOSUM62.

  2. DNA Sequence Comparison

    1. To retrieve the DNA sequences of the above two repressor genes, go to the GenBank entries for the phage λ and p22 genomes and retrieve the gene sequences from the features table (NC_001416, complementary strand positions 37227..37940 and NC_002371, complementary strand positions 12764..13414, respectively).
    2. Find the coding sequence entry (CDS) for the repressor genes in Features and click the mouse on the CDS link. A new window will come up with the DNA sequence.
    3. Copy and paste the sequences into two DNA windows in DNA Strider.
    4. Use the DNA matrix option in the matrix window to obtain a dot matrix analysis of these DNA sequences using stringencies and windows of 1 and 1, and 7 and 10, respectively, using the identity matrix.

  3. Self-comparison for Finding Repeated Sequences

    1. Open a GenBank window for the haptoglobin hp2 protein sequence (accession no. 1006264A).
    2. Copy the sequence into a new protein window in DNA Strider.
    3. Use the protein self-matrix option to compare the sequence to itself. Use window 1, stringency 1, and identity matrix.
    4. Note the presence of any repeated elements and where they are.

  4. Complex Repeated Elements

    1. Obtain the human and chicken erythroid transcription factors (accession nos. CAA35120 and P17678, respectively) from GenBank.
    2. Copy and paste the sequence into protein windows in DNA Strider.
    3. Compare these sequences, first each to itself and then to each other using the same stringency and window settings in each case (2/3 or much higher, such as 15/23).
    4. What primary structure features do these proteins share? Look at the sequences and see if you can identify any features, e.g., repeats of the same amino acid, that are affecting the appearance of the dot matrix.

  5. Sequence Complexity

    When the same sequence characters are repeated many times, the complexity of the sequence is said to be low; i.e., the number of all the available sequence characters is quite small or only a single character may be present. These regions can make alignments look artificially good and score artificially high. They become quite apparent on the self-matrix as horizontal or vertical rows of dots.

    1. Examine the self-matrix pattern for the human erythroid factor above (match of stringency 1 to window 1, identity matrix) and describe what is observed around sequence positions 55 and 265.
    2. Examine the sequence and report what is found in the sequence at these positions.

Using EMBOSS Dot Matrix Software

For the instructor. A knowledgeable computer support person will need to compile the EMBOSS programs on a UNIX or Linux server (Mac OS X is an alternative, but more time-consuming, option) and then provide X server access to PCs from the server as discussed in Figure 3.5 and the text. The EMBOSS programs are well documented, and online help is accessed through the tfm program followed by the program name, e.g., dotmatcher. Some of the displays done above with DNA Strider cannot be shown because there is a minimal window size of 3 in dotmatcher. An alternative dot matrix program, dotter, is described in the text. The sequences should be retrieved as text files in FASTA format in a convenient location for student access on the server. This task is good practice for students to do themselves if they have an account on the server. For now, they could save the GenBank files on a PC and then move them to the server, for example. It is also a good idea to make a protein scoring matrix that scores identities as 1 and mismatches as 0 using a text editor and place this matrix in the EMBOSS data directory with the other scoring matrices. In Chapter 12, students will learn how to retrieve sequences from GenBank directly using Perl scripts.

For the students. For the above problems, it is best to have FASTA files of the sequences (other sequence formats can also be used).

  1. Run dotmatcher on the remote server using the X-Window client program. The program prompts for sequence names, reads in the sequences, and prompts for window size and stringency.
  2. For most input queries, hit Return to give reasonable choices.
  3. You can also use different scoring matrices to test the effects on the results, but these must be entered as options when you type in the name of the program, e.g., "dotmatcher–matrixfile=mychoice."
  4. Type "tfm dotmatcher" to read about all of the options.
  5. If all goes well, the results will be displayed in a window.




 

© 2004 by Cold Spring Harbor Laboratory Press. All rights reserved.
No part of these pages, either text or image, may be used for any purpose other than personal use. Therefore, reproduction, modification, storage in a retrieval system, or retransmission, in any form or by any means, electronic, mechanical, or otherwise, for reasons other than personal use, is strictly prohibited without prior written permission.

 

 
Home Chapters Links Problems Enroll for Updates Help CSHL Press