Bioinformatics Online Home
   Chapters Links Problems Enroll for Updates Help
 Problem 1
 Problem 2
 All problems

Home  >  Problems  >  Chapter 9


The object will be to make a good guess as to the location of the four genes (lacI, lacZ, lacY, and lacA) of the E. coli lac operon in the operon sequence using the simple orf-scanning programs in DNA Strider and also using a Web site that searches for genes using a hidden Markov model of E. coli genes.

  1. Using DNA Strider, retrieve the lac genes from GenBank using the GenBank Accession no. J01636. If DNA Strider is not available, then answer the questions in steps 8–10.
    1. Start DNA Strider.
    2. Open a new DNA window.
    3. Open the E. coli lac operon sequence.
    4. Copy and paste the sequence from the browser window into the Strider window.
    5. Position the cursor at the first base in the sequence in the Strider window.
    6. Choose the "DNA" menu, choose "ORF Map," and then "6 frame" to show Met (1/2 bar) and Termination codons (full bar) in all six possible reading frames.
    7. Note where the longest ORFs are located on a piece of paper.
    8. If DNA Strider is not available, then examine Figure 9.1 and find the longest ORFs.
    9. Examine the GenBank features entry for J01636, looking for CDS (coding sequence).
    10. Compare the entries to your annotation.
  2. Genmark Web site
    1. Go to the Genmark Web site at Georgia Tech ( or perform a Web search for GeneMark and paste in the lac operon sequence. If you have problems accessing the Genmark site, you can use the CDSB option on the Gene Finder location at the Baylor College of Medicine at instead.
    2. Choose E. coli as the organism.
    3. Report the start and stop positions and lengths of the predicted ORFs and compare them to the rough locations found with DNA Strider. What differences (if any) exist between the Genmark output and the gene locations you predicted with DNA Strider?
  3. Note that the EMBOSS programs described in Chapters 3 and 12 may also be used to find the longest ORFs. The programs PLOTORF, GETORF, and SIXPACK may be used. However, this requires the instructor to have installed these programs and for students to have access to a UNIX or Linux server (see

  1. Align the cDNA sequences and genomic sequences using the service provided on the lalign Web server, perform a Web search, or use a local copy of lalign.
    Sequences: Retrieve the following sequences in FASTA format and either paste them into the sequence window of an alignment program Web site or else save them to a local file on your computer using a simple text editor for analysis with a local program.
    1. Arabidopsis ATRAD1 cDNA sequence GenBank accesssion no. AF160500.
    2. Retrieving the genomic sequences for any gene is not so straightforward because GenBank usually does not store the sequence in a file but rather generates the genome sequences from the chromosomal sequence present on large sequence fragments (contigs) in GenBank. Arabidopsis Rad1 genomic DNA sequence reads backward on the complementary strand of entry no. AB010072 from 66706 to 62831. The best way to retrieve this sequence is to open GenBank nucleotide entry no. AB010072, find this coding sequence, and click on CDS. A new window will appear with the coding sequence.
    3. Then open a new browser window at Web site and paste in the sequence.
    4. Run the program and retrieve the complementary sequence in the new window that comes up. This sequence can then be used as input for the genome sequence of the gene on a third browser window with the alignment program.
    5. Note: These steps of retrieving a sequence are more easily performed if one writes a simple computer program called a perl script or perl wrapper on a local machine that retrieves the cDNA and genome sequences automatically, extracts the desired genome sequence, makes the complementary strand, and then performs the local alignment with the cDNA sequence. The methodology is described in Chapter 12. You still have to know the accession numbers of the GenBank entries that include the sequences of interest.

    Alternatively, the Arabidopsis genome information resource (TAIR) provides information on the chromosomal locus, mRNA, and genomic sequences of this gene (which is called UVH1 or ATRAD1 on this site) and provides a gene model at Try the following:

    1. The location of the ATUVH1 gene can be found on the SeqViewer at Choose SeqViewer. The first view is of all five chromosomes of Arabidopsis.
    2. Choose only the gene models box, and then search for the ATRAD1 gene. A new view will appear with the gene location shown on chromosome 5.
    3. Choose an 80-kb viewing range, click on the gene location mark, and then look in the expanded view for locus AT5G41150.
    4. Once the gene has been located, a mouse click will open a new window with new links to the cDNA (CDS for coding sequence) and the genome sequence (the TAIR accession number of the sequences is AT5G41150.1).
    5. Note that the locus page also includes information on the predicted gene structure that we are trying to find by sequence alignment of the cDNA and genome sequences.

    The accuracy of your findings below can also be confirmed on the gene information page in GenBank or TAIR.

    1. What do the gaps between the aligned genomic and cDNA sequences represent?
    2. Look for 5´ and 3´ splice junctions near the ends of the gapped regions.
      1. Use the Arabidopsis table of consensus splice sites at to figure out where they are on the genomic sequence and use this information to find the ends of the exons.
      2. Predict and indicate the positions of all exons in the genomic DNA sequence.
      3. Note: The alignment is reading along both sequences, and when a gap is placed the ends may not be put where we expect it to be based on our knowledge of the expected splice sites—some adjustment of the gap may be necessary.
  2. Submit the above genomic sequence to the GenScan server at and compare the results of their analysis with yours.
    1. How accurate is the GenScan server?
    2. What differences (if any) exist between the GenScan output and the gene locations you predicted using lalign?
  3. There are several sites (see Table 3.1) that specialize in aligning cDNA or EST sequences with the genome. This analysis can locate the gene on the genome or reveal the gene structure with the location of genomes. One site is GeneSeqer at Predict the mRNA sequence of the ATRAD1 on this site using the model provided of Arabidopsis gene structure and also align the genome and cDNA sequences of ATRAD1 found in part A above and compare the result to that found in part B.


© 2004 by Cold Spring Harbor Laboratory Press. All rights reserved.
No part of these pages, either text or image, may be used for any purpose other than personal use. Therefore, reproduction, modification, storage in a retrieval system, or retransmission, in any form or by any means, electronic, mechanical, or otherwise, for reasons other than personal use, is strictly prohibited without prior written permission.


Home Chapters Links Problems Enroll for Updates Help CSHL Press