Bioinformatics Online Home
   Chapters Links Problems Enroll for Updates Help
 Problem 1
 Problem 2
 Problem 3
 Problem 4
 Problem 5
 Problem 6
 All problems

Home  >  Problems  >  Chapter 2


or do a Web search for Readseq to locate another site.
The Institute for Genomic Research (TIGR)

1. This problem practices using the Entrez search program at the National Center for Biotechnology Information (NCBI) to perform a search for the amino acid sequence of the human heat shock factor HSF1. Normally a large number of matches are found in such searches. We will use the Entrez Boolean search features, which restrict the reported matches to a series of required conditions. This feature allows us to narrow the search to the sequence that we want.

This SRS Web site given above also provides powerful database search routines especially designed for the retrieval of large data sets. The student is encouraged to repeat some of the following exercises on this site.

  1. Go to the Entrez Web site and choose Protein from the drop-down window in the upper left.
  2. Enter the terms <heat shock factor> (without the angled brackets) in the search window and click the mouse on GO. This search is to find any sequence entry in the available protein sequence databases that have these three words anywhere in the text. Show how many matches (hits) are found by clicking history.
  3. Now reduce the search by entering the same terms but surrounding them by quotes "heat shock factor". The matches must now include this phrase. This time click Preview to go directly to the number of hits in the protein database. What is the number now?
  4. Now limit the search by clicking the mouse on Preview/Index, go to add terms, choose organism in the first box, type human in the second, then click AND to limit the search to just human proteins, and then click Preview. The history will now show the results of a search for database entrees with the term "heat shock factor" AND originating from humans as the organism. How many hits are there now?
  5. We can limit the hits to matches to RefSeq, which is GenBank's annotated sequence database, to give a best representative sequence entry for each protein. Click the mouse on Limits, and in the Limited To section of the pages, ignore the boxes on the left, and choose RefSeq in the right box. Then click GO and history. Now we have all human heat shock factors in RefSeq.
  6. The gene of interest is HSF1. Click clear in the text entry box at the top of the page, type HSF1, and click Preview. There should now be one entry left in History. Clicking on the number 1 provides the sequence.
  7. There are other ways of arriving at this final sequence. As another example, pull out all human protein sequences in RefSeq and all HSF1 sequences in all organisms and then select the human one using another Boolean search feature of Entrez. First clear History, clear the upper text box, and reselect Limits, or else just reload the Entrez page and choose Protein in the upper left box. Enter human in the text box at the top, click Limits, and then in the Limited To area, choose Organism in the upper left box and RefSeq in the right box. Click GO and then History. Now we have a complete list of all human proteins in RefSeq.
  8. Now replace human with HSF1 in the upper text field, click Limits, and in the Limited To area, choose gene name in the upper left box and RefSeq in the right box. Click GO and then History. The result should be a small number of HSF1 proteins.
  9. Finally, note the numbers at the beginning of the two lines that start with a pound sign (#) in history that were found by the last two searches. Go to the upper text box and type <#1 AND #2> (assuming the numbers are 1 and 2) and omit the angled brackets. This now creates a new search in which only protein sequences are matched that are from humans and which are the HSF1 gene, i.e., the new search is an intersection of the previous two. Again, 1 protein should be left.
  10. Note the RefSeq accession number starting with "NP" and use the mouse links to display the sequence in FASTA format. "NP" identifies the sequence as a curated protein sequence. The sequence may then be copied and pasted into the page of a simple text editor and saved as a local computer file.
  11. While on the page with the target sequence, click on Links and choose the Nucleotide option. Now the mRNA and genome sequence corresponding to the protein should become available. Note that the RefSeq numbers start with NM for annotated mRNA sequence and NT for the annotated genome/chromosome 8 sequence. There are also links to a display of the genome/chromosome map location of the gene and other useful information to explore at leisure.

2. Another useful NCBI search tool is LocusLink, which can be used to search for information on genes and proteins based on the known location of the genes on chromosomes that have been sequenced. Eventually Entrez and LocusLink will probably be combined at NCBI to create an even more powerful search machine. We will retrieve information about the HSF1 protein using LocusLink.
  1. Go to NCBI LocusLink address given above. In the first box choose LocusLink, then Brief in the second, and Human as the search organism in the third. Then enter HSF1 as the query and click GO. A small number of entries match the query and one of them should be HSF1. The position column shows the relative numbered position on the long arm (q for the long arm) of chromosome 8. The colored boxes provide sequence of the gene with direct links to RefSeq. Clicking on the green P will give the protein sequence entries of the protein including the RefSeq sequence labeled NP.
  2. Click on the empty box beside the sequence and then click view to produce a page with a great deal of information about the gene, including gene structure, genome location, RefSeq protein, and nucleic acid sequence identifiers, and much useful information about the evidence on which the gene sequence is based. Click on OMIM (Online Inheritance in Man) to see a biological summary of the HSF1 gene functions.

3. Visit the Saccharomyces cerevisiae (budding yeast) genome database (SGD) Web site to learn about the yeast transcription factor HSF1. Go to SGD and look up the following information using the global gene hunter. Enter the name of the gene, limit the search by unclicking boxes as needed, and click Submit. Use the links on the following page to answer these next questions. (Note: There is a large group of Ph.D. fellows who scan the literature frequently and add the information to the SGD database.)
  1. On what chromosome does the gene reside?
  2. What is the mature length of the protein?
  3. What are the SwissProt and PIR accession numbers?
  4. Is the gene found in other species—not at all, one or two, or many? If so, give an example of the name of the similar gene in another species.

4. Using any accession number found above, retrieve the sequence in fasta format from SwissProt and save the file on your PC, where XXXX is the gene name. Note that, traditionally, SwissProt only includes proteins for which there is physical evidence that they exist; e.g., they can be seen as a spot or a band on a gel.

5. READSEQ is a very useful utility for converting among sequence formats. Read through the online help file before continuing.
  1. Retrieve the mRNA sequence of the yeast SNF2 gene from GenBank (the accession number is YSCSNF2A, but try using other fields).
  2. Now go to a Web-based READSEQ conversion page and copy and paste the GenBank sequence into the sequence input box and choose Pearson/FASTA format as the output format. Click Perform Conversion and a new box will appear with the sequence in FASTA format. Copy and paste the sequence into the text editor and save as a file called snf2mRNA.seq on your computer.

6. In addition to individual genes, whole genomes of organisms are becoming available, including many prokaryotes, organelles, and viruses. One good way to retrieve these genomic sequences is through the NCBI Entrez page for genomes and the taxonomy browser.
  1. Go to the NCBI Entrez page and then to the genome page (on bar at top of Entrez page). Enter "Homo sapiens mitochondrion" and click on the entry that appears for the human mitochondrion. Note that the RefSeq accession number starts with NC (nucleotide sequence of chromosome). Examine the sequence of the Homo sapiens mitochondria. What is the length? Roughly outline the genes that are present. Click on the map to see the genes that are present and then on the gene blocks to see the sequences.
  2. Another resource for microbial genomes is at the The Institute for Genomic Research. Go to the Comprehensive Microbial Resource page, choose genomes, and click on the genome name of Synechocystis sp. PCC 6803 under the group Cyanobacteria. This is an ancient organism that produces oxygen from light and puts oxygen in the atmosphere. What is the size of the genome and how many proteins are encoded? What does the color code of the genome represent?


© 2004 by Cold Spring Harbor Laboratory Press. All rights reserved.
No part of these pages, either text or image, may be used for any purpose other than personal use. Therefore, reproduction, modification, storage in a retrieval system, or retransmission, in any form or by any means, electronic, mechanical, or otherwise, for reasons other than personal use, is strictly prohibited without prior written permission.


Home Chapters Links Problems Enroll for Updates Help CSHL Press