1. This problem practices using the Entrez search program at the National Center for Biotechnology Information (NCBI) to perform a search for the amino acid sequence of the human heat shock factor HSF1. Normally a large number of matches are found in such searches. We will use the Entrez Boolean search features, which restrict the reported matches to a series of required conditions. This feature allows us to narrow the search to the sequence that we want.

This SRS Web site given above also provides powerful database search routines especially designed for the retrieval of large data sets. The student is encouraged to repeat some of the following exercises on this site.

  1. Go to the Entrez Web site and choose Protein from the drop-down window in the upper left.
  2. Enter the terms <heat shock factor> (without the angled brackets) in the search window and click the mouse on GO. This search is to find any sequence entry in the available protein sequence databases that have these three words anywhere in the text. Show how many matches (hits) are found by clicking history.
  3. Now reduce the search by entering the same terms but surrounding them by quotes "heat shock factor". The matches must now include this phrase. This time click Preview to go directly to the number of hits in the protein database. What is the number now?
  4. Now limit the search by clicking the mouse on Preview/Index, go to add terms, choose organism in the first box, type human in the second, then click AND to limit the search to just human proteins, and then click Preview. The history will now show the results of a search for database entrees with the term "heat shock factor" AND originating from humans as the organism. How many hits are there now?
  5. We can limit the hits to matches to RefSeq, which is GenBank's annotated sequence database, to give a best representative sequence entry for each protein. Click the mouse on Limits, and in the Limited To section of the pages, ignore the boxes on the left, and choose RefSeq in the right box. Then click GO and history. Now we have all human heat shock factors in RefSeq.
  6. The gene of interest is HSF1. Click clear in the text entry box at the top of the page, type HSF1, and click Preview. There should now be one entry left in History. Clicking on the number 1 provides the sequence.
  7. There are other ways of arriving at this final sequence. As another example, pull out all human protein sequences in RefSeq and all HSF1 sequences in all organisms and then select the human one using another Boolean search feature of Entrez. First clear History, clear the upper text box, and reselect Limits, or else just reload the Entrez page and choose Protein in the upper left box. Enter human in the text box at the top, click Limits, and then in the Limited To area, choose Organism in the upper left box and RefSeq in the right box. Click GO and then History. Now we have a complete list of all human proteins in RefSeq.
  8. Now replace human with HSF1 in the upper text field, click Limits, and in the Limited To area, choose gene name in the upper left box and RefSeq in the right box. Click GO and then History. The result should be a small number of HSF1 proteins.
  9. Finally, note the numbers at the beginning of the two lines that start with a pound sign (#) in history that were found by the last two searches. Go to the upper text box and type <#1 AND #2> (assuming the numbers are 1 and 2) and omit the angled brackets. This now creates a new search in which only protein sequences are matched that are from humans and which are the HSF1 gene, i.e., the new search is an intersection of the previous two. Again, 1 protein should be left.
  10. Note the RefSeq accession number starting with "NP" and use the mouse links to display the sequence in FASTA format. "NP" identifies the sequence as a curated protein sequence. The sequence may then be copied and pasted into the page of a simple text editor and saved as a local computer file.
  11. While on the page with the target sequence, click on Links and choose the Nucleotide option. Now the mRNA and genome sequence corresponding to the protein should become available. Note that the RefSeq numbers start with NM for annotated mRNA sequence and NT for the annotated genome/chromosome 8 sequence. There are also links to a display of the genome/chromosome map location of the gene and other useful information to explore at leisure.


