2. Retrieve the protein sequence of the E. coli RecA protein from SwissProt or Entrez and then submit the sequence to the University of Virginia FASTA server. The PIR identifier of the query sequence is RQECA and the GenBank index is 72985 (search for gi72985 in Entrez). An identifier number or else the sequence itself in FASTA format may be pasted into the sequence entry window using the "Enter query sequence" dropdown window. Search the database described as the NCBI Human proteins library and use the default search parameters provided by the program.
Answer the following questions:
 Identify the name and gi (GenBank index) of the highest scoring sequence.
 How many standard deviations above the mean is this score? Note that z´ is a normalized score, calculated as z´ = 50 + 10z, where z is the raw z score. This raw z score represents the number of standard deviations that a given score s is from the mean, calculated by z = (s  m)/s, where m is the mean and s is the standard deviation.
 Using Equation 1 relating z score to probability of such a score between unrelated sequences, what is the probability of an alignment between unrelated sequences achieving this high a z score?
 How many database sequences were searched? What is the expect value (E) for a search of this many sequences achieving a score as high as z?
 By looking at the scores and E values from this search, what is the approximate value of z´ (z´ = 50 + 10z) that corresponds to an expect value of 0.02 (an approximate cutoff for significance)? How many sequences reached this high a score?
 Is the alignment of the highest scoring sequence with RecA protein significant and why? How could the significance be further tested?
 What biological information (protein structure and function) does this match suggest about the bacterial RecA protein and the human protein?
 What was the lowest reported score in this search, and is this score significant?
 What scoring matrix and gap penalties were used as default values by the FASTA program?


