Bioinformatics Online Home
   Chapters Links Problems Enroll for Updates Help
 Problem 1
 Problem 2
 Problem 3
 Problem 4
 All problems

Home  >  Problems  >  Chapter 13

1. Below is a small hypothetical data set of microarray expression values. The rows are genes A, B, C, and D on the microarray and the columns are observed variations in the transcription of these genes; e.g., the columns could represent the gene expression profile of a type of cancer cell, genes expressed in one part of the cell cycle, or genes expressed at different times after a particular treatment given to cells. Assume that these data have been corrected for sources of variation in the data as we have described in the chapter. To simplify the work, we will let the numbers be the ratio of expression under a given condition to that under a control condition as in a spotted cDNA microarray experiment.

  1. Convert the odds scores in the above table to log odds scores to the base 2 and make a new table of these log odds scores below.

    Log odds scores

  2. Calculate the Pearson correlation coefficient (see Table 13.7) between all gene combinations and between all sample combinations and enter the values into the following tables except where an x is shown. Note that the between-genes analysis is asking, "Are these genes following a similar expression pattern in the samples?" and the between samples analysis is asking, "For these two samples, is the gene expression pattern similar?"

    Pearson correlation coefficients

  3. Calculate the distances between genes and samples using the absolute value of the correlation coefficient as shown in Table 13.7.

    Distances based on correlation coefficient

  4. Which gene pairs and sample pairs show the greatest similarity on the basis of the above distance scores?
  5. Calculate the Euclidean distance between all genes and sample combinations based on the log odds scores given above. The calculation is an extension of calculating the hypotenuse on a right-angled triangle. The differences between each pair of measurements are subtracted and the result is squared. The squares are added and the square root of the sum is the Euclidean distance.

    Euclidean distance

  6. Which genes and sample combinations have the smallest Euclidean distances?
  7. How do the closest gene and sample combinations in part f compare with those found in part d using the absolute value of the correlation coefficient to calculate a distance?
  8. Produce a hierarchical clustering diagram of the samples only based on these Euclidean distances. Use the same distance method that was used for producing a phylogenetic tree in Problem 1 of Chapter 7.
  9. Suppose that samples I and II are from diseased tissue and that III and IV are from normal tissue (i.e., that the samples are from known classes), and that we want to find a test or classifier that will tell whether a new sample "X" belongs to one class (Diseased) or the other (Normal). Compare the samples on a plot of log odds scores of each combination of gene pairs and make a judgment call as to which gene pair provides the best classifier. For example, plot the log odds scores of the four tissue samples on the following graph that shows the gene A value on the vertical axis and the gene B on the horizontal axis. Draw another graph for the A, C gene combination as shown. On the basis of these two graphs, which gene combination best discriminates between diseased and normal samples?

  10. What approximate values should unclassified sample "X" have to be classified as normal or diseased?
  11. Are any other gene pairs useful for classifying "X"?

Problems 2-4 require that BioConductor be set up by the Instructor or computer support person. The 95-gene set used for many examples in the chapter may be downloaded from the book Web site and sample scripts are shown in the chapter (Tables 13.5 and 13.8).

2. Use the marray package in BioConductor to create an M/A plot with the maPlot function for the small 95-gene data set. Do so first without background adjustment and then with background adjustment. What impact does background adjustment have on the variation? Is the impact constant with the level of expression?

3. Use the LIMMA package in BioConductor to identify a list of genes that are differentially expressed by a significant amount. Is this list different than the one obtained using only fold change as a measure of difference? If so, explain why.

4. Create three distance matrices using the dist function in R for the 95-gene data set. The first should be Euclidean distance, next 1 minus the Pearson correlation coefficient, and the third 1 minus the absolute value of the Pearson correlation coefficient. Next, create a dendrogram using average linkage clustering for each distance measure.
  1. Are the dendrograms different and, if so, why are they different?
  2. Would a dendrogram created using single linkage clustering instead of average linkage clustering be different? Why would the dendrogram be different?


© 2004 by Cold Spring Harbor Laboratory Press. All rights reserved.
No part of these pages, either text or image, may be used for any purpose other than personal use. Therefore, reproduction, modification, storage in a retrieval system, or retransmission, in any form or by any means, electronic, mechanical, or otherwise, for reasons other than personal use, is strictly prohibited without prior written permission.


Home Chapters Links Problems Enroll for Updates Help CSHL Press