index_01crop.gif index_02crop.gif
index_456.jpg
spacer.gif


Using Our Discipline to Enhance Human Welfare
August 7 - 11, 2005

Lab 6 - Clustering and Classification

Jean Yang. August 6-7, 2005.

1. Software required for this lab.

You will need R 2.1.0 or 2.1.1 (http://www.R-Project.org) for this lab. This section lists all of the R packages you need to have installed and also lists some additional R packages which are recommended.

1.1 Required R packages

It is highly desirable to have these R packages in a directory in which you have write permission. You can use .libPaths("C:/Custom/R/library/directory") or .libPaths("C:\\Custom\\R\\library\\directory") before you run install.packages() (or equivalent) to install the package(s) in a customized directory location.  Packages required to run vExplorer  are Biobase, reposTools, DynDoc, tkWidgets, and widgetTools.  Other packages are shown in table below:

Package URL
cluster http://cran.at.r-project.org/src/contrib/Descriptions/cluster.html
ipred http://cran.at.r-project.org/src/contrib/Descriptions/ipred.html
som http://cran.at.r-project.org/src/contrib/Descriptions/som.html
rpart http://cran.at.r-project.org/src/contrib/Descriptions/rpart.html
sma http://cran.at.r-project.org/src/contrib/Descriptions/sma.html

1.2 Required data and introduction

The purpose of this illustration is two-fold. The first is to perform some basic clustering and examine the effect of different cluster algorithms. The second aim is to build different classification procedures for tumor samples using gene expression data and to assess the accuracy of these classification procedures. We will use the dataset presented in van't Veer et al.  (2002) which is available at http://www.rii.com/publications/2002/vantveer.htm. These data come from a study of gene expression in 91 breast cancer node-negative tumors. The training samples consisted of 78 tumors, 44 of which did not recur within 5 years of diagnosis and 34 did. Among the test samples, 7 have not recurred within 5 years and 12 did. The data were collected on two color Agilant oligo arrays containing about 24K genes.

You will work with data which has been filtered using procedures described in the original manuscript. There, only genes that show 2-fold differential expression and p-value for a gene being expressed < 0.01 in more than 5 samples are retained. There are 4,348 such genes. Additionally the missing values were imputed using k-nearest neighbors imputation procedure (Troyanskaya, et al, 2001). There, for each gene containing at leats one missing value we find 5 genes most highly correlated with it and take average of their value for the sample in which a value for a given gene is missing. The missing value is replaced with the average.

The data provided here has been filtered using procedures described in the original manuscript which resulted in 4,348 genes. The missing values were imputed using k-nearest neighbors imputation procedure (Troyanskaya, et al, 2001). The R data file “JSMLab6.RData” contains gene expression levels and gene names and various functions required for this lab. The filtered gene expression levels  are stored in a 4348 × 97 matrix named bcdata with rows corresponding to genes and columns to mRNA samples. Additionally, the labels are contained in the 97-element vector surv.resp with 0 indicating good outcome (no recurrence within 5 years after diagnosis) and 1 indicating bad outcome (recurrence within 5 years after diagnosis).

The data should be downloaded from http://bioinf.wehi.edu.au/marray/jsm2005/lab6/JSMLab6.RData before the lab.

To load the .RData file, use

 attach("JSMLab6.RData")

To easily access training and test datasets later, initialize the relevant indices:

 index.train <- 1:78
 index.test <- 79:97
 trainData <- bcdata[,index.train]
 testData <- bcdata[,index.test]

2. Clustering

2.1 Hierarchical clustering

Cluster the mRNA samples using the hclust function and vary the number of genes. Do the different survival outcome samples cluster together? Try different between clusters dissimilarity measures by modifying the “method” argument in hclust. Also try different dissimilarities metric by modifying the “method” argument in dist or come up with your own.  Below are some questions to help you get started.

Q1. Perform a hierarchical clustering on the mRNA samples using correlation as similarity function and complete linkage agglomeration.

 library(stats)
 clust.cor <- hclust(as.dist(1 - cor(bcdata)), method = "complete")
 plot(clust.cor, cex = 0.6)

Q2. Perform a hierarchical clustering on the mRNA samples using Euclidean distance and average linkage agglomeration. 

 clust.euclid <- hclust(dist(t(bcdata)), method = "average")
 plot(clust.euclid, cex = 0.6)

Q3. You might also like to create a cluster image using the function heatmap in the package stats. We will create a clustering image that is commonly seen in many microarray literature.  By default, this function performs hierarchical clustering on both genes and samples and the function will slow down considerably if the number of genes are too large. For illustration purposes, we have selected 100 variable genes.

 library(sma)
 bcdatavar <- apply(bcdata, 1, var, na.rm = TRUE)
 top100 <- stat.gnames(bcdatavar, 1:length(bcdatavar), crit = 100)$gnames
 heatmap(bcdata[top100, ])

2.2 Partition methods

Next, we will examine the different partition methods such as “k-means”, and “PAM”. 

Q4. Perform k-means clustering on the mRNA samples using correlation as the similarity function.  

 clust.kmeans <- kmeans(as.dist(1-cor(bcdata)),2)
 names(clust.kmeans$cluster) <- colnames(bcdata)
 clust.kmeans$cluster[1:10]

Q5. Perform “Partition Around Medoids” clustering using the function PAM in the cluster library.

 library(cluster)
 clust.pam <- pam(as.dist(1-cor(bcdata)), 2, diss =TRUE)
 clusplot(clust.pam, labels = 2, col.p = clust.pam$clustering)

Q6. Use ``Self Organizing Maps'' (SOM) methods from the som library with a 2 by 2 grid size.

Q7. Finally, we can also select the top 100 genes based on variance and perform the various clustering methods described above. Notice that we did not use any information associated with the samples during the gene selection process here. The function stat.gnames sorts genes according to the value of a statistic and in this case the statistics of interest is “variance”.

 golubSub <- bcdata[top100, ]
 par(mfrow = c(2, 2))
 plot(hclust(as.dist(1 - cor(bcdata)), method = "complete"), cex = 0.6)
 plot(hclust(dist(t(bcdata)), method = "average"), cex = 0.6)
 clust.pam <- pam(as.dist(1 - cor(bcdata)), 2, diss = TRUE)
 clusplot(clust.pam, labels = 2, col.p = clust.pam$clustering)
 par(mfrow = c(1, 1))

3. Discrimination

3.1. Feature selection

To develop a sense for why feature selection is a useful tool for classification problems, visualize the strength of correlation between samples belonging to the same outcome group with and without feature selection. A useful function for visualizing the strength of correlations is plot.cor function. Below are some questions to help you get started. Note that for this task we will be using training samples only.

Q8.   It is reasonable to expect that more variable genes carry more signal than the ones that do not vary across samples. Identify 100 most variable genes using robust measure of variability, MAD (median absolute deviation). Additionally, the labels are not used in this ordering, i.e. results will not be biased towards class discrimination.

 library(sma)
 rosetta.mad <- apply(bcdata[, index.train], 1, mad, na.rm = TRUE)
 top100.mad <- stat.gnames(rosetta.mad, 1:length(rosetta.mad), crit = 100)$gnames

Q9.   We can also investigate how filitering based on the statistic of gene association with an outcome will strengthen apparent relationship within groups. Identify the top 100 genes with the highest t-statistic with pooled variance.

 rosetta.tpool <- apply(bcdata[, index.train], 1, t.func,
                          surv.resp[index.train], var.equal = TRUE)
 top100.tpool <- stat.gnames(abs(rosetta.tpool),1:length(rosetta.tpool), crit = 100)$gnames

Q10.  To strengthen visual interpretation, reorder samples within each group based on their hierarchical clustering using correlation as a measure of similarity and ward method.  You can try the function plotCorOrder provided with this lab to do this.   Examine the correlation matrix separately for each set of genes (all, 100 most variable, 100 with the highest association with an outcome). Then compute and visualize the correlation matrix using plot.cor.  Note how color intensity changes across the 3 images.

 labcols <- c(rep("green", length(which(surv.lab[index.train] == "good"))),
                rep("red", length(which(surv.lab[index.train] == "bad"))))

First, the correlation plots using all genes:

 dt <- plotCorOrder()
 plot.cor(cor.na(dt), new = F, nrgcols = 50,
          labels = colnames(trainData), labcols = labcols,
          title = "All genes", zlim = c(-1,1))

Secondly, using only the 100 most variable genes:

 dt <- plotCorOrder(top100.mad)
 plot.cor(cor.na(dt), new = F, nrgcols = 50,
          labels = colnames(trainData), labcols = labcols,
          title = "100 most variable genes", zlim = c(-1, 1))

Thirdly, using only the 100 genes with the highest t-statistics:

 dt <- plotCorOrder(top100.tpool)
 plot.cor(cor.na(dt), new = F, nrgcols = 50, labels =
          colnames(trainData), labcols = labcols,
          title = "Top 100 genes with the highest association",  zlim = c(-1, 1))

In the next few sub-sections of this lab, we are interested in building classification rules to discriminate between the patients who are going to recur and those who are not. For ease of processing we transposed the learning and test sets (so samples are now in rows and genes are in columns).  Thus,LS and TS are training and test set data matrices (100 × 78 and 100 × 19 respectively) and LS.resp and TS.resp are the binary response vectors of length 78 and 19 respectively.  The matrices LS and TS were obtained with the following commands:

 ## LS <- t(bcdata[top100.tpool, index.train])
 ## TS <- t(bcdata[top100.tpool, index.test])
 ## LS.resp <- surv.resp[index.train]
 ## TS.resp <- surv.resp[index.test]

3.2. KNN

Q11.  Let's begin by examining the K Nearest Neighbor (KNN) classifier. We will use the function knn from the class package; use ?knn to see what options are available for this function. We want to see what happens with the different values of k. Is there any advantage to using one k over another based on these results? Note that with small number of samples, it is unwise to go beyond 1 or 3 neighbors even if classification rate seems to be lower.

library(class)
 for (k in 4:7) {
   disc.knn <- as.vector(knn(LS, TS, cl = LS.resp, k))
   print(paste("K = ", k))
   print(table(disc.knn, TS.resp)); cat("\n\n")
 }

Q12. If you are interested in producing similar pictures as shown in the lecture, you can try the functions geneTS and plot.class2 provided with the lab. At the moment, this works only for bivariate data. A simple way is to choose two genes that are most correlated with the outcome (i.e. the first two rows in our data matrix as it was constructed); alternatively, one can use the first and second principal components.

 plotLS <- LS[, 1:2]
 plotTS <- geneTS(plotLS)
 res.knn <- knn(plotLS, plotTS, cl = LS.resp, k = 2)
 plot.class2(plotLS, plotTS, as.vector(res.knn), cl = LS.resp)

3.3. Maximum likelihood discriminant rules

Q13. Use stat.diag.da from the sma package to build DLDA and DQDA classification procedures for tumor samples.

 disc.dqda <- stat.diag.da(LS, TS, cl = LS.resp, pool = 0)$pred
 table(disc.dqda, TS.resp)

 disc.dlda <- stat.diag.da(LS, TS, cl = LS.resp, pool = 1)$pred
 table(disc.dlda, TS.resp)

Use the function plot.class2 to visualize how the classifiers partition the space, type the following commands:

 res.dlda <- stat.diag.da(plotLS, plotTS, cl=LS.resp, pool=1)$pred
 plot.class2(plotLS, plotTS, res.dlda, cl=LS.resp)

3.4 Logistic regression

Q14. Use glm with ”binomial” family option to build a logistic model. Verify that the logistic model produces similar partition to DLDA model. Note that logistic regression outputs probabilities which need to be converted into the class.

 fit.logit <- glm(LS.resp ~., data = as.data.frame(LS),
                     family = "binomial")
 pred.logit <- predict(fit.logit, newdata = as.data.frame(TS),
                     type = "response")

 ## converting output probabilities into classes
 pred.logit.type <- rep(0,length(pred.logit))
 pred.logit.type[pred.logit>= 0.5] <- 1
 pred.logit.type[pred.logit< 0.5] <- 0
 table(pred.logit.type, TS.resp)

3.5. Trees

Q15. Use the function rpart from the rpart package to build a classification tree. Compare the results from different classification rules.

 library(rpart)
 disc.rpart <- dorpart(LS, TS, cl = as.numeric(LS.resp))
 table(disc.rpart, TS.resp)

Q16. Use the function bagging from the ipred package to build an aggregate classification tree.

 library(ipred)
 fit.bagging <- bagging(as.factor(LS.resp) ~ .,
                           data = as.data.frame(LS), nbagg = 100, coob = TRUE)
 pred.bagging <- predict(fit.bagging, newdata = as.data.frame(TS))
 table(pred.bagging, TS.resp)

3.6. Error rate

Q17. Let us reconsider the K-nearest neighbor classifier.  The following code demonstrates how to calculate the re-substitution error rate for a nearest neighbor using all the genes and k equal to 3.

 pred.LS <- as.vector(knn(LS, LS, cl = LS.resp, k = 3))
 table(pred.LS, LS.resp)
 sum(pred.LS != LS.resp)

To calculate the test set error rate for a nearest neighbour using all the genes and k equal to 3.

 pred.TS <- as.vector(knn(LS, TS, cl = LS.resp, k = 3))
 table(pred.TS, TS.resp)
 sum(pred.TS != TS.resp)

Let's consider another nearest neighbour classification rule using top 5 genes from top100.tpool and k equal to 2. Calculate and compare the resubstitution and testset error. Notice how in this case, the test set error is much higher than the re-substitution error. 

 LStop5 <- stat.gnames(top100.tpool, 1:length(top100.tpool), crit = 5)$gnames
 pred.LS <- as.vector(knn(LS[, LStop5], LS[, LStop5], cl = LS.resp, k = 2))
 table(pred.LS, LS.resp)
 sum(pred.LS != LS.resp)

 pred.TS <- as.vector(knn(LS[, LStop5], TS[, LStop5], cl = LS.resp, k = 2))
 table(pred.TS, TS.resp)
 sum(pred.TS != TS.resp)

Q18.  Lastly, we will provide some code to build a nearest neighbor classifier with the parameter k selected by cross-validation. This is an example function built on the function knn.cv.

 knn.cvk <- function(LS, cl, k = 1:5, ...) {
   cv.err <- cl.pred <- c()
   for (i in k) {
     newpre <- as.vector(knn.cv(LS, cl, i))
     cl.pred <- cbind(cl.pred, newpre)
     cv.err <- c(cv.err, sum(cl != newpre))
     }
   k0 <- k[which.min(cv.err)]
   return(list(k = k0, pred = cl.pred[,which.min(cv.err)]))
 }


 knn.cvk(LS, cl = LS.resp, k = 3:7)

Finally, the function knn.var from the knnTree package provides a starting point to build a nearest neighbour classifier with the number of features selected by cross-validation.

4. Acknowledgements

This lab was modified from previous labs developed by Jane Fridlyand and Sandrine Dudoit.

5. References

  1. Golub et. al. (1999) Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286(5439): 531-537.
  2. van 't Veer et. al. (2002) Gene expression profiling predicts clinical outcome of breast cancer. Nature, Vol. 415, pp. 531-537.
  3. Dudoit, S., Fridlyand, J. and Speed, T. P. (2002). Comparison of discrimination methods for the classification of tumors using gene expression data. Journal of the American Statistical Association, Vol. 97, No. 457, p. 77-87.

Valid HTML 4.0!