Lab 4 - Clustering and Classification

Jean Yang. July 1, 2004.

Contents:

  1. Software and data required for this lab
    1. Required R packages
    2. Required data and introduction
  2. Clustering
    1. Hierarchical clustering
    2. Partition methods
  3. Discrimination
    1. KNN
    2. Maximum likelihood discriminant rules
    3. Trees
    4. Error rate
  4. Optional: Additional data
  5. R and Bioconductor WWW resources 
  6. Acknowledgements
  7. References

1. Software required for this lab.

You will need R 1.9.0 or 1.9.1 (http://www.R-Project.org) for this lab. This section lists all of the R packages you need to have installed and also lists some additional R packages which are recommended.

1.1 Required R packages

It is highly desirable to have these R packages in a directory in which you have write permission. You can use .libPaths("C:/Custom/R/library/directory") or .libPaths("C:\\Custom\\R\\library\\directory") before you run install.packages() (or equivalent) to install the package(s) in a customized directory location.  Packages required to run vExplorer  are Biobase, reposTools, DynDoc, tkWidgets, and widgetTools.  Other packages are shown in table below:

PackageURL
cluster http://cran.at.r-project.org/src/contrib/Descriptions/cluster.html
ipred http://cran.at.r-project.org/src/contrib/Descriptions/ipred.html
som http://cran.at.r-project.org/src/contrib/Descriptions/som.html
rpart http://cran.at.r-project.org/src/contrib/Descriptions/rpart.html
smahttp://cran.at.r-project.org/src/contrib/Descriptions/sma.html
IBCLab4 http://bioinf.wehi.edu.au/marray/ibc2004/Rpackages/IBCLab4.html

1.2 Required data and introduction

The purpose of this illustration is two-fold. The first is to perform some basic clustering and examine the effect of different cluster algorithms. The second aim is to build different classification procedures for tumor samples using gene expression data and to assess the accuracy of these classification procedures. We will use the data set presented in Golub et al. (1999) and available at http://www.genome.wi.mit.edu/MPR. This data comes from a study of gene expression in three types of acute leukemias: B-cell acute lymphoblastic leukemia (B-ALL), T-cell acute lymphoblastic leukemia (T-ALL) and acute myeloid leukemia (AML). Gene expression levels were measured for 38 B-cell ALL, 9 T-cell ALL and 25 AML tumor samples with Affymetrix high-density oligonucleotide arrays, using the hgu68a chip containing p = 6817 human genes.

You will work with data that has been pre-processed using procedures similar to the one described in Golub et al. (1999) and Dudoit et al. (2000). These steps are:

1        thresholding: floor of 100 and ceiling of 16,000;

2        filtering: exclusion of genes with  max / min ≤ 5 or (max - min) ≤ 500  where max and min refer respectively to the maximum and minimum intensities for a particular gene across the mRNA samples;

3        base 2 logarithmic transformation (Golub et al. uses log 10 logarithmic transformation).

The R data file GolubData.RData contains gene expression levels and gene names. The filtered gene expression levels are stored in a 3571 × 72 matrix named golub with rows corresponding to genes and columns to mRNA samples. You can proceed with this lab in two ways. For users who are familiar with R or S-plus language, you may prefer to work through this lab by writing your own code. For other users less familiar with R, you can use the vExplorer function from the tkWidgets package to step through this lab interactively. The vExplorer function provides a graphical interface for viewing and executing code chunks from the lab. To begin, start R and load the tutorial by typing the commands:

library(IBCLab4)
vExplorer()

Next, select the IBCLab4 package using the widget. A basic overview of R and Bioconductor WWW resources are provided at the end of this tutorial. In addition, you can find other Bioconductor tutorials and labs on this data set at http://www.bioconductor.org.

To load the data packages:

library(IBCLab4)
data(GolubData)

2. Clustering

2.1 Hierarchical clustering

Cluster the leukemia mRNA samples using the hclust function and vary the number of genes. Do the T-ALL, B-ALL and AML samples cluster together? Try different between clusters dissimilarity measures by modifying the “method” argument in hclust. Also try different dissimilarities metric by modifying the “method” argument in dist or come up with your own.  Below are some questions to help you get started.

Q1. Perform a hierarchical clustering on the mRNA samples using correlation as similarity function and complete linkage agglomeration.

library(stats)
clust.cor <- hclust(as.dist(1 - cor(golub)), method = "complete")
plot(clust.cor, cex = 0.6)

Q2. Perform a hierarchical clustering on the mRNA samples using Euclidean distance and average linkage agglomeration. 

clust.euclid <- hclust(dist(t(golub)), method = "average")
plot(clust.euclid, cex = 0.6)

Q3. You might also like to create a cluster image using the function heatmap in the package stats. We will create a clustering image that is commonly seen in many microarray literature.  By default, this function performs hierarchical clustering on both genes and samples and the function will slow down considerably if the number of genes are too large. For illustration purposes, we have selected 100 variable genes.

library(sma)
golubvar <- apply(golub, 1, var, na.rm = TRUE)
top100 <- stat.gnames(golubvar, 1:length(golubvar), crit = 100)$gnames
heatmap(golub[top100, ])

2.2 Partition methods

Next, we will examine the different partition methods such as “k-means”, and “PMA”. 

Q4. Perform k-means clustering on the mRNA samples using correlation as the similarity function.  

clust.kmeans <- kmeans(as.dist(1-cor(golub)),3)
names(clust.kmeans$cluster) <- colnames(golub)
clust.kmeans$cluster[1:10]

Q5. Perform “Partition Around Medoids” clustering using the function PAM in the cluster library.

library(cluster)
clust.pam <- pam(as.dist(1-cor(golub)), 3, diss =TRUE)
clusplot(clust.pam, labels = 3, col.p = clust.pam$clustering)

Q6. Use ``Self Organizing Maps'' (SOM) methods from the som library using a 2 by 2 grid size.

Q7. Finally, we can also select the top 100 genes based on variance and perform the various clustering methods described above. Notice that we did not use any information associated with the samples during the gene selection process here. The function stat.gnames sorts genes according to the value of a statistic and in this case the statistics of interest is “variance”.

golubSub <- golub[top100, ]
par(mfrow = c(2, 2))
plot(hclust(as.dist(1 - cor(golubSub)), method = "complete"), cex = 0.6)
plot(hclust(dist(t(golubSub)), method = "average"), cex = 0.6)
clust.pam <- pam(as.dist(1 - cor(golubSub)), 3, diss = TRUE)
clusplot(clust.pam, labels = 3, col.p = clust.pam$clustering)
par(mfrow = c(1, 1))

3. Discrimination

In the second part of this lab, we are interested in building classification rules to discriminate between the three tumor groups, B-ALL, T-ALL and AML. For ease of processing, the golub matrix was further separated into expression data for the learning and test set in matrix objects named LS and TS respectively. Different from the golub matrix, the matrices LS and TS are 38 × 3517 and 34 × 3517 respectively, with rows corresponding to tumor samples and columns corresponding to genes. The tumor class labels for the learning and test sets are given in vector objects named LS.resp and TS.resp respectively. 

3.1. KNN

Q8.  Let's begin by examining the K Nearest neighbor classifer. We will use the function knn from the class package; use ?knn to see what options are available for this function. We want to see what happens with the different values of k. Using a simple feature selection procedure, we have selected the top 100 genes based on the between to within group sum of squares (BWSS) using the function stat.bwss in the sma package.

library(class)
LS.bw <- stat.bwss(t(LS), as.integer(LS.resp))$bw
LStop100 <- stat.gnames(LS.bw, 1:length(LS.bw), crit = 100)$gnames
for (k in 4:6) {
  disc.knn <- as.vector(knn(LS[,LStop100],TS[,LStop100], cl = LS.resp, k))
  print(paste("K = ", k))
  print(table(disc.knn, TS.resp))
}

Q9. If you are interested in producing similar pictures as shown in the lecture, you can try the functions geneTS and plot.class2 provided with the lab. At the moment, this works only for bivariate data. A simple way is to choose two genes from the vector LStop100, alternatively, one can use the first and second principal components.

newLS <- LS[, LStop100[2:3]]
newTS <- geneTS(newLS)
res.knn <- knn(newLS, newTS, cl = as.numeric(LS.resp), k = 3)
plot.class2(newLS, newTS, res.knn, cl = LS.resp)

3.2. Maximum likelihood discriminant rules

Q10. Use stat.diag.da from sma packages to build DLDA and DQDA classification procedures for tumor samples.

disc.dqda <- stat.diag.da(LS[, LStop100], TS[, LStop100],cl =
  as.integer(LS.resp), pool = 0)[[1]]
table(disc.dqda, TS.resp)

disc.dlda <- stat.diag.da(LS[, LStop100], TS[, LStop100],cl =
  as.integer(LS.resp), pool = 1)[[1]]
table(disc.dlda, TS.resp)

To use the function plot.class2 to visualize how the classifiers partition the space, type the following commands:

res.dlda <- stat.diag.da(newLS, newTS, cl=as.integer(LS.resp), pool=1)$pred
plot.class2(newLS, newTS, res.dlda, cl=LS.resp)

3.3. Trees

Use the function rpart from the rpart package to build a classification tree. Compare the results from different classification rules.

library(rpart)
disc.rpart <- dorpart(LS = LS[, LStop100], TS = TS[,LStop100], cl =
  as.numeric(LS.resp))
table(disc.rpart, TS.resp)

Use the function ipredbagg.factor from the ipred package to build an aggregate classification tree.

library(ipred)
fit.bagtree <- ipredbagg.factor(y=as.factor(LS.resp), X=as.data.frame(LS[,LStop100]),
  nbagg=100, coob=TRUE)
pred.bagtree <- predict(fit.bagtree, newdata= as.data.frame(TS[,LStop100]))
table(pred.bagtree, TS.resp)

3.4. Error rate

Let us reconsider the K-nearest neighbor classifer.  The following code demonstrates how to calculate the resubstitution error rate for a nearest neighbor using all the genes and k equal to 3.

pred.LS <- as.vector(knn(LS, LS, cl = LS.resp, k = 3))
table(pred.LS, LS.resp)
sum(pred.LS != LS.resp)

To calculate the test set error rate for a nearest neighbor using all the genes and k equal to 3.

pred.TS <- as.vector(knn(LS, TS, cl = LS.resp, k = 3))
table(pred.TS, TS.resp)
sum(pred.TS != TS.resp)

Let's consider another nearest neighbour classification rule using top 5 BWSS genes and k equal to 2. Calculate and compare the resubstitution and testset error. Notice how in this case, the test set error is much higher than the resubsitution error. 

LStop5 <- stat.gnames(LS.bw, 1:length(LS.bw), crit = 5)$gnames
pred.LS <- as.vector(knn(LS[, LStop5], LS[, LStop5], cl = LS.resp, k = 2))
table(pred.LS, LS.resp)
sum(pred.LS != LS.resp)

pred.TS <- as.vector(knn(LS[, LStop5], TS[, LStop5], cl = LS.resp, k = 2))
table(pred.TS, TS.resp)
sum(pred.TS != TS.resp)

Lastly, we will provide some code to build a nearest neighbor classifier with the parameter k selected by cross-validation. This is an example function build on the function knn.cv.

knn.cvk <- function(LS, cl, k = 1:5, ...) {
  cv.err <- cl.pred <- c()
  for (i in k) {
    newpre <- as.vector(knn.cv(LS, cl, i))
    cl.pred <- cbind(cl.pred, newpre)
    cv.err <- c(cv.err, sum(cl != newpre))
  }
  k0 <- k[which.min(cv.err)]
  return(list(k = k0, pred = cl.pred[,which.min(cv.err)]))
}
knn.cvk(LS, cl = LS.resp, k = 3:7)

Finally, the function knn.var from the knnTree package provide a starting point to build a nearest neighbour classifier with the number of features selected by cross-validation.

4. Optional: Additional data                                

We have also included the data set presented in van't Veer et al. (2002) which is available at http://www.rii.com/publications/2002/vantveer.htm. This data comes from a study of gene expression in 91 breast cancer node-negative tumors. The training samples consisted of 78 tumors, 44 of which did not recur within 5 years of diagnosis and 34 did. Among the test samples, 7 have not recurred within 5 years and 12 did. The data were collected on two color Agilent oligo arrays containing about 24K genes.

The data provided here has been filtered using procedures described in the original manuscript which resulted in 4,348 genes. The missing values were imputed using k-nearest neighbors imputation procedure (Troyanskaya, et al, 2001). The R data file “RosettaBreastData.RData” contains gene expression levels and gene names. The filtered gene expression levels  are stored in a 4348 × 97 matrix named rosetta.data.imp with rows corresponding to genes and columns to mRNA samples. Additionally, the labels are contained in the 97-element vector surv.resp with 0 indicating good outcome (no recurrence within 5 years after diagnosis) and 1 indicating bad outcome (recurrence within 5 years after diagnosis).

data(RosettaBreastData)
rindex.train <- 1:78
rindex.test <- 79:97

5. R and Bioconductor WWW resources                                 

For software and documentation consult:

·         The main R project website: http://www.R-project.org.

·         The Comprehensive R Archive Network (CRAN): http://cran.r-project.org. This site contains the base system and contributed packages for different platforms (Linux, MacOS, Windows), manuals, tutorials and news-letters.  For Windows, there is an installer for the main R system and  contributed packages from CRAN can be installed using the “Install package from CRAN ...” item on the Packages menu.  For other packages, you can use the “Install package from local zip file ...” item.

·         The Bioconductor website: http://www.bioconductor.org.  This site contains software packages, vignettes, data sets, short course materials. To install Bioconductor packages for windows, use the “Install package(s) from Bioconductor ...” item on the Packages menu.

6. Acknowledgements

This lab was modified from previous labs developed by Sandrine Dudoit and Jane Fridlyand.


7. References

  1. Golub et. al. (1999) Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286(5439): 531-537.
  2. van 't Veeret. al. (2002) Gene expression profiling predicts clinical outcome of breast cancer.. Nature, Vol. 415, pp. 531-537.
  3. Dudoit, S., Fridlyand, J. and Speed, T. P. (2002). Comparison of discrimination methods for the classification of tumors using gene expression data. Journal of the American Statistical Association, Vol. 97, No. 457, p. 77-87.

Valid HTML 4.0!