You will need R 1.9.0 or 1.9.1 (http://www.R-Project.org) for this lab. This section lists all of the R packages you need to have installed and also lists some additional R packages which are recommended.
It is highly desirable to have these R packages in a directory in which you have write permission. You can use
.libPaths("C:/Custom/R/library/directory") or .libPaths("C:\\Custom\\R\\library\\directory")
before you run install.packages() (or equivalent) to install the package(s) in a customized directory location.
Packages required to run vExplorer are Biobase, reposTools, DynDoc,
tkWidgets, and widgetTools. Other packages are shown in table below:
The purpose of this illustration is two-fold. The first is to perform some basic clustering and examine the effect of different cluster algorithms. The second aim is to build different classification procedures for tumor samples using gene expression data and to assess the accuracy of these classification procedures. We will use the data set presented in Golub et al. (1999) and available at http://www.genome.wi.mit.edu/MPR. This data comes from a study of gene expression in three types of acute leukemias: B-cell acute lymphoblastic leukemia (B-ALL), T-cell acute lymphoblastic leukemia (T-ALL) and acute myeloid leukemia (AML). Gene expression levels were measured for 38 B-cell ALL, 9 T-cell ALL and 25 AML tumor samples with Affymetrix high-density oligonucleotide arrays, using the hgu68a chip containing p = 6817 human genes.
You will work with data that has been pre-processed using procedures similar to the one described in Golub et al. (1999) and Dudoit et al. (2000). These steps are:
1 thresholding: floor of 100 and ceiling of 16,000;
2
filtering: exclusion of genes with max / min ≤ 5 or (max - min) ≤ 500 where max
and min
refer respectively to the maximum and minimum
intensities for a particular gene across the mRNA samples;
3 base 2 logarithmic transformation (Golub et al. uses log 10 logarithmic transformation).
The R data file
GolubData.RData
contains gene expression levels and gene names. The filtered gene
expression levels are stored in a 3571
× 72 matrix named golub
with rows corresponding to genes and columns to mRNA samples. You can proceed
with this lab in two ways. For users who are familiar with
R or
S-plus
language, you may prefer to work through this lab by writing your own
code. For other users less familiar with R,
you can use the vExplorer function
from the tkWidgets
package to step through this lab interactively. The
vExplorer
function provides a graphical interface for viewing and executing code
chunks from the lab. To begin, start R and load the tutorial by typing the
commands:
library(IBCLab4) vExplorer()
Next, select the IBCLab4 package using the widget. A basic overview of R and Bioconductor WWW resources are provided at the end of this tutorial. In addition, you can find other Bioconductor tutorials and labs on this data set at http://www.bioconductor.org.
To load the data packages:
library(IBCLab4) data(GolubData)
Cluster the leukemia mRNA samples using the hclust function and
vary the number of genes. Do the T-ALL, B-ALL and AML samples cluster together?
Try different between clusters dissimilarity measures by modifying the “method”
argument in hclust. Also
try different dissimilarities metric by
modifying the “method” argument in dist
or come up with your own. Below are
some questions to help you get started.
Q1. Perform a hierarchical clustering on the mRNA samples using correlation as similarity function and complete linkage agglomeration.
library(stats) clust.cor <- hclust(as.dist(1 - cor(golub)), method = "complete") plot(clust.cor, cex = 0.6)
Q2. Perform a hierarchical clustering on the mRNA samples using Euclidean distance and average linkage agglomeration.
clust.euclid <- hclust(dist(t(golub)), method = "average") plot(clust.euclid, cex = 0.6)
Q3. You might also like to create a cluster image using
the function heatmap in the
package stats. We will create a
clustering image that is commonly seen in many microarray literature. By
default, this function performs hierarchical clustering on both genes and
samples and the function will slow down considerably if the number of genes are
too large. For illustration purposes, we have selected 100 variable genes.
library(sma) golubvar <- apply(golub, 1, var, na.rm = TRUE) top100 <- stat.gnames(golubvar, 1:length(golubvar), crit = 100)$gnames heatmap(golub[top100, ])
Next, we will examine the different partition methods such as “k-means”, and “PMA”.
Q4. Perform k-means clustering on the mRNA samples using correlation as the similarity function.
clust.kmeans <- kmeans(as.dist(1-cor(golub)),3) names(clust.kmeans$cluster) <- colnames(golub) clust.kmeans$cluster[1:10]
Q5. Perform “Partition Around Medoids” clustering using
the function PAM in the
cluster library.
library(cluster) clust.pam <- pam(as.dist(1-cor(golub)), 3, diss =TRUE) clusplot(clust.pam, labels = 3, col.p = clust.pam$clustering)
Q6. Use ``Self Organizing Maps'' (SOM) methods from the som library using a 2 by 2 grid size.
Q7. Finally, we can also select the top 100 genes based on variance
and perform the various clustering methods described above. Notice that we did
not use any information associated with the samples during the gene selection
process here. The function stat.gnames
sorts genes according to the value of a statistic and in this case the
statistics of interest is “variance”.
golubSub <- golub[top100, ] par(mfrow = c(2, 2)) plot(hclust(as.dist(1 - cor(golubSub)), method = "complete"), cex = 0.6) plot(hclust(dist(t(golubSub)), method = "average"), cex = 0.6) clust.pam <- pam(as.dist(1 - cor(golubSub)), 3, diss = TRUE) clusplot(clust.pam, labels = 3, col.p = clust.pam$clustering) par(mfrow = c(1, 1))
In the second part of this lab, we are interested in building
classification rules to discriminate between the three tumor groups, B-ALL,
T-ALL and AML. For ease of processing, the
golub matrix was further separated
into expression data for the learning and test set in matrix objects named
LS and
TS respectively. Different from
the golub matrix, the matrices
LS and
TS are
38 × 3517
and
34 × 3517
respectively, with rows corresponding to tumor samples and
columns corresponding to genes. The tumor class labels for the learning and test
sets are given in vector objects named
LS.resp and TS.resp
respectively.
Q8. Let's begin by examining the K Nearest neighbor
classifer. We will use the function knn
from the class package; use
?knn to see what options are
available for this function. We want to see what happens with the different
values of k. Using a simple feature selection procedure, we have selected the
top 100 genes based on the between to within group sum of squares (BWSS) using
the function stat.bwss in the
sma package.
library(class)
LS.bw <- stat.bwss(t(LS), as.integer(LS.resp))$bw
LStop100 <- stat.gnames(LS.bw, 1:length(LS.bw), crit = 100)$gnames
for (k in 4:6) {
disc.knn <- as.vector(knn(LS[,LStop100],TS[,LStop100], cl = LS.resp, k))
print(paste("K = ", k))
print(table(disc.knn, TS.resp))
}
Q9. If you are interested in producing similar pictures
as shown in the lecture, you can try the functions
geneTS and
plot.class2 provided with the lab.
At the moment, this works only for bivariate data. A simple way is to choose two
genes from the vector LStop100,
alternatively, one can use the first and second principal components.
newLS <- LS[, LStop100[2:3]] newTS <- geneTS(newLS) res.knn <- knn(newLS, newTS, cl = as.numeric(LS.resp), k = 3) plot.class2(newLS, newTS, res.knn, cl = LS.resp)
Q10. Use
stat.diag.da from sma
packages to build DLDA and
DQDA classification procedures for
tumor samples.
disc.dqda <- stat.diag.da(LS[, LStop100], TS[, LStop100],cl = as.integer(LS.resp), pool = 0)[[1]] table(disc.dqda, TS.resp) disc.dlda <- stat.diag.da(LS[, LStop100], TS[, LStop100],cl = as.integer(LS.resp), pool = 1)[[1]] table(disc.dlda, TS.resp)
To use the function plot.class2 to visualize how the classifiers partition the space, type the following commands:
res.dlda <- stat.diag.da(newLS, newTS, cl=as.integer(LS.resp), pool=1)$pred plot.class2(newLS, newTS, res.dlda, cl=LS.resp)
Use the function rpart
from the rpart package to build a
classification tree. Compare the results from different classification rules.
library(rpart) disc.rpart <- dorpart(LS = LS[, LStop100], TS = TS[,LStop100], cl = as.numeric(LS.resp)) table(disc.rpart, TS.resp)
Use the function ipredbagg.factor from the ipred
package to build an aggregate classification tree.
library(ipred) fit.bagtree <- ipredbagg.factor(y=as.factor(LS.resp), X=as.data.frame(LS[,LStop100]), nbagg=100, coob=TRUE) pred.bagtree <- predict(fit.bagtree, newdata= as.data.frame(TS[,LStop100])) table(pred.bagtree, TS.resp)
Let us reconsider the K-nearest neighbor classifer. The following code demonstrates how to calculate the resubstitution error rate for a nearest neighbor using all the genes and k equal to 3.
pred.LS <- as.vector(knn(LS, LS, cl = LS.resp, k = 3)) table(pred.LS, LS.resp) sum(pred.LS != LS.resp)
To calculate the test set error rate for a nearest neighbor using all the genes and k equal to 3.
pred.TS <- as.vector(knn(LS, TS, cl = LS.resp, k = 3)) table(pred.TS, TS.resp) sum(pred.TS != TS.resp)
Let's consider another nearest neighbour classification rule using top 5 BWSS genes and k equal to 2. Calculate and compare the resubstitution and testset error. Notice how in this case, the test set error is much higher than the resubsitution error.
LStop5 <- stat.gnames(LS.bw, 1:length(LS.bw), crit = 5)$gnames pred.LS <- as.vector(knn(LS[, LStop5], LS[, LStop5], cl = LS.resp, k = 2)) table(pred.LS, LS.resp) sum(pred.LS != LS.resp) pred.TS <- as.vector(knn(LS[, LStop5], TS[, LStop5], cl = LS.resp, k = 2)) table(pred.TS, TS.resp) sum(pred.TS != TS.resp)
Lastly, we will provide some code to build a nearest neighbor
classifier with the parameter k selected by cross-validation. This is an
example function build on the function
knn.cv.
knn.cvk <- function(LS, cl, k = 1:5, ...) {
cv.err <- cl.pred <- c()
for (i in k) {
newpre <- as.vector(knn.cv(LS, cl, i))
cl.pred <- cbind(cl.pred, newpre)
cv.err <- c(cv.err, sum(cl != newpre))
}
k0 <- k[which.min(cv.err)]
return(list(k = k0, pred = cl.pred[,which.min(cv.err)]))
}
knn.cvk(LS, cl = LS.resp, k = 3:7)
Finally, the function
knn.var from the knnTree
package provide a starting point to build a nearest neighbour classifier with
the number of features selected by cross-validation.
We have also included the data set presented in van't Veer et al. (2002)
which is available at
http://www.rii.com/publications/2002/vantveer.htm. This data comes from a
study of gene expression in 91 breast cancer node-negative tumors. The training
samples consisted of 78 tumors, 44 of which did not recur within 5 years of
diagnosis and 34 did. Among the test samples, 7 have not recurred within 5 years
and 12 did. The data were collected on two color Agilent oligo arrays containing
about 24K genes.
The data provided here has been filtered using procedures described in the
original manuscript which resulted in 4,348 genes. The missing values were
imputed using k-nearest neighbors imputation procedure (Troyanskaya, et al,
2001). The R data file “RosettaBreastData.RData” contains gene
expression levels and gene names. The filtered gene expression levels are
stored in a 4348 × 97 matrix named rosetta.data.imp with rows
corresponding to genes and columns to mRNA samples. Additionally, the labels are
contained in the 97-element vector surv.resp with 0 indicating good
outcome (no recurrence within 5 years after diagnosis) and 1 indicating bad
outcome (recurrence within 5 years after diagnosis).
data(RosettaBreastData) rindex.train <- 1:78 rindex.test <- 79:97
For software and documentation consult:
· The main R project website: http://www.R-project.org.
· The Comprehensive R Archive Network (CRAN): http://cran.r-project.org. This site contains the base system and contributed packages for different platforms (Linux, MacOS, Windows), manuals, tutorials and news-letters. For Windows, there is an installer for the main R system and contributed packages from CRAN can be installed using the “Install package from CRAN ...” item on the Packages menu. For other packages, you can use the “Install package from local zip file ...” item.
· The Bioconductor website: http://www.bioconductor.org. This site contains software packages, vignettes, data sets, short course materials. To install Bioconductor packages for windows, use the “Install package(s) from Bioconductor ...” item on the Packages menu.
This lab was modified from previous labs developed by Sandrine Dudoit and Jane Fridlyand.