You will need R 2.1.0 or 2.1.1 (http://www.R-Project.org) for this lab. This section lists all of the R packages you need to have installed and also lists some additional R packages which are recommended.
It is highly desirable to have these R packages in a directory in
which you have write permission. You can use
.libPaths("C:/Custom/R/library/directory") or .libPaths("C:\\Custom\\R\\library\\directory")
before you run install.packages() (or equivalent) to
install the package(s) in a customized directory location.
Packages required to run vExplorer are Biobase,
reposTools, DynDoc,
tkWidgets, and widgetTools. Other packages are shown in table
below:
The purpose of this illustration is two-fold. The first
is to
perform some basic clustering and examine the effect of different
cluster
algorithms. The second aim is to build different classification
procedures for
tumor samples using gene expression data and to assess the accuracy of
these
classification procedures. We will use the dataset presented in van't
Veer et al. (2002)
which is available at http://www.rii.com/publications/2002/vantveer.htm.
These data come from a study of gene expression in 91 breast cancer
node-negative tumors. The training samples consisted of 78 tumors, 44
of which did not recur within 5 years of diagnosis and 34 did. Among
the test samples, 7 have not recurred within 5 years and 12 did. The
data were collected on two color Agilant oligo arrays containing about
24K genes.
You will work with data which has been filtered using procedures described in the original manuscript. There, only genes that show 2-fold differential expression and p-value for a gene being expressed < 0.01 in more than 5 samples are retained. There are 4,348 such genes. Additionally the missing values were imputed using k-nearest neighbors imputation procedure (Troyanskaya, et al, 2001). There, for each gene containing at leats one missing value we find 5 genes most highly correlated with it and take average of their value for the sample in which a value for a given gene is missing. The missing value is replaced with the average.
The data provided here has been filtered using procedures described in
the
original manuscript which resulted in 4,348 genes. The missing values
were
imputed using k-nearest neighbors imputation procedure (Troyanskaya, et
al,
2001). The R data file “JSMLab6.RData”
contains gene
expression levels and gene names and various functions required for
this lab. The filtered gene expression
levels are
stored in a 4348 × 97 matrix named bcdata with
rows
corresponding to genes and columns to mRNA samples. Additionally, the
labels are
contained in the 97-element vector surv.resp with 0 indicating
good
outcome (no recurrence within 5 years after diagnosis) and 1 indicating
bad
outcome (recurrence within 5 years after diagnosis).
The data should be downloaded from http://bioinf.wehi.edu.au/marray/jsm2005/lab6/JSMLab6.RData before the lab.
To load the .RData file, use
attach("JSMLab6.RData")
To easily access training and test datasets later, initialize the
relevant indices:
index.train <- 1:78 index.test <- 79:97 trainData <- bcdata[,index.train] testData <- bcdata[,index.test]
Cluster the mRNA samples using the hclust
function and
vary the number of genes. Do the different survival outcome samples
cluster
together?
Try different between clusters dissimilarity measures by modifying the “method”
argument in hclust. Also
try different dissimilarities metric by
modifying the “method” argument in dist
or come up with your own. Below are
some questions to help you get started.
Q1. Perform a hierarchical clustering on the mRNA samples using correlation as similarity function and complete linkage agglomeration.
library(stats) clust.cor <- hclust(as.dist(1 - cor(bcdata)), method = "complete") plot(clust.cor, cex = 0.6)
Q2. Perform a hierarchical clustering on the mRNA samples using Euclidean distance and average linkage agglomeration.
clust.euclid <- hclust(dist(t(bcdata)), method = "average") plot(clust.euclid, cex = 0.6)
Q3. You might also like to create a cluster
image using
the function heatmap in the
package stats.
We will create a
clustering image that is commonly seen in many microarray
literature. By
default, this function performs hierarchical clustering on both genes
and
samples and the function will slow down considerably if the number of
genes are
too large. For illustration purposes, we have selected 100 variable
genes.
library(sma) bcdatavar <- apply(bcdata, 1, var, na.rm = TRUE) top100 <- stat.gnames(bcdatavar, 1:length(bcdatavar), crit = 100)$gnames heatmap(bcdata[top100, ])
Next, we will examine the different partition methods such as “k-means”, and “PAM”.
Q4. Perform k-means clustering on the mRNA samples using correlation as the similarity function.
clust.kmeans <- kmeans(as.dist(1-cor(bcdata)),2) names(clust.kmeans$cluster) <- colnames(bcdata) clust.kmeans$cluster[1:10]
Q5. Perform “Partition Around Medoids”
clustering using
the function PAM in the
cluster library.
library(cluster) clust.pam <- pam(as.dist(1-cor(bcdata)), 2, diss =TRUE) clusplot(clust.pam, labels = 2, col.p = clust.pam$clustering)
Q6. Use ``Self Organizing Maps'' (SOM) methods from the som library with a 2 by 2 grid size.
Q7. Finally, we can also select the top 100 genes based on
variance and perform the various clustering methods described above.
Notice that we did not use any information associated with the samples
during the gene selection process here. The function stat.gnames
sorts genes according to the value of a statistic and in this case the
statistics of interest is “variance”.
golubSub <- bcdata[top100, ] par(mfrow = c(2, 2)) plot(hclust(as.dist(1 - cor(bcdata)), method = "complete"), cex = 0.6) plot(hclust(dist(t(bcdata)), method = "average"), cex = 0.6) clust.pam <- pam(as.dist(1 - cor(bcdata)), 2, diss = TRUE) clusplot(clust.pam, labels = 2, col.p = clust.pam$clustering) par(mfrow = c(1, 1))
To develop a sense for why feature selection is a useful
tool for classification problems, visualize the strength of correlation
between samples belonging to the same outcome group with and without
feature selection. A useful function for visualizing the strength of
correlations is plot.cor
function. Below are some questions to help you get
started. Note that for this task we will be using training samples only.
Q8. It is reasonable to expect that more variable genes carry more signal than the ones that do not vary across samples. Identify 100 most variable genes using robust measure of variability, MAD (median absolute deviation). Additionally, the labels are not used in this ordering, i.e. results will not be biased towards class discrimination.
library(sma) rosetta.mad <- apply(bcdata[, index.train], 1, mad, na.rm = TRUE) top100.mad <- stat.gnames(rosetta.mad, 1:length(rosetta.mad), crit = 100)$gnames
Q9. We can also investigate how filitering based on the statistic of gene association with an outcome will strengthen apparent relationship within groups. Identify the top 100 genes with the highest t-statistic with pooled variance.
rosetta.tpool <- apply(bcdata[, index.train], 1, t.func,
surv.resp[index.train], var.equal = TRUE)
top100.tpool <- stat.gnames(abs(rosetta.tpool),1:length(rosetta.tpool), crit = 100)$gnames
Q10.
To strengthen visual interpretation, reorder samples
within each group based on their hierarchical clustering using
correlation as a measure of similarity and ward method. You
can try the function
plotCorOrder provided with this lab to
do this. Examine the correlation matrix separately for each
set of genes (all, 100 most variable, 100 with the highest association
with an outcome). Then compute and visualize the correlation matrix
using plot.cor.
Note how color intensity changes across the 3 images.
labcols <- c(rep("green", length(which(surv.lab[index.train] == "good"))),
rep("red", length(which(surv.lab[index.train] == "bad"))))
First, the correlation plots using all genes:
dt <- plotCorOrder()
plot.cor(cor.na(dt), new = F, nrgcols = 50,
labels = colnames(trainData), labcols = labcols,
title = "All genes", zlim = c(-1,1))
Secondly, using only the 100 most variable genes:
dt <- plotCorOrder(top100.mad)
plot.cor(cor.na(dt), new = F, nrgcols = 50,
labels = colnames(trainData), labcols = labcols,
title = "100 most variable genes", zlim = c(-1, 1))
Thirdly, using only the 100 genes with the highest t-statistics:
dt <- plotCorOrder(top100.tpool)
plot.cor(cor.na(dt), new = F, nrgcols = 50, labels =
colnames(trainData), labcols = labcols,
title = "Top 100 genes with the highest association", zlim = c(-1, 1))
In the next few sub-sections of this lab, we are interested in building
classification rules to discriminate between the patients who are going to
recur and those who are not. For ease of processing we transposed the
learning and test sets (so samples are now in rows and genes are in columns).
Thus,LS and TS are training and test set data matrices
(100 × 78 and 100 × 19 respectively) and LS.resp
and TS.resp are the binary response vectors of length 78
and 19 respectively. The matrices LS and
TS were obtained with the following commands:
## LS <- t(bcdata[top100.tpool, index.train]) ## TS <- t(bcdata[top100.tpool, index.test]) ## LS.resp <- surv.resp[index.train] ## TS.resp <- surv.resp[index.test]
Q11. Let's begin by examining the K Nearest Neighbor (KNN)
classifier. We will use the function knn
from the class package;
use
?knn to see what options are
available for this function. We want to see what happens with the
different
values of k. Is
there any advantage to using one k over another based on these results?
Note that with small number of samples, it is unwise to go beyond 1 or
3 neighbors even if classification rate seems to be lower.
library(class)
for (k in 4:7) {
disc.knn <- as.vector(knn(LS, TS, cl = LS.resp, k))
print(paste("K = ", k))
print(table(disc.knn, TS.resp)); cat("\n\n")
}
Q12. If you are interested in producing similar
pictures
as shown in the lecture, you can try the functions
geneTS and
plot.class2 provided with the lab.
At the moment, this works only for bivariate data. A simple way is to
choose two genes that are most correlated with the
outcome (i.e. the first two rows in our data matrix as it was
constructed);
alternatively, one can use the first and second principal components.
plotLS <- LS[, 1:2] plotTS <- geneTS(plotLS) res.knn <- knn(plotLS, plotTS, cl = LS.resp, k = 2) plot.class2(plotLS, plotTS, as.vector(res.knn), cl = LS.resp)
Q13. Use
stat.diag.da from the sma
package to build DLDA and
DQDA classification procedures for
tumor samples.
disc.dqda <- stat.diag.da(LS, TS, cl = LS.resp, pool = 0)$pred table(disc.dqda, TS.resp) disc.dlda <- stat.diag.da(LS, TS, cl = LS.resp, pool = 1)$pred table(disc.dlda, TS.resp)
Use the function plot.class2 to visualize how the classifiers partition the space, type the following commands:
res.dlda <- stat.diag.da(plotLS, plotTS, cl=LS.resp, pool=1)$pred plot.class2(plotLS, plotTS, res.dlda, cl=LS.resp)
Q14. Use glm with ”binomial” family option to build a logistic model. Verify that the logistic model produces similar partition to DLDA model. Note that logistic regression outputs probabilities which need to be converted into the class.
fit.logit <- glm(LS.resp ~., data = as.data.frame(LS),
family = "binomial")
pred.logit <- predict(fit.logit, newdata = as.data.frame(TS),
type = "response")
## converting output probabilities into classes
pred.logit.type <- rep(0,length(pred.logit))
pred.logit.type[pred.logit>= 0.5] <- 1
pred.logit.type[pred.logit< 0.5] <- 0
table(pred.logit.type, TS.resp)
Q15. Use the function rpart
from the rpart package to build a
classification tree. Compare the results from different classification
rules.
library(rpart) disc.rpart <- dorpart(LS, TS, cl = as.numeric(LS.resp)) table(disc.rpart, TS.resp)
Q16. Use the function bagging from the ipred
package to build an aggregate classification tree.
library(ipred)
fit.bagging <- bagging(as.factor(LS.resp) ~ .,
data = as.data.frame(LS), nbagg = 100, coob = TRUE)
pred.bagging <- predict(fit.bagging, newdata = as.data.frame(TS))
table(pred.bagging, TS.resp)
Q17. Let us reconsider the K-nearest neighbor classifier. The following code demonstrates how to calculate the re-substitution error rate for a nearest neighbor using all the genes and k equal to 3.
pred.LS <- as.vector(knn(LS, LS, cl = LS.resp, k = 3)) table(pred.LS, LS.resp) sum(pred.LS != LS.resp)
To calculate the test set error rate for a nearest neighbour using all the genes and k equal to 3.
pred.TS <- as.vector(knn(LS, TS, cl = LS.resp, k = 3)) table(pred.TS, TS.resp) sum(pred.TS != TS.resp)
Let's consider another nearest neighbour classification rule
using top 5 genes from top100.tpool and k equal to 2. Calculate and
compare the resubstitution and testset error. Notice how in this case, the test set
error is much higher than the re-substitution error.
LStop5 <- stat.gnames(top100.tpool, 1:length(top100.tpool), crit = 5)$gnames pred.LS <- as.vector(knn(LS[, LStop5], LS[, LStop5], cl = LS.resp, k = 2)) table(pred.LS, LS.resp) sum(pred.LS != LS.resp) pred.TS <- as.vector(knn(LS[, LStop5], TS[, LStop5], cl = LS.resp, k = 2)) table(pred.TS, TS.resp) sum(pred.TS != TS.resp)
Q18. Lastly, we will provide some code to
build a nearest neighbor
classifier with the parameter k selected by cross-validation.
This is an example function built on the function knn.cv.
knn.cvk <- function(LS, cl, k = 1:5, ...) {
cv.err <- cl.pred <- c()
for (i in k) {
newpre <- as.vector(knn.cv(LS, cl, i))
cl.pred <- cbind(cl.pred, newpre)
cv.err <- c(cv.err, sum(cl != newpre))
}
k0 <- k[which.min(cv.err)]
return(list(k = k0, pred = cl.pred[,which.min(cv.err)]))
}
knn.cvk(LS, cl = LS.resp, k = 3:7)
Finally, the function knn.var from the knnTree
package provides a starting point to build a nearest neighbour
classifier with the number of features selected by cross-validation.
This lab was modified from previous labs developed by Jane Fridlyand and Sandrine Dudoit.