You will need R 1.9.0 or 1.9.1
(http://www.R-Project.org)
for this lab. This section lists all of the R packages
you need to have installed and also lists some additional R packages which are recommended.
Note that most of the R packages can be installed from the Bioconductor site automatically using:
source("http://www.bioconductor.org/getBioC.R")
getBioC()
but you are expected to use a more recent version of the limmaGUI, limma, marray and arrayQuality packages for this workshop,
which is the reason for the alternate (non-Bioconductor) URLs for these packages below.
It is highly desirable to have these R packages in a directory in which you have write permission. You can use
.libPaths("C:/Custom/R/library/directory") or .libPaths("C:\\Custom\\R\\library\\directory")
before you run install.packages() (or equivalent) to install the package(s) in a customized directory location.
The cell line and Swirl datasets are required for this lab, and can be downloaded from the following URLs:
http://bioinf.wehi.edu.au/marray/ibc2004/CellLine.zip
http://bioinf.wehi.edu.au/marray/ibc2004/swirl.zip.
After conducting a microarray experiment on one or more arrays printed with a particular library of probes, the arrays are scanned to produce TIFF images, one for each channel (Cy3 and Cy5). The TIFF images are processed using an image analysis program such as ArrayVision, ImaGene, GenePix, QuantArray or SPOT to acquire the red and green foreground and background intensities for each spot, along with other measurements. The spot intensities are then exported from the image analysis program into a series of text files. There should be one file for each array or, in the case of ImaGene, two files for each array.
To analyse microarray data, we require (i) a file which describes the probes, often a GenePix Array List (GAL) file, and (ii) the image analysis output files. In most cases it is also desirable to have a Targets File, describing which RNA sample was hybridized to each channel of each array. A further optional file is the Spot Types file (STF) which identifies special probes such as control spots.
The Targets FileThe Targets File is normally in tab-delimited text format. It should contain a row for each microarray in your experiment. It should contain a FileName column, giving the file from image-analysis containing raw foreground and background intensities for each slide, a Cy3 column giving the RNA type reverse transcribed and labelled with Cy3 dye for that slide (e.g. Wild Type) and a Cy5 column giving the RNA type reverse transcribed and labelled with Cy5 dye for that slide. For ImaGene files, the FileName column is split into a FileNameCy3 column and a FileNameCy5. As well as the essential columns, you can have a Name column giving an alternative slide name to the default name, "Slide n", where n is the SlideNumber and you can have a Date column, listing the date of the hybridization. Additional columns are allowed, provided that the column names are unique. Targets Files can be created in excel or a text editor, and should be saved in Text (Tab Delimited) .txt format.
The Spot Types FileThe Spot Types File (STF) is a tab-delimited text file which allows you to identify different types of spots from the gene list. The STF is typically used to distinguish control spots from those corresponding to genes of interest, to distinguish positive from negative controls, ratio from calibration controls and so on. In the first column of this file (named SpotType), names for each class of spot (eg gene, control) on the array should be specified. One or more other columns should have the same names as columns in the gene list file and should contain patterns or regular expressions sufficient to identify the spot-type. Asterisks are wildcards which can represent anything. Be careful to use upper or lower case as appropriate and don't insert any extra spaces. Any other columns are assumed to contain plotting parameters, such as colors (column name Color) or plotting characters (column name cex) to be associated with the different types of points. STF can be created in excel or a text editor, and should be saved in Text (Tab Delimited) .txt format.
The STF uses simplified regular expressions to match patterns. For example, 'AA*'
means any string starting with 'AA', '*AA' means any
code ending with 'AA', 'AA' means exactly these two
letters, '*AA*' means any string containing 'AA', 'AA.'
means 'AA' followed by exactly one other character and 'AA\.'
means exactly 'AA' followed by a period and no other characters.
For those familiar with regular expressions, any other regular expressions are
allowed but the codes ^ for beginning of string and $
for end of string should be excluded. Note that the patterns are matched
sequentially from first to last, so more general patterns should be included
first. For example, it is often a good idea to include a default spot-type as
the first line in the STF with pattern '*' for all the
pattern-matching columns and with default plotting parameters.
Background. The experiment was carried out using zebrafish as a model organism to study the early development in vertebrates. Swirl is a point mutant in the BMP2 gene that affects the dorsal/ventral body axis. The main goal of the Swirl experiment is to identify genes with altered expression in the Swirl mutant compared to wild-type zebrafish.
The hybridizations. Two sets of dye-swap experiments were performed making a total of four replicate hybridizations. Each of the arrays compares RNA from swirl fish with RNA from normal ("wild type") fish. The experimenters have prepared a tab-delimited targets file called "SwirlSample.txt" which describes the four hybridizations:

On slides 81 and 93, swirl RNA was labelled with green (Cy3) dye and wild type RNA was labelled with red (Cy5) dye. On slides 82 and 94, the labelling was the reversed.
Each of the four hybridized arrays was scanned on an Axon scanner to produce a TIFF image, which was then processed using the image analysis software SPOT. The data from the arrays are stored in the four output files listed under FileName.
The arrays. The microarrays used in this experiment were printed with 8448 probes (spots), including 768 control spots. The array printer uses a print head with a 4x4 arrangement of print-tips and so the microarrays are partitioned into a 4x4 grid of tip groups. Each grid consists of 22x24 spots that were printed with a single print-tip. The gene name associated with each spot is recorded in a GenePix Array List (GAL) file named "fish.gal".
For this example we assume that these files, along with a Targets file ("SwirlSample.txt") and STF ("SpotTypes.txt") are available in the same directory.
Before loading limmaGUI, be advised that if you are a Windows user, it is
best to run Rgui in Single Document Interface (SDI) mode, otherwise,
Rgui often "steals" the focus from limmaGUI.
This can be done by selecting "GUI preferences" from the "Edit" menu,
selecting "SDI" saving preferences and restarting R. (Or alternatively, you
can edit the file rw1091\etc\Rconsole.)
The limmaGUI library needs to be loaded by typing
library(limmaGUI)
and selecting Yes. The main GUI screen should be displayed. If the GUI window is closed, the command
limmaGUI()
will re-open it.
Reading the dataFrom the File menu, select "New".

You will be asked to choose a working directory. Select the directory containing the Swirl dataset and click OK.

Now you can open a GAL (GenePix Array List) file, an RNA Targets file (listing the hybridizations), and a Spot Types file.

Clicking on the "Select GAL File" button gives the following dialog. Open "fish.gal".

Now click on the "Select Targets" file button to open the RNA Targets file.

Finally, the Spot Types file, which for this experiment is called "SpotTypes.txt" and has the following format

is selected.

Once the GAL, Targets and Spot Types files have been selected, click OK.
Now select the type of image-processing file listed in the RNA Targets file ("Spot") and click OK.

For Spot files, using background correction is highly recommended (choose Yes).

A number of background correction methods are available. Subtract will be used for this dataset.

For the Swirl data set, we will not use any spot quality weighting, so click No.

When prompted for a name for this data set, type in "Swirl".

Once the data set has been loaded, its name is displayed on top of the left status window. The status window shows that Red and Green background-corrected intensities (R and G) have been loaded, and that there is no spot-quality weighting. The data set name can be later modified with the "Data Set Name" button. The data set name is not the same as the file name, displayed in the title bar. For example, the same data set (Swirl) could be saved at two different stages of the analysis, e.g. SwirlArraysLoaded.lma and SwirlLinearModelComputed.lma.

Now we can check that the Targets have been read in correctly. From the RNA Targets menu, click on "RNA Targets" to display the information in a table. An Edit menu is provided to allow the user to copy the Targets table to the clipboard.

The Spot Types table can viewed from within limmaGUI in a similar manner. Unlike the
RNA Targets table, the Spot Types table is actually editable within limmaGUI. You can
change the default colors associated with each spot type and you can even create new
spot types and save the table to a tab-delimited text file. The arrow keys can be used
to select the active cell in the table, while holding down Control and using the arrow
keys will move the cursor within the text in one cell. The Rows menu can be used to
add or delete spot types.

The layout should also be checked by selecting the "Layout Parameters" menu item from the Layout menu. A dialogue box with the number of rows and columns of blocks on an array and rows and columns of spots in each block will be displayed. If these values are appropriate, click OK.

Once the data has been loaded, various diagnostic plots can be generated by choosing an appropriate option from the Plot menu.

Image plots. It can be interesting to look at the variation of foreground and background values over an array. Consider an image plot of the red foreground for the first array, by selecting:
Plot > Image Array Plot > Choose a slide: "Slide81" > OK > Choose which variable to plot: "R" > OK > Plot title: "Image array plot of R for Slide 81" > OK.
An image plot should be displayed in another window. The top left of the array is on the bottom left of the plot, which represents a counter-clockwise rotation of 90 degrees. We can see a bright streak across the middle two grids of the 3rd row caused by a scratch or dust on the array. Spots which are affected by this artifact will have suspect M-values. The streak also shows up as brighter regions in the image plots for the background. Clicking on a particular spot on this image will bring up a window displaying its ID information retrieved from the GAL file. Other variables which may be plotted in this way include "Mraw" and "Araw", for un-normalized M and A values, "G" for green foreground and "Rb" and "Gb" for red and green background.
MA-plots. An MA-plot plots the log-ratio of R vs G against the overall intensity of each spot. The log-ratio is represented by the M-value, M = log2(R)-log2(G), and the overall intensity by the A-value, A = (log2(R)+log2(G))/2. To get an MA-plot of the un-normalized values for the first array, try the following:
Plot > M A Plot with lowess curves (for one slide) > Choose a slide: "Slide81" > OK > Normalization Within A Single Array: No > Lowess Curve(s) Options: "Print-Tip Group Lowess Curves" > OK > Plot title: "M A Plot for slide 81 with no normalization" > OK.
A different coloured curve is displayed for each print-tip group. To see individual MA-plots for each of the print-tip groups on this array, with lowess curves, try the following:
Plot > Print-Tip Group M A Plot (for one slide) > Choose a slide: "Slide81" > OK.
You should be able to notice the points which make up the red streak (Note: this plot is not rotated). The affected spots are in grids 10 and 11, and have very large positive M values at high intensities.
Normalization. Several normalization options are available in the Normalization menu. By choosing the "Select Within-Array Normalization Method" item from the menu,

a dialogue box with various normalization options is displayed, as below.

Print-tip group loess normalization is the default method, and will be used for this data. Click OK.
Next choose the "Normalize / Update M and A" item from the Normalization menu

Dialogue boxes which ask whether you'd like to "Normalize Within Arrays" (choose Yes) and "Normalize Between Arrays" (choose No) should follow. Now check the status window to see that "Within-Array Normalized" M and A values are available. Try generating an MA plot of the normalized values, with the control spots highlighted (Hint: Plot > Colour-Coded M A Plot (for one slide)).
Saving and Exiting. To save the data generated during your limmaGUI session, choose "Save" from the File menu, and enter a filename such as "Swirl.lma". When you have finished, you can quit limmaGUI by going to the File menu and choosing "Exit".
You may prefer to load your data using the command line functions available in limma. Ensure that the R working directory is set to the directory containing the Swirl files (using setwd(), or from the R Console in Windows, select File > Change dir... > Change working directory to: "<directory>"). To load the limma library, type
library(limma)
at the R command prompt. Next read the targets file using the command:
targets <- readTargets("SwirlSample.txt")
To read in the intensity data, the function read.maimages is used.
RG <- read.maimages(targets$FileName, source="spot") Read swirl.1.spot Read swirl.2.spot Read swirl.3.spot Read swirl.4.spot
The default for SPOT output is that Rmean and Gmean columns of each file
are used as foreground intensities and morphR and morphG are used as
background intensities. The object RG is an RGList object.
RG
An object of class "RGList"
$R
swirl.1 swirl.2 swirl.3 swirl.4
[1,] 19538.470 16138.720 2895.1600 14054.5400
[2,] 23619.820 17247.670 2976.6230 20112.2600
[3,] 21579.950 17317.150 2735.6190 12945.8500
[4,] 8905.143 6794.381 318.9524 524.0476
[5,] 8676.095 6043.542 780.6667 304.6190
8443 more rows ...
$G
swirl.1 swirl.2 swirl.3 swirl.4
[1,] 22028.260 19278.770 2727.5600 19930.6500
[2,] 25613.200 21438.960 2787.0330 25426.5800
[3,] 22652.390 20386.470 2419.8810 16225.9500
[4,] 8929.286 6677.619 383.2381 786.9048
[5,] 8746.476 6576.292 901.0000 468.0476
8443 more rows ...
$Rb
swirl.1 swirl.2 swirl.3 swirl.4
[1,] 174 136 82 48
[2,] 174 133 82 48
[3,] 174 133 76 48
[4,] 163 105 61 48
[5,] 140 105 61 49
8443 more rows ...
$Gb
swirl.1 swirl.2 swirl.3 swirl.4
[1,] 182 175 86 97
[2,] 171 183 86 85
[3,] 153 183 86 85
[4,] 153 142 71 87
[5,] 153 142 71 87
8443 more rows ...
To read in the .gal file, infer the slide layout, and assign this information to the RGList object, use the commands
RG$genes <- readGAL("fish.gal")
and
RG$printer <- getLayout(RG$genes)
Image plots. Consider image plots of the red and green background for the first array:
imageplot(log2(RG$Rb[,1]), RG$printer, low="white", high="red") imageplot(log2(RG$Gb[,1]), RG$printer, low="white", high="green")
MA-plots. The M and A values can be calculated using the function normalizeWithinArrays.
The option method="none" calculates raw (un-normalized) M and A values.
MA <- normalizeWithinArrays(RG, method="none")
or equivalently
MA <- MA.RG(RG)
Note that 'subtraction' of the background from the foreground is the default background correction method
used to construct these log-ratios. Other options are possible using the backgroundCorrect
function on the RGList. For instance, if you don't want to background correct the data, you
would use the following
RGnobg <- backgroundCorrect(RG, method="none")
and then proceed as before to calculate the M and A values, using RGnobg instead of RG.
To plot the raw M and A values for the first array, use the following command
plotMA(MA, array=1)
By incrementing the array argument (eg array=2), MA plots for other slides can be generated.
Now plot the individual MA-plots for each of the print-tip groups on this array, together with the loess curves which will be used for normalization:
plotPrintTipLoess(MA)
Normalization. For print-tip loess normalization, use the command:
MA <- normalizeWithinArrays(RG)
The is the default normalization in normalizeWithinArrays, however
other options are possible, and are specified by the method argument (for example use method="loess" for global intensity based loess normalization of each slide). To plot the normalized M versus A values by print-tip group, type:
plotPrintTipLoess(MA)
Exiting. Once you are finished, type q() at the R prompt, or from the R console (Windows users) choose File > Exit to quit from your R session.
In this exercise we consider an experiment where two RNA sources are compared directly on 6 replicate arrays.
Background. This data is taken from a set of quality control hybridizations, which were performed to assess how well the spots were printed on the arrays, rather than a question of biological interest. The RNA compared was from two very different cell lines, which ensured many of the genes were differentially expressed at varying degrees. Data for each slide is available from two image analysis packages (GenePix and SPOT), and will be used to highlight some of the differences between the programs, particularly in terms of the background levels measured for each spot.
The hybridizations. Six microarrays were each hybridized with RNA from cell line A (Cy5) and cell line B (Cy3).
The arrays. A total of 10944 spots were printed on each array. They were arranged in 6x4 print-tip grids, each containing 19x24 rows and columns of spots. The Lucidea Universal ScoreCard controls (Samartzidou et al, 2001) were printed on the arrays, and spiked into the RNA mixes of each experiment. This control series is made up of artificial genes, some of which are added in equal quantities, or at 3 and 10 times as much in one sample compared to the other to give known fold-changes for particular spots.
For this example, you will need the GAL file called "genelist.gal" and the raw SPOT (.spot) and GenePix (.gpr) output files in the current working directory.
For the cell line comparison arrays, a targets file would look like

for the Spot data, and

for the GenePix data. Files named "TargetsSpot.txt" and "TargetsGenePix.txt"
are included in the zip file for the cell line data.
An appropriate STF for these arrays which contain Lucidea Universal ScoreCard controls is:
A STF named "SpotTypes.txt" is included in the zip file for the cell line data.
To load the limma library, type
library(limma)
To read in the STF and targets information, use the following commands
SpotTypes <- readSpotTypes()
and
TargetsSpot <- readTargets("TargetsSpot.txt")
The filenames stored in the targets file can be used to read in the intensity information contained in each of the image analysis output files with the command
RGspot <- read.maimages(TargetsSpot$FileName, source="spot")
The object RGspot is an RGList object which contains the following items
names(RGspot) [1] "R" "G" "Rb" "Gb" "genes" "printer"
The foreground intensities are stored in RGspot$R (Cy5) and RGspot$G (Cy3) and the
backgrounds are stored in RGspot$Rb (Cy5) and RGspot$Gb (Cy3).
Other useful information can also be added to the RGList, such as the probe ID information
RGspot$genes <- readGAL()
This information is read in from a file in the current working directory with an extension .gal. If no such file, or multiple files with this extension exist, an error message will be generated.
TheSpotTypes information can now be used to determine the status of each spot (ie control or gene), using the commandRGspot$genes$Status <- controlStatus(SpotTypes, RGspot$genes)
This information will be used later to highlight the control spots on an MA plot. The printer information is also needed if intensity based print-tip normlization is to be applied to the data. The easiest way to acquire this information is from the GAL file, ie
RGspot$printer <- getLayout(RGspot$genes)
Now all of the necessary information is stored in the RGList.
RGspot
An object of class "RGList"
$R
slide1 slide2 slide3 slide4 slide5 slide6
[1,] 187.1304 341.8478 235.5345 1039.7429 584.4063 367.1930
[2,] 195.9623 433.5333 209.1333 757.4167 1033.5319 800.4595
[3,] 182.9615 198.7879 197.3830 532.4074 357.3704 199.1791
[4,] 1355.4242 3353.5395 690.7414 5495.8267 3031.9355 2726.9020
[5,] 1344.0921 2706.9730 726.3710 5458.9130 5778.7037 3558.3462
10939 more rows ...
$G
slide1 slide2 slide3 slide4 slide5 slide6
[1,] 867.3261 588.5217 566.0345 524.0286 690.1563 663.2982
[2,] 497.4906 542.1000 396.3667 625.5000 1211.8298 1550.7027
[3,] 635.6538 366.6061 261.0213 210.2963 524.8148 431.5672
[4,] 1487.6818 1754.9079 883.5172 3970.1333 1318.7581 1641.5882
[5,] 1625.0526 2032.5676 654.3548 4214.9420 2731.8025 2531.5385
10939 more rows ...
$Rb
slide1 slide2 slide3 slide4 slide5 slide6
[1,] 58 62 53 64 74 62
[2,] 60 62 50 60 71 62
[3,] 60 62 54 59 71 62
[4,] 58 62 54 59 71 61
[5,] 59 62 54 65 70 63
10939 more rows ...
$Gb
slide1 slide2 slide3 slide4 slide5 slide6
[1,] 57 59 55 55 61 63
[2,] 59 58 55 55 61 62
[3,] 59 56 55 53 60 61
[4,] 67 56 54 52 59 61
[5,] 67 57 51 51 59 59
10939 more rows ...
$genes
Block Column Row ID Name Status
1 1 1 1 AA234982 AA234982 Gene
2 1 2 1 T69271 T69271 Gene
3 1 3 1 AA425102 AA425102 Gene
4 1 4 1 AA447684 AA447684 Gene
5 1 5 1 N59426 N59426 Gene
10939 more rows ...
$printer
$ngrid.r
[1] 6
$ngrid.c
[1] 4
$nspot.r
[1] 19
$nspot.c
[1] 24
attr(,"class")
[1] "PrintLayout"
This process can be repeated for the GenePix data:
TargetsGenePix <- readTargets("TargetsGenePix.txt")
RGgpr <- read.maimages(TargetsGenePix$FileName, source="genepix")
RGgpr$genes <- readGAL()
RGgpr$printer <- getLayout(RGgpr$genes)
RGgpr$genes$Status <- controlStatus(SpotTypes, RGgpr$genes)
The default for GenePix output is that the F635 Mean (Cy5) and F532 Mean
(Cy3) columns of each .gpr file are used as foreground intensities and B635 Median (Cy5) and
B532 Median (Cy3) are used as background intensities.
An alternative way of reading data from the image analysis output files without using
the Targets file, is
spotFiles <- dir(pattern=".spot") RGspot <- read.maimages(spotFiles, source="spot")
for SPOT data and
gprFiles <- dir(pattern=".gpr") RGgpr <- read.maimages(gprFiles, source="genepix")
for GenePix data. In this case, the order of the files in the working directory (which depends on their names) will determine the order they are read in. A benefit of using a targets file is that the order is controlled by the user. The targets file can also be used later in the linear modelling analysis to set up the design matrix. Assigning the gene ID's, control status of the spots and the printer layout will need to be repeated as before.
Un-normalized M and A values for each spot can be constructed using the function MA.RG.
MAspot <- MA.RG(RGspot) MAgpr <- MA.RG(RGgpr)
MA-plots. The MA-plot of the un-normalized values for the first array is obtained using:
plotMA(MAspot, array=1) plotMA(MAgpr, array=1)
Notice the automatic highlighting of the control spots specified in the STF.
To do the same plot for other arrays, the array argument can be changed.
Normalization. For print-tip group normalization, try
MAspot <- normalizeWithinArrays(RGspot) MAgpr <- normalizeWithinArrays(RGgpr)
To do MA plots, 6 to a page and save the output in .png format, the commands
plotMA3by2(MAspot, prefix="MAspot") plotMA3by2(MAgpr, prefix="MAgpr")
can be used. Comparing the MA plots for a given slide, you should notice that the dynamic range of the A values is less for SPOT data compared to GenePix. There is also some fanning of the M values at lower intensities for the GenePix plots. This is a result of the different background estimates used in the two packages. The morphological background in SPOT tends to give a low, stable estimate of slide background, whereas the median of the background pixels used in GenePix is an over-estimate (see Yang et al (2002)), resulting in numerical instability for many of the low intensity log-ratios. Not background correcting GenePix data may be a good option in some instances, ie
RGgprnobg <- backgroundCorrect(RGgpr, method="none")
then proceed as before to normalize the data, using RGgprnobg instead of RGgpr.
For some other diagnostic plots, try the following functions from the arrayQuality library.
library(arrayQuality)
controlCode <- read.table("controlCode.txt", header=TRUE, sep="\t")
rawdata <- gpQuality(TargetsGenePix$FileName, compBoxplot=FALSE)
For each slide, a summary diagnostic plot "diagPlot.<slidename>.png" will be saved in the working directory. Each slide summary shows MA plots, image plots of M (before and after normalization) and A values, dot plots of the M and A values for each class of control spot (provided controlCode is set correctly), as well as single channel intensity plots. These plots allow the overall quality of a slide to be assessed visually. The arrayQuality package can also provide a more quantitative measurement of slide quality by comparing each slide to a set of reference slides, where available (not applicable for this example).
Exiting. Once you are finished, type q() at the R prompt, or from the R console (Windows users) choose File > Exit to quit from your R session.
Thanks to Gordon Smyth and James Wettenhall for allowing use of their documentation and worked example
analyses and to Jean Yang and Agnes Paquet who assisted greatly with the arrayQuality example.