index_01crop.gif index_02crop.gif
index_456.jpg
spacer.gif


Using Our Discipline to Enhance Human Welfare
August 7 - 11, 2005

Lab 1 - Loading, pre-processing and normalizing two-color microarray data (part I)

James Wettenhall. August 6-7, 2005

1. Software required for this lab.

You will need R 2.1.0 or 2.1.1 (http://www.R-Project.org) for this lab. This section lists all of the R packages you need to have installed and also lists some additional R packages which are recommended. Note that most of the R packages can be installed from the Bioconductor site automatically using:
source("http://www.bioconductor.org/getBioC.R")
getBioC()
However, Bioconductor releases only occur once every six months, whereas the authors of these labs would typically use a more up-to-date version of some packages e.g. limma, which is why it is advisable to check the R package version numbers below and install them from the links provided if your installation is not up-to-date.

1.1 Required R packages

It is highly desirable to have these R packages in a directory in which you have write permission. You can use .libPaths("C:/Custom/R/library/directory") or .libPaths("C:\\Custom\\R\\library\\directory") before you run install.packages() (or equivalent) to install the package(s) in a customized directory location.

Package WindowsMacOS XSource
limma_2.0.2 limma_2.0.2.zip limma_2.0.2.tar.gz limma_2.0.2.tar.gz
limmaGUI_1.3.9 limmaGUI_1.3.9.zip limmaGUI_1.3.9.tar.gz limmaGUI_1.3.9.tar.gz
R2HTML_1.54 R2HTML_1.54.zip R2HTML_1.54.tar.gz R2HTML_1.54.tar.gz
sma_0.5.14 sma_0.5.14.zip sma_0.5.14.tgz sma_0.5.14.tar.gz
statmod_1.1.1 or statmod_1.2.0 statmod_1.2.0.zip statmod_1.2.0.tar.gz statmod_1.2.0.tar.gz
tkrplot_0.0-12 tkrplot_0.0-12.zip tkrplot_0.0-12.tar.gz tkrplot_0.0-12.tar.gz
xtable_1.2-5 xtable_1.2-5.zip xtable_1.2-5.tgz xtable_1.2-5.tar.gz

As there is currently no Mac binary version of the tkrplot package available on CRAN, it may be difficult to install for Mac users. tkrplot is not essential - the regular R graphics device can be used instead.

1.2 Required data

The Swirl Zebrafish dataset is required for this lab, and can be downloaded from the following URL:
http://bioinf.wehi.edu.au/marray/jsm2005/swirl.zip.

1.3 Details on the files used

After conducting a microarray experiment on one or more arrays printed with a particular library of probes, the arrays are scanned to produce TIFF images, one for each channel (Cy3 and Cy5). The TIFF images are processed using an image analysis program such as ArrayVision, ImaGene, GenePix, QuantArray or SPOT to acquire the red and green foreground and background intensities for each spot, along with other measurements. The spot intensities are then exported from the image analysis program into a series of text files. There should be one file for each array or, in the case of ImaGene, two files for each array.

To analyze microarray data, we require (i) a file which describes the probes, often a GenePix Array List (GAL) file, and (ii) the image analysis output files. In most cases it is also desirable to have a Targets File, describing which RNA sample was hybridized to each channel of each array. A further optional file is the Spot Types file (STF) which identifies special probes such as control spots.

The Targets File

The Targets File is normally in tab-delimited text format. It should contain a row for each microarray in your experiment. It should contain a FileName column, giving the file from image-analysis containing raw foreground and background intensities for each slide, a Cy3 column giving the RNA type reverse transcribed and labelled with Cy3 dye for that slide (e.g. Wild Type) and a Cy5 column giving the RNA type reverse transcribed and labelled with Cy5 dye for that slide. For ImaGene files, the FileName column is split into a FileNameCy3 column and a FileNameCy5. As well as the essential columns, you can have a Name column giving an alternative slide name to the default name, "Slide n", where n is the SlideNumber and you can have a Date column, listing the date of the hybridization. Additional columns are allowed, provided that the column names are unique. Targets Files can be created in excel or a text editor, and should be saved in Text (Tab Delimited) .txt format.

The Spot Types File

The Spot Types File (STF) is a tab-delimited text file which allows you to identify different types of spots from the gene list. The STF is typically used to distinguish control spots from those corresponding to genes of interest, to distinguish positive from negative controls, ratio from calibration controls and so on. In the first column of this file (named SpotType), names for each class of spot (eg gene, control) on the array should be specified. One or more other columns should have the same names as columns in the gene list file and should contain patterns or regular expressions sufficient to identify the spot-type. Asterisks are wildcards which can represent anything. Be careful to use upper or lower case as appropriate and don't insert any extra spaces. Any other columns are assumed to contain plotting parameters, such as colors (column name Color) or plotting characters (column name cex) to be associated with the different types of points. STF can be created in excel or a text editor, and should be saved in Text (Tab Delimited) .txt format.

The STF uses simplified regular expressions to match patterns. For example, 'AA*' means any string starting with 'AA', '*AA' means any code ending with 'AA', 'AA' means exactly these two letters, '*AA*' means any string containing 'AA', 'AA.' means 'AA' followed by exactly one other character and 'AA\.' means exactly 'AA' followed by a period and no other characters. For those familiar with regular expressions, any other regular expressions are allowed but the codes ^ for beginning of string and $ for end of string should be excluded. Note that the patterns are matched sequentially from first to last, so more general patterns should be included first. For example, it is often a good idea to include a default spot-type as the first line in the STF with pattern '*' for all the pattern-matching columns and with default plotting parameters.

2. Swirl Experiment

Background. The experiment was carried out using zebrafish as a model organism to study the early development in vertebrates. Swirl is a point mutant in the BMP2 gene that affects the dorsal/ventral body axis. The main goal of the Swirl experiment is to identify genes with altered expression in the Swirl mutant compared to wild-type zebrafish.

The hybridizations. Two sets of dye-swap experiments were performed making a total of four replicate hybridizations. Each of the arrays compares RNA from swirl fish with RNA from normal ("wild type") fish. The experimenters have prepared a tab-delimited targets file called "SwirlSample.txt" which describes the four hybridizations:

images/SwirlSampleInExcel.png

On slides 81 and 93, swirl RNA was labelled with green (Cy3) dye and wild type RNA was labelled with red (Cy5) dye. On slides 82 and 94, the labelling was the reversed.

Each of the four hybridized arrays was scanned on an Axon scanner to produce a TIFF image, which was then processed using the image analysis software SPOT. The data from the arrays are stored in the four output files listed under FileName.

The arrays. The microarrays used in this experiment were printed with 8448 probes (spots), including 768 control spots. The array printer uses a print head with a 4x4 arrangement of print-tips and so the microarrays are partitioned into a 4x4 grid of tip groups. Each grid consists of 22x24 spots that were printed with a single print-tip. The gene name associated with each spot is recorded in a GenePix Array List (GAL) file named "fish.gal".

For this example we assume that these files, along with a Targets file ("SwirlSample.txt") and STF ("SpotTypes.txt") are available in the same directory.

2.1 Reading the data using limmaGUI

Before loading limmaGUI, be advised that if you are a Windows user, it is best to run Rgui in Single Document Interface (SDI) mode, otherwise, Rgui often "steals" the focus from limmaGUI. This can be done by selecting "GUI preferences" from the "Edit" menu, selecting "SDI" saving preferences and restarting R. (Or alternatively, you can edit the file rw2011\etc\Rconsole.)

The limmaGUI library needs to be loaded by typing

 library(limmaGUI)

and selecting Yes. The main GUI screen should be displayed. If the GUI window is closed, the command

 limmaGUI()

will re-open it.

Reading the data

From the File menu, select "New".

images/limmaGUINewFileSelected.png

You will be asked to choose a working directory. Select the directory containing the Swirl dataset and click OK.

images/ChooseSwirlWorkingDir.png

Now you can open a GAL (GenePix Array List) file, an RNA Targets file (listing the hybridizations), and a Spot Types file.

images/OpenGALandTargetsandSpotTypesFiles.png

Clicking on the "Select GAL File" button gives the following dialog. Open "fish.gal".

images/OpeningFishGAL.png

Now click on the "Select Targets" file button to open the RNA Targets file.

images/OpeningSwirlTargets.png

Finally, the Spot Types file, which for this experiment is called "SpotTypes.txt" and has the following format

images/SwirlSpotTypesInExcel.png

is selected.

images/OpeningSwirlSpotTypes.png

Once the GAL, Targets and Spot Types files have been selected, click OK.

images/GALTargetsandSpotTypesFilesSelected.png

Now select the type of image-processing file listed in the RNA Targets file ("Spot") and click OK.

images/ImageProcessingFileType.png

For Spot files, using background correction is highly recommended (choose Yes).

images/UseBackgroundCorrection.png

A number of background correction methods are available. Subtract will be used for this dataset.

images/BackgroundCorrectionMethod.png

For the Swirl data set, we will not use any spot quality weighting, so click No.

images/SpotQualityWeighting.png

When prompted for a name for this data set, type in "Swirl".

images/DataSetNameSwirl.png

Once the data set has been loaded, its name is displayed on top of the left status window. The status window shows that Red and Green background-corrected intensities (R and G) have been loaded, and that there is no spot-quality weighting. The data set name can be later modified with the "Data Set Name" button. The data set name is not the same as the file name, displayed in the title bar. For example, the same data set (Swirl) could be saved at two different stages of the analysis, e.g. SwirlArraysLoaded.lma and SwirlLinearModelComputed.lma.

images/limmaGUImainWindowSwirlArraysLoaded.png

Now we can check that the Targets have been read in correctly. From the RNA Targets menu, click on "RNA Targets" to display the information in a table. An Edit menu is provided to allow the user to copy the Targets table to the clipboard.

images/SwirlRNATargets.png

The Spot Types table can viewed from within limmaGUI in a similar manner. Unlike the RNA Targets table, the Spot Types table is actually editable within limmaGUI. You can change the default colors associated with each spot type and you can even create new spot types and save the table to a tab-delimited text file. The arrow keys can be used to select the active cell in the table, while holding down Control and using the arrow keys will move the cursor within the text in one cell. The Rows menu can be used to add or delete spot types.

images/SwirlTypesTable.png

The layout should also be checked by selecting the "Layout Parameters" menu item from the Layout menu. A dialog box with the number of rows and columns of blocks on an array and rows and columns of spots in each block will be displayed. If these values are appropriate, click OK.

images/limmaGUISwirlLayout.png

2.2 Diagnostic plots and normalization using limmaGUI

Once the data has been loaded, various diagnostic plots can be generated by choosing an appropriate option from the Plot menu.

images/ChoosePlotMenu.png

Image plots. It can be interesting to look at the variation of foreground and background values over an array. Consider an image plot of the red foreground for the first array, by selecting:

Plot > Image Array Plot > Choose a slide: "Slide81" > OK > Choose which variable to plot: "R" > OK > Plot title: "Image array plot of R for Slide 81" > OK.

An image plot should be displayed in another window. The top left of the array is on the bottom left of the plot, which represents a counter-clockwise rotation of 90 degrees. We can see a bright streak across the middle two grids of the 3rd row caused by a scratch or dust on the array. Spots which are affected by this artifact will have suspect M-values. The streak also shows up as brighter regions in the image plots for the background. Clicking on a particular spot on this image will bring up a window displaying its ID information retrieved from the GAL file. Other variables which may be plotted in this way include "Mraw" and "Araw", for un-normalized M and A values, "G" for green foreground and "Rb" and "Gb" for red and green background.

MA-plots. An MA-plot plots the log-ratio of R vs G against the overall intensity of each spot. The log-ratio is represented by the M-value, M = log2(R)-log2(G), and the overall intensity by the A-value, A = (log2(R)+log2(G))/2. To get an MA-plot of the un-normalized values for the first array, try the following:

Plot > M A Plot with lowess curves (for one slide) > Choose a slide: "Slide81" > OK > Normalization Within A Single Array: No > Lowess Curve(s) Options: "Print-Tip Group Lowess Curves" > OK > Plot title: "M A Plot for slide 81 with no normalization" > OK.

A different colored curve is displayed for each print-tip group. To see individual MA-plots for each of the print-tip groups on this array, with lowess curves, try the following:

Plot > Print-Tip Group M A Plot (for one slide) > Choose a slide: "Slide81" > OK.

You should be able to notice the points which make up the red streak (Note: this plot is not rotated). The affected spots are in grids 10 and 11, and have very large positive M values at high intensities.

Normalization. Several normalization options are available in the Normalization menu. By choosing the "Select Within-Array Normalization Method" item from the menu,

images/ChooseNormalizationMenu.png

a dialog box with various normalization options is displayed, as below.

images/ChooseNormalizationOption.png

Print-tip group loess normalization is the default method, and will be used for this data. Click OK.

Next choose the "Normalize / Update M and A" item from the Normalization menu

images/ChooseNormalizeUpdateMandA.png

Dialog boxes which ask whether you'd like to "Normalize Within Arrays" (choose Yes) and "Normalize Between Arrays" (choose No) should follow. Now check the status window to see that "Within-Array Normalized" M and A values are available. Try generating an MA plot of the normalized values, with the control spots highlighted (Hint: Plot > Color-Coded M A Plot (for one slide)).

Saving and Exiting. To save the data generated during your limmaGUI session, choose "Save" from the File menu, and enter a filename such as "Swirl.lma". When you have finished, you can quit limmaGUI by going to the File menu and choosing "Exit".

2.3 Reading the data using limma

You may prefer to load your data using the command line functions available in limma. Ensure that the R working directory is set to the directory containing the Swirl files (using setwd(), or from the R Console in Windows, select File > Change dir... > Change working directory to: "<directory>"). To load the limma library, type

 library(limma)

at the R command prompt. Next read the targets file using the command:

 targets <- readTargets("SwirlSample.txt")

To read in the intensity data, the function read.maimages is used.

 RG <- read.maimages(targets$FileName, source="spot")

The default for SPOT output is that Rmean and Gmean columns of each file are used as foreground intensities and morphR and morphG are used as background intensities. The object RG is an RGList object which contains a foreground and background intensity for each of the red and green channels for every gene (spot) on every array. To see a summary of the contents of this object, simply type its name and press enter.

 RG

To read in the .gal file, infer the slide layout, and assign this information to the RGList object, use the commands

 RG$genes <- readGAL("fish.gal")

and

 RG$printer <- getLayout(RG$genes)

2.4 Diagnostic plots and normalization using limma

Image plots. Consider image plots of the red and green background for the first array:

 imageplot(log2(RG$Rb[,1]), RG$printer, low="white", high="red")
 imageplot(log2(RG$Gb[,1]), RG$printer, low="white", high="green")

MA-plots. The M and A values can be calculated using the function normalizeWithinArrays. The option method="none" calculates raw (un-normalized) M and A values.

 MA <- normalizeWithinArrays(RG, method="none")

or equivalently

 MA <- MA.RG(RG)

Note that 'subtraction' of the background from the foreground is the default background correction method used to construct these log-ratios. Other options are possible using the backgroundCorrect function on the RGList. For example, if you don't want to background correct the data, you would use the following

 RGnobg <- backgroundCorrect(RG, method="none")

and then proceed as before to calculate the M and A values, using RGnobg instead of RG. To plot the raw M and A values for the first array, use the following command

 plotMA(MA, array=1)

By incrementing the array argument (eg array=2), MA plots for other slides can be generated.

Now plot the individual MA-plots for each of the print-tip groups on this array, together with the loess curves which will be used for normalization:

 plotPrintTipLoess(MA)

Normalization. For print-tip loess normalization, use the command:

 MA <- normalizeWithinArrays(RG)

Print-tip loess is the default normalization method in normalizeWithinArrays, however other options are possible, and are specified by the method argument (for example use method="loess" for global intensity based loess normalization of each slide). To plot the normalized M versus A values by print-tip group, type:

 plotPrintTipLoess(MA)

Exiting. Once you are finished, type q() at the R prompt, or from the R console (Windows users) choose File > Exit to quit from your R session.

Analysis of two-color cDNA microarray data will be continued in lab 2, using a different data set.

3. Acknowledgements

Thanks to Matt Ritchie and Gordon Smyth for allowing the use of material from previous microarray workshops, thanks to Gordon Smyth for allowing the use of his limma documentation and worked example analyses.

4. References

  1. Samartzidou, H., Turner, L., Houts, T., Frome, M., Worley, J., and Albertsen, H. (2001) Lucidea Microarray ScoreCard: An integrated analysis tool for microarray experiments, Life Science News.
  2. Smyth, G. K., Thorne, N. P. and Wettenhall J. (2004) limma: Linear Models for Microarray Data User's Guide. The Walter and Eliza Hall Institute of Medical Research.
  3. Wettenhall, J. M., and Smyth, G. K. (2004). limmaGUI: a graphical user interface for linear modeling of microarray data. Bioinformatics 20, 3705-3706.
  4. Yang, Y. H., Buckley, M. J., Dudoit, S., and Speed, T. P. (2002). Comparison of methods for image analysis on cDNA microarray data. Journal of Computational and Graphical Statistics, 11 (1), 108-136.

Valid HTML 4.0!