Supplementary Information for
voom: precision weights unlock linear model analysis tools for RNA-seq read counts

Charity W Law^1,2, Yunshun Chen^1,2, Wei Shi^1,3 and Gordon K. Smyth^1,4

1. Bioinformatics Division, The Walter and Eliza Hall Institute of Medical Research.
2. Department of Medical Biology,
3. Department of Computing and Information Systems and
4. Department of Mathematics and Statistics, The University of Melbourne.

Genome Biology 2014, 15:R29
http://genomebiology.com/2014/15/2/R29
Preprint with corrections (6 June 2016)

Summarized counts for SEQC Pilot Data

This page provides the SEQC summarized count data used for the evaluations in the article by Law et al (http://genomebiology.com/2014/15/2/R29). The data used in the paper are from an early pilot part of the SEQC project. The full SEQC data set, including this pilot data, will eventually become publicly available as GEO series GSE47792.

The pilot SEQC dataset consisted of 16 RNA-seq libraries in four groups. The groups are labeled A-D. A total of 92 artificial control transcripts were spiked-in at different concentrations in such as way that three quarters of the transcripts were truly DE and the remaining quarter were not. An Illumina HiSeq 2000 was used to create a FastQ file of paired-end sequence reads for each sample. The library size for each sample varied from 5.4 to 8.0 million read pairs.

To obtain gene-level counts, fragments were mapped to NCBI Build 37.2 of the human genome using the Subread aligner, and fragment counts were summarized by Entrez Gene ID using the featureCounts function of the Bioconductor package Rsubread:

Read counts summarized by Entrez Gene ID (1.52Mb, comma separated text file)

To obtain read counts for the spike-in control transcripts, reads were mapped to the control sequences:

Read counts for ERCC spike-in transcripts (5.29Kb, comma separated text file)
Annotation for ERCC spike-in transcripts (3.45Kb, comma separated text file)

Summarized counts for the Drosophila Melanogaster Data

This page provides the summarized Drosophila Melanogaster count data used for the evaluations in the article by Law et al (http://genomebiology.com/2014/15/2/R29). The original data was from Graveley et al (https://www.nature.com/articles/nature09715). Technical replicates were pooled together using the ReCount online rescource http://bowtie-bio.sourceforge.net/recount.

Only the 12 pooled RNA-seq libraries from the embryonic stages are used in the paper. The table of read counts and the sample information are available below:

Read counts and sample information (760Kb, tar.gz file)

Simulation R code

The simulation R code is provided below. These codes were used to generate Figure 3, 4 and 5 in the paper. Users can specify the number of simulations, whether to use equal or unequal library size, inverse chi-square or log-normal dispersion by modifying the settings at the beginning part of the code.

Note that some methods take a long time to run. Hence, we provide a shorter version of the simulation code that runs a smaller number of simulations and includes the voom, limma and edgeR methods only.

To obtain a simple version of the simulation code:

Simulation_Simple.R

To obtain the full version of the simulation code:

Simulation_Full.R

Comments/Questions? Contact smyth@wehi.edu.au.
Last modified: 16 February 2014

Supplementary Information for voom: precision weights unlock linear model analysis tools for RNA-seq read counts

Summarized counts for SEQC Pilot Data

Summarized counts for the Drosophila Melanogaster Data

Simulation R code

Supplementary Information for
voom: precision weights unlock linear model analysis tools for RNA-seq read counts