Preliminaries

Upstream analysis

This workflow assumes that upstream processing of FASTQ files has been performed using scPipe. The result of the upstream processing should be a gene_count.csv file containing gene counts and a stat folder containing various experiment summary statistics. The data in this workflow is generated from 5 types of blood cells using the CEL-Seq2 protocol, this makes use of unique molecular identifiers (UMIs) on transcript fragments and known cellular barcodes for each cell.

Read in data

We read in our data with create_sce_by_dir(), this will search the provided directory for the count matrix and scPipe statistics to generate a SingleCellExperiments object. The SingleCellExperiments object is the standard bioconductor object for storing single cell experiments. The main field accessors are counts(sce) for the raw count values, colData(sce) for the sample data and rowData(sce) for the gene data. After reading in the data we calculate additional QC metrics required by scPipe using calculate_QC_metrics(sce).

# read in data from working directory
sce <- create_sce_by_dir(
    ".",
    organism = "mmusculus_gene_ensembl",
    gene_id_type = "ensembl_gene_id"
)

# calculate additional quality control metrics
sce <- calculate_QC_metrics(sce)

## cannot connect to the ensembl database. ERROR:
## Error in textConnection(attrfilt): invalid 'text' argument
## [1] "Try to use org.Mm.eg.db."
## cannot connect to the ensembl database. ERROR:
## Error in textConnection(attrfilt): invalid 'text' argument
## [1] "Try to use org.Mm.eg.db."

# dimension of sce object (genes x samples)
dim(sce)

## [1] 19903   383

We see that there are 19903 genes detected in 383 cells in this experiment.

We also need sample annotations to give context our cells.

# read in sample annotations
sample_anno <- read.csv("sample_anno.csv")
dim(sample_anno)

## [1] 375   2

head(sample_anno)

Barcode demultiplexing statistics

It’s useful to check the barcode de-multiplexing statistics to get an initial idea of the quality of the sample.

# plot barcode demultiplexing statistics
plot_demultiplex(sce)

Here we see that around 65% of the reads matched to a barcode. We also see that around 23% of samples could be aligned but did not have a barcode match. These values can help identify sequencing failures, contamination and other problems with the experiment.

Plot mapping statistics

We check the mapping statistics to see the profiles of the individual cells in the experiment. In scPipe aligned refers to successful alignment to reference genome while mapped refers to successful mapping to an annotated feature. Therefore aligned_unmapped refers to reads that can be aligned to a position on the genome but the feature annotations do not specify a feature at the aligned position.

plot_mapping(sce, percentage = TRUE, dataname = "sample data")

Filter unannotated samples

We filter out samples that do not have any annotation and remove annotation entries that do not match any detected cells.

# keep only annotated samples by matching sample_ids
annotated_samples <- colnames(sce) %in% sample_anno$cell_id
sce <- sce[, annotated_samples]

# keep only annotation for samples captured
samples_captured <- sample_anno$cell_id %in% colnames(sce)
sample_anno <- sample_anno[samples_captured, ]

Outlier sample detection

With the large number of cells sequenced, it is inevitable that some degraded cells are sequenced. In most situations these cells are not of primary interest and should be filtered out prior to further analysis. It is possible to manually set thresholds for QC metrics to exclude samples. scPipe offers a detect_outlier() function to automatically detect outliers based on QC metrics previously calculated with calculate_QC_metrics(). using the type = "low" option labels the outliers with low QC metrics for exclusion.

In this experiment there are multiple cell types which may have different QC metric profiles. So to avoid losing an entire cell type to automatic filtering we can filter within batches but assigning the colData(sce)$batch values and setting batch = TRUE in detect_outlier. In this case we have prior knowledge of the cell types so we can use that as our batching group, otherwise some preliminary clustering can be done, for example using the quickCluster() function provided by scran can be used.

colData(sce)$batch <- factor(sample_anno$cell_type)
# automatic outlier detection based on QC statistics
sce <- detect_outlier(
    sce,
    type = "low",
    batch = TRUE
)

table(QC_metrics(sce)$outliers)

## 
## FALSE  TRUE 
##   359    15

We see that the automatic algorithm has identified 16 outliers. Below we inspect the difference between QC metrics between the cells labelled as outliers and the remaining cells. The emerald coloured values represent the outliers and the salmon coloured values represent the samples to be kept. It is important to inspect these plots as an overview of the automatic outlier detection. Across the the diagonal are density plots, on the upper triangle are correlation values and on the lower triangle are classic pairs plots. The right column shows box-plots of the QC metrics.

# plot summary statistics of outlier detection
plot_QC_pairs(sce)

The two most useful metrics are genes detected and total counts. We see that both these values are visibly lower in the outlier marked samples. As we are happy with these filtering results we can later remove these samples with remove_outliers(). Before we remove the outliers we will attach our sample annotations to the colData().

# add sample annotation to colData(sce)
# colData(sce) requires DataFrame objects but dplyr and tibble operations only
# work on data.frame objects, some conversion is required
colData(sce) <- colData(sce) %>%
    data.frame %>%
    rownames_to_column("cell_id") %>%
    DataFrame

colData(sce) <- inner_join(
    colData(sce) %>% data.frame, 
    sample_anno %>% data.frame
) %>% DataFrame

# ensure rownames of colData is the same as colnames of sce
rownames(colData(sce)) <- colData(sce)$cell_id
row_ord <- match(colnames(sce), rownames(colData(sce)))
colData(sce) <- colData(sce)[row_ord, ]

# remove outlier samples
sce <- remove_outliers(sce)
dim(sce)

## [1] 19903   359

At this point we have just under 20,000 genes and 358 cells.

Gene filtering

Up to this point our genes have been encoded in EMSEMBL IDs, gene symbols are usually more informative for researchers and so we convert the rownames of our counts matrix into gene symbols. We then use calculateQCMetrics() to compute scater specific QC metrics.

# convert rownames from ENSEMBL IDs to gene symbols
sce <- convert_geneid(sce)

## cannot connect to the ensembl database. ERROR:
## Error in textConnection(attrfilt): invalid 'text' argument
## [1] "Try to use org.Mm.eg.db."
## [1] "Number of NA in new gene id: 7096. Duplicated id: 8.5"
## [1] "First 5 duplicated:"
##      description    ensembl_gene_id external_gene_name
## 505           NA ENSMUSG00000057130             Txnl4a
## 1626          NA ENSMUSG00000090093            Gm14391
## 3892          NA ENSMUSG00000036199            Ndufa13
## 5255          NA ENSMUSG00000092335             Zfp976
## 5354          NA ENSMUSG00000078903            Gm14391
## 6521          NA ENSMUSG00000062593            Lilrb4a

# calculate scater QC metrics
sce <- calculateQCMetrics(
    sce,
    feature_controls = list(ERCC = str_detect(rownames(sce), "^ERCC"))
)

It’s often informative to check the most highly expressed genes, this can be easily compared with reference datasets and provide additional diagnostic insight. Below we see high expression of genes related to red blood cells.

# plot 20 highest expression genes
plotQC(sce, type = "highest-expression", n = 20)

We filter by the number of samples a gene can be detected in and the average counts of the gene within samples where it is expressed. Taking the average only within expressed samples is important as not to remove informative genes that are well expressed only in a small population. Here we have set the thresholds

Greater than 1 average count within expressed samples
Greater than 2 samples expressed in

Read depth in this dataset is not ideal so we try to set less stringent filters. These filter values should be reconsidered on an experiment by experiment basis.

# take average of non-zero counts
avg_within_expressed <- apply(counts(sce), 1, function(x) mean(x[x > 0]))
# take the number of samples gene is expressed in
expressed_samples <- rowSums(counts(sce) > 0)

# filter out lowly expressed genes
keep <- (avg_within_expressed > 1) & (expressed_samples > 2)
sce <- sce[keep, ]
dim(sce)

## [1] 11459   359

Data normalisation

Data normalisation is necessary to make meaningful statistical comparisons. A number of technical factors introduce uninteresting variation into the data, one of the primary factors being sequencing depth.

Computing size factors

We use a scaling approach to normalise our libraries, size factors are calculated for quantile normalisation, this is more robust than simple library size normalisation. Since spike-in molecules have different behaviour to endogenous genes, they have a separate set of size factors which are not for general use.

# compute size factors separately for endegenous features and spike-ins
sce <- computeSumFactors(sce)
sce <- computeSpikeFactors(sce, general.use = FALSE)

# assign normalised expression to exprs(sce)
sce <- normalize(sce)
dim(exprs(sce))

## [1] 11459   359

Analysis of highly variable genes

In an exploratory analysis we are usually interested in highly variable genes within the whole dataset. These are the genes that vary greatly between cells, some of these will be genes that are naturally highly variable, but some of these genes will be variable because they are differentially expressed between cell types.

Detect highly variable genes

We detect highly variable genes using the trendVar() and decomposeVar() functions from scran. This estimates variances for each gene then decomposes that variance into technical and biological components. The biological component is the one of primary interest.

# estimate variance then decompose into technical and biological components
fit <- trendVar(sce, parametric = TRUE)
decomp <- decomposeVar(sce, fit) %>%
    data.frame() %>%
    rownames_to_column("GeneID") %>%
    arrange(desc(bio))

Our table of decomposed variances contains an FDR for the test that the biological variance is 0. We choose the highly variable genes as those that are statistically significant at the 0.05 level and have biological variance greater than or equal to 0.5. We plot the mean log expression against the variance, with blue points indicating those chosen as highly variable and a red trend line representing the technical variation. The red points are the ERCC spike-ins.

# plot all genes in black
plot(
    decomp$mean,
    decomp$total,
    xlab = "Mean log-expression",
    ylab = "Variance",
    pch = 20,
    cex = 0.4
)
o <- order(decomp$mean)
hvg <- which(decomp$FDR <= 0.05 & decomp$bio >= 0.5)
# plot highly variable genes in blue
points(decomp$mean[hvg], decomp$total[hvg], col = "blue", pch = 20, cex = 0.8)
# plot spike-ins in red and draw trend
lines(decomp$mean[o], decomp$tech[o], col = "red", lwd = 1)
points(fit$mean, fit$var, col = "red", pch = 20, cex = 0.8)

The expression of ERCC spike-ins should be quite consistent across the samples, this is reflected in the plot as the red points have lower variance than endogenous genes.

Heatmap of most variable genes

Heat-maps are useful ways to visualise patterns in gene expression across many samples. Genes are hierarchically clustered along the rows and samples clustered across the columns. Blocks of co-expressed genes and samples with similar expression can be identified and help infer the type of cells in the experiment.

# plot heatmap of most variable genes
hvg_ids <- decomp$GeneID[hvg]
gene_exp <- exprs(sce)
gene_exp <- gene_exp[hvg_ids,]
hc_rows <- hclust(dist(gene_exp))
hc_cols <- hclust(dist(t(gene_exp)))
gene_exp <- gene_exp[hc_rows$order, hc_cols$order]

margin <- list(
  l = 100,
  r = 40,
  b = 10,
  t = 10,
  pad = 0
)

plot_ly(
    x = colnames(gene_exp),
    y = rownames(gene_exp),
    z = gene_exp,
    type = "heatmap"
) %>%
    layout(autosize = F, margin = margin)

PCA using highly variable genes

The goal of principal component analysis (PCA) plots is to transform the high-dimensional data (each gene is another dimension) into a low number of dimensions (two in this example) that contain as much of the variation as possible. In PCA plots the distances between samples are preserved. For multiple cell types it is common to see a few cell types fail to separate out in the first two dimensions and require investigation of the higher PCA dimensions. We plot our PCA plots using only the highly variable genes, as they are the most likely to behave differentially across cell types.

plotPCA(
    sce,
    colour_by = "cell_type",
    run_args = list(
        exprs_values = "logcounts",
        feature_set = hvg_ids
    )
)

tSNE using highly variable genes

t-Distributed Stochastic Neighbour Embedding (tSNE) plots are a very popular method for dimensionality reduction of single cell data. It is a dimensionality reduction technique that preserves neighbourhoods rather than distances. tSNE plots are able to fully reduce the high dimensional data into two dimensions, such that all significant differences are represented in the clustering of points. However tSNE plots are pseudo-random and also make use of a perplexity parameter. For reproducibility, set.seed() should always be run before tSNE to set the random seed. Multiple tSNE plots should also be generated over different perplexities in order to observe the stability of the clusters. Once again we run this using only our highly variable genes.

set.seed(100)
out5 <- plotTSNE(
    sce,
    colour_by = "cell_type",
    run_args = list(
        exprs_values = "logcounts",
        perplexity = 5,
        feature_set = hvg_ids
    )
) + ggtitle("Perplexity = 5")

out10 <- plotTSNE(
    sce,
    colour_by = "cell_type",
    run_args = list(
        exprs_values = "logcounts",
        perplexity = 5,
        feature_set = hvg_ids
    )
) + ggtitle("Perplexity = 10")

out20 <- plotTSNE(
    sce,
    colour_by = "cell_type",
    run_args = list(
        exprs_values = "logcounts",
        perplexity = 5,
        feature_set = hvg_ids
    )
) + ggtitle("Perplexity = 20")

out30 <- plotTSNE(
    sce,
    colour_by = "cell_type",
    run_args = list(
        exprs_values = "logcounts",
        perplexity = 5,
        feature_set = hvg_ids
    )
) + ggtitle("Perplexity = 30")

multiplot(out5, out10, out20, out30, cols = 2)

We see that tSNE does a very good job are separating out our known cell types.

Clustering using SC3

Often we do not know our cell types a priori, or we find that one of our cell types has split into multiple clusters with distinct genetic signatures. We would want to investigate the differences between these clusters. We can computationally assign clusters using SC3 which uses consensus of k-means clusters. Since k-means is pseudo-random, we set the seed before running the algorithm.

set.seed(100)
rowData(sce)$feature_symbol <- rownames(sce)
sce <- sc3(sce, 5, n_cores = 4)

We visualise the results of the clustering by colouring the tSNE plot based on the 5 cluster assignments made by SC3.

set.seed(100)
p1 <- plotTSNE(
    sce,
    colour_by = "sc3_5_clusters",
    run_args = list(
        exprs_values = "logcounts",
        perplexity = 10,
        feature_set = hvg_ids
    )
) + ggtitle("tSNE Coloured by SC3 Clusters")

set.seed(100)
p2 <- plotTSNE(
    sce,
    colour_by = "cell_type",
    run_args = list(
        exprs_values = "logcounts",
        perplexity = 10,
        feature_set = hvg_ids
    )
) + ggtitle("tSNE Coloured by Known Cell Type")

multiplot(p1, p2, cols = 1)

We can see that the SC3 assigned clustering matches up with the known cell types. These assigned cluster indices help with downstream differential expression analysis.

Differential Expression

For differential expression we will use edgeR’s glmFit() to fit a generalised linear model. The design matrix can be generated from the SC3 clusters.

edgeR does not use SingleCellExperiment but requires a DGEList. scran provides a convertTo() function for converting the formats. Make sure to set row.fields = TRUE and col.fields = TRUE such that the colData and rowData are transferred into the dge$samples and dge$genes fields respectively.

# convert SingleCellExperiment object to DGEList for edgeR modelling
dge <- convertTo(sce, type = "edgeR", row.fields = TRUE, col.fields = TRUE)

Comparing between clusters

We proceed as if we did not know about our cell types. Our clusters have identified samples that have similar expression profiles, we should then compare these clusters for differential expression to identify marker genes that will help us identify cell identities and/or discover new cell types.

# use no-intercept design
design_matrix <- model.matrix(~0 + cell_type, data = dge$samples)
colnames(design_matrix) %<>% str_remove("cell_type")

head(design_matrix)

##     B_Lymph EryB Grans LSK T_Lymph
## A10       1    0     0   0       0
## A11       1    0     0   0       0
## A12       1    0     0   0       0
## A13       1    0     0   0       0
## A14       1    0     0   0       0
## A15       1    0     0   0       0

In contrast to a standard edgeR pipeline we do not run calcNormFactors as we have already calculated size factors from scran which were transferred when running convertTo. The scran method for size factors is more robust to gene drop-out.

# fit glm using edgeR
# calcNormFactors is skipped because we already have size factors from scran
dge <- estimateDisp(dge, design = design_matrix, robust = TRUE)
fit <- glmFit(dge, design = design_matrix)

For the contrast we have a few choices, most commonly we could perform pair-wise comparisons between our clusters, alternatively one could compare one cluster to the average of the remaining clusters. Here for demonstration purposes we simply compare clusters 4 and 5 of the SC3 clusters.

# compare cluster 5 vs cluster 4 with likelihood ratio test
contrast <- makeContrasts("B_Lymph - T_Lymph", 
                          levels = design_matrix)
lrt <- glmLRT(fit, contrast = contrast)

plotMD(lrt, main = "B_Lymph vs T_Lymph")

# summarise test results and sort by FDR
top <- topTags(lrt, n = Inf)
# move Symbol from rownames to a column
top$table <- top$table %>%
    data.frame() %>%
    rownames_to_column("Symbol")

# print top 1000 most DE genes sorted by FDR
top$table %>% 
    select(Symbol, logFC, logCPM, LR, PValue, FDR) %>%
    slice(1:1000)

With the SingleCellExperiment and DGEList objects a variety of other analyses can be performed.

Appendix

sessionInfo()

## R version 3.5.0 (2018-04-23)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS High Sierra 10.13.3
## 
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_AU.UTF-8/en_AU.UTF-8/en_AU.UTF-8/C/en_AU.UTF-8/en_AU.UTF-8
## 
## attached base packages:
## [1] parallel  stats4    stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] bindrcpp_0.2.2              plotly_4.7.1               
##  [3] tibble_1.4.2                dplyr_0.7.4                
##  [5] Glimma_1.8.0                edgeR_3.22.0               
##  [7] limma_3.36.0                SC3_1.8.0                  
##  [9] scran_1.8.0                 scater_1.8.0               
## [11] scPipe_1.3.3                SingleCellExperiment_1.2.0 
## [13] SummarizedExperiment_1.10.0 DelayedArray_0.6.0         
## [15] BiocParallel_1.14.0         matrixStats_0.53.1         
## [17] Biobase_2.40.0              GenomicRanges_1.32.0       
## [19] GenomeInfoDb_1.16.0         IRanges_2.14.1             
## [21] S4Vectors_0.18.1            BiocGenerics_0.26.0        
## [23] ggplot2_2.2.1               stringr_1.3.0              
## [25] magrittr_1.5               
## 
## loaded via a namespace (and not attached):
##   [1] backports_1.1.2          plyr_1.8.4              
##   [3] igraph_1.2.1             lazyeval_0.2.1          
##   [5] splines_3.5.0            shinydashboard_0.7.0    
##   [7] crosstalk_1.0.0          digest_0.6.15           
##   [9] foreach_1.4.4            htmltools_0.3.6         
##  [11] viridis_0.5.1            gdata_2.18.0            
##  [13] memoise_1.1.0            cluster_2.0.7-1         
##  [15] doParallel_1.0.11        ROCR_1.0-7              
##  [17] prettyunits_1.0.2        colorspace_1.3-2        
##  [19] blob_1.1.1               rrcov_1.4-3             
##  [21] WriteXLS_4.0.0           RCurl_1.95-4.10         
##  [23] jsonlite_1.5             tximport_1.8.0          
##  [25] org.Mm.eg.db_3.6.0       bindr_0.1.1             
##  [27] iterators_1.0.9          glue_1.2.0              
##  [29] registry_0.5             gtable_0.2.0            
##  [31] zlibbioc_1.26.0          XVector_0.20.0          
##  [33] Rhdf5lib_1.2.0           DEoptimR_1.0-8          
##  [35] scales_0.5.0             pheatmap_1.0.8          
##  [37] mvtnorm_1.0-7            DBI_1.0.0               
##  [39] GGally_1.3.2             rngtools_1.2.4          
##  [41] Rcpp_0.12.16             viridisLite_0.3.0       
##  [43] xtable_1.8-2             progress_1.1.2          
##  [45] bit_1.1-12               mclust_5.4              
##  [47] DT_0.4                   htmlwidgets_1.2         
##  [49] httr_1.3.1               FNN_1.1                 
##  [51] gplots_3.0.1             RColorBrewer_1.1-2      
##  [53] pkgconfig_2.0.1          reshape_0.8.7           
##  [55] XML_3.98-1.11            locfit_1.5-9.1          
##  [57] dynamicTreeCut_1.63-1    labeling_0.3            
##  [59] rlang_0.2.0              reshape2_1.4.3          
##  [61] later_0.7.2              AnnotationDbi_1.42.0    
##  [63] munsell_0.4.3            tools_3.5.0             
##  [65] RSQLite_2.1.0            evaluate_0.10.1         
##  [67] yaml_2.1.19              org.Hs.eg.db_3.6.0      
##  [69] knitr_1.20               bit64_0.9-7             
##  [71] robustbase_0.93-0        caTools_1.17.1          
##  [73] purrr_0.2.4              doRNG_1.6.6             
##  [75] mime_0.5                 biomaRt_2.36.0          
##  [77] compiler_3.5.0           beeswarm_0.2.3          
##  [79] Rhtslib_1.12.0           e1071_1.6-8             
##  [81] statmod_1.4.30           pcaPP_1.9-73            
##  [83] stringi_1.2.2            lattice_0.20-35         
##  [85] Matrix_1.2-14            pillar_1.2.2            
##  [87] data.table_1.11.0        bitops_1.0-6            
##  [89] httpuv_1.4.1             R6_2.2.2                
##  [91] promises_1.0.1           KernSmooth_2.23-15      
##  [93] gridExtra_2.3            vipor_0.4.5             
##  [95] codetools_0.2-15         MASS_7.3-50             
##  [97] gtools_3.5.0             assertthat_0.2.0        
##  [99] rhdf5_2.24.0             pkgmaker_0.22           
## [101] rprojroot_1.3-2          rjson_0.2.15            
## [103] GenomeInfoDbData_1.1.0   grid_3.5.0              
## [105] tidyr_0.8.0              class_7.3-14            
## [107] rmarkdown_1.9            DelayedMatrixStats_1.2.0
## [109] Rtsne_0.13               shiny_1.0.5             
## [111] ggbeeswarm_0.6.0

Single Cell Workflow

Shian Su

16 May 2018