Single Cell Atlas Spleen Data

After running scPipe and Rsubread we get the gene count matrix gene_counts.csv and mapping statistics folder stat. By putting these two into a folder we can construct a SingleCellExperiment using create_sce_by_dir() from scPipe.

# read in data from working directory
sce <- create_sce_by_dir(
    "data",
    organism = "hsapiens_gene_ensembl",
    gene_id_type = "ensembl_gene_id"
)

# calculate additional quality control metrics
sce <- calculate_QC_metrics(sce)

## [1] "no ERCC spike-in. Skip `non_ERCC_percent`"

# dimension of sce object (genes x samples)
dim(sce)

## [1] 28719  2273

We can quickly visualised some mapping statistics with regards to barcode matches and feature alignment rates.

# plot barcode demultiplexing statistics
plot_demultiplex(sce)

plot_mapping(sce, percentage = FALSE, dataname = "Spleen data")

plot_mapping(sce, percentage = TRUE, dataname = "Spleen data")

Low quality cells can be removed using detect_outlier(), this clusters the samples based on various mapping and alignment statistics, then removing the cluster(s) with undesireable qualities.

# automatic outlier detection based on QC statistics
sce <- detect_outlier(
    sce,
    comp = 2,
    type = "low"
)

cat("Samples detected as outlier:\n")

## Samples detected as outlier:

# check how many are outliers
table(QC_metrics(sce)$outliers)

## 
## FALSE  TRUE 
##  1347   926

We can use plot_QC_pairs() to summarise the distribution of quality control metrics between outliers and non-outliers. After that remove_outliers() removes the detected outliers from our SingleCellExperiment() object from downstream analysis.

plot_QC_pairs(sce)

# remove outlier samples
sce <- remove_outliers(sce)
dim(sce)

## [1] 28719  1347

Because SingleCellExperiments is the standard Bioconductor object for storing single-cell sequencing output, it is used by other useful analysis packages designed for single cells. Here we pass our SingleCellExperiment object sce into scater for calculation of more quality control metrics. From scater we can plot the most expressed genes to get a quick overview of our data.

# convert rownames from ENSEMBL IDs to gene symbols
rowData(sce)$GeneID <- rownames(sce)
sce <- convert_geneid(sce)

## [1] "Number of NA in new gene id: 60. Duplicated id: 172"
## [1] "First 5 duplicated:"
##     ensembl_gene_id external_gene_name description
## 260 ENSG00000238560            RF00019            
## 107 ENSG00000200448            RF00019            
## 450 ENSG00000275504            RF00066            
## 311 ENSG00000253060            RF00601            
## 140 ENSG00000212458            RF00554            
## 102 ENSG00000199461            RF00019

# calculate scater QC metrics
sce <- calculateQCMetrics(
    sce,
    feature_controls = list(ERCC = str_detect(rownames(sce), "^ERCC"))
)

# plot 20 highest expression genes
plotQC(sce, type = "highest-expression", n = 20)

We filter out some lowly expressed genes before moving onto further analysis. The sce object can then be passed into computeSumFactors from scran to calculated normalisation factors so that normalised expression values can be computed.

# take the number of samples gene is expressed in
expressed_samples <- rowSums(counts(sce) > 0)

# filter out lowly expressed genes
keep <- expressed_samples > 30
sce <- sce[keep, ]
dim(sce)

## [1] 12190  1347

# compute size factors
sce <- computeSumFactors(sce, positive = TRUE)

# assign normalised expression to exprs(sce)
sce <- normalize(sce)

We can also use variance modelling from scran to estimate biological and technical variances of each gene. Here we will plot the trend of biological variance and highlight genes that have high biological variance. We also plot a heatmap of the highly variable genes across the samples.

# estimate variance then decompose into technical and biological components
fit <- trendVar(sce, parametric = TRUE, use.spikes = FALSE)
decomp <- decomposeVar(sce, fit) %>%
    data.frame() %>%
    rownames_to_column("GeneID") %>%
    arrange(desc(bio))

# plot all genes in black
plot(
    decomp$mean,
    decomp$total,
    xlab = "Mean log-expression",
    ylab = "Variance",
    pch = 20,
    cex = 0.4
)
o <- order(decomp$mean)
hvg <- which(decomp$FDR <= 0.05 & decomp$bio >= 0.5)

# plot highly variable genes in blue
points(decomp$mean[hvg], decomp$total[hvg], col = "blue", pch = 20, cex = 0.8)

# plot spike-ins in red and draw trend
lines(decomp$mean[o], decomp$tech[o], col = "red", lwd = 1)

# plot heatmap of most variable genes
hvg_ids <- decomp$GeneID[hvg]
gene_exp <- exprs(sce)
gene_exp <- gene_exp[hvg_ids,]
hc_rows <- hclust(dist(gene_exp))
hc_cols <- hclust(dist(t(gene_exp)))
gene_exp <- gene_exp[hc_rows$order, hc_cols$order]

margin <- list(
  l = 100,
  r = 40,
  b = 10,
  t = 10,
  pad = 0
)

plot_ly(
    x = colnames(gene_exp),
    y = rownames(gene_exp),
    z = gene_exp,
    type = "heatmap"
) %>%
    layout(autosize = F, margin = margin)

We can use plotting functions from scater to produce dimensionality reduction visualisations of the data. This gives a rough idea of the grouping of the heterogenous groups of cells in the sample.

# plot PCA of data
plotPCA(
    sce,
    ncomponents = 3,
    run_args = list(
        exprs_values = "logcounts"
    )
)

# plot tSNE of data
set.seed(100)
plotTSNE(
    sce,
    ncomponents = 3,
    run_args = list(
        exprs_values = "logcounts",
        perplexity = 30
    )
) + ggtitle("Perplexity = 30")

We can assign clusters to the samples using SC3 which also makes use of the SingleCellExperiment object. This estimates optimal cluster number and performs consensus based clustering to assign a cluster id to each sample.

set.seed(100)
rowData(sce)$feature_symbol <- rownames(sce)
sce <- sc3_estimate_k(sce)
k_estimate <- metadata(sce)$sc3$k_estimation
sc3_cluster_str <- sprintf("sc3_%d_clusters", k_estimate)
sce <- sc3(
    sce,
    ks = k_estimate,
    n_cores = 1
)

There currently lacks a robust computational method for inferring the identity of the sequenced cells, here we shall use an ad-hoc method to infer cellular identity by correlation against a microarray dataset that profiled a range of immunological lineages. We assign a cell to the immunological lineage with the greatest correlation with the cells. Two approaches are used, one to correlate cells individually and one to correlate the mean expression of SC3 assigned clusters.

immune_lineage_exprs <- read.csv(
    "data/dmap_immune_signatures/dmap_immune_lineage_exprs.csv",
    row.names = 1
)
lineages <- colnames(immune_lineage_exprs)

# predict individual cell types
common_genes <- intersect(
    rowData(sce)$GeneID,
    rownames(immune_lineage_exprs)
)
gene_ids <- rowData(sce)$GeneID
sc_exprs <- exprs(sce)[gene_ids %in% common_genes, ]
gmap_exprs <- immune_lineage_exprs[common_genes, ]
signature_cor <- cor(sc_exprs, gmap_exprs, method = "spearman")

sc_type_pred <- apply(signature_cor, 1,
    function(x) { lineages[x == max(x)] }
)

# predict cluster_type
cluster_ids <- colData(sce)[, sc3_cluster_str] %>% unique %>% sort()
counts <- counts(sce)
rownames(counts) <- rowData(sce)$GeneID
cluster_exprs <- lapply(cluster_ids,
    function(x) { counts[, colData(sce)[[sc3_cluster_str]] == x, drop = FALSE] })

avg_clust_exprs <- lapply(cluster_exprs,
    function(x) { rowMeans(x) }) %>%
    do.call(cbind, .)
colnames(avg_clust_exprs) <- cluster_ids

common_genes <- intersect(
    rownames(avg_clust_exprs), 
    rownames(immune_lineage_exprs)
)

signature_cor <- cor(
    avg_clust_exprs[common_genes, ], 
    immune_lineage_exprs[common_genes, ], 
    method = "spearman"
)

lineages <- colnames(signature_cor)
clust_type_pred <- apply(signature_cor, 1,
    function(x) { lineages[x == max(x)] }
)

clust_type_pred <- clust_type_pred[colData(sce)[[sc3_cluster_str]]]

colData(sce)$sc_cell_type <- sc_type_pred
colData(sce)$clust_cell_type <- clust_type_pred

set.seed(100)

plotTSNE(
    sce,
    ncomponents = 2,
    colour_by = sc3_cluster_str
)

set.seed(100)
tsne_out <- Rtsne(t(logcounts(sce)))

sc_cell_type <- factor(colData(sce)$sc_cell_type, levels = lineages)
clust_cell_type <- factor(colData(sce)$clust_cell_type, levels = lineages)
manual_cols <- c(
    RColorBrewer::brewer.pal(8, "Set2"),
    RColorBrewer::brewer.pal(4, "Set1") 
)
names(manual_cols) <- lineages

plot_data <- data.frame(
        dim1 = tsne_out$Y[, 1],
        dim2 = tsne_out$Y[, 2],
        sc_cell_type = sc_cell_type,
        clust_cell_type = clust_cell_type
    )

p <- ggplot(plot_data, aes(x = dim1, y = dim2, col = sc_cell_type)) +
    geom_point() +
    scale_colour_manual(values = manual_cols) + 
    theme_bw() +
    coord_fixed(ratio = 3/4) +
    ggtitle("Single Cell Type Assignment")
print(p)

p <- ggplot(plot_data, aes(x = dim1, y = dim2, col = clust_cell_type)) +
    geom_point() +
    scale_colour_manual(values = manual_cols) + 
    theme_bw() + 
    coord_fixed(ratio = 3/4) +
    ggtitle("Whole Cluster Type Assignment")
print(p)

sessionInfo()

## R version 3.5.0 (2018-04-23)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS High Sierra 10.13.3
## 
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_AU.UTF-8/en_AU.UTF-8/en_AU.UTF-8/C/en_AU.UTF-8/en_AU.UTF-8
## 
## attached base packages:
## [1] parallel  stats4    stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] bindrcpp_0.2.2              plotly_4.7.1               
##  [3] tibble_1.4.2                Rtsne_0.13                 
##  [5] dplyr_0.7.4                 Glimma_1.8.0               
##  [7] edgeR_3.22.0                limma_3.36.0               
##  [9] SC3_1.8.0                   scran_1.8.0                
## [11] scater_1.8.0                scPipe_1.3.3               
## [13] SingleCellExperiment_1.2.0  SummarizedExperiment_1.10.0
## [15] DelayedArray_0.6.0          BiocParallel_1.14.0        
## [17] matrixStats_0.53.1          Biobase_2.40.0             
## [19] GenomicRanges_1.32.0        GenomeInfoDb_1.16.0        
## [21] IRanges_2.14.1              S4Vectors_0.18.1           
## [23] BiocGenerics_0.26.0         ggplot2_2.2.1              
## [25] stringr_1.3.0               magrittr_1.5               
## 
## loaded via a namespace (and not attached):
##   [1] backports_1.1.2          plyr_1.8.4              
##   [3] igraph_1.2.1             lazyeval_0.2.1          
##   [5] shinydashboard_0.7.0     crosstalk_1.0.0         
##   [7] digest_0.6.15            foreach_1.4.4           
##   [9] htmltools_0.3.6          viridis_0.5.1           
##  [11] gdata_2.18.0             memoise_1.1.0           
##  [13] cluster_2.0.7-1          doParallel_1.0.11       
##  [15] ROCR_1.0-7               limSolve_1.5.5.3        
##  [17] lpSolve_5.6.13           prettyunits_1.0.2       
##  [19] colorspace_1.3-2         blob_1.1.1              
##  [21] rrcov_1.4-3              WriteXLS_4.0.0          
##  [23] RCurl_1.95-4.10          jsonlite_1.5            
##  [25] tximport_1.8.0           org.Mm.eg.db_3.6.0      
##  [27] bindr_0.1.1              iterators_1.0.9         
##  [29] glue_1.2.0               registry_0.5            
##  [31] gtable_0.2.0             zlibbioc_1.26.0         
##  [33] XVector_0.20.0           Rhdf5lib_1.2.0          
##  [35] DEoptimR_1.0-8           scales_0.5.0            
##  [37] pheatmap_1.0.8           mvtnorm_1.0-7           
##  [39] DBI_1.0.0                GGally_1.3.2            
##  [41] rngtools_1.2.4           Rcpp_0.12.16            
##  [43] viridisLite_0.3.0        xtable_1.8-2            
##  [45] progress_1.1.2           bit_1.1-12              
##  [47] mclust_5.4               DT_0.4                  
##  [49] htmlwidgets_1.2          httr_1.3.1              
##  [51] FNN_1.1                  gplots_3.0.1            
##  [53] RColorBrewer_1.1-2       pkgconfig_2.0.1         
##  [55] reshape_0.8.7            XML_3.98-1.11           
##  [57] locfit_1.5-9.1           dynamicTreeCut_1.63-1   
##  [59] labeling_0.3             rlang_0.2.0             
##  [61] reshape2_1.4.3           later_0.7.2             
##  [63] AnnotationDbi_1.42.0     munsell_0.4.3           
##  [65] tools_3.5.0              RSQLite_2.1.0           
##  [67] evaluate_0.10.1          yaml_2.1.19             
##  [69] org.Hs.eg.db_3.6.0       knitr_1.20              
##  [71] bit64_0.9-7              robustbase_0.93-0       
##  [73] caTools_1.17.1           purrr_0.2.4             
##  [75] doRNG_1.6.6              mime_0.5                
##  [77] biomaRt_2.36.0           compiler_3.5.0          
##  [79] beeswarm_0.2.3           Rhtslib_1.12.0          
##  [81] curl_3.2                 e1071_1.6-8             
##  [83] statmod_1.4.30           pcaPP_1.9-73            
##  [85] stringi_1.2.2            lattice_0.20-35         
##  [87] Matrix_1.2-14            pillar_1.2.2            
##  [89] data.table_1.11.0        bitops_1.0-6            
##  [91] httpuv_1.4.1             R6_2.2.2                
##  [93] promises_1.0.1           KernSmooth_2.23-15      
##  [95] gridExtra_2.3            vipor_0.4.5             
##  [97] codetools_0.2-15         MASS_7.3-50             
##  [99] gtools_3.5.0             assertthat_0.2.0        
## [101] rhdf5_2.24.0             pkgmaker_0.22           
## [103] rprojroot_1.3-2          rjson_0.2.15            
## [105] GenomeInfoDbData_1.1.0   quadprog_1.5-5          
## [107] grid_3.5.0               tidyr_0.8.0             
## [109] class_7.3-14             rmarkdown_1.9           
## [111] DelayedMatrixStats_1.2.0 shiny_1.0.5             
## [113] ggbeeswarm_0.6.0

Single Cell Atlas Spleen Data

Shian Su

18/05/2018