After running scPipe
and Rsubread
we get the gene count matrix gene_counts.csv
and mapping statistics folder stat
. By putting these two into a folder we can construct a SingleCellExperiment using create_sce_by_dir()
from scPipe
.
# read in data from working directory
sce <- create_sce_by_dir(
"data",
organism = "hsapiens_gene_ensembl",
gene_id_type = "ensembl_gene_id"
)
# calculate additional quality control metrics
sce <- calculate_QC_metrics(sce)
## [1] "no ERCC spike-in. Skip `non_ERCC_percent`"
# dimension of sce object (genes x samples)
dim(sce)
## [1] 28719 2273
We can quickly visualised some mapping statistics with regards to barcode matches and feature alignment rates.
# plot barcode demultiplexing statistics
plot_demultiplex(sce)
plot_mapping(sce, percentage = FALSE, dataname = "Spleen data")
plot_mapping(sce, percentage = TRUE, dataname = "Spleen data")
Low quality cells can be removed using detect_outlier()
, this clusters the samples based on various mapping and alignment statistics, then removing the cluster(s) with undesireable qualities.
# automatic outlier detection based on QC statistics
sce <- detect_outlier(
sce,
comp = 2,
type = "low"
)
cat("Samples detected as outlier:\n")
## Samples detected as outlier:
# check how many are outliers
table(QC_metrics(sce)$outliers)
##
## FALSE TRUE
## 1347 926
We can use plot_QC_pairs()
to summarise the distribution of quality control metrics between outliers and non-outliers. After that remove_outliers()
removes the detected outliers from our SingleCellExperiment()
object from downstream analysis.
plot_QC_pairs(sce)
# remove outlier samples
sce <- remove_outliers(sce)
dim(sce)
## [1] 28719 1347
Because SingleCellExperiments
is the standard Bioconductor
object for storing single-cell sequencing output, it is used by other useful analysis packages designed for single cells. Here we pass our SingleCellExperiment
object sce
into scater
for calculation of more quality control metrics. From scater
we can plot the most expressed genes to get a quick overview of our data.
# convert rownames from ENSEMBL IDs to gene symbols
rowData(sce)$GeneID <- rownames(sce)
sce <- convert_geneid(sce)
## [1] "Number of NA in new gene id: 60. Duplicated id: 172"
## [1] "First 5 duplicated:"
## ensembl_gene_id external_gene_name description
## 260 ENSG00000238560 RF00019
## 107 ENSG00000200448 RF00019
## 450 ENSG00000275504 RF00066
## 311 ENSG00000253060 RF00601
## 140 ENSG00000212458 RF00554
## 102 ENSG00000199461 RF00019
# calculate scater QC metrics
sce <- calculateQCMetrics(
sce,
feature_controls = list(ERCC = str_detect(rownames(sce), "^ERCC"))
)
# plot 20 highest expression genes
plotQC(sce, type = "highest-expression", n = 20)
We filter out some lowly expressed genes before moving onto further analysis. The sce
object can then be passed into computeSumFactors
from scran
to calculated normalisation factors so that normalised expression values can be computed.
# take the number of samples gene is expressed in
expressed_samples <- rowSums(counts(sce) > 0)
# filter out lowly expressed genes
keep <- expressed_samples > 30
sce <- sce[keep, ]
dim(sce)
## [1] 12190 1347
# compute size factors
sce <- computeSumFactors(sce, positive = TRUE)
# assign normalised expression to exprs(sce)
sce <- normalize(sce)
We can also use variance modelling from scran
to estimate biological and technical variances of each gene. Here we will plot the trend of biological variance and highlight genes that have high biological variance. We also plot a heatmap of the highly variable genes across the samples.
# estimate variance then decompose into technical and biological components
fit <- trendVar(sce, parametric = TRUE, use.spikes = FALSE)
decomp <- decomposeVar(sce, fit) %>%
data.frame() %>%
rownames_to_column("GeneID") %>%
arrange(desc(bio))
# plot all genes in black
plot(
decomp$mean,
decomp$total,
xlab = "Mean log-expression",
ylab = "Variance",
pch = 20,
cex = 0.4
)
o <- order(decomp$mean)
hvg <- which(decomp$FDR <= 0.05 & decomp$bio >= 0.5)
# plot highly variable genes in blue
points(decomp$mean[hvg], decomp$total[hvg], col = "blue", pch = 20, cex = 0.8)
# plot spike-ins in red and draw trend
lines(decomp$mean[o], decomp$tech[o], col = "red", lwd = 1)
# plot heatmap of most variable genes
hvg_ids <- decomp$GeneID[hvg]
gene_exp <- exprs(sce)
gene_exp <- gene_exp[hvg_ids,]
hc_rows <- hclust(dist(gene_exp))
hc_cols <- hclust(dist(t(gene_exp)))
gene_exp <- gene_exp[hc_rows$order, hc_cols$order]
margin <- list(
l = 100,
r = 40,
b = 10,
t = 10,
pad = 0
)
plot_ly(
x = colnames(gene_exp),
y = rownames(gene_exp),
z = gene_exp,
type = "heatmap"
) %>%
layout(autosize = F, margin = margin)
We can use plotting functions from scater
to produce dimensionality reduction visualisations of the data. This gives a rough idea of the grouping of the heterogenous groups of cells in the sample.
# plot PCA of data
plotPCA(
sce,
ncomponents = 3,
run_args = list(
exprs_values = "logcounts"
)
)
# plot tSNE of data
set.seed(100)
plotTSNE(
sce,
ncomponents = 3,
run_args = list(
exprs_values = "logcounts",
perplexity = 30
)
) + ggtitle("Perplexity = 30")
We can assign clusters to the samples using SC3
which also makes use of the SingleCellExperiment
object. This estimates optimal cluster number and performs consensus based clustering to assign a cluster id to each sample.
set.seed(100)
rowData(sce)$feature_symbol <- rownames(sce)
sce <- sc3_estimate_k(sce)
k_estimate <- metadata(sce)$sc3$k_estimation
sc3_cluster_str <- sprintf("sc3_%d_clusters", k_estimate)
sce <- sc3(
sce,
ks = k_estimate,
n_cores = 1
)
There currently lacks a robust computational method for inferring the identity of the sequenced cells, here we shall use an ad-hoc method to infer cellular identity by correlation against a microarray dataset that profiled a range of immunological lineages. We assign a cell to the immunological lineage with the greatest correlation with the cells. Two approaches are used, one to correlate cells individually and one to correlate the mean expression of SC3 assigned clusters.
immune_lineage_exprs <- read.csv(
"data/dmap_immune_signatures/dmap_immune_lineage_exprs.csv",
row.names = 1
)
lineages <- colnames(immune_lineage_exprs)
# predict individual cell types
common_genes <- intersect(
rowData(sce)$GeneID,
rownames(immune_lineage_exprs)
)
gene_ids <- rowData(sce)$GeneID
sc_exprs <- exprs(sce)[gene_ids %in% common_genes, ]
gmap_exprs <- immune_lineage_exprs[common_genes, ]
signature_cor <- cor(sc_exprs, gmap_exprs, method = "spearman")
sc_type_pred <- apply(signature_cor, 1,
function(x) { lineages[x == max(x)] }
)
# predict cluster_type
cluster_ids <- colData(sce)[, sc3_cluster_str] %>% unique %>% sort()
counts <- counts(sce)
rownames(counts) <- rowData(sce)$GeneID
cluster_exprs <- lapply(cluster_ids,
function(x) { counts[, colData(sce)[[sc3_cluster_str]] == x, drop = FALSE] })
avg_clust_exprs <- lapply(cluster_exprs,
function(x) { rowMeans(x) }) %>%
do.call(cbind, .)
colnames(avg_clust_exprs) <- cluster_ids
common_genes <- intersect(
rownames(avg_clust_exprs),
rownames(immune_lineage_exprs)
)
signature_cor <- cor(
avg_clust_exprs[common_genes, ],
immune_lineage_exprs[common_genes, ],
method = "spearman"
)
lineages <- colnames(signature_cor)
clust_type_pred <- apply(signature_cor, 1,
function(x) { lineages[x == max(x)] }
)
clust_type_pred <- clust_type_pred[colData(sce)[[sc3_cluster_str]]]
colData(sce)$sc_cell_type <- sc_type_pred
colData(sce)$clust_cell_type <- clust_type_pred
set.seed(100)
plotTSNE(
sce,
ncomponents = 2,
colour_by = sc3_cluster_str
)
set.seed(100)
tsne_out <- Rtsne(t(logcounts(sce)))
sc_cell_type <- factor(colData(sce)$sc_cell_type, levels = lineages)
clust_cell_type <- factor(colData(sce)$clust_cell_type, levels = lineages)
manual_cols <- c(
RColorBrewer::brewer.pal(8, "Set2"),
RColorBrewer::brewer.pal(4, "Set1")
)
names(manual_cols) <- lineages
plot_data <- data.frame(
dim1 = tsne_out$Y[, 1],
dim2 = tsne_out$Y[, 2],
sc_cell_type = sc_cell_type,
clust_cell_type = clust_cell_type
)
p <- ggplot(plot_data, aes(x = dim1, y = dim2, col = sc_cell_type)) +
geom_point() +
scale_colour_manual(values = manual_cols) +
theme_bw() +
coord_fixed(ratio = 3/4) +
ggtitle("Single Cell Type Assignment")
print(p)
p <- ggplot(plot_data, aes(x = dim1, y = dim2, col = clust_cell_type)) +
geom_point() +
scale_colour_manual(values = manual_cols) +
theme_bw() +
coord_fixed(ratio = 3/4) +
ggtitle("Whole Cluster Type Assignment")
print(p)
sessionInfo()
## R version 3.5.0 (2018-04-23)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS High Sierra 10.13.3
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
##
## locale:
## [1] en_AU.UTF-8/en_AU.UTF-8/en_AU.UTF-8/C/en_AU.UTF-8/en_AU.UTF-8
##
## attached base packages:
## [1] parallel stats4 stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] bindrcpp_0.2.2 plotly_4.7.1
## [3] tibble_1.4.2 Rtsne_0.13
## [5] dplyr_0.7.4 Glimma_1.8.0
## [7] edgeR_3.22.0 limma_3.36.0
## [9] SC3_1.8.0 scran_1.8.0
## [11] scater_1.8.0 scPipe_1.3.3
## [13] SingleCellExperiment_1.2.0 SummarizedExperiment_1.10.0
## [15] DelayedArray_0.6.0 BiocParallel_1.14.0
## [17] matrixStats_0.53.1 Biobase_2.40.0
## [19] GenomicRanges_1.32.0 GenomeInfoDb_1.16.0
## [21] IRanges_2.14.1 S4Vectors_0.18.1
## [23] BiocGenerics_0.26.0 ggplot2_2.2.1
## [25] stringr_1.3.0 magrittr_1.5
##
## loaded via a namespace (and not attached):
## [1] backports_1.1.2 plyr_1.8.4
## [3] igraph_1.2.1 lazyeval_0.2.1
## [5] shinydashboard_0.7.0 crosstalk_1.0.0
## [7] digest_0.6.15 foreach_1.4.4
## [9] htmltools_0.3.6 viridis_0.5.1
## [11] gdata_2.18.0 memoise_1.1.0
## [13] cluster_2.0.7-1 doParallel_1.0.11
## [15] ROCR_1.0-7 limSolve_1.5.5.3
## [17] lpSolve_5.6.13 prettyunits_1.0.2
## [19] colorspace_1.3-2 blob_1.1.1
## [21] rrcov_1.4-3 WriteXLS_4.0.0
## [23] RCurl_1.95-4.10 jsonlite_1.5
## [25] tximport_1.8.0 org.Mm.eg.db_3.6.0
## [27] bindr_0.1.1 iterators_1.0.9
## [29] glue_1.2.0 registry_0.5
## [31] gtable_0.2.0 zlibbioc_1.26.0
## [33] XVector_0.20.0 Rhdf5lib_1.2.0
## [35] DEoptimR_1.0-8 scales_0.5.0
## [37] pheatmap_1.0.8 mvtnorm_1.0-7
## [39] DBI_1.0.0 GGally_1.3.2
## [41] rngtools_1.2.4 Rcpp_0.12.16
## [43] viridisLite_0.3.0 xtable_1.8-2
## [45] progress_1.1.2 bit_1.1-12
## [47] mclust_5.4 DT_0.4
## [49] htmlwidgets_1.2 httr_1.3.1
## [51] FNN_1.1 gplots_3.0.1
## [53] RColorBrewer_1.1-2 pkgconfig_2.0.1
## [55] reshape_0.8.7 XML_3.98-1.11
## [57] locfit_1.5-9.1 dynamicTreeCut_1.63-1
## [59] labeling_0.3 rlang_0.2.0
## [61] reshape2_1.4.3 later_0.7.2
## [63] AnnotationDbi_1.42.0 munsell_0.4.3
## [65] tools_3.5.0 RSQLite_2.1.0
## [67] evaluate_0.10.1 yaml_2.1.19
## [69] org.Hs.eg.db_3.6.0 knitr_1.20
## [71] bit64_0.9-7 robustbase_0.93-0
## [73] caTools_1.17.1 purrr_0.2.4
## [75] doRNG_1.6.6 mime_0.5
## [77] biomaRt_2.36.0 compiler_3.5.0
## [79] beeswarm_0.2.3 Rhtslib_1.12.0
## [81] curl_3.2 e1071_1.6-8
## [83] statmod_1.4.30 pcaPP_1.9-73
## [85] stringi_1.2.2 lattice_0.20-35
## [87] Matrix_1.2-14 pillar_1.2.2
## [89] data.table_1.11.0 bitops_1.0-6
## [91] httpuv_1.4.1 R6_2.2.2
## [93] promises_1.0.1 KernSmooth_2.23-15
## [95] gridExtra_2.3 vipor_0.4.5
## [97] codetools_0.2-15 MASS_7.3-50
## [99] gtools_3.5.0 assertthat_0.2.0
## [101] rhdf5_2.24.0 pkgmaker_0.22
## [103] rprojroot_1.3-2 rjson_0.2.15
## [105] GenomeInfoDbData_1.1.0 quadprog_1.5-5
## [107] grid_3.5.0 tidyr_0.8.0
## [109] class_7.3-14 rmarkdown_1.9
## [111] DelayedMatrixStats_1.2.0 shiny_1.0.5
## [113] ggbeeswarm_0.6.0