2D Expression PCAs colored by different features
Feature correlation plots
Quantification and quality-control pipeline
The quality of each sample was independently assessed using FastQC, Preseq, Picard tools, RSeQC, SAMtools, and QualiMap. FastQ Screen and Kraken + Krona were used to screen for various sources of contamination.
Adapter sequences were removed using Cutadapt prior to mapping to hg38 reference genome. STAR was run in two-pass mode where splice-junctions are collected and aggregated across all samples and provided to the second-pass of STAR. Gene expression levels were quantified using RSEM. The expected counts from RSEM are merged across samples to create a counts matrix for downstream analysis. RSeQC tin.py was used to calculate transcript integrity numbers for all canonical protein-coding transcripts.
Downstream Analysis
The expected counts from RSEM were filtered to remove lowly expressed genes using edgeR's filterByExpr() function. The following critea were selected for filtering: genes must have 10 reads in >= 70% samples. After filtering, we are left with 18787 genes. Trimmed mean of M-values (TMM) was performed using the calcNormFactors() function in edgeR. The normalisation factors calculated here are used as a scaling factor for the library sizes. Using the voom() function in limma, the counts are converted log2-counts-per-million (logCPM) and quantile normalized.
voom is an acronym for mean-variance modelling at the observational level. The key concern is to estimate the mean-variance relationship in the data, then use this to compute appropriate weights for each observation. Count data almost show non-trivial mean-variance relationships. Raw counts show increasing variance with increasing count size, while log-counts typically show a decreasing mean-variance trend. This function estimates the mean-variance trend for log-counts, then assigns a weight to each observation based on its predicted variance. The weights are then used in the linear modelling process to adjust for heteroscedasticity.
| Tool | Version | Notes |
|---|---|---|
| FastQC2 | 0.11.5 | Quality-control step to assess sequencing quality, run before and after adapter trimming |
| Cutadapt3 | 1.18 | Data processing step to remove adapter sequences and perform quality trimming |
| Kraken14 | 1.1 | Quality-control step to assess microbial taxonomic composition |
| KronaTools15 | 2.7 | Quality-control step to visualize kraken output |
| FastQ Screen17 | 0.9.3 | Quality-control step to assess contamination; additional dependencies: bowtie2/2.3.4, perl/5.24.3 |
| STAR4 | 2.7.0f | Data processing step to align reads against reference genome (using its two-pass mode) |
| QualiMap16 | 2.2.1 | Quality-control step to assess various alignment metrics, also calculates insert_size |
| Picard10 | 2.17.11 | Quality-control step to run MarkDuplicates, CollectRnaSeqMetrics and AddOrReplaceReadGroups |
| Preseq1 | 2.0.3 | Quality-control step to estimate library complexity |
| SAMtools13 | 1.6 | Quality-control step to run flagstat to calculate alignment statistics |
| RSeQC9 | 2.6.4 | Quality-control step to infer stranded-ness, TIN19, and read distributions over specific genomic features |
| RSEM5 | 1.3.0 | Data processing step to quantify gene and isoform counts |
| MultiQC11 | 1.4 | Reporting step to aggregate sample statistics and quality-control information across all sample |
General Recommendations
Here is a set of generalized guidelines for different QC metrics. Some of these metrics will vary genome-to-genome depending on the quality of the assembly and annotation but that has been taken into consideration for our set of supported reference genomes.
| Metric | Guideline |
|---|---|
| medTIN | > 65 |
| Trimmed Reads | > 10000000 |
| % Aligned to Reference | > 65% |
| % Duplicates | < 65 % |
| % rRNA | < 10% |
| % Coding | Coding > 35% |
References
1. Daley, T. and A.D. Smith, Predicting the molecular complexity of sequencing libraries. Nat Methods, 2013. 10(4): p. 325-7.
2. Andrews, S. (2010). FastQC: a quality control tool for high throughput sequence data.
3. Martin, M. (2011). "Cutadapt removes adapter sequences from high-throughput sequencing reads." EMBnet 17(1): 10-12.
4. Dobin, A., et al., STAR: ultrafast universal RNA-seq aligner. Bioinformatics, 2013. 29(1): p. 15-21.
5. Li, B. and C.N. Dewey, RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome. BMC Bioinformatics, 2011. 12: p. 323.
6. Harrow, J., et al., GENCODE: the reference human genome annotation for The ENCODE Project. Genome Res, 2012. 22(9): p. 1760-74.
7. Law, C.W., et al., voom: Precision weights unlock linear model analysis tools for RNA-seq read counts. Genome Biol, 2014. 15(2): p. R29.
8. Smyth, G.K., Linear models and empirical bayes methods for assessing differential expression in microarray experiments. Stat Appl Genet Mol Biol, 2004. 3: p. Article3.
9. Wang, L., et al. (2012). "RSeQC: quality control of RNA-seq experiments." Bioinformatics 28(16): 2184-2185.
10. The Picard toolkit. https://broadinstitute.github.io/picard/.
11. Ewels, P., et al. (2016). "MultiQC: summarize analysis results for multiple tools and samples in a single report." Bioinformatics 32(19): 3047-3048.
12. R Core Team (2018). R: A Language and Environment for Statistical Computing. Vienna, Austria, R Foundation for Statistical Computing.
13. Li, H., et al. (2009). "The Sequence Alignment/Map format and SAMtools." Bioinformatics 25(16): 2078-2079.
14. Wood, D. E. and S. L. Salzberg (2014). "Kraken: ultrafast metagenomic sequence classification using exact alignments." Genome Biol 15(3): R46.
15. Ondov, B. D., et al. (2011). "Interactive metagenomic visualization in a Web browser." BMC Bioinformatics 12(1): 385.
16. Okonechnikov, K., et al. (2015). "Qualimap 2: advanced multi-sample quality control for high-throughput sequencing data." Bioinformatics 32(2): 292-294.
17. Wingett, S. and S. Andrews (2018). "FastQ Screen: A tool for multi-genome mapping and quality control." F1000Research 7(2): 1338.
18. Robinson, M. D., et al. (2009). "edgeR: a Bioconductor package for differential expression analysis of digital gene expression data." Bioinformatics 26(1): 139-140.
19. Wang, L., et al. (2016). "Measure transcript integrity using RNA-seq data." BMC Bioinformatics 17(1): 58.
R version 3.5.2 (2018-12-20)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Mojave 10.14.6
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] grid stats graphics grDevices utils datasets methods
[8] base
other attached packages:
[1] circlize_0.4.10 ComplexHeatmap_2.5.5 reshape2_1.4.3
[4] DT_0.15 crosstalk_1.1.0.1 gridExtra_2.3
[7] RColorBrewer_1.1-2 edgeR_3.24.3 limma_3.38.3
[10] plotly_4.9.2.9000 ggplot2_3.3.2 plyr_1.8.6
[13] argparse_2.0.1
loaded via a namespace (and not attached):
[1] viridis_0.5.1 httr_1.4.2 tidyr_1.0.2
[4] jsonlite_1.7.1 viridisLite_0.3.0 shiny_1.5.0
[7] assertthat_0.2.1 stats4_3.5.2 yaml_2.2.1
[10] pillar_1.4.6 lattice_0.20-40 glue_1.3.2
[13] digest_0.6.25 promises_1.1.0 colorspace_1.4-1
[16] htmltools_0.5.0 httpuv_1.5.2 pkgconfig_2.0.3
[19] GetoptLong_1.0.2 flexdashboard_0.5.2 purrr_0.3.3
[22] xtable_1.8-4 scales_1.1.1 later_1.0.0
[25] tibble_2.1.3 farver_2.0.3 IRanges_2.16.0
[28] ellipsis_0.3.0 withr_2.3.0 BiocGenerics_0.28.0
[31] lazyeval_0.2.2 magrittr_1.5 crayon_1.3.4
[34] mime_0.9 evaluate_0.14 tools_3.5.2
[37] data.table_1.12.8 GlobalOptions_0.1.2 lifecycle_0.2.0
[40] matrixStats_0.56.0 stringr_1.4.0 S4Vectors_0.20.1
[43] findpython_1.0.5 munsell_0.5.0 locfit_1.5-9.1
[46] cluster_2.1.0 compiler_3.5.2 rlang_0.4.7
[49] rjson_0.2.20 htmlwidgets_1.5.1.9002 labeling_0.3
[52] rmarkdown_2.3 gtable_0.3.0 R6_2.4.1
[55] knitr_1.25 dplyr_0.8.5 fastmap_1.0.1
[58] clue_0.3-57 shape_1.4.5 stringi_1.4.6
[61] parallel_3.5.2 Rcpp_1.0.4 vctrs_0.2.4
[64] png_0.1-7 tidyselect_1.0.0 xfun_0.12
---
title: "report Not Applicable"
author: "Assess the quality of large RNA-seq cohorts"
params:
raw: "/path/to/raw/counts/matrix.tsv"
tin: "/path/to/tin/counts/martrix.tsv"
qc: "/path/to/qc/table.tsv"
wdir: "/path/to/output/directory"
annot: "FALSE"
output:
flexdashboard::flex_dashboard:
orientation: rows
navbar:
- { title: "Pipeline Documentation", href: "https://ccbr.github.io/pipeliner-docs/RNA-seq/Theory-and-practical-guide-for-RNA-seq/", align: right }
source_code: "embed"
---
```{r setup, include=FALSE}
# Clear R environment
# rm(list = ls())
set.seed(42)
# Set the working directory
knitr::opts_knit$set(root.dir = params$wdir)
knitr::opts_chunk$set(echo = FALSE, warning=FALSE, dev="png", fig.path=file.path(params$wdir, "figs/"))
```
```{r global, include=FALSE}
# Library imports
suppressMessages(library(plyr))
suppressMessages(library(plotly))
suppressMessages(library(ggplot2))
suppressMessages(library(limma))
suppressMessages(library(edgeR))
suppressMessages(library(RColorBrewer))
suppressMessages(library(gridExtra))
suppressMessages(library(crosstalk))
suppressMessages(library(DT))
suppressMessages(library(reshape2))
suppressMessages(library(ComplexHeatmap))
suppressMessages(library(circlize))
# Reading in raw counts matrix, TIN matrix, and QC metadata
rawcounts = read.table(file = params$raw, sep = '\t', header = TRUE, row.names = 1)
#rawcounts = read.table(file = 'data/Test_Raw_RSEM_Genes_Dataset.txt', sep = '\t', header = TRUE, row.names = 1)
tincounts = read.table(file = params$tin, sep = '\t', header = TRUE, row.names = 1)
#tincounts = read.table(file = 'data/Test_TIN_Dataset.txt', sep = '\t', header = TRUE, row.names = 1)
# Remove zero variance rows prior to PC
tincounts = tincounts[apply(tincounts, 1, var) != 0, ]
multiQC = read.table(file = params$qc, sep = '\t', header = TRUE)
rownames(multiQC) = make.names(multiQC$Sample)
# Create DGEList
deg = edgeR::DGEList(counts = rawcounts)
# Filter lowly expressed genes
keep_genes = edgeR::filterByExpr(deg) # Using default: Gene must have 10 reads in >= 70% samples
deg = deg[keep_genes,,keep.lib.sizes=FALSE] # Recaluate new lib.sizes after filtering
# edgeR TMM normalization
deg = calcNormFactors(deg, method = "TMM") # calculate scaling norm.factors
# limma voom normalization
deg_voom = voom(deg, normalize="quantile", plot = TRUE, save.plot = TRUE)
# Order genes by MAD
deg_voom$E <- deg_voom$E[order(apply(deg_voom$E, 1, mad), decreasing = T),]
# Remove zero variance rows prior to PC
deg_voom$E <- deg_voom$E[apply(deg_voom$E, 1, var) != 0, ]
# Principal Components Analysis
pca_exp = prcomp(t(as.matrix(deg_voom$E)), scale.=T)$x[,1:3] # Expression PC Analysis
pca_tin = prcomp(t(as.matrix(tincounts)), scale.=T)$x[,1:3] # Transcript Integrity Number PC Analysis
colnames(pca_tin) = c("PC1_tin", "PC2_tin", "PC3_tin") # Renaming PC cols to avoid collision with gene expression PCs
# Merge both dataframes on rowname
multiQC = transform(merge(multiQC, as.data.frame(pca_exp), by='row.names', all=TRUE), row.names=Row.names, Row.names=NULL)
multiQC = transform(merge(multiQC, as.data.frame(pca_tin), by='row.names', all=TRUE), row.names=Row.names, Row.names=NULL)
# Crosstalk object (inter-widget connectivity)
shared_metadata = SharedData$new(multiQC)
```
Interactives {data-icon="ion-android-options"}
=====================================
Inputs {.sidebar}
-------------------------------------
### Filters
```{r filters}
# Extracted Tissue
filter_checkbox(
id = "TissueType",
label = "Tissue",
sharedData = shared_metadata,
group = ~TissueType
)
# Tissue Source Site
filter_checkbox(
id = "TSS_Site",
label = "Tissue Source",
sharedData = shared_metadata,
group = ~TSS_Site
)
# Sequence Ranges
filter_select(
id = "sequence_length",
label = "Sequence Ranges",
sharedData = shared_metadata,
group = ~sequence_length
)
# Median TIN
filter_slider(
id = "median_tin",
label = "medTIN",
sharedData = shared_metadata,
column = ~median_tin,
step = 5,
round = TRUE,
sep = "",
ticks = TRUE,
min = 0,
max = 100
)
# Trimmed Reads
filter_slider(
id = "trimmed_read_pairs",
label = "Trimmed Reads",
sharedData = shared_metadata,
column = ~trimmed_read_pairs,
step = 5000000,
round = TRUE,
sep = "",
ticks = TRUE,
min = 0
)
# % Duplicates
filter_slider(
id = "percent_duplication",
label = "% Duplicates",
sharedData = shared_metadata,
column = ~percent_duplication,
step = 5,
round = TRUE,
sep = "",
ticks = TRUE,
min = 0,
max = 100
)
# % Aligned
filter_slider(
id = "percent_aligned",
label = "% Aligned",
sharedData = shared_metadata,
column = ~percent_aligned,
step = 5,
round = TRUE,
sep = "",
ticks = TRUE,
min = 0,
max = 100
)
# % Coding
filter_slider(
id = "pct_coding_bases",
label = "% Coding",
sharedData = shared_metadata,
column = ~pct_coding_bases,
step = 5,
round = TRUE,
sep = "",
ticks = TRUE,
min = 0,
max = 100
)
# % Intronic
filter_slider(
id = "pct_intronic_bases",
label = "% Intronic",
sharedData = shared_metadata,
column = ~pct_intronic_bases,
step = 5,
sep = "",
ticks = TRUE,
min = 0,
max = 100
)
# % rRNA
filter_slider(
id = "rRNA_percent_aligned",
label = "% rRNA",
sharedData = shared_metadata,
column = ~rRNA_percent_aligned,
step = 5,
sep = "",
round = TRUE,
ticks = TRUE,
min = 0,
max = 100
)
# GC content
filter_slider(
id = "gc_content",
label = "GC content",
sharedData = shared_metadata,
column = ~gc_content,
sep = "",
ticks = TRUE
)
# CV Coverage
filter_slider(
id = "median_cv_coverage",
label = "CV Coverage",
sharedData = shared_metadata,
column = ~median_cv_coverage,
sep = "",
ticks = TRUE
)
# Inner Distance Maxima
filter_slider(
id = "inner_distance_maxima",
label = "Inner Distance Maxima",
sharedData = shared_metadata,
column = ~inner_distance_maxima,
step = 10,
sep = "",
ticks = TRUE,
round = TRUE
)
# Insert Size
filter_slider(
id = "median_insert_size",
label = "Insert Size",
sharedData = shared_metadata,
column = ~median_insert_size,
step = 5,
sep = "",
ticks = TRUE
)
```
Row {data-height=400}
-------------------------------------
### Metadata
```{r datatable}
shared_metadata %>%
DT::datatable(
# selection = 'none', # disable datatable row selection
# filter = "top", # allows filtering on each column
extensions = c(
"Buttons", # add download buttons
"Scroller" # for scrolling instead of pagination
),
rownames = FALSE, # remove rownames
style = "bootstrap",
class = "compact",
width = "100%",
options = list(
dom = "Blrtip", # specify content (search box, etc)
deferRender = TRUE,
scrollY = 300,
scroller = TRUE,
columnDefs = list(
list(
visible = FALSE,
targets = c(1, 11, 14, 15, 18, 19, 21:23, 27, 31, 32, 33, 34, 35, 36) # hide columes
)
),
buttons = list(
I("colvis"), # turn columns on and off
"csv", # download as .csv
"excel" # download as .xlsx
)
),
colnames = c(
"Sample ID" = "Sample",
"Total Reads" = "total_read_pairs",
"Trimmed Reads" = "trimmed_read_pairs",
"Avg Seq Length" = "avg_sequence_length",
"Seq Range" = "sequence_length",
"GC" = "gc_content",
"% Dup" = "percent_duplication",
"% Aligned " = "percent_aligned",
"Inner Distance Maxima" = "inner_distance_maxima",
"Insert Size " = "median_insert_size",
"Avg MapQ" = "mean_mapping_quality",
"Coverage" = "mean_coverage",
"% Coding" = "pct_coding_bases",
"% Intronic" = "pct_intronic_bases",
"CV Coverage" = "median_cv_coverage",
"% rRNA" = "rRNA_percent_aligned" ,
"% UniVec" = "uni_vec_percent_aligned",
"Tissue" = "TissueType",
"% Anti-sense" = "percent_antisense_strand",
"medTIN" = "median_tin",
"Tissue Source" = "TSS_Site"
)
)
```
Row {data-height=600}
-------------------------------------
### Expression PCA
```{r 3d-expression-pca}
# Principal Components Analysis
pca = prcomp(t(as.matrix(deg_voom$E)), scale.=T)
# Variance explained for PCs: 1, 2, 3
pc1 = round(pca$sdev[1]^2/sum(pca$sdev^2)*100,2)
pc2 = round(pca$sdev[2]^2/sum(pca$sdev^2)*100,2)
pc3 = round(pca$sdev[3]^2/sum(pca$sdev^2)*100,2)
cgroups = as.factor(multiQC$TissueType)
cgroups = addNA(cgroups)
cpalette <- brewer.pal(nlevels(cgroups), "Paired")
p <- plot_ly(shared_metadata, x = ~PC1, y = ~PC2, z = ~PC3, color=cgroups, colors=cpalette, hoverinfo="text", marker=list(size = 8),
text = ~paste(' Sample: ', Sample, ' Tissue Source: ', TSS_Site, ' % Aligned: ', percent_aligned,
' % Dup: ', percent_duplication, ' % Coding: ', pct_coding_bases,
' % Intronic: ', pct_intronic_bases, ' medTIN: ', median_tin, ' Sequence Range: ', sequence_length,
' Inner Distance Maxima: ', inner_distance_maxima, ' Insert Size: ', median_insert_size) ) %>%
add_markers() %>% layout(scene = list(xaxis = list(title = paste0("PC1 (",pc1,"%)")),
yaxis = list(title = paste0("PC2 (",pc2,"%)")),
zaxis = list(title = paste0("PC3 (",pc3,"%)"))))
# Important: disable onclick() events
plotly::highlight(p, on = NULL) # fixes unexpected behavior when multiple plots via crosstalk
```
### TIN PCA
```{r 3d-tin-pca}
# Principal Components Analysis
pca = prcomp(t(as.matrix(tincounts)), scale.=T)
# Variance explained for PCs: 1, 2, 3
pc1 = round(pca$sdev[1]^2/sum(pca$sdev^2)*100,2)
pc2 = round(pca$sdev[2]^2/sum(pca$sdev^2)*100,2)
pc3 = round(pca$sdev[3]^2/sum(pca$sdev^2)*100,2)
cgroups = as.factor(multiQC$TissueType)
cgroups = addNA(cgroups)
cpalette <- brewer.pal(nlevels(cgroups), "Paired")
p <- plot_ly(shared_metadata, x = ~PC1_tin, y = ~PC2_tin, z = ~PC3_tin, color=cgroups, colors=cpalette, hoverinfo="text", marker=list(size = 8),
text = ~paste(' Sample: ', Sample, ' Tissue Source: ', TSS_Site, ' % Aligned: ', percent_aligned,
' % Dup: ', percent_duplication, ' % Coding: ', pct_coding_bases,
' % Intronic: ', pct_intronic_bases, ' medTIN: ', median_tin, ' Sequence Range: ', sequence_length,
' Inner Distance Maxima: ', inner_distance_maxima, ' Insert Size: ', median_insert_size) ) %>%
add_markers() %>% layout(scene = list(xaxis = list(title = paste0("PC1 (",pc1,"%)")),
yaxis = list(title = paste0("PC2 (",pc2,"%)")),
zaxis = list(title = paste0("PC3 (",pc3,"%)"))))
# Important: disable onclick() events
plotly::highlight(p, on = NULL) # fixes unexpected behavior when multiple plots via crosstalk
```
Subplots {data-icon="ion-grid"}
=====================================
**2D Expression PCAs colored by different features**
Row
-------------------------------------
```{r pca-initialize}
# Principal Components Analysis
pca = prcomp(t(as.matrix(deg_voom$E)), scale.=T)
# Variance explained for PCs: 1, 2, 3
pc1 = round(pca$sdev[1]^2/sum(pca$sdev^2)*100,2)
pc2 = round(pca$sdev[2]^2/sum(pca$sdev^2)*100,2)
```
### Tissue Type
```{r colored-by-tissue-type}
# Gene Expression PCA colored by Tissue Type
g <- ggplot(multiQC, aes(PC1, PC2, color = TissueType), xlab) + geom_point(size = multiQC$TissueType) + theme_minimal() +
labs(color = "Tissue Type", x = paste0("PC1 (",pc1,"%)"), y = paste0("PC2 (",pc2,"%)"))
g
```
### medTIN
```{r colored-by-tin}
# Gene Expression PCA colored by TIN
g <- ggplot(multiQC, aes(PC1, PC2, color = median_tin), xlab) + geom_point(size = multiQC$TissueType) + theme_minimal() +
labs(color = "medTIN", x = paste0("PC1 (",pc1,"%)"), y = paste0("PC2 (",pc2,"%)")) +
scale_colour_gradientn(colours = viridis::viridis(100))
g
```
### % Dups
```{r colored-by-dups}
# Gene Expression PCA colored by % Duplicates
g <- ggplot(multiQC, aes(PC1, PC2, color = percent_duplication), xlab) + geom_point(size = multiQC$TissueType) + theme_minimal() +
labs(color = "% Dups", x = paste0("PC1 (",pc1,"%)"), y = paste0("PC2 (",pc2,"%)")) +
scale_colour_gradientn(colours = viridis::viridis(100))
g
```
### % Aligned
```{r colored-by-alignment}
# Gene Expression PCA colored by % Aligned
g <- ggplot(multiQC, aes(PC1, PC2, color = percent_aligned), xlab) + geom_point(size = multiQC$TissueType) + theme_minimal() +
labs(color = "% Aligned", x = paste0("PC1 (",pc1,"%)"), y = paste0("PC2 (",pc2,"%)")) +
scale_colour_gradientn(colours = viridis::viridis(100))
g
```
Row
-------------------------------------
### CV Coverage
```{r colored-by-cv-coverage}
# Gene Expression PCA colored by CV Coverage
g <- ggplot(multiQC, aes(PC1, PC2, color = median_cv_coverage), xlab) + geom_point(size = multiQC$TissueType) + theme_minimal() +
labs(color = "CV Coverage", x = paste0("PC1 (",pc1,"%)"), y = paste0("PC2 (",pc2,"%)")) +
scale_colour_gradientn(colours = viridis::viridis(100))
g
```
### 3' Prime Bias
```{r colored-by-3-prime-coverage}
# Gene Expression PCA colored by 3' Prime Coverage
g <- ggplot(multiQC, aes(PC1, PC2, color = median_3prime_bias), xlab) + geom_point(size = multiQC$TissueType) + theme_minimal() +
labs(color = "3' Prime Coverage", x = paste0("PC1 (",pc1,"%)"), y = paste0("PC2 (",pc2,"%)")) +
scale_colour_gradientn(colours = viridis::viridis(100))
g
```
### Insert Size
```{r colored-by-insert-size}
# Gene Expression PCA colored by Insert Size
g <- ggplot(multiQC, aes(PC1, PC2, color = median_insert_size), xlab) + geom_point(size = multiQC$TissueType) + theme_minimal() +
labs(color = "Insert Size", x = paste0("PC1 (",pc1,"%)"), y = paste0("PC2 (",pc2,"%)")) +
scale_colour_gradientn(colours = viridis::viridis(100))
g
```
### Inner Distance Maxima
```{r colored-by-inner-distance}
# Gene Expression PCA colored by Inner Distance
g <- ggplot(multiQC, aes(PC1, PC2, color = inner_distance_maxima), xlab) + geom_point(size = multiQC$TissueType) + theme_minimal() +
labs(color = "Inner Distance Maxima", x = paste0("PC1 (",pc1,"%)"), y = paste0("PC2 (",pc2,"%)")) +
scale_colour_gradientn(colours = viridis::viridis(100))
g
```
Row
-------------------------------------
### % Coding
```{r colored-by-coding}
# Gene Expression PCA colored by % Coding
g <- ggplot(multiQC, aes(PC1, PC2, color = pct_coding_bases), xlab) + geom_point(size = multiQC$TissueType) + theme_minimal() +
labs(color = "% Coding", x = paste0("PC1 (",pc1,"%)"), y = paste0("PC2 (",pc2,"%)")) +
scale_colour_gradientn(colours = viridis::viridis(100))
g
```
### % rRNA
```{r colored-by-rrna}
# Gene Expression PCA colored by % rRNA
g <- ggplot(multiQC, aes(PC1, PC2, color = rRNA_percent_aligned), xlab) + geom_point(size = multiQC$TissueType) + theme_minimal() +
labs(color = "% rRNA", x = paste0("PC1 (",pc1,"%)"), y = paste0("PC2 (",pc2,"%)")) +
scale_colour_gradientn(colours = viridis::viridis(100))
g
```
### % Anti-sense
```{r colored-by-anti-sense}
# Gene Expression PCA colored by % Anti-sense
g <- ggplot(multiQC, aes(PC1, PC2, color = percent_antisense_strand), xlab) + geom_point(size = multiQC$TissueType) + theme_minimal() +
labs(color = "% Anti-sense", x = paste0("PC1 (",pc1,"%)"), y = paste0("PC2 (",pc2,"%)")) +
scale_colour_gradientn(colours = viridis::viridis(100))
g
```
Corr plots {data-orientation=columns data-icon="ion-stats-bars"}
=====================================
**Feature correlation plots**
Column {data-width=500}
-------------------------------------
### Heirarchical clustering of pairwise spearman correlation coefficients
```{r heirarchical-correlation-matrix, dpi=300}
# Helper Function
reorder_cormat <- function(cormat){
# Use correlation between variables as distance for heirarchincal clustering
dd <- as.dist((1-cormat)/2)
hc <- hclust(dd)
cormat <-cormat[hc$order, hc$order]
}
# Remove all columns that are categorical
numericQC = multiQC[,-which(sapply(multiQC, class) == "factor")]
# Additional columns to remove
additional_remove <- names(numericQC) %in% c("total_read_pairs", "mean_insert_size", "avg_aligned_read_length", "pct_mrna_bases",
"pct_utr_bases", "pct_intergenic_bases", "median_5prime_to_3prime_bias", "median_5prime_bias",
"median_3prime_bias", "percent_sense_strand", "PC1_tin", "PC2_tin", "PC3_tin")
# Cleaned numerical QC dataframe
numericQC = numericQC[!additional_remove]
# Remove zero-variance columns to prevent any hlclust() errors
#numericQC = numericQC[,-which(apply(numericQC, 2, var) == 0)]
numericQC = numericQC[, apply(numericQC, 2, var) != 0]
# Pair-wise spearman correlation matrix
cormatrix = round(cor(numericQC, method = "spearman"),2)
# Reorder the correlation matrix based on hierarchical clustering of the correlation coeff
cormat <- reorder_cormat(cormatrix)
# Get upper triangle of the correlation matrix
cormat[lower.tri(cormat)] <- NA
# Remove lower trinagle and reshape from wide to long format
cormat <- melt(cormat, na.rm = TRUE)
# Correlation ggheatmap
ggheatmap <- ggplot(cormat, aes(Var2, Var1, fill = value)) + geom_tile(color = "white") +
scale_fill_gradient2(low = "blue", high = "red", mid = "white", midpoint = 0, limit = c(-1,1), space = "Lab", name="Spearman\nCorrelation") +
theme_minimal() + theme(axis.text.x = element_text(angle = 45, vjust = 1, size = 9, hjust = 1)) + coord_fixed() +
theme(axis.title.x = element_blank(),
axis.title.y = element_blank(),
panel.grid.major = element_blank(),
panel.border = element_blank(),
panel.background = element_blank(),
axis.ticks = element_blank(),
legend.justification = c(1, 0),
legend.position = c(0.6, 0.7),
legend.direction = "horizontal") +
guides(fill = guide_colorbar(barwidth = 8, barheight = 1, title.position = "top", title.hjust = 0.5))
ggheatmap
```
Column {data-width=500}
-------------------------------------
### Complete linkage clustering of PC loadings with QC annotations
```{r loadings-heatmap, dpi=300, fig.height=6}
# Principal Components Analysis: 5 PCs as heatmap input
pca_exp = prcomp(t(as.matrix(deg_voom$E)), scale.=T)$x[,1:5] # Expression PC Analysis
hm_data <- as.matrix(t(pca_exp)) # Input for heatmap
# Additional columns to remove
additional_remove <- names(numericQC) %in% c("PC1", "PC2", "PC3")
# Cleaned numerical QC dataframe with matched rownames
numericQC = numericQC[match(colnames(hm_data), rownames(numericQC), nomatch=0), !additional_remove]
column_annotations = HeatmapAnnotation(df = numericQC)
if (params$annot){
cheatmap <- ComplexHeatmap::Heatmap(hm_data,
col=colorRamp2(seq(-max(abs(pca_exp), na.rm = T), max(abs(pca_exp), na.rm = T), length.out = 20),
rev(colorRampPalette(brewer.pal(9, "PuOr"))(20))),
bottom_annotation = column_annotations,
show_column_names=T,
column_names_rot = 45,
show_heatmap_legend = F) # Turning off to control the placement
} else{
cheatmap <- ComplexHeatmap::Heatmap(hm_data,
col=colorRamp2(seq(-max(abs(pca_exp), na.rm = T), max(abs(pca_exp), na.rm = T), length.out = 20),
rev(colorRampPalette(brewer.pal(9, "PuOr"))(20))),
bottom_annotation = column_annotations,
show_column_names=F,
show_heatmap_legend = F) # Turning off to control the placement
}
draw(cheatmap, show_annotation_legend = FALSE)
```
Information {data-orientation=columns data-icon="fa-info-circle"}
=====================================
Column {data-width=600}
-------------------------------------
### Overview
**Quantification and quality-control pipeline**
The quality of each sample was independently assessed using FastQC, Preseq, Picard tools, RSeQC, SAMtools, and QualiMap. FastQ Screen and Kraken + Krona were used to screen for various sources of contamination.
Adapter sequences were removed using Cutadapt prior to mapping to hg38 reference genome. STAR was run in _two-pass_ mode where splice-junctions are collected and aggregated across all samples and provided to the second-pass of STAR. Gene expression levels were quantified using RSEM. The expected counts from RSEM are merged across samples to create a counts matrix for downstream analysis. RSeQC `tin.py` was used to calculate transcript integrity numbers for all canonical protein-coding transcripts.
**Downstream Analysis**
The expected counts from RSEM were filtered to remove lowly expressed genes using edgeR's `filterByExpr()` function. The following critea were selected for filtering: genes must have 10 reads in >= 70% samples. After filtering, we are left with `r dim(deg)[1]` genes. Trimmed mean of M-values (TMM) was performed using the `calcNormFactors()` function in edgeR. The normalisation factors calculated here are used as a scaling factor for the library sizes. Using the `voom()` function in limma, the counts are converted log2-counts-per-million (logCPM) and quantile normalized.
_voom_ is an acronym for mean-variance modelling at the observational level. The key concern is to estimate the mean-variance relationship in the data, then use this to compute appropriate weights for each observation. Count data almost show non-trivial mean-variance relationships. Raw counts show increasing variance with increasing count size, while log-counts typically show a decreasing mean-variance trend. This function estimates the mean-variance trend for log-counts, then assigns a weight to each observation based on its predicted variance. The weights are then used in the linear modelling process to adjust for heteroscedasticity.
| Tool | Version | Notes |
|----------------|:-------:|---------------------------------------------------------------------------------------------------------------|
| FastQC2 | 0.11.5 | **Quality-control step** to assess sequencing quality, run before and after adapter trimming |
| Cutadapt3 | 1.18 | **Data processing step** to remove adapter sequences and perform quality trimming |
| Kraken14 | 1.1 | **Quality-control step** to assess microbial taxonomic composition |
| KronaTools15 | 2.7 | **Quality-control step** to visualize kraken output |
| FastQ Screen17 | 0.9.3 | **Quality-control step** to assess contamination; additional dependencies: `bowtie2/2.3.4`, `perl/5.24.3` |
| STAR4 | 2.7.0f | **Data processing step** to align reads against reference genome (using its two-pass mode) |
| QualiMap16 | 2.2.1 | **Quality-control step** to assess various alignment metrics, also calculates insert_size |
| Picard10 | 2.17.11 | **Quality-control step** to run `MarkDuplicates`, `CollectRnaSeqMetrics` and `AddOrReplaceReadGroups` |
| Preseq1 | 2.0.3 | **Quality-control step** to estimate library complexity |
| SAMtools13 | 1.6 | **Quality-control step** to run `flagstat` to calculate alignment statistics |
| RSeQC9 | 2.6.4 | **Quality-control step** to infer stranded-ness, TIN19, and read distributions over specific genomic features |
| RSEM5 | 1.3.0 | **Data processing step** to quantify gene and isoform counts |
| MultiQC11 | 1.4 | **Reporting step** to aggregate sample statistics and quality-control information across all sample |
**General Recommendations**
Here is a set of generalized guidelines for different QC metrics. Some of these metrics will vary genome-to-genome depending on the quality of the assembly and annotation but that has been taken into consideration for our set of supported reference genomes.
| Metric | Guideline |
|-----------------------------|:-------------------------------:|
| *medTIN* | > 65 |
| *Trimmed Reads* | > 10000000 |
| *% Aligned to Reference* | > 65% |
| *% Duplicates* | < 65 % |
| *% rRNA* | < 10% |
| *% Coding* | Coding > 35% |
**References**
**1.** Daley, T. and A.D. Smith, Predicting the molecular complexity of sequencing libraries. Nat Methods, 2013. 10(4): p. 325-7.
**2.** Andrews, S. (2010). FastQC: a quality control tool for high throughput sequence data.
**3.** Martin, M. (2011). "Cutadapt removes adapter sequences from high-throughput sequencing reads." EMBnet 17(1): 10-12.
**4.** Dobin, A., et al., STAR: ultrafast universal RNA-seq aligner. Bioinformatics, 2013. 29(1): p. 15-21.
**5.** Li, B. and C.N. Dewey, RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome. BMC Bioinformatics, 2011. 12: p. 323.
**6.** Harrow, J., et al., GENCODE: the reference human genome annotation for The ENCODE Project. Genome Res, 2012. 22(9): p. 1760-74.
**7.** Law, C.W., et al., voom: Precision weights unlock linear model analysis tools for RNA-seq read counts. Genome Biol, 2014. 15(2): p. R29.
**8.** Smyth, G.K., Linear models and empirical bayes methods for assessing differential expression in microarray experiments. Stat Appl Genet Mol Biol, 2004. 3: p. Article3.
**9.** Wang, L., et al. (2012). "RSeQC: quality control of RNA-seq experiments." Bioinformatics 28(16): 2184-2185.
**10.** The Picard toolkit. https://broadinstitute.github.io/picard/.
**11.** Ewels, P., et al. (2016). "MultiQC: summarize analysis results for multiple tools and samples in a single report." Bioinformatics 32(19): 3047-3048.
**12.** R Core Team (2018). R: A Language and Environment for Statistical Computing. Vienna, Austria, R Foundation for Statistical Computing.
**13.** Li, H., et al. (2009). "The Sequence Alignment/Map format and SAMtools." Bioinformatics 25(16): 2078-2079.
**14.** Wood, D. E. and S. L. Salzberg (2014). "Kraken: ultrafast metagenomic sequence classification using exact alignments." Genome Biol 15(3): R46.
**15.** Ondov, B. D., et al. (2011). "Interactive metagenomic visualization in a Web browser." BMC Bioinformatics 12(1): 385.
**16.** Okonechnikov, K., et al. (2015). "Qualimap 2: advanced multi-sample quality control for high-throughput sequencing data." Bioinformatics 32(2): 292-294.
**17.** Wingett, S. and S. Andrews (2018). "FastQ Screen: A tool for multi-genome mapping and quality control." F1000Research 7(2): 1338.
**18.** Robinson, M. D., et al. (2009). "edgeR: a Bioconductor package for differential expression analysis of digital gene expression data." Bioinformatics 26(1): 139-140.
**19.** Wang, L., et al. (2016). "Measure transcript integrity using RNA-seq data." BMC Bioinformatics 17(1): 58.
Column {data-width=400}
-------------------------------------
### Session Information
```{r session-info}
sessionInfo()
```