Interactives

Row

Metadata

Row

Expression PCA

TIN PCA

Subplots

2D Expression PCAs colored by different features

Row

Tissue

GC Content

Median TIN

% Dups

% Aligned

Row

Histology

CV Coverage

3' Prime Bias

Insert Size

Inner Distance Maxima

Row

Tumor Grade

Biological Sex

% Coding

% rRNA

% Anti-sense

Corr plots

Feature correlation plots

Column

Heirarchical clustering of pairwise spearman correlation coefficients

Column

Complete linkage clustering of PC loadings with QC annotations

Information

Column

Overview

Quantification and quality-control pipeline

The quality of each sample was independently assessed using FastQC, Preseq, Picard tools, RSeQC, SAMtools, and QualiMap. FastQ Screen and Kraken + Krona were used to screen for various sources of contamination.

Adapter sequences were removed using Cutadapt prior to mapping to hg38 reference genome. STAR was run in two-pass mode where splice-junctions are collected and aggregated across all samples and provided to the second-pass of STAR. Gene expression levels were quantified using RSEM. The expected counts from RSEM are merged across samples to create a counts matrix for downstream analysis. RSeQC tin.py was used to calculate transcript integrity numbers for all canonical protein-coding transcripts.

Downstream Analysis

The expected counts from RSEM were filtered to remove lowly expressed genes using edgeR's filterByExpr() function. The following critea were selected for filtering: genes must have 10 reads in >= 70% samples. After filtering, we are left with 19439 genes. Trimmed mean of M-values (TMM) was performed using the calcNormFactors() function in edgeR. The normalisation factors calculated here are used as a scaling factor for the library sizes. Using the voom() function in limma, the counts are converted log2-counts-per-million (logCPM) and quantile normalized.

voom is an acronym for mean-variance modelling at the observational level. The key concern is to estimate the mean-variance relationship in the data, then use this to compute appropriate weights for each observation. Count data almost show non-trivial mean-variance relationships. Raw counts show increasing variance with increasing count size, while log-counts typically show a decreasing mean-variance trend. This function estimates the mean-variance trend for log-counts, then assigns a weight to each observation based on its predicted variance. The weights are then used in the linear modelling process to adjust for heteroscedasticity.

Tool Version Notes
FastQC2 0.11.5 Quality-control step to assess sequencing quality, run before and after adapter trimming
Cutadapt3 1.18 Data processing step to remove adapter sequences and perform quality trimming
Kraken14 1.1 Quality-control step to assess microbial taxonomic composition
KronaTools15 2.7 Quality-control step to visualize kraken output
FastQ Screen17 0.9.3 Quality-control step to assess contamination; additional dependencies: bowtie2/2.3.4, perl/5.24.3
STAR4 2.7.0f Data processing step to align reads against reference genome (using its two-pass mode)
QualiMap16 2.2.1 Quality-control step to assess various alignment metrics, also calculates insert_size
Picard10 2.17.11 Quality-control step to run MarkDuplicates, CollectRnaSeqMetrics and AddOrReplaceReadGroups
Preseq1 2.0.3 Quality-control step to estimate library complexity
SAMtools13 1.6 Quality-control step to run flagstat to calculate alignment statistics
RSeQC9 2.6.4 Quality-control step to infer stranded-ness, TIN19, and read distributions over specific genomic features
RSEM5 1.3.0 Data processing step to quantify gene and isoform counts
MultiQC11 1.4 Reporting step to aggregate sample statistics and quality-control information across all sample

General Recommendations

Here is a set of generalized guidelines for different QC metrics. Some of these metrics will vary genome-to-genome depending on the quality of the assembly and annotation but that has been taken into consideration for our set of supported reference genomes.

Metric Guideline
medTIN > 65
Trimmed Reads > 10000000
% Aligned to Reference > 65%
% Duplicates < 65 %
% rRNA < 10%
% Coding Coding > 35%

References

1. Daley, T. and A.D. Smith, Predicting the molecular complexity of sequencing libraries. Nat Methods, 2013. 10(4): p. 325-7.
2. Andrews, S. (2010). FastQC: a quality control tool for high throughput sequence data.
3. Martin, M. (2011). "Cutadapt removes adapter sequences from high-throughput sequencing reads." EMBnet 17(1): 10-12.
4. Dobin, A., et al., STAR: ultrafast universal RNA-seq aligner. Bioinformatics, 2013. 29(1): p. 15-21.
5. Li, B. and C.N. Dewey, RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome. BMC Bioinformatics, 2011. 12: p. 323.
6. Harrow, J., et al., GENCODE: the reference human genome annotation for The ENCODE Project. Genome Res, 2012. 22(9): p. 1760-74.
7. Law, C.W., et al., voom: Precision weights unlock linear model analysis tools for RNA-seq read counts. Genome Biol, 2014. 15(2): p. R29.
8. Smyth, G.K., Linear models and empirical bayes methods for assessing differential expression in microarray experiments. Stat Appl Genet Mol Biol, 2004. 3: p. Article3.
9. Wang, L., et al. (2012). "RSeQC: quality control of RNA-seq experiments." Bioinformatics 28(16): 2184-2185.
10. The Picard toolkit. https://broadinstitute.github.io/picard/.
11. Ewels, P., et al. (2016). "MultiQC: summarize analysis results for multiple tools and samples in a single report." Bioinformatics 32(19): 3047-3048.
12. R Core Team (2018). R: A Language and Environment for Statistical Computing. Vienna, Austria, R Foundation for Statistical Computing.
13. Li, H., et al. (2009). "The Sequence Alignment/Map format and SAMtools." Bioinformatics 25(16): 2078-2079.
14. Wood, D. E. and S. L. Salzberg (2014). "Kraken: ultrafast metagenomic sequence classification using exact alignments." Genome Biol 15(3): R46.
15. Ondov, B. D., et al. (2011). "Interactive metagenomic visualization in a Web browser." BMC Bioinformatics 12(1): 385.
16. Okonechnikov, K., et al. (2015). "Qualimap 2: advanced multi-sample quality control for high-throughput sequencing data." Bioinformatics 32(2): 292-294.
17. Wingett, S. and S. Andrews (2018). "FastQ Screen: A tool for multi-genome mapping and quality control." F1000Research 7(2): 1338.
18. Robinson, M. D., et al. (2009). "edgeR: a Bioconductor package for differential expression analysis of digital gene expression data." Bioinformatics 26(1): 139-140.
19. Wang, L., et al. (2016). "Measure transcript integrity using RNA-seq data." BMC Bioinformatics 17(1): 58.

Column

Session Information

R version 3.5.2 (2018-12-20)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Mojave 10.14.6

Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] grid      stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
 [1] circlize_0.4.10      ComplexHeatmap_2.5.5 reshape2_1.4.3      
 [4] DT_0.15              crosstalk_1.1.0.1    gridExtra_2.3       
 [7] RColorBrewer_1.1-2   edgeR_3.24.3         limma_3.38.3        
[10] plotly_4.9.2.9000    ggplot2_3.3.2        plyr_1.8.6          
[13] argparse_2.0.1      

loaded via a namespace (and not attached):
 [1] viridis_0.5.1          httr_1.4.2             tidyr_1.0.2           
 [4] jsonlite_1.7.1         viridisLite_0.3.0      shiny_1.5.0           
 [7] assertthat_0.2.1       stats4_3.5.2           yaml_2.2.1            
[10] pillar_1.4.6           lattice_0.20-40        glue_1.3.2            
[13] digest_0.6.25          promises_1.1.0         colorspace_1.4-1      
[16] htmltools_0.5.0        httpuv_1.5.2           pkgconfig_2.0.3       
[19] GetoptLong_1.0.2       flexdashboard_0.5.2    purrr_0.3.3           
[22] xtable_1.8-4           scales_1.1.1           later_1.0.0           
[25] tibble_2.1.3           farver_2.0.3           IRanges_2.16.0        
[28] ellipsis_0.3.0         withr_2.3.0            BiocGenerics_0.28.0   
[31] lazyeval_0.2.2         magrittr_1.5           crayon_1.3.4          
[34] mime_0.9               evaluate_0.14          tools_3.5.2           
[37] data.table_1.12.8      GlobalOptions_0.1.2    lifecycle_0.2.0       
[40] matrixStats_0.56.0     stringr_1.4.0          S4Vectors_0.20.1      
[43] findpython_1.0.5       munsell_0.5.0          locfit_1.5-9.1        
[46] cluster_2.1.0          compiler_3.5.2         rlang_0.4.7           
[49] rjson_0.2.20           htmlwidgets_1.5.1.9002 labeling_0.3          
[52] rmarkdown_2.3          gtable_0.3.0           R6_2.4.1              
[55] knitr_1.25             dplyr_0.8.5            fastmap_1.0.1         
[58] clue_0.3-57            shape_1.4.5            stringi_1.4.6         
[61] parallel_3.5.2         Rcpp_1.0.4             vctrs_0.2.4           
[64] png_0.1-7              tidyselect_1.0.0       xfun_0.12             
---
title: "RNA Report"
author: "TCGA LGG RNA-seq cohort"
params:
  raw: "/path/to/raw/counts/matrix.tsv"
  tin: "/path/to/tin/counts/martrix.tsv"
  qc: "/path/to/qc/table.tsv"
  wdir: "/path/to/output/directory"
  annot: "FALSE"
output:
  flexdashboard::flex_dashboard:
    orientation: rows
    navbar:
      - { title: "Pipeline Documentation", href: "https://ccbr.github.io/pipeliner-docs/RNA-seq/Theory-and-practical-guide-for-RNA-seq/", align: right }
    source_code: "embed"
---


```{r setup, include=FALSE}
# Clear R environment
# rm(list = ls())
set.seed(42)

# Set the working directory
knitr::opts_knit$set(root.dir = params$wdir)
knitr::opts_chunk$set(echo = FALSE, warning=FALSE, dev="png", fig.path=file.path(params$wdir, "figs/"))
```


```{r global, include=FALSE}
# Library imports
suppressMessages(library(plyr))
suppressMessages(library(plotly))
suppressMessages(library(ggplot2))
suppressMessages(library(limma))
suppressMessages(library(edgeR))
suppressMessages(library(RColorBrewer))
suppressMessages(library(gridExtra))
suppressMessages(library(crosstalk))
suppressMessages(library(DT))
suppressMessages(library(reshape2))
suppressMessages(library(ComplexHeatmap))
suppressMessages(library(circlize))

# Reading in raw counts matrix, TIN matrix, and QC metadata
rawcounts = read.table(file = params$raw, sep = '\t', header = TRUE, row.names = 1)
#rawcounts = read.table(file = 'data/Test_Raw_RSEM_Genes_Dataset.txt', sep = '\t', header = TRUE, row.names = 1)

tincounts = read.table(file = params$tin, sep = '\t', header = TRUE, row.names = 1)
#tincounts = read.table(file = 'data/Test_TIN_Dataset.txt', sep = '\t', header = TRUE, row.names = 1)

# Remove zero variance rows prior to PC
tincounts = tincounts[apply(tincounts, 1, var) != 0, ]

multiQC = read.table(file = params$qc, sep = '\t', header = TRUE)
rownames(multiQC) = make.names(multiQC$Sample)

# Create DGEList
deg = edgeR::DGEList(counts = rawcounts)

# Filter lowly expressed genes
keep_genes = edgeR::filterByExpr(deg)        # Using default: Gene must have 10 reads in >= 70% samples
deg = deg[keep_genes,,keep.lib.sizes=FALSE]  # Recaluate new lib.sizes after filtering

# edgeR TMM normalization
deg = calcNormFactors(deg, method = "TMM")   # calculate scaling norm.factors

# limma voom normalization
deg_voom = voom(deg, normalize="quantile", plot = TRUE, save.plot = TRUE)

# Order genes by MAD
deg_voom$E <- deg_voom$E[order(apply(deg_voom$E, 1, mad), decreasing = T),]

# Remove zero variance rows prior to PC
deg_voom$E <- deg_voom$E[apply(deg_voom$E, 1, var) != 0, ]

# Principal Components Analysis
pca_exp = prcomp(t(as.matrix(deg_voom$E)), scale.=T)$x[,1:3] # Expression PC Analysis
pca_tin = prcomp(t(as.matrix(tincounts)), scale.=T)$x[,1:3]  # Transcript Integrity Number PC Analysis
colnames(pca_tin) = c("PC1_tin", "PC2_tin", "PC3_tin")       # Renaming PC cols to avoid collision with gene expression PCs

# Merge both dataframes on rowname
multiQC = transform(merge(multiQC, as.data.frame(pca_exp), by='row.names', all=TRUE), row.names=Row.names, Row.names=NULL)
multiQC = transform(merge(multiQC, as.data.frame(pca_tin), by='row.names', all=TRUE), row.names=Row.names, Row.names=NULL)

# Crosstalk object (inter-widget connectivity)
shared_metadata = SharedData$new(multiQC)
```


Interactives {data-icon="ion-android-options"}
=====================================  

Inputs {.sidebar}
-------------------------------------

### Filters

```{r filters}

# Extracted Tissue
filter_checkbox(
  id = "TissueType",
  label = "Tissue",
  sharedData = shared_metadata,
  group = ~TissueType
)

# Histology
 filter_checkbox(
   id = "histology",
   label = "Histology",
   sharedData = shared_metadata,
   group = ~histology
)

# Tumor Grade
 filter_select(
   id = "grade",
   label = "Tumor Grade",
   sharedData = shared_metadata,
   group = ~grade
)

# Tissue Source Site
filter_select(
  id = "TSS_Site",
  label = "Tissue Source",
  sharedData = shared_metadata,
  group = ~TSS_Site
)

# Biological Sex
filter_select(
  id = "sex",
  label = "Biological Sex",
  sharedData = shared_metadata,
  group = ~sex
)

# Flowcell
 filter_select(
   id = "flowcell_ids",
   label = "Flowcell",
   sharedData = shared_metadata,
   group = ~flowcell_ids
)

# Sequence Ranges
filter_select(
  id = "sequence_length",
  label = "Sequence Ranges",
  sharedData = shared_metadata,
  group = ~sequence_length
)

# Median TIN
filter_slider(
  id = "median_tin",
  label = "Median TIN",
  sharedData = shared_metadata,
  column = ~median_tin,
  step = 5,
  round = TRUE,
  sep = "",
  ticks = TRUE,
  min = 0,
  max = 100
)

# Trimmed Reads
filter_slider(
  id = "trimmed_read_pairs",
  label = "Trimmed Reads",
  sharedData = shared_metadata,
  column = ~trimmed_read_pairs,
  step = 5000000,
  round = TRUE,
  sep = "",
  ticks = TRUE,
  min = 0
)

# % Duplicates
filter_slider(
  id = "percent_duplication",
  label = "% Duplicates",
  sharedData = shared_metadata,
  column = ~percent_duplication,
  step = 5,
  round = TRUE,
  sep = "",
  ticks = TRUE,
  min = 0,
  max = 100
)

# % Aligned
filter_slider(
  id = "percent_aligned",
  label = "% Aligned",
  sharedData = shared_metadata,
  column = ~percent_aligned,
  step = 5,
  round = TRUE,
  sep = "",
  ticks = TRUE,
  min = 0,
  max = 100
)

# % Coding
filter_slider(
  id = "pct_coding_bases",
  label = "% Coding",
  sharedData = shared_metadata,
  column = ~pct_coding_bases,
  step = 5,
  round = TRUE,
  sep = "",
  ticks = TRUE,
  min = 0,
  max = 100
)

# % Intronic
filter_slider(
  id = "pct_intronic_bases",
  label = "% Intronic",
  sharedData = shared_metadata,
  column = ~pct_intronic_bases,
  step = 5,
  sep = "",
  ticks = TRUE,
  min = 0,
  max = 100
)

# % rRNA
filter_slider(
  id = "rRNA_percent_aligned",
  label = "% rRNA",
  sharedData = shared_metadata,
  column = ~rRNA_percent_aligned,
  step = 5,
  sep = "",
  round = TRUE,
  ticks = TRUE,
  min = 0,
  max = 100
)

# GC content
filter_slider(
  id = "gc_content",
  label = "GC content",
  sharedData = shared_metadata,
  column = ~gc_content,
  sep = "",
  ticks = TRUE
)

# CV Coverage
filter_slider(
  id = "median_cv_coverage",
  label = "CV Coverage",
  sharedData = shared_metadata,
  column = ~median_cv_coverage,
  sep = "",
  ticks = TRUE
)

# Inner Distance Maxima
filter_slider(
  id = "inner_distance_maxima",
  label = "Inner Distance Maxima",
  sharedData = shared_metadata,
  column = ~inner_distance_maxima,
  step = 10,
  sep = "",
  ticks = TRUE,
  round = TRUE
)

# Insert Size
filter_slider(
  id = "median_insert_size",
  label = "Insert Size",
  sharedData = shared_metadata,
  column = ~median_insert_size,
  step = 5,
  sep = "",
  ticks = TRUE
)

```


Row {data-height=400}
-------------------------------------

### Metadata
```{r datatable}
shared_metadata %>%
  DT::datatable(
    # selection = 'none', # disable datatable row selection
    # filter = "top",     # allows filtering on each column
    extensions = c(
      "Buttons",      # add download buttons
      "Scroller"      # for scrolling instead of pagination
    ),
    rownames = FALSE,  # remove rownames
    style = "bootstrap",
    class = "compact",
    width = "100%",
    options = list(
      dom = "Blrtip",  # specify content (search box, etc)
      deferRender = TRUE,
      scrollY = 300,
      scroller = TRUE,
      columnDefs = list(
        list(
          visible = FALSE,
          targets = c(1, 11, 14, 15, 18, 19, 21:23, 27, 30, 36:41)   # hide columns, last index range is for PCAs (tin/exp)
        )
      ),
      buttons = list(
        I("colvis"),  # turn columns on and off
        "csv",  # download as .csv
        "excel"  # download as .xlsx
      )
    ),
    colnames = c(
      "Sample ID" = "Sample",
      "Total Reads" = "total_read_pairs",
      "Trimmed Reads" = "trimmed_read_pairs",
      "Avg Seq Length" = "avg_sequence_length",
      "Seq Range" = "sequence_length",
      "GC" = "gc_content",
      "% Dup" = "percent_duplication",
      "% Aligned " = "percent_aligned",
      "Inner Distance Maxima" = "inner_distance_maxima",
      "Insert Size " = "median_insert_size",
      "Avg MapQ" = "mean_mapping_quality",
      "Coverage" = "mean_coverage",
      "% Coding" = "pct_coding_bases",
      "% Intronic" = "pct_intronic_bases",
      "CV Coverage" = "median_cv_coverage",
      "% rRNA" = "rRNA_percent_aligned" ,
      "% UniVec" = "uni_vec_percent_aligned",
      "% Anti-sense" = "percent_antisense_strand",
      "Median TIN" = "median_tin",
      "Flowcell" = "flowcell_ids",
      "Biological Sex" = "sex",
      "Tumor Grade" = "grade",
      "Histology" = "histology",
      "Tissue Source" = "TSS_Site",
      "Tissue" = "TissueType"
    )
  )
```


Row {data-height=600}
-------------------------------------

### Expression PCA

```{r 3d-expression-pca}
# Principal Components Analysis
pca = prcomp(t(as.matrix(deg_voom$E)), scale.=T)

# Variance explained for PCs: 1, 2, 3
pc1 = round(pca$sdev[1]^2/sum(pca$sdev^2)*100,2)
pc2 = round(pca$sdev[2]^2/sum(pca$sdev^2)*100,2)
pc3 = round(pca$sdev[3]^2/sum(pca$sdev^2)*100,2)

cgroups = as.factor(multiQC$histology)
cgroups = addNA(cgroups)
cpalette <- brewer.pal(nlevels(cgroups), "Paired")

p <- plot_ly(shared_metadata, x = ~PC1, y = ~PC2, z = ~PC3, color=cgroups, colors=cpalette, hoverinfo="text", marker=list(size  = 8),
            text = ~paste('
Sample: ', Sample, '
Tissue: ', TissueType, '
Histology: ', histology, '
Grade: ', grade, '
Flowcell: ', flowcell_ids, '


% Aligned: ', percent_aligned, '
% Dup: ', percent_duplication, '
% Coding: ', pct_coding_bases, '
% Intronic: ', pct_intronic_bases, '

medTIN: ', median_tin, '
Sequence Range: ', sequence_length, '
Inner Distance Maxima: ', inner_distance_maxima, '
Insert Size: ', median_insert_size) ) %>% add_markers() %>% layout(scene = list(xaxis = list(title = paste0("PC1 (",pc1,"%)")), yaxis = list(title = paste0("PC2 (",pc2,"%)")), zaxis = list(title = paste0("PC3 (",pc3,"%)")))) # Important: disable onclick() events plotly::highlight(p, on = NULL) # fixes unexpected behavior when multiple plots via crosstalk ``` ### TIN PCA ```{r 3d-tin-pca} # Principal Components Analysis pca = prcomp(t(as.matrix(tincounts)), scale.=T) # Variance explained for PCs: 1, 2, 3 pc1 = round(pca$sdev[1]^2/sum(pca$sdev^2)*100,2) pc2 = round(pca$sdev[2]^2/sum(pca$sdev^2)*100,2) pc3 = round(pca$sdev[3]^2/sum(pca$sdev^2)*100,2) cgroups = as.factor(multiQC$histology) cgroups = addNA(cgroups) cpalette <- brewer.pal(nlevels(cgroups), "Paired") p <- plot_ly(shared_metadata, x = ~PC1_tin, y = ~PC2_tin, z = ~PC3_tin, color=cgroups, colors=cpalette, hoverinfo="text", marker=list(size = 8), text = ~paste('
Sample: ', Sample, '
Tissue: ', TissueType, '
Histology: ', histology, '
Grade: ', grade, '
Flowcell: ', flowcell_ids, '


% Aligned: ', percent_aligned, '
% Dup: ', percent_duplication, '
% Coding: ', pct_coding_bases, '
% Intronic: ', pct_intronic_bases, '

medTIN: ', median_tin, '
Sequence Range: ', sequence_length, '
Inner Distance Maxima: ', inner_distance_maxima, '
Insert Size: ', median_insert_size) ) %>% add_markers() %>% layout(scene = list(xaxis = list(title = paste0("PC1 (",pc1,"%)")), yaxis = list(title = paste0("PC2 (",pc2,"%)")), zaxis = list(title = paste0("PC3 (",pc3,"%)")))) # Important: disable onclick() events plotly::highlight(p, on = NULL) # fixes unexpected behavior when multiple plots via crosstalk ``` Subplots {data-icon="ion-grid"} ===================================== **2D Expression PCAs colored by different features** Row ------------------------------------- ```{r pca-initialize} # Principal Components Analysis pca = prcomp(t(as.matrix(deg_voom$E)), scale.=T) # Variance explained for PCs: 1, 2, 3 pc1 = round(pca$sdev[1]^2/sum(pca$sdev^2)*100,2) pc2 = round(pca$sdev[2]^2/sum(pca$sdev^2)*100,2) ``` ### Tissue ```{r colored-by-tissue-type} # Gene Expression PCA colored by Tissue Type g <- ggplot(multiQC, aes(PC1, PC2, color = TissueType), xlab) + geom_point(size = multiQC$TissueType) + theme_minimal() + labs(color = "Tissue", x = paste0("PC1 (",pc1,"%)"), y = paste0("PC2 (",pc2,"%)")) g ``` ### GC Content ```{r colored-by-gc-content} # Gene Expression PCA colored by GC Content g <- ggplot(multiQC, aes(PC1, PC2, color = gc_content), xlab) + geom_point(size = multiQC$TissueType) + theme_minimal() + labs(color = "GC", x = paste0("PC1 (",pc1,"%)"), y = paste0("PC2 (",pc2,"%)")) + scale_colour_gradientn(colours = viridis::viridis(100)) g ``` ### Median TIN ```{r colored-by-tin} # Gene Expression PCA colored by TIN g <- ggplot(multiQC, aes(PC1, PC2, color = median_tin), xlab) + geom_point(size = multiQC$TissueType) + theme_minimal() + labs(color = "medTIN", x = paste0("PC1 (",pc1,"%)"), y = paste0("PC2 (",pc2,"%)")) + scale_colour_gradientn(colours = viridis::viridis(100)) g ``` ### % Dups ```{r colored-by-dups} # Gene Expression PCA colored by % Duplicates g <- ggplot(multiQC, aes(PC1, PC2, color = percent_duplication), xlab) + geom_point(size = multiQC$TissueType) + theme_minimal() + labs(color = "% Dups", x = paste0("PC1 (",pc1,"%)"), y = paste0("PC2 (",pc2,"%)")) + scale_colour_gradientn(colours = viridis::viridis(100)) g ``` ### % Aligned ```{r colored-by-alignment} # Gene Expression PCA colored by % Aligned g <- ggplot(multiQC, aes(PC1, PC2, color = percent_aligned), xlab) + geom_point(size = multiQC$TissueType) + theme_minimal() + labs(color = "% Aligned", x = paste0("PC1 (",pc1,"%)"), y = paste0("PC2 (",pc2,"%)")) + scale_colour_gradientn(colours = viridis::viridis(100)) g ``` Row ------------------------------------- ### Histology ```{r colored-by-histology-type} # Gene Expression PCA colored by Histology g <- ggplot(multiQC, aes(PC1, PC2, color = histology), xlab) + geom_point(size = multiQC$TissueType) + theme_minimal() + labs(color = "Histology", x = paste0("PC1 (",pc1,"%)"), y = paste0("PC2 (",pc2,"%)")) g ``` ### CV Coverage ```{r colored-by-cv-coverage} # Gene Expression PCA colored by CV Coverage g <- ggplot(multiQC, aes(PC1, PC2, color = median_cv_coverage), xlab) + geom_point(size = multiQC$TissueType) + theme_minimal() + labs(color = "CV Coverage", x = paste0("PC1 (",pc1,"%)"), y = paste0("PC2 (",pc2,"%)")) + scale_colour_gradientn(colours = viridis::viridis(100)) g ``` ### 3' Prime Bias ```{r colored-by-3-prime-coverage} # Gene Expression PCA colored by 3' Prime Coverage g <- ggplot(multiQC, aes(PC1, PC2, color = median_3prime_bias), xlab) + geom_point(size = multiQC$TissueType) + theme_minimal() + labs(color = "3' Prime Coverage", x = paste0("PC1 (",pc1,"%)"), y = paste0("PC2 (",pc2,"%)")) + scale_colour_gradientn(colours = viridis::viridis(100)) g ``` ### Insert Size ```{r colored-by-insert-size} # Gene Expression PCA colored by Insert Size g <- ggplot(multiQC, aes(PC1, PC2, color = median_insert_size), xlab) + geom_point(size = multiQC$TissueType) + theme_minimal() + labs(color = "Insert Size", x = paste0("PC1 (",pc1,"%)"), y = paste0("PC2 (",pc2,"%)")) + scale_colour_gradientn(colours = viridis::viridis(100)) g ``` ### Inner Distance Maxima ```{r colored-by-inner-distance} # Gene Expression PCA colored by Inner Distance g <- ggplot(multiQC, aes(PC1, PC2, color = inner_distance_maxima), xlab) + geom_point(size = multiQC$TissueType) + theme_minimal() + labs(color = "Inner Distance Maxima", x = paste0("PC1 (",pc1,"%)"), y = paste0("PC2 (",pc2,"%)")) + scale_colour_gradientn(colours = viridis::viridis(100)) g ``` Row ------------------------------------- ### Tumor Grade ```{r colored-by-tumor-grade} # Gene Expression PCA colored by Tumor Grade g <- ggplot(multiQC, aes(PC1, PC2, color = grade), xlab) + geom_point(size = multiQC$TissueType) + theme_minimal() + labs(color = "Tumor Grade", x = paste0("PC1 (",pc1,"%)"), y = paste0("PC2 (",pc2,"%)")) g ``` ### Biological Sex ```{r colored-by-biological-sex} # Gene Expression PCA colored by Sex g <- ggplot(multiQC, aes(PC1, PC2, color = sex), xlab) + geom_point(size = multiQC$TissueType) + theme_minimal() + labs(color = "Sex", x = paste0("PC1 (",pc1,"%)"), y = paste0("PC2 (",pc2,"%)")) g ``` ### % Coding ```{r colored-by-coding} # Gene Expression PCA colored by % Coding g <- ggplot(multiQC, aes(PC1, PC2, color = pct_coding_bases), xlab) + geom_point(size = multiQC$TissueType) + theme_minimal() + labs(color = "% Coding", x = paste0("PC1 (",pc1,"%)"), y = paste0("PC2 (",pc2,"%)")) + scale_colour_gradientn(colours = viridis::viridis(100)) g ``` ### % rRNA ```{r colored-by-rrna} # Gene Expression PCA colored by % rRNA g <- ggplot(multiQC, aes(PC1, PC2, color = rRNA_percent_aligned), xlab) + geom_point(size = multiQC$TissueType) + theme_minimal() + labs(color = "% rRNA", x = paste0("PC1 (",pc1,"%)"), y = paste0("PC2 (",pc2,"%)")) + scale_colour_gradientn(colours = viridis::viridis(100)) g ``` ### % Anti-sense ```{r colored-by-anti-sense} # Gene Expression PCA colored by % Anti-sense g <- ggplot(multiQC, aes(PC1, PC2, color = percent_antisense_strand), xlab) + geom_point(size = multiQC$TissueType) + theme_minimal() + labs(color = "% Anti-sense", x = paste0("PC1 (",pc1,"%)"), y = paste0("PC2 (",pc2,"%)")) + scale_colour_gradientn(colours = viridis::viridis(100)) g ``` Corr plots {data-orientation=columns data-icon="ion-stats-bars"} ===================================== **Feature correlation plots** Column {data-width=500} ------------------------------------- ### Heirarchical clustering of pairwise spearman correlation coefficients ```{r heirarchical-correlation-matrix, dpi=300} # Helper Function reorder_cormat <- function(cormat){ # Use correlation between variables as distance for hierarchical clustering dd <- as.dist((1-cormat)/2) hc <- hclust(dd) cormat <-cormat[hc$order, hc$order] } # Remove all columns that are categorical numericQC = multiQC[,-which(sapply(multiQC, class) == "factor")] # Additional columns to remove additional_remove <- names(numericQC) %in% c("total_read_pairs", "mean_insert_size", "avg_aligned_read_length", "pct_mrna_bases", "pct_utr_bases", "pct_intergenic_bases", "median_5prime_to_3prime_bias", "median_5prime_bias", "mean_mapping_quality", "median_3prime_bias", "percent_sense_strand", "PC1_tin", "PC2_tin", "PC3_tin") # Cleaned numerical QC dataframe numericQC = numericQC[!additional_remove] # Remove zero-variance columns to prevent any hlclust() errors #numericQC = numericQC[,-which(apply(numericQC, 2, var) == 0)] numericQC = numericQC[, apply(numericQC, 2, var) != 0] # Pair-wise spearman correlation matrix cormatrix = round(cor(numericQC, method = "spearman"),2) # Reorder the correlation matrix based on hierarchical clustering of the correlation coeff cormat <- reorder_cormat(cormatrix) # Get upper triangle of the correlation matrix cormat[lower.tri(cormat)] <- NA # Remove lower trinagle and reshape from wide to long format cormat <- melt(cormat, na.rm = TRUE) # Correlation ggheatmap ggheatmap <- ggplot(cormat, aes(Var2, Var1, fill = value)) + geom_tile(color = "white") + scale_fill_gradient2(low = "blue", high = "red", mid = "white", midpoint = 0, limit = c(-1,1), space = "Lab", name="Spearman\nCorrelation") + theme_minimal() + theme(axis.text.x = element_text(angle = 45, vjust = 1, size = 9, hjust = 1)) + coord_fixed() + theme(axis.title.x = element_blank(), axis.title.y = element_blank(), panel.grid.major = element_blank(), panel.border = element_blank(), panel.background = element_blank(), axis.ticks = element_blank(), legend.justification = c(1, 0), legend.position = c(0.6, 0.7), legend.direction = "horizontal") + guides(fill = guide_colorbar(barwidth = 8, barheight = 1, title.position = "top", title.hjust = 0.5)) ggheatmap ``` Column {data-width=500} ------------------------------------- ### Complete linkage clustering of PC loadings with QC annotations ```{r loadings-heatmap, dpi=300, fig.height=6} # Principal Components Analysis: 5 PCs as heatmap input pca_exp = prcomp(t(as.matrix(deg_voom$E)), scale.=T)$x[,1:5] # Expression PC Analysis hm_data <- as.matrix(t(pca_exp)) # Input for heatmap # Additional columns to remove additional_remove <- names(numericQC) %in% c("PC1", "PC2", "PC3") # Cleaned numerical QC dataframe with matched rownames numericQC = numericQC[match(colnames(hm_data), rownames(numericQC), nomatch=0), !additional_remove] column_annotations = HeatmapAnnotation(df = numericQC) if (params$annot){ cheatmap <- ComplexHeatmap::Heatmap(hm_data, col=colorRamp2(seq(-max(abs(pca_exp), na.rm = T), max(abs(pca_exp), na.rm = T), length.out = 20), rev(colorRampPalette(brewer.pal(9, "PuOr"))(20))), bottom_annotation = column_annotations, show_column_names=T, column_names_rot = 45, show_heatmap_legend = F) # Turning off to control the placement } else{ cheatmap <- ComplexHeatmap::Heatmap(hm_data, col=colorRamp2(seq(-max(abs(pca_exp), na.rm = T), max(abs(pca_exp), na.rm = T), length.out = 20), rev(colorRampPalette(brewer.pal(9, "PuOr"))(20))), bottom_annotation = column_annotations, show_column_names=F, show_heatmap_legend = F) # Turning off to control the placement } draw(cheatmap, show_annotation_legend = FALSE) ``` Information {data-orientation=columns data-icon="fa-info-circle"} ===================================== Column {data-width=600} ------------------------------------- ### Overview **Quantification and quality-control pipeline** The quality of each sample was independently assessed using FastQC, Preseq, Picard tools, RSeQC, SAMtools, and QualiMap. FastQ Screen and Kraken + Krona were used to screen for various sources of contamination. Adapter sequences were removed using Cutadapt prior to mapping to hg38 reference genome. STAR was run in _two-pass_ mode where splice-junctions are collected and aggregated across all samples and provided to the second-pass of STAR. Gene expression levels were quantified using RSEM. The expected counts from RSEM are merged across samples to create a counts matrix for downstream analysis. RSeQC `tin.py` was used to calculate transcript integrity numbers for all canonical protein-coding transcripts. **Downstream Analysis** The expected counts from RSEM were filtered to remove lowly expressed genes using edgeR's `filterByExpr()` function. The following critea were selected for filtering: genes must have 10 reads in >= 70% samples. After filtering, we are left with `r dim(deg)[1]` genes. Trimmed mean of M-values (TMM) was performed using the `calcNormFactors()` function in edgeR. The normalisation factors calculated here are used as a scaling factor for the library sizes. Using the `voom()` function in limma, the counts are converted log2-counts-per-million (logCPM) and quantile normalized. _voom_ is an acronym for mean-variance modelling at the observational level. The key concern is to estimate the mean-variance relationship in the data, then use this to compute appropriate weights for each observation. Count data almost show non-trivial mean-variance relationships. Raw counts show increasing variance with increasing count size, while log-counts typically show a decreasing mean-variance trend. This function estimates the mean-variance trend for log-counts, then assigns a weight to each observation based on its predicted variance. The weights are then used in the linear modelling process to adjust for heteroscedasticity. | Tool | Version | Notes | |----------------|:-------:|---------------------------------------------------------------------------------------------------------------| | FastQC2 | 0.11.5 | **Quality-control step** to assess sequencing quality, run before and after adapter trimming | | Cutadapt3 | 1.18 | **Data processing step** to remove adapter sequences and perform quality trimming | | Kraken14 | 1.1 | **Quality-control step** to assess microbial taxonomic composition | | KronaTools15 | 2.7 | **Quality-control step** to visualize kraken output | | FastQ Screen17 | 0.9.3 | **Quality-control step** to assess contamination; additional dependencies: `bowtie2/2.3.4`, `perl/5.24.3` | | STAR4 | 2.7.0f | **Data processing step** to align reads against reference genome (using its two-pass mode) | | QualiMap16 | 2.2.1 | **Quality-control step** to assess various alignment metrics, also calculates insert_size | | Picard10 | 2.17.11 | **Quality-control step** to run `MarkDuplicates`, `CollectRnaSeqMetrics` and `AddOrReplaceReadGroups` | | Preseq1 | 2.0.3 | **Quality-control step** to estimate library complexity | | SAMtools13 | 1.6 | **Quality-control step** to run `flagstat` to calculate alignment statistics | | RSeQC9 | 2.6.4 | **Quality-control step** to infer stranded-ness, TIN19, and read distributions over specific genomic features | | RSEM5 | 1.3.0 | **Data processing step** to quantify gene and isoform counts | | MultiQC11 | 1.4 | **Reporting step** to aggregate sample statistics and quality-control information across all sample | **General Recommendations** Here is a set of generalized guidelines for different QC metrics. Some of these metrics will vary genome-to-genome depending on the quality of the assembly and annotation but that has been taken into consideration for our set of supported reference genomes. | Metric | Guideline | |-----------------------------|:-------------------------------:| | *medTIN* | > 65 | | *Trimmed Reads* | > 10000000 | | *% Aligned to Reference* | > 65% | | *% Duplicates* | < 65 % | | *% rRNA* | < 10% | | *% Coding* | Coding > 35% | **References** **1.** Daley, T. and A.D. Smith, Predicting the molecular complexity of sequencing libraries. Nat Methods, 2013. 10(4): p. 325-7. **2.** Andrews, S. (2010). FastQC: a quality control tool for high throughput sequence data. **3.** Martin, M. (2011). "Cutadapt removes adapter sequences from high-throughput sequencing reads." EMBnet 17(1): 10-12. **4.** Dobin, A., et al., STAR: ultrafast universal RNA-seq aligner. Bioinformatics, 2013. 29(1): p. 15-21. **5.** Li, B. and C.N. Dewey, RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome. BMC Bioinformatics, 2011. 12: p. 323. **6.** Harrow, J., et al., GENCODE: the reference human genome annotation for The ENCODE Project. Genome Res, 2012. 22(9): p. 1760-74. **7.** Law, C.W., et al., voom: Precision weights unlock linear model analysis tools for RNA-seq read counts. Genome Biol, 2014. 15(2): p. R29. **8.** Smyth, G.K., Linear models and empirical bayes methods for assessing differential expression in microarray experiments. Stat Appl Genet Mol Biol, 2004. 3: p. Article3. **9.** Wang, L., et al. (2012). "RSeQC: quality control of RNA-seq experiments." Bioinformatics 28(16): 2184-2185. **10.** The Picard toolkit. https://broadinstitute.github.io/picard/. **11.** Ewels, P., et al. (2016). "MultiQC: summarize analysis results for multiple tools and samples in a single report." Bioinformatics 32(19): 3047-3048. **12.** R Core Team (2018). R: A Language and Environment for Statistical Computing. Vienna, Austria, R Foundation for Statistical Computing. **13.** Li, H., et al. (2009). "The Sequence Alignment/Map format and SAMtools." Bioinformatics 25(16): 2078-2079. **14.** Wood, D. E. and S. L. Salzberg (2014). "Kraken: ultrafast metagenomic sequence classification using exact alignments." Genome Biol 15(3): R46. **15.** Ondov, B. D., et al. (2011). "Interactive metagenomic visualization in a Web browser." BMC Bioinformatics 12(1): 385. **16.** Okonechnikov, K., et al. (2015). "Qualimap 2: advanced multi-sample quality control for high-throughput sequencing data." Bioinformatics 32(2): 292-294. **17.** Wingett, S. and S. Andrews (2018). "FastQ Screen: A tool for multi-genome mapping and quality control." F1000Research 7(2): 1338. **18.** Robinson, M. D., et al. (2009). "edgeR: a Bioconductor package for differential expression analysis of digital gene expression data." Bioinformatics 26(1): 139-140. **19.** Wang, L., et al. (2016). "Measure transcript integrity using RNA-seq data." BMC Bioinformatics 17(1): 58. Column {data-width=400} ------------------------------------- ### Session Information ```{r} sessionInfo() ```