report Not Applicable

Tool	Version	Notes
FastQC²	0.11.5	Quality-control step to assess sequencing quality, run before and after adapter trimming
Cutadapt³	1.18	Data processing step to remove adapter sequences and perform quality trimming
Kraken¹⁴	1.1	Quality-control step to assess microbial taxonomic composition
KronaTools¹⁵	2.7	Quality-control step to visualize kraken output
FastQ Screen¹⁷	0.9.3	Quality-control step to assess contamination; additional dependencies: `bowtie2/2.3.4`, `perl/5.24.3`
STAR⁴	2.7.0f	Data processing step to align reads against reference genome (using its two-pass mode)
QualiMap¹⁶	2.2.1	Quality-control step to assess various alignment metrics, also calculates insert_size
Picard¹⁰	2.17.11	Quality-control step to run `MarkDuplicates`, `CollectRnaSeqMetrics` and `AddOrReplaceReadGroups`
Preseq¹	2.0.3	Quality-control step to estimate library complexity
SAMtools¹³	1.6	Quality-control step to run `flagstat` to calculate alignment statistics
RSeQC⁹	2.6.4	Quality-control step to infer stranded-ness, TIN¹⁹, and read distributions over specific genomic features
RSEM⁵	1.3.0	Data processing step to quantify gene and isoform counts
MultiQC¹¹	1.4	Reporting step to aggregate sample statistics and quality-control information across all sample
Metric	Guideline
medTIN	> 65
Trimmed Reads	> 10000000
% Aligned to Reference	> 65%
% Duplicates	< 65 %
% rRNA	< 10%
% Coding	Coding > 35%
---
title: "report Not Applicable"
author: "Assess the quality of large RNA-seq cohorts"
params:
  raw: "/path/to/raw/counts/matrix.tsv"
  tin: "/path/to/tin/counts/martrix.tsv"
  qc: "/path/to/qc/table.tsv"
  wdir: "/path/to/output/directory"
  annot: "FALSE"
output:
  flexdashboard::flex_dashboard:
    orientation: rows
    navbar:
      - { title: "Pipeline Documentation", href: "https://ccbr.github.io/pipeliner-docs/RNA-seq/Theory-and-practical-guide-for-RNA-seq/", align: right }
    source_code: "embed"
---


```{r setup, include=FALSE}
# Clear R environment
# rm(list = ls())
set.seed(42)

# Set the working directory
knitr::opts_knit$set(root.dir = params$wdir)
knitr::opts_chunk$set(echo = FALSE, warning=FALSE, dev="png", fig.path=file.path(params$wdir, "figs/"))
```


```{r global, include=FALSE}
# Library imports
suppressMessages(library(plyr))
suppressMessages(library(plotly))
suppressMessages(library(ggplot2))
suppressMessages(library(limma))
suppressMessages(library(edgeR))
suppressMessages(library(RColorBrewer))
suppressMessages(library(gridExtra))
suppressMessages(library(crosstalk))
suppressMessages(library(DT))
suppressMessages(library(reshape2))
suppressMessages(library(ComplexHeatmap))
suppressMessages(library(circlize))

# Reading in raw counts matrix, TIN matrix, and QC metadata
rawcounts = read.table(file = params$raw, sep = '\t', header = TRUE, row.names = 1)
#rawcounts = read.table(file = 'data/Test_Raw_RSEM_Genes_Dataset.txt', sep = '\t', header = TRUE, row.names = 1)

tincounts = read.table(file = params$tin, sep = '\t', header = TRUE, row.names = 1)
#tincounts = read.table(file = 'data/Test_TIN_Dataset.txt', sep = '\t', header = TRUE, row.names = 1)

# Remove zero variance rows prior to PC
tincounts = tincounts[apply(tincounts, 1, var) != 0, ]

multiQC = read.table(file = params$qc, sep = '\t', header = TRUE)
rownames(multiQC) = make.names(multiQC$Sample)

# Create DGEList
deg = edgeR::DGEList(counts = rawcounts)

# Filter lowly expressed genes
keep_genes = edgeR::filterByExpr(deg)        # Using default: Gene must have 10 reads in >= 70% samples
deg = deg[keep_genes,,keep.lib.sizes=FALSE]  # Recaluate new lib.sizes after filtering

# edgeR TMM normalization
deg = calcNormFactors(deg, method = "TMM")   # calculate scaling norm.factors

# limma voom normalization
deg_voom = voom(deg, normalize="quantile", plot = TRUE, save.plot = TRUE)

# Order genes by MAD
deg_voom$E <- deg_voom$E[order(apply(deg_voom$E, 1, mad), decreasing = T),]

# Remove zero variance rows prior to PC
deg_voom$E <- deg_voom$E[apply(deg_voom$E, 1, var) != 0, ]

# Principal Components Analysis
pca_exp = prcomp(t(as.matrix(deg_voom$E)), scale.=T)$x[,1:3] # Expression PC Analysis
pca_tin = prcomp(t(as.matrix(tincounts)), scale.=T)$x[,1:3]  # Transcript Integrity Number PC Analysis
colnames(pca_tin) = c("PC1_tin", "PC2_tin", "PC3_tin")       # Renaming PC cols to avoid collision with gene expression PCs

# Merge both dataframes on rowname
multiQC = transform(merge(multiQC, as.data.frame(pca_exp), by='row.names', all=TRUE), row.names=Row.names, Row.names=NULL)
multiQC = transform(merge(multiQC, as.data.frame(pca_tin), by='row.names', all=TRUE), row.names=Row.names, Row.names=NULL)

# Crosstalk object (inter-widget connectivity)
shared_metadata = SharedData$new(multiQC)
```


Interactives {data-icon="ion-android-options"}
=====================================  

Inputs {.sidebar}
-------------------------------------

### Filters

```{r filters}

# Extracted Tissue
filter_checkbox(
  id = "TissueType",
  label = "Tissue",
  sharedData = shared_metadata,
  group = ~TissueType
)


# Tissue Source Site
filter_checkbox(
  id = "TSS_Site",
  label = "Tissue Source",
  sharedData = shared_metadata,
  group = ~TSS_Site
)


# Sequence Ranges
filter_select(
  id = "sequence_length",
  label = "Sequence Ranges",
  sharedData = shared_metadata,
  group = ~sequence_length
)


# Median TIN
filter_slider(
  id = "median_tin",
  label = "medTIN",
  sharedData = shared_metadata,
  column = ~median_tin,
  step = 5,
  round = TRUE,
  sep = "",
  ticks = TRUE,
  min = 0,
  max = 100
)


# Trimmed Reads
filter_slider(
  id = "trimmed_read_pairs",
  label = "Trimmed Reads",
  sharedData = shared_metadata,
  column = ~trimmed_read_pairs,
  step = 5000000,
  round = TRUE,
  sep = "",
  ticks = TRUE,
  min = 0
)


# % Duplicates
filter_slider(
  id = "percent_duplication",
  label = "% Duplicates",
  sharedData = shared_metadata,
  column = ~percent_duplication,
  step = 5,
  round = TRUE,
  sep = "",
  ticks = TRUE,
  min = 0,
  max = 100
)


# % Aligned
filter_slider(
  id = "percent_aligned",
  label = "% Aligned",
  sharedData = shared_metadata,
  column = ~percent_aligned,
  step = 5,
  round = TRUE,
  sep = "",
  ticks = TRUE,
  min = 0,
  max = 100
)


# % Coding
filter_slider(
  id = "pct_coding_bases",
  label = "% Coding",
  sharedData = shared_metadata,
  column = ~pct_coding_bases,
  step = 5,
  round = TRUE,
  sep = "",
  ticks = TRUE,
  min = 0,
  max = 100
)


# % Intronic
filter_slider(
  id = "pct_intronic_bases",
  label = "% Intronic",
  sharedData = shared_metadata,
  column = ~pct_intronic_bases,
  step = 5,
  sep = "",
  ticks = TRUE,
  min = 0,
  max = 100
)


# % rRNA
filter_slider(
  id = "rRNA_percent_aligned",
  label = "% rRNA",
  sharedData = shared_metadata,
  column = ~rRNA_percent_aligned,
  step = 5,
  sep = "",
  round = TRUE,
  ticks = TRUE,
  min = 0,
  max = 100
)


# GC content
filter_slider(
  id = "gc_content",
  label = "GC content",
  sharedData = shared_metadata,
  column = ~gc_content,
  sep = "",
  ticks = TRUE
)


# CV Coverage
filter_slider(
  id = "median_cv_coverage",
  label = "CV Coverage",
  sharedData = shared_metadata,
  column = ~median_cv_coverage,
  sep = "",
  ticks = TRUE
)


# Inner Distance Maxima
filter_slider(
  id = "inner_distance_maxima",
  label = "Inner Distance Maxima",
  sharedData = shared_metadata,
  column = ~inner_distance_maxima,
  step = 10,
  sep = "",
  ticks = TRUE,
  round = TRUE
)

# Insert Size
filter_slider(
  id = "median_insert_size",
  label = "Insert Size",
  sharedData = shared_metadata,
  column = ~median_insert_size,
  step = 5,
  sep = "",
  ticks = TRUE
)

```


Row {data-height=400}
-------------------------------------

### Metadata
```{r datatable}
shared_metadata %>%
  DT::datatable(
    # selection = 'none', # disable datatable row selection
    # filter = "top",     # allows filtering on each column
    extensions = c(
      "Buttons",      # add download buttons
      "Scroller"      # for scrolling instead of pagination
    ),
    rownames = FALSE,  # remove rownames
    style = "bootstrap",
    class = "compact",
    width = "100%",
    options = list(
      dom = "Blrtip",  # specify content (search box, etc)
      deferRender = TRUE,
      scrollY = 300,
      scroller = TRUE,
      columnDefs = list(
        list(
          visible = FALSE,
          targets = c(1, 11, 14, 15, 18, 19, 21:23, 27, 31, 32, 33, 34, 35, 36)   # hide columes
        )
      ),
      buttons = list(
        I("colvis"),  # turn columns on and off
        "csv",  # download as .csv
        "excel"  # download as .xlsx
      )
    ),
    colnames = c(
      "Sample ID" = "Sample",
      "Total Reads" = "total_read_pairs",
      "Trimmed Reads" = "trimmed_read_pairs",
      "Avg Seq Length" = "avg_sequence_length",
      "Seq Range" = "sequence_length",
      "GC" = "gc_content",
      "% Dup" = "percent_duplication",
      "% Aligned " = "percent_aligned",
      "Inner Distance Maxima" = "inner_distance_maxima",
      "Insert Size " = "median_insert_size",
      "Avg MapQ" = "mean_mapping_quality",
      "Coverage" = "mean_coverage",
      "% Coding" = "pct_coding_bases",
      "% Intronic" = "pct_intronic_bases",
      "CV Coverage" = "median_cv_coverage",
      "% rRNA" = "rRNA_percent_aligned" ,
      "% UniVec" = "uni_vec_percent_aligned",
      "Tissue" = "TissueType",
      "% Anti-sense" = "percent_antisense_strand",
      "medTIN" = "median_tin",
      "Tissue Source" = "TSS_Site"
    )
  )
```


Row {data-height=600}
-------------------------------------

### Expression PCA

```{r 3d-expression-pca}
# Principal Components Analysis
pca = prcomp(t(as.matrix(deg_voom$E)), scale.=T)

# Variance explained for PCs: 1, 2, 3
pc1 = round(pca$sdev[1]^2/sum(pca$sdev^2)*100,2)
pc2 = round(pca$sdev[2]^2/sum(pca$sdev^2)*100,2)
pc3 = round(pca$sdev[3]^2/sum(pca$sdev^2)*100,2)

cgroups = as.factor(multiQC$TissueType)
cgroups = addNA(cgroups)
cpalette <- brewer.pal(nlevels(cgroups), "Paired")

p <- plot_ly(shared_metadata, x = ~PC1, y = ~PC2, z = ~PC3, color=cgroups, colors=cpalette, hoverinfo="text", marker=list(size  = 8),
            text = ~paste('
 Sample: ', Sample, '
 Tissue Source: ', TSS_Site, '

 % Aligned: ', percent_aligned,
                     '
 % Dup: ', percent_duplication, '
 % Coding: ', pct_coding_bases,
                     '
 % Intronic: ', pct_intronic_bases, '

 medTIN: ', median_tin, '
 Sequence Range: ', sequence_length,
                     '
 Inner Distance Maxima: ', inner_distance_maxima, '
 Insert Size: ', median_insert_size) ) %>%
      add_markers() %>%  layout(scene = list(xaxis = list(title = paste0("PC1 (",pc1,"%)")),
                                              yaxis = list(title = paste0("PC2 (",pc2,"%)")),
                                              zaxis = list(title = paste0("PC3 (",pc3,"%)"))))
# Important: disable onclick() events
plotly::highlight(p, on = NULL) # fixes unexpected behavior when multiple plots via crosstalk
```


### TIN PCA

```{r 3d-tin-pca}
# Principal Components Analysis
pca = prcomp(t(as.matrix(tincounts)), scale.=T)

# Variance explained for PCs: 1, 2, 3
pc1 = round(pca$sdev[1]^2/sum(pca$sdev^2)*100,2)
pc2 = round(pca$sdev[2]^2/sum(pca$sdev^2)*100,2)
pc3 = round(pca$sdev[3]^2/sum(pca$sdev^2)*100,2)

cgroups = as.factor(multiQC$TissueType)
cgroups = addNA(cgroups)
cpalette <- brewer.pal(nlevels(cgroups), "Paired")

p <- plot_ly(shared_metadata, x = ~PC1_tin, y = ~PC2_tin, z = ~PC3_tin, color=cgroups, colors=cpalette, hoverinfo="text", marker=list(size  = 8),
            text = ~paste('
 Sample: ', Sample, '
 Tissue Source: ', TSS_Site, '

 % Aligned: ', percent_aligned,
                     '
 % Dup: ', percent_duplication, '
 % Coding: ', pct_coding_bases,
                     '
 % Intronic: ', pct_intronic_bases, '

 medTIN: ', median_tin, '
 Sequence Range: ', sequence_length,
                     '
 Inner Distance Maxima: ', inner_distance_maxima, '
 Insert Size: ', median_insert_size) ) %>%
      add_markers() %>%  layout(scene = list(xaxis = list(title = paste0("PC1 (",pc1,"%)")),
                                              yaxis = list(title = paste0("PC2 (",pc2,"%)")),
                                              zaxis = list(title = paste0("PC3 (",pc3,"%)"))))

# Important: disable onclick() events
plotly::highlight(p, on = NULL) # fixes unexpected behavior when multiple plots via crosstalk
```

Subplots {data-icon="ion-grid"}
=====================================

**2D Expression PCAs colored by different features**

Row
-------------------------------------

```{r pca-initialize}
# Principal Components Analysis
pca = prcomp(t(as.matrix(deg_voom$E)), scale.=T)

# Variance explained for PCs: 1, 2, 3
pc1 = round(pca$sdev[1]^2/sum(pca$sdev^2)*100,2)
pc2 = round(pca$sdev[2]^2/sum(pca$sdev^2)*100,2)

```

### Tissue Type

```{r colored-by-tissue-type}
# Gene Expression PCA colored by Tissue Type
g <- ggplot(multiQC, aes(PC1, PC2, color = TissueType), xlab) + geom_point(size = multiQC$TissueType) + theme_minimal() +
        labs(color = "Tissue Type", x = paste0("PC1 (",pc1,"%)"), y = paste0("PC2 (",pc2,"%)"))

g
```   

### medTIN

```{r colored-by-tin}
# Gene Expression PCA colored by TIN
g <- ggplot(multiQC, aes(PC1, PC2, color = median_tin), xlab) + geom_point(size = multiQC$TissueType) + theme_minimal() +
        labs(color = "medTIN", x = paste0("PC1 (",pc1,"%)"), y = paste0("PC2 (",pc2,"%)")) +
          scale_colour_gradientn(colours = viridis::viridis(100))

g
```   

### % Dups

```{r colored-by-dups}
# Gene Expression PCA colored by % Duplicates
g <- ggplot(multiQC, aes(PC1, PC2, color = percent_duplication), xlab) + geom_point(size = multiQC$TissueType) + theme_minimal() +
        labs(color = "% Dups", x = paste0("PC1 (",pc1,"%)"), y = paste0("PC2 (",pc2,"%)")) +
          scale_colour_gradientn(colours = viridis::viridis(100))

g
```

### % Aligned

```{r colored-by-alignment}
# Gene Expression PCA colored by % Aligned
g <- ggplot(multiQC, aes(PC1, PC2, color = percent_aligned), xlab) + geom_point(size = multiQC$TissueType) + theme_minimal() +
        labs(color = "% Aligned", x = paste0("PC1 (",pc1,"%)"), y = paste0("PC2 (",pc2,"%)")) +
          scale_colour_gradientn(colours = viridis::viridis(100))

g
```

Row
-------------------------------------

### CV Coverage

```{r colored-by-cv-coverage}
# Gene Expression PCA colored by CV Coverage
g <- ggplot(multiQC, aes(PC1, PC2, color = median_cv_coverage), xlab) + geom_point(size = multiQC$TissueType) + theme_minimal() +
        labs(color = "CV Coverage", x = paste0("PC1 (",pc1,"%)"), y = paste0("PC2 (",pc2,"%)")) +
        scale_colour_gradientn(colours = viridis::viridis(100))

g
```   

### 3' Prime Bias

```{r colored-by-3-prime-coverage}
# Gene Expression PCA colored by 3' Prime Coverage
g <- ggplot(multiQC, aes(PC1, PC2, color = median_3prime_bias), xlab) + geom_point(size = multiQC$TissueType) + theme_minimal() +
        labs(color = "3' Prime Coverage", x = paste0("PC1 (",pc1,"%)"), y = paste0("PC2 (",pc2,"%)")) +
          scale_colour_gradientn(colours = viridis::viridis(100))

g
```

### Insert Size

```{r colored-by-insert-size}
# Gene Expression PCA colored by Insert Size
g <- ggplot(multiQC, aes(PC1, PC2, color = median_insert_size), xlab) + geom_point(size = multiQC$TissueType) + theme_minimal() +
        labs(color = "Insert Size", x = paste0("PC1 (",pc1,"%)"), y = paste0("PC2 (",pc2,"%)")) +
          scale_colour_gradientn(colours = viridis::viridis(100))

g
```

### Inner Distance Maxima

```{r colored-by-inner-distance}
# Gene Expression PCA colored by Inner Distance
g <- ggplot(multiQC, aes(PC1, PC2, color = inner_distance_maxima), xlab) + geom_point(size = multiQC$TissueType) + theme_minimal() +
        labs(color = "Inner Distance Maxima", x = paste0("PC1 (",pc1,"%)"), y = paste0("PC2 (",pc2,"%)")) +
          scale_colour_gradientn(colours = viridis::viridis(100))

g
```

Row
-------------------------------------

### % Coding

```{r colored-by-coding}
# Gene Expression PCA colored by % Coding
g <- ggplot(multiQC, aes(PC1, PC2, color = pct_coding_bases), xlab) + geom_point(size = multiQC$TissueType) + theme_minimal() +
        labs(color = "% Coding", x = paste0("PC1 (",pc1,"%)"), y = paste0("PC2 (",pc2,"%)")) +
          scale_colour_gradientn(colours = viridis::viridis(100))

g
```   

### % rRNA

```{r colored-by-rrna}
# Gene Expression PCA colored by % rRNA
g <- ggplot(multiQC, aes(PC1, PC2, color = rRNA_percent_aligned), xlab) + geom_point(size = multiQC$TissueType) + theme_minimal() +
        labs(color = "% rRNA", x = paste0("PC1 (",pc1,"%)"), y = paste0("PC2 (",pc2,"%)")) +
          scale_colour_gradientn(colours = viridis::viridis(100))

g
```

### % Anti-sense

```{r colored-by-anti-sense}
# Gene Expression PCA colored by % Anti-sense
g <- ggplot(multiQC, aes(PC1, PC2, color = percent_antisense_strand), xlab) + geom_point(size = multiQC$TissueType) + theme_minimal() +
        labs(color = "% Anti-sense", x = paste0("PC1 (",pc1,"%)"), y = paste0("PC2 (",pc2,"%)")) +
          scale_colour_gradientn(colours = viridis::viridis(100))

g
```


Corr plots {data-orientation=columns data-icon="ion-stats-bars"}
=====================================

**Feature correlation plots**

Column {data-width=500}
-------------------------------------

### Heirarchical clustering of pairwise spearman correlation coefficients

```{r heirarchical-correlation-matrix, dpi=300}

# Helper Function
reorder_cormat <- function(cormat){
  # Use correlation between variables as distance for heirarchincal clustering
  dd <- as.dist((1-cormat)/2)
  hc <- hclust(dd)
  cormat <-cormat[hc$order, hc$order]
}

# Remove all columns that are categorical
numericQC = multiQC[,-which(sapply(multiQC, class) == "factor")]

# Additional columns to remove
additional_remove <- names(numericQC) %in% c("total_read_pairs", "mean_insert_size", "avg_aligned_read_length", "pct_mrna_bases",
                                             "pct_utr_bases", "pct_intergenic_bases", "median_5prime_to_3prime_bias", "median_5prime_bias",
                                             "median_3prime_bias", "percent_sense_strand", "PC1_tin", "PC2_tin", "PC3_tin")
# Cleaned numerical QC dataframe
numericQC = numericQC[!additional_remove]

# Remove zero-variance columns to prevent any hlclust() errors
#numericQC = numericQC[,-which(apply(numericQC, 2, var) == 0)]
numericQC = numericQC[, apply(numericQC, 2, var) != 0]

# Pair-wise spearman correlation matrix
cormatrix = round(cor(numericQC, method = "spearman"),2)

# Reorder the correlation matrix based on hierarchical clustering of the correlation coeff
cormat <- reorder_cormat(cormatrix)

# Get upper triangle of the correlation matrix
cormat[lower.tri(cormat)] <- NA

# Remove lower trinagle and reshape from wide to long format
cormat <- melt(cormat, na.rm = TRUE)

# Correlation ggheatmap
ggheatmap <- ggplot(cormat, aes(Var2, Var1, fill = value)) + geom_tile(color = "white") +
                scale_fill_gradient2(low = "blue", high = "red", mid = "white", midpoint = 0, limit = c(-1,1), space = "Lab", name="Spearman\nCorrelation") +
                theme_minimal() + theme(axis.text.x = element_text(angle = 45, vjust = 1, size = 9, hjust = 1)) + coord_fixed() +
                theme(axis.title.x = element_blank(),
                      axis.title.y = element_blank(),
                      panel.grid.major = element_blank(),
                      panel.border = element_blank(),
                      panel.background = element_blank(),
                      axis.ticks = element_blank(),
                      legend.justification = c(1, 0),
                      legend.position = c(0.6, 0.7),
                      legend.direction = "horizontal") +
                guides(fill = guide_colorbar(barwidth = 8, barheight = 1, title.position = "top", title.hjust = 0.5))

ggheatmap
```

Column {data-width=500}
-------------------------------------

### Complete linkage clustering of PC loadings with QC annotations

```{r loadings-heatmap, dpi=300, fig.height=6}

# Principal Components Analysis: 5 PCs as heatmap input
pca_exp = prcomp(t(as.matrix(deg_voom$E)), scale.=T)$x[,1:5] # Expression PC Analysis
hm_data <- as.matrix(t(pca_exp))                             # Input for heatmap

# Additional columns to remove
additional_remove <- names(numericQC) %in% c("PC1", "PC2", "PC3")

# Cleaned numerical QC dataframe with matched rownames
numericQC = numericQC[match(colnames(hm_data), rownames(numericQC), nomatch=0), !additional_remove]

column_annotations = HeatmapAnnotation(df = numericQC)

if (params$annot){
	cheatmap <- ComplexHeatmap::Heatmap(hm_data,
                    col=colorRamp2(seq(-max(abs(pca_exp), na.rm = T), max(abs(pca_exp), na.rm = T), length.out = 20),
                                   rev(colorRampPalette(brewer.pal(9, "PuOr"))(20))),
                    bottom_annotation = column_annotations,
                    show_column_names=T,
                    column_names_rot = 45,
                    show_heatmap_legend = F) # Turning off to control the placement
} else{
	cheatmap <- ComplexHeatmap::Heatmap(hm_data,
                    col=colorRamp2(seq(-max(abs(pca_exp), na.rm = T), max(abs(pca_exp), na.rm = T), length.out = 20),
                                   rev(colorRampPalette(brewer.pal(9, "PuOr"))(20))),
                    bottom_annotation = column_annotations,
                    show_column_names=F,
                    show_heatmap_legend = F) # Turning off to control the placement
}

draw(cheatmap, show_annotation_legend = FALSE)
```

Information {data-orientation=columns data-icon="fa-info-circle"}
=====================================

Column {data-width=600}
-------------------------------------

### Overview

**Quantification and quality-control pipeline**

The quality of each sample was independently assessed using FastQC, Preseq, Picard tools, RSeQC, SAMtools, and QualiMap. FastQ Screen and Kraken + Krona were used to screen for various sources of contamination.

Adapter sequences were removed using Cutadapt prior to mapping to hg38 reference genome. STAR was run in _two-pass_ mode where splice-junctions are collected and aggregated across all samples and provided to the second-pass of STAR. Gene expression levels were quantified using RSEM. The expected counts from RSEM are merged across samples to create a counts matrix for downstream analysis. RSeQC `tin.py` was used to calculate transcript integrity numbers for all canonical protein-coding transcripts.

**Downstream Analysis**  

The expected counts from RSEM were filtered to remove lowly expressed genes using edgeR's `filterByExpr()` function. The following critea were selected for filtering: genes must have 10 reads in >= 70% samples. After filtering, we are left with `r dim(deg)[1]` genes. Trimmed mean of M-values (TMM) was performed using the `calcNormFactors()` function in edgeR. The normalisation factors calculated here are used as a scaling factor for the library sizes. Using the `voom()` function in limma, the counts are converted log2-counts-per-million (logCPM) and quantile normalized.

_voom_ is an acronym for mean-variance modelling at the observational level. The key concern is to estimate the mean-variance relationship in the data, then use this to compute appropriate weights for each observation. Count data almost show non-trivial mean-variance relationships. Raw counts show increasing variance with increasing count size, while log-counts typically show a decreasing mean-variance trend. This function estimates the mean-variance trend for log-counts, then assigns a weight to each observation based on its predicted variance. The weights are then used in the linear modelling process to adjust for heteroscedasticity.


| Tool           | Version | Notes                                                                                                         |
|----------------|:-------:|---------------------------------------------------------------------------------------------------------------|
| FastQC²         |  0.11.5 | **Quality-control step** to assess sequencing quality, run before and after adapter trimming               |
| Cutadapt³       |   1.18  | **Data processing step** to remove adapter sequences and perform quality trimming                             |
| Kraken¹⁴         |   1.1   | **Quality-control step** to assess microbial taxonomic composition                                            |
| KronaTools¹⁵     |   2.7   | **Quality-control step** to visualize kraken output                                                           |
| FastQ Screen¹⁷   |  0.9.3  | **Quality-control step** to assess contamination; additional dependencies: `bowtie2/2.3.4`, `perl/5.24.3`     |
| STAR⁴           |  2.7.0f | **Data processing step** to align reads against reference genome (using its two-pass mode)                    |
| QualiMap¹⁶       |  2.2.1  | **Quality-control step** to assess various alignment metrics, also calculates insert_size                     |
| Picard¹⁰         | 2.17.11 | **Quality-control step** to run `MarkDuplicates`, `CollectRnaSeqMetrics` and `AddOrReplaceReadGroups`         |
| Preseq¹         |  2.0.3  | **Quality-control step** to estimate library complexity                                                       |
| SAMtools¹³       |   1.6   | **Quality-control step** to run `flagstat` to calculate alignment statistics                                  |
| RSeQC⁹          | 2.6.4   | **Quality-control step** to infer stranded-ness, TIN¹⁹, and read distributions over specific genomic features    |
| RSEM⁵           | 1.3.0   | **Data processing step** to quantify gene and isoform counts                                                  |
| MultiQC¹¹        | 1.4     | **Reporting step** to aggregate sample statistics and quality-control information across all sample           |


**General Recommendations**

Here is a set of generalized guidelines for different QC metrics. Some of these metrics will vary genome-to-genome depending on the quality of the assembly and annotation but that has been taken into consideration for our set of supported reference genomes.

| Metric                      |             Guideline           |
|-----------------------------|:-------------------------------:|
| *medTIN*                    |             > 65                |
| *Trimmed Reads*             |           > 10000000            |
| *% Aligned to Reference*    |             > 65%               |
| *% Duplicates*              |             < 65 %              |
| *% rRNA*                    |             < 10%               |
| *% Coding*                  |          Coding > 35%           |


**References**

^{**1.**	Daley, T. and A.D. Smith, Predicting the molecular complexity of sequencing libraries. Nat Methods, 2013. 10(4): p. 325-7.}  
^{**2.** Andrews, S. (2010). FastQC: a quality control tool for high throughput sequence data.}  
^{**3.**	Martin, M. (2011). "Cutadapt removes adapter sequences from high-throughput sequencing reads." EMBnet 17(1): 10-12.}  
^{**4.**	Dobin, A., et al., STAR: ultrafast universal RNA-seq aligner. Bioinformatics, 2013. 29(1): p. 15-21.}  
^{**5.**	Li, B. and C.N. Dewey, RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome. BMC Bioinformatics, 2011. 12: p. 323.}  
^{**6.**	Harrow, J., et al., GENCODE: the reference human genome annotation for The ENCODE Project. Genome Res, 2012. 22(9): p. 1760-74.}  
^{**7.**	Law, C.W., et al., voom: Precision weights unlock linear model analysis tools for RNA-seq read counts. Genome Biol, 2014. 15(2): p. R29.}  
^{**8.**	Smyth, G.K., Linear models and empirical bayes methods for assessing differential expression in microarray experiments. Stat Appl Genet Mol Biol, 2004. 3: p. Article3.}  
^{**9.**    Wang, L., et al. (2012). "RSeQC: quality control of RNA-seq experiments." Bioinformatics 28(16): 2184-2185.}  
^{**10.**    The Picard toolkit. https://broadinstitute.github.io/picard/.}  
^{**11.**    Ewels, P., et al. (2016). "MultiQC: summarize analysis results for multiple tools and samples in a single report." Bioinformatics 32(19): 3047-3048.}  
^{**12.**    R Core Team (2018). R: A Language and Environment for Statistical Computing. Vienna, Austria, R Foundation for Statistical Computing.}  
^{**13.**    Li, H., et al. (2009). "The Sequence Alignment/Map format and SAMtools." Bioinformatics 25(16): 2078-2079.}  
^{**14.**    Wood, D. E. and S. L. Salzberg (2014). "Kraken: ultrafast metagenomic sequence classification using exact alignments." Genome Biol 15(3): R46.}  
^{**15.**    Ondov, B. D., et al. (2011). "Interactive metagenomic visualization in a Web browser." BMC Bioinformatics 12(1): 385.}  
^{**16.**    Okonechnikov, K., et al. (2015). "Qualimap 2: advanced multi-sample quality control for high-throughput sequencing data." Bioinformatics 32(2): 292-294.}  
^{**17.**    Wingett, S. and S. Andrews (2018). "FastQ Screen: A tool for multi-genome mapping and quality control." F1000Research 7(2): 1338.}  
^{**18.**    Robinson, M. D., et al. (2009). "edgeR: a Bioconductor package for differential expression analysis of digital gene expression data." Bioinformatics 26(1): 139-140.}  
^{**19.**    Wang, L., et al. (2016). "Measure transcript integrity using RNA-seq data." BMC Bioinformatics 17(1): 58.}


Column {data-width=400}
-------------------------------------

### Session Information

```{r session-info}
sessionInfo()
```