Resources
Reference genomes¶
Warning: This section contains FTP links for downloading each reference file.
The quality-control and differential expression pipeline support the following genomes:
GenomeName | Species | Annotation Version | Comments |
---|---|---|---|
hg19 | Homo sapiens (human) | Gencode Release 19 | GRCh37, Release date: 07/2013 |
hg38 | Homo sapiens (human) | Gencode Release 28 | GRCh38, Annotation Release date: 11/2017 |
hg38_30 | Homo sapiens (human) | Gencode Release 30 | GRCh38, Annotation Release date: 11/2018 |
hs37d5 | Homo sapiens (human) | Gencode Release 19 | hg19 + decoy sequences |
hs38d1 | Homo sapiens (human) | Gencode Release 28 | hg38 + decoy sequences |
hg38_30_KSHV | Homo sapiens + KSHV | Gencode Release 30 (hg38) + 06/2019 (KSHV) | hg38 + NC_009333.1. Annotation Release dates: 11/2018(human) + 06/2019(KSHV) |
hg38_HPV16 | Homo sapiens + HPV16 | Gencode Release 28 (hg38) + 03/2019 (HPV16 custom annotation from Zheng lab) | hg38 + HPV16 custom sequence based off of KU298885.1 with custom annotation |
mm9 | Mus musculus (house mouse) | M1 | NCBIM37, Annotation Release date: 12/2011 |
mm10 | Mus musculus (house mouse) | M18 | GRCm38, Annotation Release date: 07/2018 |
mm10_M21 | Mus musculus (house mouse) | M21 | GRCm38, Annotation Release date: 04/2019 |
canFam3 | Canis lupus familiaris (dog) | Ensembl Release 94 | CanFam3.1 |
Mmul_8.0.1 | Macaca mulatta (Rhesus monkey or macaque) | Ensembl Release 97 | Mmul_8.0.1 (rheMac8) |
Please note: If you are looking for a reference genome and/or annotation that is currently not available, it can be generated using Pipeliner Index Maker (PIM) . Given the reference's FASTA file
ref.fa
and a GTF filegenes.gtf
, PIM will create all of the required reference files to run RNA-seq pipeline on Biowulf.
Tools and versions¶
Quality-control pipeline¶
Raw data > Adapter Trimming > Alignment > Quantification (genes and isoforms)
Tool | Version | Notes |
---|---|---|
FastQC2 | 0.11.5 | Quality-control step to assess sequencing quality, run before and after adapter trimming |
Cutadapt3 | 1.18 | Data processing step to remove adapter sequences and perform quality trimming |
Kraken18 | 1.1 | Quality-control step to assess microbial taxonomic composition |
KronaTools19 | 2.7 | Quality-control step to visualize kraken output |
FastQ Screen21 | 0.9.3 | Quality-control step to assess contamination; additional dependencies: bowtie2/2.3.4 , perl/5.24.3 |
STAR4 | 2.7.0f | Data processing step to align reads against reference genome (using its two-pass mode) |
QualiMap20 | 2.2.1 | Quality-control step to assess various alignment metrics, also calculates insert_size |
Picard12 | 2.17.11 | Quality-control step to run MarkDuplicates , CollectRnaSeqMetrics and AddOrReplaceReadGroups |
Preseq1 | 2.0.3 | Quality-control step to estimate library complexity |
SAMtools17 | 1.6 | Quality-control step to run flagstat to calculate alignment statistics |
bam2strandedbw | custom | Summarization step to convert STAR aligned PE bam file into forward and reverse strand bigwigs suitable for a genomic track viewer like IGV |
RSeQC11 | 2.6.4 | Quality-control step to infer stranded-ness and read distributions over specific genomic features |
RSEM5 | 1.3.0 | Data processing step to quantify gene and isoform counts |
Subread14 | 1.5.2 | Data processing step to run featureCounts , an alternative quantification method to RSEM |
PCA Report16 | custom | Summarization step to identify outliers prior to DE, contains pre- and post- normalization plots |
MultiQC15 | 1.4 | Reporting step to aggregate sample statistics and quality-control information across all sample |
Differential expression pipeline¶
Raw counts matrix > Normalization > Differential Expression Analysis > Fuctional Impact
Tool | Version | Notes |
---|---|---|
filtersamples16 | custom | Data processing step to remove low CPM genes prior to differential expression analysis |
PCAReport16 | custom | Summarization step to identify outliers prior to DE, contains pre- and post- normalization plots |
EBSeq22 | 1.2.0 | Data processing step to find differentially expressed isoforms, additional dependencies: rsem/1.3.0 |
edgeR23 | 3.24.3 | Data processing step to find differentially expressed genes. Counts are modeled using a negative binomial distribution with mean equal to the multiplication of library size and relative abundance while a quasi-likelihood F-test is used for testing gene differential expression |
DESeq213 | 1.22.2 | Data processing step to find differentially expressed genes. Counts are modeled using a negative binomial distribution similar to edgeR while a wald-test is implemented to test for differential expression |
limma7,8 | 3.38.3 | Data processing step to find differentially expressed genes. Log-transformed counts are modeled using a method analogous to a t-distribution while a moderated t-statistics is used to test for differential expression |
l2p16 | custom | Summarization step for gene set enrichment analysis |
References¶
1. Daley, T. and A.D. Smith, Predicting the molecular complexity of sequencing libraries. Nat Methods, 2013. 10(4): p. 325-7.
2. Andrews, S. (2010). FastQC: a quality control tool for high throughput sequence data.
3. Martin, M. (2011). "Cutadapt removes adapter sequences from high-throughput sequencing reads." EMBnet 17(1): 10-12.
4. Dobin, A., et al., STAR: ultrafast universal RNA-seq aligner. Bioinformatics, 2013. 29(1): p. 15-21.
5. Li, B. and C.N. Dewey, RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome. BMC Bioinformatics, 2011. 12: p. 323.
6. Harrow, J., et al., GENCODE: the reference human genome annotation for The ENCODE Project. Genome Res, 2012. 22(9): p. 1760-74.
7. Law, C.W., et al., voom: Precision weights unlock linear model analysis tools for RNA-seq read counts. Genome Biol, 2014. 15(2): p. R29.
8. Smyth, G.K., Linear models and empirical bayes methods for assessing differential expression in microarray experiments. Stat Appl Genet Mol Biol, 2004. 3: p. Article3.
9. Fabregat, A., et al., The Reactome Pathway Knowledgebase. Nucleic Acids Res, 2018. 46(D1): p. D649-D655.
10. Liberzon, A., et al., Molecular signatures database (MSigDB) 3.0. Bioinformatics, 2011. 27(12): p. 1739-40.
11. Wang, L., et al. (2012). "RSeQC: quality control of RNA-seq experiments." Bioinformatics 28(16): 2184-2185.
12. The Picard toolkit. https://broadinstitute.github.io/picard/.
13. Love, M. I., et al. (2014). "Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2." Genome Biol 15(12): 550.
14. Liao, Y., et al. (2013). "The Subread aligner: fast, accurate and scalable read mapping by seed-and-vote." Nucleic Acids Research 41(10): e108-e108.
15. Ewels, P., et al. (2016). "MultiQC: summarize analysis results for multiple tools and samples in a single report." Bioinformatics 32(19): 3047-3048.
16. R Core Team (2018). R: A Language and Environment for Statistical Computing. Vienna, Austria, R Foundation for Statistical Computing.
17. Li, H., et al. (2009). "The Sequence Alignment/Map format and SAMtools." Bioinformatics 25(16): 2078-2079.
18. Wood, D. E. and S. L. Salzberg (2014). "Kraken: ultrafast metagenomic sequence classification using exact alignments." Genome Biol 15(3): R46.
19. Ondov, B. D., et al. (2011). "Interactive metagenomic visualization in a Web browser." BMC Bioinformatics 12(1): 385.
20. Okonechnikov, K., et al. (2015). "Qualimap 2: advanced multi-sample quality control for high-throughput sequencing data." Bioinformatics 32(2): 292-294.
21. Wingett, S. and S. Andrews (2018). "FastQ Screen: A tool for multi-genome mapping and quality control." F1000Research 7(2): 1338.
22. Leng, N., et al. (2013). "EBSeq: an empirical Bayes hierarchical model for inference in RNA-seq experiments." Bioinformatics 29(8): 1035-1043.
23. Robinson, M. D., et al. (2009). "edgeR: a Bioconductor package for differential expression analysis of digital gene expression data." Bioinformatics 26(1): 139-140.