Output from the whole genome and exome pipelines¶
The output files and their locations are broken down here per pipeline type. All file locations are relative to the working directory specified for the Pipeliner run.
Initial QC¶
The Intial QC
pipeline is the first step for both exome and genome workflows. It implements alignment and pre-processing according to best practices for GATK 3.6.
Here's a table of the most important output files from this pipeline.
Pipeline | Output Type | Tool(s) | File Location |
---|---|---|---|
Initial QC | |||
Original fastqs (symlinked) | -- | [sample].R[1,2].fastq.gz | |
Recalibrated (BQSR) BAMs | GATK 3.6 | [sample].recal.bam | |
QC Report | multiqc, qualimap, fastqc, fastq_screen | multiqc_report.html |
Germline¶
This pipeline is essentially the GATK Best Practices with a few alterations detailed below. Briefly, joint SNP and INDEL variant detection is conducted across all samples included in a pipeline run using the GATK Haplotypcaller under default settings. This produces the 'combined.vcf' call file. This file is subsequently filtered at two levels of stringency based on several GATK annotations:
- A strict set of criteria (QD < 2.0, FS > 60.0, MQ < 40.0, MQRankSum < -12.5, ReadPosRankSum < -8.0 for SNPs; QD < 2.0, FS > 200.0, ReadPosRankSum < -20.0 for INDELs) generates the 'combined.strictFilter.vcf'. This call set is highly stringent, maximizing the true positive rate at the expense of an elevated false negative rate. This call set is really only intended for more general population genetic scale analyses (e.g., burden tests, admixture, linkage/pedigree based analysis, etc.) where false positives can be significantly confounding.
- A relaxed set of criteria (QD < 2.0, FS > 60.0 for SNPs; QD < 2.0, FS > 200.0 for INDELs) generates the 'combined.relaxedFilter.vcf' file. This call set is an attempt to optimize the balance between false positive and false negative, and is generally suitable for all discovery applications. Unless you have strong justification otherwise, the 'combined.relaxedFilter.vcf' file should be used for all downstream analyses.
In addition, we provide structural variants called using Manta v1.2.0 and Svaba. We also provide copy number calling using Freec and Sequenza, as well as Canvas for WGS data. Finally, a basic analyses of sample relatedness and ancestry (e.g., % European, African, etc.) is also performed and displayed as a network tree.
Pipeline | Output Type | Tool(s) | File Location |
---|---|---|---|
Germline | |||
SNVs | |||
Strict Filter | exome.strictFilter.vcf | ||
Relaxed Filter | exome.relaxedFilter.vcf | ||
Admixture and PLINK | |||
admixture_out/admixture_mqc.png | |||
admixture_out/admixture_table.tsv | |||
admixture_out/samples_and_knowns_filtered_recode* | |||
Structural Variants | |||
Manta | manta_out/[pair]/results/variants/diploidSV.vcf.gz | ||
SvABA | svaba_out/* | ||
Copy Number Variants | |||
Canvas (WGS only) | canvas_out/* |
Tumor-Normal¶
This workflow calls somatic SNPs and INDELs using three variant detection algorithms. For each of these tools, variants are called in a paired tumor-normal fashion, with default settings. For each sample, the resulting VCF is fully annotated using VEP v92 and converted to a MAF file using the vcf2maf tool. Resulting MAF files are found in the onctotator_out directory within each caller's results directory (e.g., mutect2_out/oncotator_out/NORMAL+TUMOR.maf). Individual sample MAF files are then merged within the oncotator_out directory for each caller (e.g., mutect2_out/oncotator_out/mutect2_merged.maf), and MutSigCV is run for each caller separately (e.g., mutect2_out/mutsigCV_out/). In addition, within each caller's output directory, an oncoplot for the top 30 non-silent mutated genes and a general MAF summary are generated.
For Copy Number Variants (CNVs), two tools are employed in tandem. First, Control-FREEC is run with default parameters. This generates pileup files that can be used by Sequenza, primarily for jointly estimating contamination and ploidy. These value are used to run Freec a second time for improved performance.
Sample pairing must be provided as shown in the Quick Start guide. Individual germline samples samples can be used multiple times (e.g., for multiple tumors from the same patient), as long as one of the two samples in the pair is unique. You cannot run the exact pair in duplicate in the same run.
For Mutect2, we use a panel of normals (PON) developed from the ExAC (excluding TCGA) dataset, filtered for variants <0.001 in the general population, and also including and in-house set of blacklisted recurrent germline variants that are not found in any population databases.
Finally, germline analysis is also performed (see above for output details) with the Tumor-Normal pipeline.
Pipeline | Output Type | Tool(s) | File Location |
---|---|---|---|
Somatic Tumor-Normal | |||
SNVs | |||
mutect VCF | mutect_out/[pair].FINAL.vcf | ||
Mutect MAF and Summaries | |||
mutect_out/oncotator_out/final_filtered.maf | |||
mutect_out/oncotator_out/variants_fixed.maf | |||
mutect_out/oncotator_out/tcga_comparison.pdf | |||
mutect_out/oncotator_out/genes_by_VAF.pdf | |||
Mutect2 MAF and Summaries | |||
mutect2_out/oncotator_out/final_filtered.maf | |||
mutect2_out/oncotator_out/variants_fixed.maf | |||
mutect2_out/oncotator_out/tcga_comparison.pdf | |||
mutect2_out/oncotator_out/genes_by_VAF.pdf | |||
VarDict MAF and Summaries | |||
vardict_out/oncotator_out/final_filtered.maf | |||
vardict_out/oncotator_out/variants_fixed.maf | |||
vardict_out/oncotator_out/tcga_comparison.pdf | |||
vardict_out/oncotator_out/genes_by_VAF.pdf | |||
Strelka MAF and Summaries | |||
strelka_out/oncotator_out/final_filtered.maf | |||
strelka_out/oncotator_out/variants_fixed.maf | |||
strelka_out/oncotator_out/tcga_comparison.pdf | |||
strelka_out/oncotator_out/genes_by_VAF.pdf | |||
Merged Somatic Variants | |||
merged_somatic_variants/oncotator_out/final_filtered.maf | |||
merged_somatic_variants/oncotator_out/variants_fixed.maf | |||
merged_somatic_variants/oncotator_out/tcga_comparison.pdf | |||
merged_somatic_variants/oncotator_out/genes_by_VAF.pdf | |||
Structural Variants | |||
Manta | manta_out/[pair]/results/variants/diploidSV.vcf.gz | ||
Copy Number Variants | |||
Control-FREEC (Pass 1) | freec_out/pass1/[pair].recal.bam_CNVs.p.value.txt | ||
Control-FREEC (Pass 2) | freec_out/pass2/[pair].recal.bam_CNVs.p.value.txt | ||
Sequenza | sequenza_out/[tumor-sample]/* |
Tumor-Only¶
In general, the tumor-only pipeline is a stripped down version of the tumor-normal pipeline. We only run MuTect2, Mutect, and VarDict for somatic variant detection, with the same PON and filtering as described above for the tumor-normal pipeline.