Skip to content

Output from the whole genome and exome pipelines

The output files and their locations are broken down here per pipeline type. All file locations are relative to the working directory specified for the Pipeliner run.

Initial QC

The Intial QC pipeline is the first step for both exome and genome workflows. It implements alignment and pre-processing according to best practices for GATK 3.6.

Here's a table of the most important output files from this pipeline.

Pipeline Output Type Tool(s) File Location
Initial QC
Original fastqs (symlinked) -- [sample].R[1,2].fastq.gz
Recalibrated (BQSR) BAMs GATK 3.6 [sample].recal.bam
QC Report multiqc, qualimap, fastqc, fastq_screen multiqc_report.html

Germline

This pipeline is essentially the GATK Best Practices with a few alterations detailed below. Briefly, joint SNP and INDEL variant detection is conducted across all samples included in a pipeline run using the GATK Haplotypcaller under default settings. This produces the 'combined.vcf' call file. This file is subsequently filtered at two levels of stringency based on several GATK annotations:

  1. A strict set of criteria (QD < 2.0, FS > 60.0, MQ < 40.0, MQRankSum < -12.5, ReadPosRankSum < -8.0 for SNPs; QD < 2.0, FS > 200.0, ReadPosRankSum < -20.0 for INDELs) generates the 'combined.strictFilter.vcf'. This call set is highly stringent, maximizing the true positive rate at the expense of an elevated false negative rate. This call set is really only intended for more general population genetic scale analyses (e.g., burden tests, admixture, linkage/pedigree based analysis, etc.) where false positives can be significantly confounding.
  2. A relaxed set of criteria (QD < 2.0, FS > 60.0 for SNPs; QD < 2.0, FS > 200.0 for INDELs) generates the 'combined.relaxedFilter.vcf' file. This call set is an attempt to optimize the balance between false positive and false negative, and is generally suitable for all discovery applications. Unless you have strong justification otherwise, the 'combined.relaxedFilter.vcf' file should be used for all downstream analyses.

In addition, we provide structural variants called using Manta v1.2.0 and Svaba. We also provide copy number calling using Freec and Sequenza, as well as Canvas for WGS data. Finally, a basic analyses of sample relatedness and ancestry (e.g., % European, African, etc.) is also performed and displayed as a network tree.

Pipeline Output Type Tool(s) File Location
Germline
SNVs
Strict Filter exome.strictFilter.vcf
Relaxed Filter exome.relaxedFilter.vcf
Admixture and PLINK
admixture_out/admixture_mqc.png
admixture_out/admixture_table.tsv
admixture_out/samples_and_knowns_filtered_recode*
Structural Variants
Manta manta_out/[pair]/results/variants/diploidSV.vcf.gz
SvABA svaba_out/*
Copy Number Variants
Canvas (WGS only) canvas_out/*

Tumor-Normal

This workflow calls somatic SNPs and INDELs using three variant detection algorithms. For each of these tools, variants are called in a paired tumor-normal fashion, with default settings. For each sample, the resulting VCF is fully annotated using VEP v92 and converted to a MAF file using the vcf2maf tool. Resulting MAF files are found in the onctotator_out directory within each caller's results directory (e.g., mutect2_out/oncotator_out/NORMAL+TUMOR.maf). Individual sample MAF files are then merged within the oncotator_out directory for each caller (e.g., mutect2_out/oncotator_out/mutect2_merged.maf), and MutSigCV is run for each caller separately (e.g., mutect2_out/mutsigCV_out/). In addition, within each caller's output directory, an oncoplot for the top 30 non-silent mutated genes and a general MAF summary are generated.

For Copy Number Variants (CNVs), two tools are employed in tandem. First, Control-FREEC is run with default parameters. This generates pileup files that can be used by Sequenza, primarily for jointly estimating contamination and ploidy. These value are used to run Freec a second time for improved performance.

Sample pairing must be provided as shown in the Quick Start guide. Individual germline samples samples can be used multiple times (e.g., for multiple tumors from the same patient), as long as one of the two samples in the pair is unique. You cannot run the exact pair in duplicate in the same run.

For Mutect2, we use a panel of normals (PON) developed from the ExAC (excluding TCGA) dataset, filtered for variants <0.001 in the general population, and also including and in-house set of blacklisted recurrent germline variants that are not found in any population databases.

Finally, germline analysis is also performed (see above for output details) with the Tumor-Normal pipeline.

Pipeline Output Type Tool(s) File Location
Somatic Tumor-Normal
SNVs
mutect VCF mutect_out/[pair].FINAL.vcf
Mutect MAF and Summaries
mutect_out/oncotator_out/final_filtered.maf
mutect_out/oncotator_out/variants_fixed.maf
mutect_out/oncotator_out/tcga_comparison.pdf
mutect_out/oncotator_out/genes_by_VAF.pdf
Mutect2 MAF and Summaries
mutect2_out/oncotator_out/final_filtered.maf
mutect2_out/oncotator_out/variants_fixed.maf
mutect2_out/oncotator_out/tcga_comparison.pdf
mutect2_out/oncotator_out/genes_by_VAF.pdf
VarDict MAF and Summaries
vardict_out/oncotator_out/final_filtered.maf
vardict_out/oncotator_out/variants_fixed.maf
vardict_out/oncotator_out/tcga_comparison.pdf
vardict_out/oncotator_out/genes_by_VAF.pdf
Strelka MAF and Summaries
strelka_out/oncotator_out/final_filtered.maf
strelka_out/oncotator_out/variants_fixed.maf
strelka_out/oncotator_out/tcga_comparison.pdf
strelka_out/oncotator_out/genes_by_VAF.pdf
Merged Somatic Variants
merged_somatic_variants/oncotator_out/final_filtered.maf
merged_somatic_variants/oncotator_out/variants_fixed.maf
merged_somatic_variants/oncotator_out/tcga_comparison.pdf
merged_somatic_variants/oncotator_out/genes_by_VAF.pdf
Structural Variants
Manta manta_out/[pair]/results/variants/diploidSV.vcf.gz
Copy Number Variants
Control-FREEC (Pass 1) freec_out/pass1/[pair].recal.bam_CNVs.p.value.txt
Control-FREEC (Pass 2) freec_out/pass2/[pair].recal.bam_CNVs.p.value.txt
Sequenza sequenza_out/[tumor-sample]/*

Tumor-Only

In general, the tumor-only pipeline is a stripped down version of the tumor-normal pipeline. We only run MuTect2, Mutect, and VarDict for somatic variant detection, with the same PON and filtering as described above for the tumor-normal pipeline.


Last update: 2022-11-04