ASPEN Outputs¶
Workdir¶
The workdir which is supplied as -w
while running aspen init
, dryrun
and run
commands will contain the following files:
WORKDIR
├── cluster.json
├── config.yaml
├── contrasts.tsv
├── dryrun_git_commit.txt
├── dryrun.log
├── fastqs
├── logs
├── results
├── runinfo.yaml
├── sampleinfo.txt
├── samples.tsv
├── scripts
├── slurm-48768339.out
├── snakemake.log
├── snakemake.stats
├── stats
├── submit_script.sbatch
└── tools.yaml
Here are more details about these files:
File | File Type | Mode (-m ) When This File is Created/Overwritten | Description |
---|---|---|---|
cluster.json | JSON | init | Defines cluster resources per snakemake rule |
config.yaml | YAML | init; can be edited later | Configurable parameters for this specific run |
contrasts.tsv | TSV | Needs to be added in after init | List of contrasts to run, one per line; has no header |
dryrun_git_commit.txt | TXT | dryrun | The git commit hash of the version of ASPEN used at dryrun |
dryrun.log | TXT | dryrun | Log from -m=dryrun |
fastqs | FOLDER | dryrun | Folder containing symlinks to raw data |
logs | FOLDER | dryrun | Folder containing all logs including Slurm .out and .err files |
results | FOLDER | Created at dryrun but populated during run | Main outputs folder |
runinfo.yaml | YAML | After completion of run | Metadata about the run executor, etc. |
sampleinfo.txt | TXT | dryrun, run | Tab-delimited mappings between replicateNames and sampleNames |
samples.tsv | TSV | init; can be edited later | Tab-delimited manifest with replicateName , sampleName , path_to_R1_fastq , path_to_R2_fastq . This file has a header. |
scripts | FOLDER | init | Folder keeps local copy of scripts called by various rules |
slurm-49051815.out | TXT | run | Slurm .out file for the master job |
snakemake.log | TXT | run | Snakemake .log file for the master job; older copies timestamped and moved into logs folder |
stats | FOLDER | Created at dryrun but populated during run | Contains older timestamped runinfo.yaml files |
submit_script.sbatch | TXT | run | Slurm script to kickstart the main Snakemake job |
tools.yaml | YAML | run | YAML containing the version of tools used in the pipeline (obsolete; was used to load specific module versions prior to moving over to Docker/Singularity containers) |
Resultsdir¶
The results directory contains the actual output files. Below are the folders that you may find within it.
WORKDIR
├── results
├── dedupBam
├── peaks
├── QC
├── qsortedBam
├── tagAlign
└── tmp
Content details:
Folder | Description |
---|---|
dedupBam | Deduplicated filtered BAM files; can be used for visualization. |
peaks | Genrich/MACS2 peak calls (raw, consensus, fixed-width); also contains ROI files with Diff-ATAC results if contrasts.tsv is provided; motif enrichments using HOMER and AME; bigwigs for visualization. |
QC | Flagstats; dupmetrics; read counts; motif enrichments; FLD stats; Fqscreen; FRiP; ChIPSeeker results; TSS enrichments; Preseq; MultiQC. |
qsortedBam | Query name sorted BAM files; used for Genrich peak calling (includes multimappers). |
tagAlign | tagAlign.gz files; deduplicated; used for MACS2 peak calling. |
tmp | Can be deleted; blacklist index; intermediate FASTQs; Genrich output reads. |
The QC
folder contains the multiqc_report.html
file which provides a comprehensive summary of the quality control metrics across all samples, including read quality, duplication rates, and other relevant statistics. This report aggregates results from various QC tools such as FastQC, FastqScreen, FLD, TSS enrichment, Peak Annotations, and others, presenting them in an easy-to-read format with interactive plots and tables. It helps in quickly identifying any issues with the sequencing data and ensures that the data quality is sufficient for downstream analysis.
Note
BAM files from dedupBam
can be used for downstream footprinting analysis using CCBR_TOBIAS pipeline
Note
bamCompare from deeptools can be run to compare BAMs from dedupBam
for comprehensive BAM comparisons.
Note
BAM files from dedupBam
can also be converted to BED format and processed with chromVAR to identify variability in motif accessibility across samples and assess differentially active transcription factors from the JASPAR database.
Most of the above folders are self-explanatory. The peaks
folder has this hierarchy:
WORKDIR
├── results
├── peaks
├── genrich
│ ├── <replicateName>.genrich.narrowPeak
│ ├── <replicateName>.genrich.narrowPeak.annotated
│ ├── <replicateName>.genrich.narrowPeak.genelist
│ ├── <replicateName>.genrich.narrowPeak.annotation_summary
│ ├── <replicateName>.genrich.narrowPeak.annotation_distribution
│ ├── <sampleName>.genrich.pooled.narrowPeak
│ ├── <sampleName>.genrich.consensus.bed
│ ├── ROI.counts.tsv
│ ├── bigwig
│ ├── <replicateName>.genrich.narrowPeak_motif_enrichment
│ │ └── knownResults
│ ├── DiffATAC
│ ├── <replicateName>.genrich.consensus.bed_motif_enrichment
│ └── tn5nicks
└── macs2
│ ├── <replicateName>.macs2.narrowPeak
│ ├── <replicateName>.macs2.narrowPeak.annotated
│ ├── <replicateName>.macs2.narrowPeak.genelist
│ ├── <replicateName>.macs2.narrowPeak.annotation_summary
│ ├── <replicateName>.macs2.narrowPeak.annotation_distribution
│ ├── <sampleName>.macs2.pooled.narrowPeak
│ ├── <sampleName>.macs2.consensus.bed
│ ├── ROI.counts.tsv
├── bigwig
├── <replicateName>.macs2.narrowPeak_motif_enrichment
│ └── knownResults
├── DiffATAC
│ ├── <replicateName>.macs2.consensus.bed_motif_enrichment
├── fixed_width
└── tn5nicks
Some of the important folders and files are highlighted below:
Folders¶
bigwig
:
For easy visualization, they are converted to bigWig format and saved in respective bigwig
folders. The bigWig files can be directly loaded into UCSC Browser or IGV.
tn5nicks
:
This folder host the per-replicate BAM files containing the Tn5 nicking sites in Genrich or MACS2 "peakcalling" reads, respectively.
DiffATAC
:
Contains the DESeq2 differential accessiblity results, both per-contrast and aggregated accross all contrasts in contrasts.tsv
. These results are solely based on tn5 nick counts.
fixed_width
:
This folder contains fixed-width consensus peaks across replicates and samples, represented in the "Regions-Of-Interest" files. The ROI.bed
file lists genomic regions where chromatin accessibility is analyzed using DESeq2, with results stored in the DiffATAC
folder.
<replicateName>.macs2.narrowPeak_motif_enrichment
;<replicateName>.genrich.narrowPeak_motif_enrichment
;<replicateName>.macs2.consensus.bed_motif_enrichment
;<replicateName>.genrich.consensus.bed_motif_enrichment
:
Contains the motif enrichments calculated using HOMER and AME for peaks called for each replicate, sample consensus peaks using both MACS2 and Genrich. Specifically, two types of motif enrichments are performed:
-
Enrichment of known HOCOMOCO (version 11) motifs for HUMAN or MOUSE or BOTH using HOMER. See file
knownResults.html
. -
de novo motif enrichment using AME from MEME suite. See file
ame_results.txt
. Custom parallelization is used to optimize AME based enrichment analysis.
Files¶
*.narrowPeak
:
Called peaks from Genrich or MACS2
- Annotated peak files:
Peaks are annotated with ChIPSeeker and results are saved in the following files:
-
.annotated
Tab-delimited txt file with the following columns:
Column Number | Field Name | Description |
---|---|---|
1 | #peakID | Peak identifier |
2 | chrom | Peak chromosome |
3 | chromStart | Peak start coordinate |
4 | chromEnd | Peak end coordinate |
5 | width | Peak width |
6 | annotation | Peak annotation (Promoter; 3' or 5' UTR; Distal; Downstream; Exon; Intron) |
7 | geneChr | Gene chromosome |
8 | geneStart | Gene start coordinate |
9 | geneEnd | Gene end coordinate |
10 | geneLength | Gene length (including introns) |
11 | geneStrand | Gene strand |
12 | geneId | Gene identifier |
13 | transcriptId | Transcript identifier |
14 | distanceToTSS | Distance of peak from the Transcription Start Site |
15 | ENSEMBL | Gene Ensembl ID |
16 | SYMBOL | Gene symbol |
17 | GENENAME | Gene description |
18 | score | Score from .narrowPeak file |
19 | signalValue | Signal from .narrowPeak file |
20 | pValue | p-value from .narrowPeak file |
21 | qValue | q-value from .narrowPeak file |
22 | peak | Distance of peak summit from peak start coordinate |
.genelist
This is a tab-delimited file with names (Ensembl ID, gene symbol) of genes which have ATAC-seq peaks in their promotor regions. This file can be used downstream for gene enrichment analysis (ORA or over-representation analysis).
.annotation_summary
;.annotation_distribution
Tab-delimited files that provide statistics on peak annotations, quantifying the number of peaks found in Promoters, Exonic regions, Distal Intergenic regions, etc. The .annotation_distribution
is use to create visualization of these annotation-distributions in the MultiQC report.
ROI.counts.tsv
This file contains the read counts for each Region-Of-Interest (ROI) across all replicates of all samples. It is a tab-delimited file with the following columns:
Column Number | Field Name | Description |
---|---|---|
1 | Geneid | Region-Of-Interest identifier |
2 | Chr | Chromosome of the ROI |
3 | Start | Start coordinate of the ROI |
4 | End | End coordinate of the ROI |
5 | Strand | "." |
6 | Length | Length of the ROI |
7 | sample1_replicate1 | Tn5 nicking site counts in this ROI for replicate1 of sample1 |
8 | sample1_replicate2 | Tn5 nicking site counts in this ROI for replicate2 of sample1 |
... | ... | ... |
n | sampleN_replicateM | Tn5 nicking site counts in this ROI for replicateM of sampleN |
Each row represents a specific ROI, and the columns contain the read counts for each sample, allowing for differential accessibility analysis.
Warning
DISCLAIMER: This folder hierarchy is specific to v1.0.6 and is subject to change with version.