Preparing Files¶
The CARLISLE pipeline is configured and controlled through a set of editable configuration and manifest files. After running carlisle --runmode=init --workdir=/path/to/workdir (see Running the Pipeline), default templates for these files are automatically generated under WORKDIR/config/.
Configuration Files¶
CARLISLE’s configuration system is modular and designed for both flexibility and transparency. The main configuration files include:
config/config.yaml– global pipeline settings and user parameters.resources/cluster.yaml– cluster resource specifications for Biowulf or other SLURM-based systems.resources/tools.yaml– software versions, tool paths, and binary locations.
Cluster Configuration (cluster.yaml)¶
The cluster configuration file defines computational resources such as memory, CPU cores, and runtime limits for each Snakemake rule. Parameters can be adjusted globally or per rule. Edits should be made with caution, as inappropriate resource settings may cause job failures or queuing delays.
Tools Configuration (tools.yaml)¶
This file specifies which versions of each tool are used during execution. When running on Biowulf, tools are automatically loaded from environment modules, ensuring consistency across users. Once CARLISLE transitions to containers, these version pins will map to container image tags instead of module versions, guaranteeing strict reproducibility.
Primary Configuration (config.yaml)¶
The main configuration file (config.yaml) contains parameters grouped into logical sections:
- Folders and Paths: defines input/output directories and manifest file locations.
- User Parameters: controls feature-level behavior (e.g., thresholds, normalization methods, peak calling options).
- References: specifies genome assemblies, index paths, spike-in references, and species annotations.
⚠️ Important: Always verify that reference genome paths and spike-in references correspond to accessible Biowulf or shared filesystem locations.
User Parameters¶
Run Contrasts¶
run_contrasts: Set to true to enable DESeq2 differential analysis between conditions defined in the contrasts manifest. Set to false to skip differential analysis and only produce peaks, QC, and annotation outputs.
run_contrasts: true ℹ️ Note: Differential analysis requires at least two biological replicates per condition in the contrasts manifest. If you have only one replicate per condition, set
run_contrasts: false.
Spike-in Controls¶
CARLISLE supports spike-in normalization using reference genomes such as E. coli or Drosophila melanogaster. The parameter spikein_genome defines the spike-in species, and spikein_reference provides the corresponding FASTA path.
Example for E. coli spike-in:
run_contrasts: true
norm_method: "spikein"
spikein_genome: "ecoli"
spikein_reference:
ecoli:
fa: "PIPELINE_HOME/resources/spikein/Ecoli_GCF_000005845.2_ASM584v2_genomic.fna" Example for Drosophila spike-in:
run_contrasts: true
norm_method: "spikein"
spikein_genome: "drosophila"
spikein_reference:
drosophila:
fa: "/fdb/igenomes/Drosophila_melanogaster/UCSC/dm6/Sequence/WholeGenomeFasta/genome.fa" Example for Saccharomyces cerevisiae spike-in:
norm_method: "spikein"
spikein_genome: "saccharomyces"
spikein_reference:
saccharomyces:
fa: "$PIPELINE_HOME/resources/spikein/S_cer_S288C_R64.fna" If spike-ins are unavailable or insufficient, normalization can alternatively be performed based on library size. Recommended workflow:
- Run CARLISLE with
norm_method: spikeinfor an initial QC assessment. - Evaluate spike-in alignment statistics (found in
alignment_stats/alignment_stats.tsvin your results directory). - Change
norm_methodtolibraryin yourconfig.yaml. - Re-run CARLISLE — pooled control outputs are automatically regenerated when
norm_methodchanges.
ℹ️ Don’t have spike-in samples? That is fine — spike-in normalization is optional. If your experiment did not include spike-in DNA (e.g., E. coli or Drosophila chromatin), simply set
norm_method: "library"from the start and omit thespikein_genomeandspikein_referenceparameters entirely. Library-size normalization is a valid and commonly used alternative.ℹ️ Normalization change behavior: Pooled control fragment and bedgraph filenames encode the normalization method (e.g.,
*.SPIKEIN.bedgraph). Changingnorm_methodin an existing results directory causes Snakemake to detect stale targets and regenerate them automatically — no manual deletion of intermediate files is required.ℹ️ Note:
alignment_stats.tsvis generated automatically by the pipeline and does not need to be specified in your configuration.
Duplication Status¶
Control deduplication behavior using the dupstatus parameter:
dupstatus: "dedup, no_dedup" ✅ Recommendation: Keep this setting unchanged, let CARLISLIE run with dedup and no_dedup options and then choose which peakSets to use later.
🧬 Note: Linear deduplication is essential for CUT&RUN and CUT&Tag datasets to avoid PCR bias and ensure accurate read quantification.
Peak Callers¶
CARLISLE supports three major peak callers, configurable via the peaktype parameter:
- MACS2 – supports
narrowPeakandbroadPeakmodes. - SEACR – supports stringent and relaxed thresholds, for both normalized and non-normalized datasets.
- GoPeaks – optimized for CUT&RUN and CUT&Tag data; recommended for most applications.
✅ Recommendation: Use GoPeaks for its superior signal detection in sparse chromatin accessibility datasets.
All valid peaktype values:
| Value | Description |
|---|---|
macs2_narrow | MACS2 narrow peaks (recommended for TFs and sharp histone marks like H3K4me3) |
macs2_broad | MACS2 broad peaks (H3K27me3, H3K9me3). Note: DESeq2 differential analysis often fails on broad peaks due to excessive peak counts |
seacr_stringent | SEACR stringent threshold (lower sensitivity, higher specificity) |
seacr_relaxed | SEACR relaxed threshold (higher sensitivity) |
gopeaks_narrow | GoPeaks narrow peaks (TFs, sharp marks) |
gopeaks_broad | GoPeaks broad peaks (broad histone marks) |
You can run any combination in a single pipeline execution by listing them comma-separated:
peaktype: "macs2_narrow, gopeaks_narrow, seacr_stringent" MACS2 Control Option¶
Enable control sample usage for MACS2 to improve specificity:
macs2_control: "Y" Optional Analysis Steps¶
Control execution of computationally intensive annotation steps:
run_rose: false # ROSE super-enhancer analysis (set to true to enable)
run_go_enrichment: false # ChIP-Enrich GO enrichment (set to true to enable)
run_motif_enrichment_called_peaks: false # HOMER motif discovery on all called peaks
run_motif_enrichment_deg_peaks: false # HOMER + AME motif enrichment on DEG peaks only ⏱️ Performance Note: ROSE, GO enrichment, and motif enrichment are disabled by default due to their computational requirements. Enable them when you need super-enhancer identification, pathway enrichment, or motif discovery.
run_motif_enrichment_called_peaks: Whentrue, runs HOMERfindMotifon the full set of called peaks for each sample/condition.run_motif_enrichment_deg_peaks: Whentrue, runs both HOMER motif discovery and AME (Analysis of Motif Enrichment) against the HOCOMOCO v14 CORE motif database on up-regulated peak BED files (up_group1.bed,up_group2.bed) from each contrast. Both tools must be enabled together for full DEG motif enrichment output.
When run_go_enrichment: true, additional parameters control the enrichment methods and gene sets used:
go_enrichment_methods: "chipenrich" # options: chipenrich, polyenrich, hybridenrich
geneset_id: "GOBP,GOCC,GOMF,kegg_pathway,reactome" Available geneset_id values include: biocarta_pathway, ctd, cytoband, drug_bank, GOBP, GOCC, GOMF, hallmark, immunologic, kegg_pathway, mesh, metabolite, microrna, oncogenic, panther_pathway, pfam, protein_interaction_biogrid, reactome, transcription_factors.
⚠️ Performance:
hybridenrichis significantly slower thanchipenrichorpolyenrich. Only add it when its model is specifically required. GO enrichment is only supported forhg19andhg38samples.
Pooled Controls¶
Control whether the pipeline pools control replicates for peak calling:
pool_controls: true When enabled (true), CARLISLE runs peak calling in both modes:
- Individual mode – Each treatment replicate is paired with its individual control replicate
- Pooled mode – Each treatment replicate is compared against merged high-depth controls from all control replicates
This dual-mode analysis enables comparison of replicate-specific vs merged control strategies. Results are organized in separate individual/ and pooled/ subdirectories within peak calling outputs.
💡 Use Case: Pooled controls provide increased depth and reduced noise but may miss replicate-specific artifacts. Running both modes allows downstream selection of the most appropriate strategy.
⚠️ Note: If controls have no replicates to pool (each control has only 1 replicate), pooling will have no effect. Consider setting
pool_controls: falsein such cases.
Singularity Cache Directory¶
CARLISLE uses Singularity/Apptainer containers for R-based steps (DESeq2, GO enrichment, ROSE). Most users do not need to configure this. Loading the ccbrpipeliner module automatically sets SIFCACHE to the shared CCBR container cache, so pre-pulled images are used immediately:
module load ccbrpipeliner
# SIFCACHE is now set to the shared cache — no further action required If you need to override the cache location (e.g., for a custom or updated image), you can do so in two ways:
# Option 1: Pass at runtime (takes precedence over everything)
carlisle --runmode=run --workdir=/path/to/workdir --singcache=/path/to/sif/cache
# Option 2: Set as an environment variable before running
export SIFCACHE=/path/to/sif/cache
carlisle --runmode=run --workdir=/path/to/workdir If neither --singcache nor $SIFCACHE is set, CARLISLE resolves the cache directory in the following order:
/data/${USER}/.singularity— if/data/${USER}/exists on the filesystem (standard on Biowulf)${WORKDIR}/.singularity— fallback when/data/${USER}/is not available
⚠️ Warning: Pointing to a cache directory that does not already contain the required
.siffiles will cause Singularity to pull all container images from Docker Hub. This can take significant time (depending on network conditions) and consume several gigabytes of disk space. Use the shared CCBR cache whenever possible.
Control Sample Requirements¶
By default, CARLISLE requires a control sample (e.g., IgG, input DNA) paired with every treatment sample. Each non-control row in the sample manifest must have controlName and controlReplicateNumber filled in.
If you do not have control samples, control-free mode is supported — see Control-Free Mode below.
💡 No controls? Options include:
Control-Free Mode¶
When no IgG or antibody control samples are available, set run_without_controls: true to run all peak callers without a control:
run_without_controls: true
quality_thresholds: "0.01" # SEACR uses these numeric threshold value(s) in control-free mode When enabled:
- All treatment samples are called as peaks against no background control.
macs2_controlis automatically forced to"N"(no-control MACS2 mode).pool_controlsis automatically forced tofalse.- SEACR uses each
quality_thresholdsvalue instead of a control bedgraph. Each value represents the fraction of the signal distribution used as the peak-calling threshold (e.g.,0.01= top 1%). - GoPeaks runs without the
-ccontrol BAM flag. - The sample manifest does not require
controlNameorcontrolReplicateNumbercolumns to be filled in.
Example manifest for control-free mode (all samples are treatments; no control rows):
| sampleName | replicateNumber | isControl | controlName | controlReplicateNumber | path_to_R1 | path_to_R2 |
|---|---|---|---|---|---|---|
| H3K4me3_treated | 1 | N | /path/to/H3K4me3_rep1.R1.fastq.gz | /path/to/H3K4me3_rep1.R2.fastq.gz | ||
| H3K4me3_treated | 2 | N | /path/to/H3K4me3_rep2.R1.fastq.gz | /path/to/H3K4me3_rep2.R2.fastq.gz |
⚠️ Caution: Control-free peak calling will yield higher false-positive rates. Results should be interpreted with care and ideally validated by comparing to matched control experiments.
deepTools Annotation Tracks¶
Control which annotation BED files are used to build deepTools coverage heatmaps and profiles:
deeptools_bedtypes: "geneinfo,protein_coding,ca_ctcf,ca_h3k4me3,ca_tf,pls,pels" Available options (comma-separated, no spaces):
cCRE bedtypes (pls, pels, dels, ca_ctcf, ca_h3k4me3, ca_tf) are sourced from the ENCODE SCREEN database.
| Bedtype | Description |
|---|---|
geneinfo | All genes: gene bodies, promoters, intergenic regions |
protein_coding | Protein-coding genes only |
pls | ENCODE SCREEN promoter-like signatures |
pels | ENCODE SCREEN proximal enhancer-like signatures |
dels | ENCODE SCREEN distal enhancer-like signatures |
ca_ctcf | CTCF-bound chromatin accessibility regions (ENCODE SCREEN) |
ca_h3k4me3 | H3K4me3-marked chromatin accessibility / active promoters (ENCODE SCREEN) |
ca_tf | Transcription factor-bound chromatin accessibility (ENCODE SCREEN) |
⚠️ Memory Warning: The
delsBED file (ENCODE dELS) is very large. Includingdelsindeeptools_bedtypesrequires>=240gmemory for thedeeptools_matanddeeptools_heatmaprules. Update the corresponding entries incluster.yamlbefore enabling it.
Quality Thresholds¶
Set peak-calling quality thresholds using the quality_thresholds parameter:
quality_thresholds: "0.1, 0.05, 0.01" Refer to tool-specific defaults:
MACS2 Broad Peak Threshold¶
For MACS2 broad peak calling (macs2_broad), an additional p-value threshold is applied independently of the global quality_thresholds:
macs2_broad_peak_threshold: "0.01" This maps to the --broad-cutoff MACS2 argument and controls the significance cutoff for the broad-region merging step. The default of 0.01 is generally appropriate; however, reduce it (e.g., 0.001) if broad peaks are excessively numerous or fragmented.
ℹ️ Note: DESeq2 differential analysis frequently fails for broadPeak outputs due to excessive peak counts. For differential analysis,
macs2_narroworgopeaks_narroware recommended.
Differential Analysis Thresholds¶
DESeq2 significance cutoffs for contrast-based differential enrichment:
contrasts_fdr_cutoff: 0.05 # Benjamini-Hochberg adjusted p-value (FDR) threshold
contrasts_lfc_cutoff: 0.59 # log2 fold-change threshold (~1.5-fold change) Both thresholds are applied simultaneously: a peak is considered differentially enriched only if it passes both FDR and log2FC filters. Adjust contrasts_lfc_cutoff to 1.0 (2-fold) for more conservative enrichment calls, or lower both thresholds if the experiment has high biological variability.
Reference Files¶
CARLISLE includes comprehensive reference annotations for supported genomes:
Built-in Annotations¶
For each genome (hg38, hg19, hs1/T2T, mm10), the pipeline provides:
- Gene annotations: TSS, gene bodies, promoters, intergenic regions (protein-coding and all genes)
- Blacklisted regions: ENCODE DAC blacklists for artifact exclusion
- cCREs (candidate cis-Regulatory Elements): From the ENCODE SCREEN database
- PLS – Promoter-like signatures
- pELS – Proximal enhancer-like signatures
- dELS – Distal enhancer-like signatures
- CA-CTCF – CTCF-bound chromatin accessibility regions
- CA-H3K4me3 – H3K4me3-marked chromatin accessibility (active promoters)
- CA-TF – Transcription factor-bound chromatin accessibility
- CA – General chromatin accessibility
- TF – Transcription factor binding sites
These annotations are automatically used by HOMER, GO enrichment, and other annotation tools.
Custom Genomes¶
Additional reference genomes can be integrated by defining:
species_name:
fa: "/path/to/species.fa"
blacklist: "/path/to/blacklistbed/species.bed.gz"
regions: "chr1 chr2 chr3"
macs2_g: "hs" # genome shorthand for MACS2
tss_bed: "/path/to/tss.bed.gz"
# Add cCRE annotations if available
ca_pls_bed: "/path/to/cCREs.PLS.bed.gz"
ca_pels_bed: "/path/to/cCREs.pELS.bed.gz"
ca_dels_bed: "/path/to/cCREs.dELS.bed.gz" 🧭 Best Practice: Store reference paths under a centralized
/fdbor/datalocation on Biowulf to ensure accessibility and consistency across users.
Preparing Manifests¶
CARLISLE uses two manifests:
samplemanifest– required for all analyses.contrasts– optional, required only for differential analysis with DESeq2.
Sample Manifest (Required)¶
Defines sample-level metadata, including sample names, controls, and FASTQ paths.
📄 File format: The sample manifest is a tab-separated values (TSV) file. Do not use commas or spaces as delimiters. The header row is required and column names must match exactly.
⚠️ Paired-end only: CARLISLE requires paired-end sequencing data. Single-end data is not supported. Both
path_to_R1andpath_to_R2must point to valid FASTQ files.
Column descriptions:
| Column | Description |
|---|---|
sampleName | Unique name for the sample (shared across replicates). Must not be a substring of another sampleName. |
replicateNumber | Positive integer (starting from 1) identifying each replicate within a sampleName. Must be unique per sampleName. Sequential numbering recommended. |
isControl | Y if this row is a control sample (e.g., IgG), N for treatment samples. |
controlName | For treatment rows (isControl: N): the sampleName of the paired control. Must be an exact string match to a sampleName where isControl: Y. Leave blank for control rows. |
controlReplicateNumber | The replicateNumber of the control replicate to pair with this treatment. Leave blank for control rows. |
path_to_R1 | Absolute path to the R1 (forward) FASTQ file. |
path_to_R2 | Absolute path to the R2 (reverse) FASTQ file. |
| sampleName | replicateNumber | isControl | controlName | controlReplicateNumber | path_to_R1 | path_to_R2 |
|---|---|---|---|---|---|---|
| 53_H3K4me3 | 1 | N | HN6_IgG_rabbit_negative_control | 1 | ||
| 54_H3K4me3 | 2 | N | HN6_IgG_rabbit_negative_control | 1 | ||
| HN6_IgG_rabbit_negative_control | 1 | Y |
ℹ️ Note:
controlNameandcontrolReplicateNumberare required for non-control samples in normal mode. In control-free mode (run_without_controls: true), leave these columns blank for all samples and omit control rows entirely.⚠️ Exact match required: The
controlNamevalue must exactly match asampleNamein the same manifest whereisControl: Y. Spelling differences, extra spaces, or case mismatches will cause the pipeline to fail.⚠️ Sample name uniqueness: Sample names must not be substrings of each other (e.g., having both
H3K4me3andH3K4me3_rep1assampleNamevalues will cause incorrect sample matching). Use fully distinct names for all samples.
Contrast Manifest (Optional)¶
Specifies conditions for differential analysis:
| condition1 | condition2 |
|---|---|
| MOC1_siSmyd3_2m_25_HCHO | MOC1_siNC_2m_25_HCHO |
📊 Requirement: Each condition must have at least two biological replicates to perform DESeq2-based differential analysis.
ℹ️ How conditions map to samples:
- Values in
condition1andcondition2must exactly matchsampleNamevalues in the sample manifest.- All replicates with that
sampleNameare included automatically — do not list individual replicates.condition2is the reference group (denominator). A positivelog2FoldChangein results means higher enrichment incondition1. If unsure, put the control or untreated condition incondition2.ℹ️ Multiple contrasts: You can include multiple rows to test several comparisons in one run. Each row is an independent DESeq2 comparison:
condition1 condition2 treated_H3K4me3 untreated_H3K4me3 treated_H3K27me3 untreated_H3K27me3 treated_H3K4me3 treated_H3K27me3