Preparing Files

The CARLISLE pipeline is configured and controlled through a set of editable configuration and manifest files. Upon initialization, default templates for these files are automatically generated under the /WORKDIR/config and /WORKDIR/manifest directories.

βš™οΈ Technical Note: CARLISLE follows a Snakemake-driven workflow architecture where all configuration parameters are read dynamically at runtime. Users are encouraged to version-control configuration files (e.g., via Git) to ensure reproducibility across runs.

πŸš€ Future Development: While dependencies are currently module-loaded on the Biowulf HPC environment, future releases will adopt containerization using Singularity/Apptainer and Docker. This shift will provide complete environment encapsulation, allowing consistent execution across HPC and cloud environments.


Configuration Files

CARLISLE’s configuration system is modular and designed for both flexibility and transparency. The main configuration files include:

  • config/config.yaml – global pipeline settings and user parameters.
  • resources/cluster.yaml – cluster resource specifications for Biowulf or other SLURM-based systems.
  • resources/tools.yaml – software versions, tool paths, and binary locations.

Cluster Configuration (cluster.yaml)

The cluster configuration file defines computational resources such as memory, CPU cores, and runtime limits for each Snakemake rule. Parameters can be adjusted globally or per rule. Edits should be made with caution, as inappropriate resource settings may cause job failures or queuing delays.

Tools Configuration (tools.yaml)

This file specifies which versions of each tool are used during execution. When running on Biowulf, tools are automatically loaded from environment modules, ensuring consistency across users. Once CARLISLE transitions to containers, these version pins will map to container image tags instead of module versions, guaranteeing strict reproducibility.

Primary Configuration (config.yaml)

The main configuration file (config.yaml) contains parameters grouped into logical sections:

  • Folders and Paths: defines input/output directories and manifest file locations.
  • User Parameters: controls feature-level behavior (e.g., thresholds, normalization methods, peak calling options).
  • References: specifies genome assemblies, index paths, spike-in references, and species annotations.

⚠️ Important: Always verify that reference genome paths and spike-in references correspond to accessible Biowulf or shared filesystem locations.


User Parameters

Spike-in Controls

CARLISLE supports spike-in normalization using reference genomes such as E. coli or Drosophila melanogaster. The parameter spikein_genome defines the spike-in species, and spikein_reference provides the corresponding FASTA path.

Example for E. coli spike-in:

run_contrasts: true
norm_method: "spikein"
spikein_genome: "ecoli"
spikein_reference:
  ecoli:
    fa: "PIPELINE_HOME/resources/spikein/Ecoli_GCF_000005845.2_ASM584v2_genomic.fna"

Example for Drosophila spike-in:

run_contrasts: true
norm_method: "spikein"
spikein_genome: "drosophila"
spikein_reference:
  drosophila:
    fa: "/fdb/igenomes/Drosophila_melanogaster/UCSC/dm6/Sequence/WholeGenomeFasta/genome.fa"

If spike-ins are unavailable or insufficient, normalization can alternatively be performed based on library size. Recommended workflow:

  1. Run CARLISLE with norm_method: spikein for an initial QC assessment.
  2. Evaluate spike-in alignment statistics.
  3. Add alignment_stats to your configuration.
  4. Re-run CARLISLE using library-size normalization.

Duplication Status

Control deduplication behavior using the dupstatus parameter:

dupstatus: "dedup, no_dedup"

βœ… Recommendation: Keep this setting unchanged, let CARLISLIE run with dedup and no_dedup options and then choose which peakSets to use later.

🧬 Note: Linear deduplication is essential for CUT&RUN and CUT&Tag datasets to avoid PCR bias and ensure accurate read quantification.

Peak Callers

CARLISLE supports three major peak callers, configurable via the peaktype parameter:

  1. MACS2 – supports narrowPeak and broadPeak modes.
  2. SEACR – supports stringent and relaxed thresholds, for both normalized and non-normalized datasets.
  3. GoPeaks – optimized for CUT&RUN and CUT&Tag data; recommended for most applications.

βœ… Recommendation: Use GoPeaks for its superior signal detection in sparse chromatin accessibility datasets.

Example configuration:

peaktype: "macs2_narrow, gopeaks_narrow"

MACS2 Control Option

Enable control sample usage for MACS2 to improve specificity:

macs2_control: "Y"

Optional Analysis Steps

Control execution of computationally intensive annotation steps:

run_rose: false              # ROSE super-enhancer analysis (set to true to enable)
run_go_enrichment: false     # ChIP-Enrich GO enrichment (set to true to enable)

⏱️ Performance Note: ROSE and GO enrichment are disabled by default due to their computational requirements. Enable them when you need super-enhancer identification or pathway enrichment analysis.

Pooled Controls

Control whether the pipeline pools control replicates for peak calling:

pool_controls: true

When enabled (true), CARLISLE runs peak calling in both modes:

  • Individual mode – Each treatment replicate is paired with its individual control replicate
  • Pooled mode – Each treatment replicate is compared against merged high-depth controls from all control replicates

This dual-mode analysis enables comparison of replicate-specific vs merged control strategies. Results are organized in separate individual/ and pooled/ subdirectories within peak calling outputs.

πŸ’‘ Use Case: Pooled controls provide increased depth and reduced noise but may miss replicate-specific artifacts. Running both modes allows downstream selection of the most appropriate strategy.

⚠️ Note: If controls have no replicates to pool (each control has only 1 replicate), pooling will have no effect. Consider setting pool_controls: false in such cases.

Quality Thresholds

Set peak-calling quality thresholds using the quality_thresholds parameter:

quality_thresholds: "0.1, 0.05, 0.01"

Refer to tool-specific defaults:

  • MACS2 q-value: 0.05
  • GoPeaks p-value: 0.05
  • SEACR FDR threshold: 1.0

Reference Files

CARLISLE includes comprehensive reference annotations for supported genomes:

Built-in Annotations

For each genome (hg38, hg19, hs1/T2T, mm10), the pipeline provides:

  • Gene annotations: TSS, gene bodies, promoters, intergenic regions (protein-coding and all genes)
  • Blacklisted regions: ENCODE DAC blacklists for artifact exclusion
  • cCREs (candidate cis-Regulatory Elements): From ENCODE SCREEN database
  • PLS – Promoter-like signatures
  • pELS – Proximal enhancer-like signatures
  • dELS – Distal enhancer-like signatures
  • CA-CTCF – CTCF-bound chromatin accessibility regions
  • CA-H3K4me3 – H3K4me3-marked chromatin accessibility (active promoters)
  • CA-TF – Transcription factor-bound chromatin accessibility
  • CA – General chromatin accessibility
  • TF – Transcription factor binding sites

These annotations are automatically used by HOMER, GO enrichment, and other annotation tools.

Custom Genomes

Additional reference genomes can be integrated by defining:

species_name:
  fa: "/path/to/species.fa"
  blacklist: "/path/to/blacklistbed/species.bed.gz"
  regions: "chr1 chr2 chr3"
  macs2_g: "hs" # genome shorthand for MACS2
  tss_bed: "/path/to/tss.bed.gz"
  # Add cCRE annotations if available
  ca_pls_bed: "/path/to/cCREs.PLS.bed.gz"
  ca_pels_bed: "/path/to/cCREs.pELS.bed.gz"
  ca_dels_bed: "/path/to/cCREs.dELS.bed.gz"

🧭 Best Practice: Store reference paths under a centralized /fdb or /data location on Biowulf to ensure accessibility and consistency across users.


Preparing Manifests

CARLISLE uses two manifests:

  • samplemanifest – required for all analyses.
  • contrasts – optional, required only for differential analysis with DESeq2.

Sample Manifest (Required)

Defines sample-level metadata, including sample names, controls, and FASTQ paths.

sampleName replicateNumber isControl controlName controlReplicateNumber path_to_R1 path_to_R2
53_H3K4me3 1 N HN6_IgG_rabbit_negative_control 1 /53_H3K4me3_1.R1.fastq.gz /53_H3K4me3_1.R2.fastq.gz
54_H3K4me3 2 N HN6_IgG_rabbit_negative_control 1 /54_H3K4me3_1.R1.fastq.gz /54_H3K4me3_1.R2.fastq.gz
HN6_IgG_rabbit_negative_control 1 Y /HN6_IgG_rabbit_negative_control_1.R1.fastq.gz /HN6_IgG_rabbit_negative_control_2.R2.fastq.gz

Contrast Manifest (Optional)

Specifies conditions for differential analysis:

condition1 condition2
MOC1_siSmyd3_2m_25_HCHO MOC1_siNC_2m_25_HCHO

πŸ“Š Requirement: Each condition must have at least two biological replicates to perform DESeq2-based differential analysis.