renee build¶
1. About¶
The renee executable is composed of several inter-related sub commands. Please see renee -h for all available options.
This part of the documentation describes options and concepts for renee build sub command in more detail. With minimal configuration, the build sub command enables you to build new reference files for the renee run pipeline.
Setting up the RENEE build pipeline is fast and easy! In its most basic form, renee build only has five required inputs.
2. Synopsis¶
$ renee build [--help] \
[--shared-resources SHARED_RESOURCES] [--small-genome] \
[--dry-run] [--singularity-cache SINGULARITY_CACHE] \
[--sif-cache SIF_CACHE] [--tmp-dir TMP_DIR] \
--ref-fa REF_FA \
--ref-name REF_NAME \
--ref-gtf REF_GTF \
--gtf-ver GTF_VER \
--output OUTPUT
The synopsis for each command shows its parameters and their usage. Optional parameters are shown in square brackets.
A user must provide the genomic sequence of the reference's assembly in FASTA format via --ref-fa argument, an alias for the reference genome via --ref-name argument, a gene annotation for the reference assembly via --ref-gtf argument, an alias or version for the gene annotation via the --gtf-ver argument, and an output directory to store the built reference files via --output argument. If you are running the pipeline outside of Biowulf, you will need to additionally provide the the following options: --shared-resources, --tmp-dir. More information about each of these options can be found below.
For human and mouse data, we highly recommend downloading the latest available PRI genome assembly and corresponding gene annotation from GENCODE. These reference files contain chromosomes and scaffolds sequences.
The build pipeline will generate a JSON file containing key, value pairs to required reference files for the renee run pipeline. This file will be located in the path provided to --output. The name of this JSON file is dependent on the values provided to --ref-name and --gtf-ver and has the following naming convention: {OUTPUT}/{REF_NAME}_{GTF_VER}.json. Once the build pipeline completes, this reference JSON file can be passed to the --genome option of renee run. This is how new references are built for the RENEE pipeline.
Use you can always use the -h option for information on a specific command.
2.1 Required Arguments¶
Each of the following arguments are required. Failure to provide a required argument will result in a non-zero exit-code.
--ref-fa REF_FA
Genomic FASTA file of the reference genome.
type: fileThis file represents the genome sequence of the reference assembly in FASTA format. If you are downloading this from GENCODE, you should select the PRI genomic FASTA file. This file will contain the primary genomic assembly (contains chromosomes and scaffolds). This input file should not be compressed. Sequence identifiers in this file must match with sequence identifiers in the GTF file provided to
--ref-gtf.Example: >
--ref-fa GRCh38.primary_assembly.genome.fa
--ref-name REF_NAME
Name of the reference genome.
type: stringName or alias for the reference genome. This can be the common name for the reference genome. Here is a list of common examples for different model organisms: mm10, hg38, rn6, danRer11, dm6, canFam3, sacCer3, ce11. If the provided values contains one of the following sub-strings (hg19, hs37d, grch37, hg38, hs38d, grch38, mm10, grcm38), then Arriba will run with its corresponding blacklist.
Example:
--ref-name hg38
--ref-gtf REF_GTF
Gene annotation or GTF file for the reference genome.
type: fileThis file represents the reference genome's gene annotation in GTF format. If you are downloading this from GENCODE, you should select the 'PRI' GTF file. This file contains gene annotations for the primary assembly (contains chromosomes and scaffolds). This input file should not be compressed. Sequence identifiers (column 1) in this file must match with sequence identifiers in the FASTA file provided to
--ref-fa.
Example:--ref-gtf gencode.v36.primary_assembly.annotation.gtf
--gtf-ver GTF_VER
Version of the gene annotation or GTF file provided.
type: string or intThis is the version of the supplied gene annotation or GTF file. If you are using a GTF file from GENCODE, use the release number or version (i.e. M25 for mouse or 37 for human). Visit gencodegenes.org for more details.
Example:--gtf-ver 36
--output OUTPUT
Path to an output directory.
type: pathThis location is where the build pipeline will create all of its output files. If the user-provided working directory has not been initialized, it will automatically be created. Note: by default, any files in
config,resources,orworkflowin the output directory may be overwritten byrenee build. Example:--output /data/$USER/refs/hg38_36/
2.2 Build Options¶
Each of the following arguments are optional and do not need to be provided. If you are running the pipeline outside of Biowulf, the --shared-resources option only needs to be provided at least once. This will ensure reference files that are shared across different genomes are downloaded locally.
--shared-resources SHARED_RESOURCES
Local path to shared resources.
type: pathThe pipeline uses a set of shared reference files that can be re-used across reference genomes. These currently include reference files for kraken and FQScreen. These reference files can be downloaded with the build sub command's
--shared-resourcesoption. With that being said, these files only need to be downloaded once. We recommend storing this files in a shared location on the filesystem that other people can access. If you are running the pipeline on Biowulf, you do NOT need to download these reference files! They already exist on the filesystem in a location that anyone can access; however, if you are running the pipeline on another cluster or target system, you will need to download the shared resources with the build sub command, and you will need to provide this option every time you run the pipeline. Please provide the same path that was provided to the build sub command's --shared-resources option. Again, if you are running the pipeline on Biowulf, you do NOT need to provide this option. For more information about how to download shared resources, please reference the build sub command's--shared-resourcesoption.Example:
--shared-resources /data/shared/renee
--small-genome
Builds a small genome index.
type: booleanFor small genomes, it is recommended running STAR with a scaled down
--genomeSAindexNbasesvalue. This option runs the build pipeline in a mode where it dynamically finds the optimal value for this option using the following formula:min(14, log2(GenomeSize)/2 - 1). Generally speaking, this option is not really applicable for most mammalian reference genomes, i.e. human and mouse; however, researcher working with very small reference genomes, like S. cerevisiae ~ 12Mb, should provide this option.When in doubt feel free to provide this option, as the optimal value will be found based on your input.
Example:
--small-genome
2.3 Orchestration Options¶
--dry-run
Dry run the build pipeline.
type: booleanDisplays what steps in the build pipeline remain or will be run. Does not execute anything!
Example:
--dry-run
--singularity-cache SINGULARITY_CACHE
Overrides the $SINGULARITY_CACHEDIR environment variable.
type: path
default:--output OUTPUT/.singularitySingularity will cache image layers pulled from remote registries. This ultimately speeds up the process of pull an image from DockerHub if an image layer already exists in the singularity cache directory. By default, the cache is set to the value provided to the
--outputargument. Please note that this cache cannot be shared across users. Singularity strictly enforces you own the cache directory and will return a non-zero exit code if you do not own the cache directory! See the--sif-cacheoption to create a shareable resource.Example:
--singularity-cache /data/$USER/.singularity
--sif-cache SIF_CACHE
Path where a local cache of SIFs are stored.
type: pathUses a local cache of SIFs on the filesystem. This SIF cache can be shared across users if permissions are set correctly. If a SIF does not exist in the SIF cache, the image will be pulled from Dockerhub and a warning message will be displayed. The
renee cachesubcommand can be used to create a local SIF cache. Please seerenee cachefor more information. This command is extremely useful for avoiding DockerHub pull rate limits. It also remove any potential errors that could occur due to network issues or DockerHub being temporarily unavailable. We recommend running RENEE with this option when ever possible.Example:
--singularity-cache /data/$USER/SIFs
--tmp-dir TMP_DIR
Path on the file system for writing temporary files.
type: path
default:/lscratch/$SLURM_JOBIDPath on the file system for writing temporary output files. By default, the temporary directory is set to '/lscratch/$SLURM_JOBID' on NIH's Biowulf cluster and 'OUTPUT' on the FRCE cluster. However, if you are running the pipeline on another cluster, this option will need to be specified. Ideally, this path should point to a dedicated location on the filesystem for writing tmp files. On many systems, this location is set to somewhere in /scratch. If you need to inject avariable into this string that should NOT be expanded,please quote this options value in single quotes.
Example:
--tmp-dir /cluster_scratch/$USER/
2.4 Misc Options¶
Each of the following arguments are optional and do not need to be provided.
-h, --help
Display Help.
type: booleanShows command's synopsis, help message, and an example command
Example:
--help
3. Hybrid Genomes¶
If you have two GTF files, e.g. hybrid genomes (host + virus), then you need to create one genomic FASTA file and one GTF file for the hybrid genome prior to running the renee build command.
We recommend creating an artificial chromosome for the non-host sequence. The sequence identifier in the FASTA file must match the sequence identifier in the GTF file (column 1). Generally speaking, since the host annotation is usually downloaded from Ensembl or GENCODE, it will be correctly formatted; however, that may not be the case for the non-host sequence!
Please ensure the non-host annotation contains the following features and/or constraints:
- for a given
genefeature - each
geneentry has at least onetranscriptfeature - and each
transcriptentry has at least oneexonfeature gene_id,gene_nameandgene_biotypeare required- for a given
transciptfeature - along with
gene_id,gene_nameandgene_biotype...transcript_idis also required - for a given
exonfeature gene_id,gene_name,gene_biotype,transcript_idare required
If not, the GTF file may need to be manually curated until these conditions are satisfied.
Here is an example feature from a hand-curated Biotyn_probe GTF file:
Biot1 BiotynProbe gene 1 21 0.000000 + . gene_id "Biot1"; gene_name "Biot1"; gene_biotype "biotynlated_probe_control";
Biot1 BiotynProbe transcript 1 21 0.000000 + . gene_id "Biot1"; gene_name "Biot1"; gene_biotype "biotynlated_probe_control"; transcript_id "Biot1"; transcript_name "Biot1"; transcript_type "biotynlated_probe_control";
Biot1 BiotynProbe exon 1 21 0.000000 + . gene_id "Biot1"; gene_biotype "biotynlated_probe_control"; transcript_id "Biot1"; transcript_type "biotynlated_probe_control";
In this tab-delimited example above,
- line 1: the
genefeature has 3 required attributes in column 9:gene_idandgene_nameandgene_biotype - line 2: the
transcriptentry for the abovegenerepeats the same attributes with following required fields:transcript_idandtranscript_name - Please note:
transcript_typeis optional - line 3: the
exonentry for the abovetranscripthas 3 required attributes:gene_idandtranscript_idandgene_biotype - Please note:
transcript_typeis optional
For a given gene, the combination of the gene_id AND gene_name should form a unique string. There should be no instances where two different genes share the same gene_id AND gene_name.
4. Convert NCBI GFF3 to GTF format¶
It is worth noting that RENEE comes bundled with a script to convert GFF3 files downloaded from NCBI to GTF file format. This convenience script is useful as the renee build sub command takes a GTF file as one of its inputs.
Please note that this script has only been tested with GFF3 files downloaded from NCBI, and it is not recommended to use with GFF3 files originating from other sources like Ensembl or GENCODE. If you are selecting an annotation from Ensembl or GENCODE, please download the GTF file option.
The only dependency of the script is the python package argparse, which comes bundled with the following python⅔ distributions: python>=2.7.18 or python>=3.2. If argparse is not installed, it can be downloaded with pip by running the following command:
pip install --upgrade pip
pip install argparse
For more information about the script and its usage, please run:
./resources/gff3togtf.py -h
5. Example¶
5.1 Biowulf¶
On Biowulf getting started with the pipeline is fast and easy! In this example, we build a mouse reference genome.
# Step 0.) Grab an interactive node (do not run on head node)
srun -N 1 -n 1 --time=2:00:00 -p interactive --mem=8gb --cpus-per-task=4 --pty bash
module purge
module load ccbrpipeliner
# Step 1.) Dry run the Build pipeline
renee build --ref-fa GRCm39.primary_assembly.genome.fa \
--ref-name mm39 \
--ref-gtf gencode.vM26.annotation.gtf \
--gtf-ver M26 \
--output /data/$USER/refs/mm39_M26 \
--sif-cache /data/CCBR_Pipeliner/SIFs/ \
--dry-run
# Step 2.) Build new RENEE reference files
renee build --ref-fa GRCm39.primary_assembly.genome.fa \
--ref-name mm39 \
--ref-gtf gencode.vM26.annotation.gtf \
--gtf-ver M26 \
--output /data/$USER/refs/mm39_M26 \
--sif-cache /data/CCBR_Pipeliner/SIFs/
5.2 Generic SLURM Cluster¶
Running the pipeline outside of Biowulf is easy; however, there are a few extra options you must provide. Please note when running the build sub command for the first time, you will also need to provide the --shared-resources option. This option will download our kraken2 database and bowtie2 indices for FastQ Screen. The path provided to this option should be provided to the --shared-resources option of the run sub command. Next, you will also need to provide a path to write temporary output files via the --tmp-dir option. We also recommend providing a path to a SIF cache. You can cache software containers locally with the cache sub command.
# Step 0.) Grab an interactive node (do not run on head node)
srun -N 1 -n 1 --time=2:00:00 -p interactive --mem=8gb --cpus-per-task=4 --pty bash
# Add snakemake and singularity to $PATH,
# This step may vary across clusters, you
# can reach out to a sys admin if snakemake
# and singularity are not installed.
module purge
# Replace the following:
# module load ccbrpipeliner
# with module load statements that load
# python >= 3.7,
# snakemake, and
# singularity
# before running renee
# Also, ensure that the `renee` executable is in PATH
# Step 1.) Dry run the Build pipeline
renee build --ref-fa GRCm39.primary_assembly.genome.fa \
--ref-name mm39 \
--ref-gtf gencode.vM26.annotation.gtf \
--gtf-ver M26 \
--output /data/$USER/refs/mm39_M26 \
--shared-resources /data/shared/renee \
--tmp-dir /cluster_scratch/$USER/ \
--sif-cache /data/$USER/cache \
--dry-run
# Step 2.) Build new RENEE reference files
renee build --ref-fa GRCm39.primary_assembly.genome.fa \
--ref-name mm39 \
--ref-gtf gencode.vM26.annotation.gtf \
--gtf-ver M26 \
--output /data/$USER/refs/mm39_M26 \
--shared-resources /data/shared/renee \
--tmp-dir /cluster_scratch/$USER/ \
--sif-cache /data/$USER/cache