pipeline.util

pipeline.util

Pipeline utility functions

Functions

Name Description
chmod_bins_exec Ensure that all files in bin/ are executable.
err Prints any provided args to standard error.
exists Checks if file exists on the local filesystem.
fatal Prints any provided args to standard error
get_genomes_dict Get dictionary of genome annotation versions and the paths to the corresponding JSON files.
get_genomes_list Get list of genome annotations available for the current platform
get_hpcname Get the HPC name using scontrol
get_tmp_dir Get default temporary directory for biowulf and frce. Allow user override.
git_commit_hash Gets the git commit hash of the RNA-seek repo.
join_jsons Joins multiple JSON files to into one data structure
ln Creates symlinks for files to an output directory.
md5sum Gets md5checksum of a file in memory-safe manner.
permissions Checks permissions using os.access() to see the user is authorized to access
rename Dynamically renames FastQ file to have one of the following extensions: .R1.fastq.gz, .R2.fastq.gz
require Enforces an executable is in $PATH
safe_copy Private function: Given a list paths it will recursively copy each to the
scontrol_show Run scontrol show config and parse the output as a dictionary
standard_input Checks for standard input when provided or permissions using permissions().
which Checks if an executable is in $PATH

chmod_bins_exec

pipeline.util.chmod_bins_exec(repo_base=repo_base)

Ensure that all files in bin/ are executable.

It appears that setuptools strips executable permissions from package_data files, yet post-install scripts are not possible with the pyproject.toml format. Without this hack, nextflow processes that call scripts in bin/ fail.

https://stackoverflow.com/questions/18409296/package-data-files-with-executable-permissions https://github.com/pypa/setuptools/issues/2041 https://stackoverflow.com/questions/76320274/post-install-script-for-pyproject-toml-projects

err

pipeline.util.err(*message, **kwargs)

Prints any provided args to standard error. kwargs can be provided to modify print functions behavior. @param message : Values printed to standard error @params kwargs <print()> Key words to modify print function behavior

exists

pipeline.util.exists(testpath)

Checks if file exists on the local filesystem. @param parser <argparse.ArgumentParser() object>: argparse parser object @param testpath : Name of file/directory to check @return does_exist : True when file/directory exists, False when file/directory does not exist

fatal

pipeline.util.fatal(*message, **kwargs)

Prints any provided args to standard error and exits with an exit code of 1. @param message : Values printed to standard error @params kwargs <print()> Key words to modify print function behavior

get_genomes_dict

pipeline.util.get_genomes_dict(
    repo_base,
    hpcname=get_hpcname(),
    error_on_warnings=False,
)

Get dictionary of genome annotation versions and the paths to the corresponding JSON files.

Parameters

Name Type Description Default
repo_base function Function for getting the base directory of the repository. required
hpcname str Name of the HPC. Defaults to the value returned by get_hpcname(). get_hpcname()
error_on_warnings bool Flag to indicate whether to raise warnings as errors. Defaults to False. False

Returns: genomes_dict (dict): A dictionary containing genome names as keys and corresponding JSON file paths as values. { genome_name: json_file_path }

get_genomes_list

pipeline.util.get_genomes_list(
    repo_base,
    hpcname=get_hpcname(),
    error_on_warnings=False,
)

Get list of genome annotations available for the current platform

Parameters

Name Type Description Default
repo_base str The base directory of the repository required
hpcname str The name of the HPC. Defaults to the value returned by get_hpcname(). get_hpcname()
error_on_warnings bool Whether to raise an error on warnings. Defaults to False. False

Returns: genomes (list): A sorted list of genome annotations available for the current platform

get_hpcname

pipeline.util.get_hpcname()

Get the HPC name using scontrol

Returns

Name Type Description
hpcname str The HPC name (biowulf, frce, or an empty string)

get_tmp_dir

pipeline.util.get_tmp_dir(tmp_dir, outdir, hpc=get_hpcname())

Get default temporary directory for biowulf and frce. Allow user override.

Parameters

Name Type Description Default
tmp_dir str User-defined temporary directory path. If provided, this path will be used as the temporary directory. required
outdir str Output directory path. required
hpc str HPC name. Defaults to the value returned by get_hpcname(). get_hpcname()

Returns: tmp_dir (str): The default temporary directory path based on the HPC name and user-defined path.

git_commit_hash

pipeline.util.git_commit_hash(repo_path)

Gets the git commit hash of the RNA-seek repo. @param repo_path : Path to RNA-seek git repo @return githash : Latest git commit hash

join_jsons

pipeline.util.join_jsons(templates)

Joins multiple JSON files to into one data structure Used to join multiple template JSON files to create a global config dictionary. @params templates <list[str]>: List of template JSON files to join together @return aggregated : Dictionary containing the contents of all the input JSON files

ln

pipeline.util.ln(files, outdir)

Creates symlinks for files to an output directory. @param files list[]: List of filenames @param outdir : Destination or output directory to create symlinks

md5sum

pipeline.util.md5sum(filename, first_block_only=False, blocksize=65536)

Gets md5checksum of a file in memory-safe manner. The file is read in blocks/chunks defined by the blocksize parameter. This is a safer option to reading the entire file into memory if the file is very large. @param filename : Input file on local filesystem to find md5 checksum @param first_block_only : Calculate md5 checksum of the first block/chunk only @param blocksize : Blocksize of reading N chunks of data to reduce memory profile @return hasher.hexdigest() : MD5 checksum of the file’s contents

permissions

pipeline.util.permissions(parser, path, *args, **kwargs)

Checks permissions using os.access() to see the user is authorized to access a file/directory. Checks for existence, readability, writability and executability via: os.F_OK (tests existence), os.R_OK (tests read), os.W_OK (tests write), os.X_OK (tests exec). @param parser <argparse.ArgumentParser() object>: Argparse parser object @param path : Name of path to check @return path : Returns abs path if it exists and permissions are correct

rename

pipeline.util.rename(filename)

Dynamically renames FastQ file to have one of the following extensions: .R1.fastq.gz, .R2.fastq.gz To automatically rename the fastq files, a few assumptions are made. If the extension of the FastQ file cannot be inferred, an exception is raised telling the user to fix the filename of the fastq files. @param filename : Original name of file to be renamed @return filename : A renamed FastQ filename

require

pipeline.util.require(cmds, suggestions, path=None)

Enforces an executable is in $PATH @param cmds list[]: List of executable names to check @param suggestions list[]: Name of module to suggest loading for a given index in param cmd. @param path list[]]: Optional list of PATHs to check [default: $PATH]

safe_copy

pipeline.util.safe_copy(source, target, resources=[])

Private function: Given a list paths it will recursively copy each to the target location. If a target path already exists, it will NOT over-write the existing paths data. @param resources <list[str]>: List of paths to copy over to target location @params source : Add a prefix PATH to each resource @param target : Target path to copy templates and required resources

scontrol_show

pipeline.util.scontrol_show()

Run scontrol show config and parse the output as a dictionary

Returns

Name Type Description
scontrol_dict dict dictionary containing the output of scontrol show config

standard_input

pipeline.util.standard_input(parser, path, *args, **kwargs)

Checks for standard input when provided or permissions using permissions(). @param parser <argparse.ArgumentParser() object>: Argparse parser object @param path : Name of path to check @return path : If path exists and user can read from location

which

pipeline.util.which(cmd, path=None)

Checks if an executable is in $PATH @param cmd : Name of executable to check @param path : Optional list of PATHs to check [default: $PATH] @return : True if exe in PATH, False if not in PATH