pipeline.util

pipeline.util

Pipeline utility functions

Functions

Name Description
check_python_version Check if the current Python version meets the minimum required version.
chmod_bins_exec Ensure that all files in bin/ are executable.
copy_config Copy default config files to the current working directory.
err Prints any provided args to standard error.
exists Checks if file exists on the local filesystem.
fatal Prints any provided args to standard error
get_genomes_dict Get dictionary of genome annotation versions and the paths to the corresponding JSON files.
get_genomes_list Get list of genome annotations available for the current platform
get_tmp_dir Get default temporary directory for biowulf and frce. Allow user override.
git_commit_hash Gets the git commit hash of the RNA-seek repo.
join_jsons Joins multiple JSON files into one data structure.
ln Creates symlinks for files to an output directory.
md5sum Gets md5checksum of a file in memory-safe manner.
permissions Checks permissions using os.access() to see if the user is authorized to access
read_config_yml Reads a YAML configuration file and returns its contents as a dictionary.
rename Dynamically renames FastQ file to have one of the following extensions: .R1.fastq.gz, .R2.fastq.gz
require Enforces an executable is in $PATH
safe_copy Private function: Given a list paths it will recursively copy each to the
standard_input Checks for standard input when provided or permissions using permissions().
which Checks if an executable is in $PATH
write_config_yml Writes a configuration dictionary to a YAML file.

check_python_version

pipeline.util.check_python_version(MIN_PYTHON=(3, 11))

Check if the current Python version meets the minimum required version.

Parameters

Name Type Description Default
MIN_PYTHON tuple Minimum required Python version as a tuple (major, minor). (3, 11)

chmod_bins_exec

pipeline.util.chmod_bins_exec(repo_base=repo_base)

Ensure that all files in bin/ are executable.

It appears that setuptools strips executable permissions from package_data files, yet post-install scripts are not possible with the pyproject.toml format. Without this hack, nextflow processes that call scripts in bin/ fail.

See Also

https://stackoverflow.com/questions/18409296/package-data-files-with-executable-permissions https://github.com/pypa/setuptools/issues/2041 https://stackoverflow.com/questions/76320274/post-install-script-for-pyproject-toml-projects

copy_config

pipeline.util.copy_config(config_paths, overwrite=True, repo_base=repo_base)

Copy default config files to the current working directory.

Parameters

Name Type Description Default
config_paths list[str] List of configuration paths to copy. required
overwrite bool Whether to overwrite existing files. Defaults to True. True
repo_base function Function to get the base directory of the repository. repo_base

err

pipeline.util.err(*message, **kwargs)

Prints any provided args to standard error. kwargs can be provided to modify print function’s behavior.

Parameters

Name Type Description Default
message any Values printed to standard error. ()
kwargs dict Key words to modify print function behavior. {}

exists

pipeline.util.exists(testpath)

Checks if file exists on the local filesystem.

Parameters

Name Type Description Default
parser argparse.ArgumentParser Argparse parser object. required
testpath str Name of file/directory to check. required

Returns

Name Type Description
bool True when file/directory exists, False when file/directory does not exist.

fatal

pipeline.util.fatal(*message, **kwargs)

Prints any provided args to standard error and exits with an exit code of 1.

Parameters

Name Type Description Default
message any Values printed to standard error. ()
kwargs dict Key words to modify print function behavior. {}

get_genomes_dict

pipeline.util.get_genomes_dict(
    repo_base,
    hpcname=get_hpcname(),
    error_on_warnings=False,
)

Get dictionary of genome annotation versions and the paths to the corresponding JSON files.

Parameters

Name Type Description Default
repo_base function Function for getting the base directory of the repository. required
hpcname str Name of the HPC. Defaults to the value returned by get_hpcname(). get_hpcname()
error_on_warnings bool Flag to indicate whether to raise warnings as errors. Defaults to False. False

Returns: genomes_dict (dict): A dictionary containing genome names as keys and corresponding JSON file paths as values. { genome_name: json_file_path }

get_genomes_list

pipeline.util.get_genomes_list(
    repo_base,
    hpcname=get_hpcname(),
    error_on_warnings=False,
)

Get list of genome annotations available for the current platform

Parameters

Name Type Description Default
repo_base str The base directory of the repository required
hpcname str The name of the HPC. Defaults to the value returned by get_hpcname(). get_hpcname()
error_on_warnings bool Whether to raise an error on warnings. Defaults to False. False

Returns: genomes (list): A sorted list of genome annotations available for the current platform

get_tmp_dir

pipeline.util.get_tmp_dir(tmp_dir, outdir, hpc=get_hpcname())

Get default temporary directory for biowulf and frce. Allow user override.

Parameters

Name Type Description Default
tmp_dir str User-defined temporary directory path. If provided, this path will be used as the temporary directory. required
outdir str Output directory path. required
hpc str HPC name. Defaults to the value returned by get_hpcname(). get_hpcname()

Returns: tmp_dir (str): The default temporary directory path based on the HPC name and user-defined path.

git_commit_hash

pipeline.util.git_commit_hash(repo_path)

Gets the git commit hash of the RNA-seek repo.

Parameters

Name Type Description Default
repo_path str Path to RNA-seek git repo. required

Returns

Name Type Description
str Latest git commit hash.

join_jsons

pipeline.util.join_jsons(templates)

Joins multiple JSON files into one data structure. Used to join multiple template JSON files to create a global config dictionary.

Parameters

Name Type Description Default
templates list[str] List of template JSON files to join together. required

Returns

Name Type Description
dict Dictionary containing the contents of all the input JSON files.

ln

pipeline.util.ln(files, outdir)

Creates symlinks for files to an output directory.

Parameters

Name Type Description Default
files list[str] List of filenames. required
outdir str Destination or output directory to create symlinks. required

md5sum

pipeline.util.md5sum(filename, first_block_only=False, blocksize=65536)

Gets md5checksum of a file in memory-safe manner. The file is read in blocks/chunks defined by the blocksize parameter. This is a safer option to reading the entire file into memory if the file is very large.

Parameters

Name Type Description Default
filename str Input file on local filesystem to find md5 checksum. required
first_block_only bool Calculate md5 checksum of the first block/chunk only. False
blocksize int Blocksize of reading N chunks of data to reduce memory profile. 65536

Returns

Name Type Description
str MD5 checksum of the file’s contents.

permissions

pipeline.util.permissions(parser, path, *args, **kwargs)

Checks permissions using os.access() to see if the user is authorized to access a file/directory. Checks for existence, readability, writability, and executability via: os.F_OK (tests existence), os.R_OK (tests read), os.W_OK (tests write), os.X_OK (tests exec).

Parameters

Name Type Description Default
parser argparse.ArgumentParser Argparse parser object. required
path str Name of the path to check. required

Returns

Name Type Description
str Returns absolute path if it exists and permissions are correct.

read_config_yml

pipeline.util.read_config_yml(file)

Reads a YAML configuration file and returns its contents as a dictionary.

Parameters

Name Type Description Default
file str The path to the YAML file to be read. required

Returns

Name Type Description
dict The contents of the YAML file as a dictionary.

rename

pipeline.util.rename(filename)

Dynamically renames FastQ file to have one of the following extensions: .R1.fastq.gz, .R2.fastq.gz To automatically rename the fastq files, a few assumptions are made. If the extension of the FastQ file cannot be inferred, an exception is raised telling the user to fix the filename of the fastq files.

Parameters

Name Type Description Default
filename str Original name of file to be renamed. required

Returns

Name Type Description
str A renamed FastQ filename.

require

pipeline.util.require(cmds, suggestions, path=None)

Enforces an executable is in $PATH

Parameters

Name Type Description Default
cmds list[str] List of executable names to check. required
suggestions list[str] Name of module to suggest loading for a given index in cmds. required
path list[str] Optional list of PATHs to check. Defaults to $PATH. None

safe_copy

pipeline.util.safe_copy(source, target, resources=[])

Private function: Given a list paths it will recursively copy each to the target location. If a target path already exists, it will NOT over-write the existing paths data.

Parameters

Name Type Description Default
resources list[str] List of paths to copy over to target location. []
source str Add a prefix PATH to each resource. required
target str Target path to copy templates and required resources. required

standard_input

pipeline.util.standard_input(parser, path, *args, **kwargs)

Checks for standard input when provided or permissions using permissions().

Parameters

Name Type Description Default
parser argparse.ArgumentParser Argparse parser object. required
path str Name of the path to check. required

Returns

Name Type Description
str If path exists and user can read from location.

which

pipeline.util.which(cmd, path=None)

Checks if an executable is in $PATH

Parameters

Name Type Description Default
cmd str Name of the executable to check. required
path list Optional list of PATHs to check. Defaults to $PATH. None

Returns: bool: True if the executable is in PATH, False otherwise.

write_config_yml

pipeline.util.write_config_yml(_config, file)

Writes a configuration dictionary to a YAML file.

Parameters

Name Type Description Default
_config dict The configuration dictionary to write to the file. required
file str The path to the file where the configuration will be written. required