:rocket: spacesavers2 :rocket:¶
Background¶
spacesavers2
- crawls through the provided folder (and its subfolders),
- gathers stats for each file like size, inode, user/group information, etc.,
- calculates unique hashes for each file,
- using the information gathers determines "duplicates",
- reports "high-value" duplicates, i.e., the ones that will give back most diskspace, if deleted,and
- makes a "counts-matrix" style matrix with folders as rownames and users a columnnames with each cell representing duplicate bytes.
New improved parallel implementation of
spacesavers.spacesaversis soon to be decommissioned!Note:
spacesavers2requires python version 3.11 or later and the xxhash library. These dependencies are already installed on biowulf (as a conda env). The environment for runningspacesavers2can get set up using the shared conda environment on biowulf or frce.on biowulf:
. "/data/CCBR_Pipeliner/db/PipeDB/Conda/etc/profile.d/conda.sh" && \ conda activate /data/CCBR_Pipeliner/db/PipeDB/Conda/envs/py311on frce:
. "/mnt/projects/CCBR-Pipelines/resources/miniconda3/etc/profile.d/conda.sh" && \ conda activate /mnt/projects/CCBR-Pipelines/resources/miniconda3/envs/py311
Commands¶
spacesavers2 has the following Basic commands:
- spacesavers2_catalog
- spacesavers2_mimeo
- spacesavers2_grubbers
- spacesavers2_usurp
- spacesavers2_e2e
- spacesavers2_pdq
Use case¶
One would like to monitor the per-user digital footprint on shared data drives like /data/CCBR on biowulf. Setting the spacesavers2_e2e as a weekly cronjob will allow automation of this task. slurm_job script is also provided to work as a template for using the job scheduler on the HPC to submit (possibly, as cronjob).