grubbers

spacesavers2_grubbers¶

This takes in the mimeo.files.gz generated by spacesavers2_mimeo and processes it to:

sort duplicates by total size
reports the "high-value" duplicates.

Deleting these high-value duplicates first will have the biggest impact on the users overall digital footprint.

Inputs¶

--filesgz output file from spacesavers2_mimeo.
--limit lower cut-off for output display (default 5 GiB). This means that duplicates with overall size of less than 5 GiB will not be displayed. Set 0 to report all.

╰─○ spacesavers2_grubbers --help
spacesavers2_grubbers:00000.00s:version: v0.10.2-dev
usage: spacesavers2_grubbers [-h] -f FILESGZ [-l LIMIT] [-o OUTFILE] [-v]

spacesavers2_grubbers: get list of large duplicates sorted by total size

options:
  -h, --help            show this help message and exit
  -f FILESGZ, --filesgz FILESGZ
                        spacesavers2_mimeo prefix.<user>.mimeo.files.gz file
  -l LIMIT, --limit LIMIT
                        stop showing duplicates with total size smaller than (5 default) GiB. Set 0 for unlimited.
  -o OUTFILE, --outfile OUTFILE
                        output tab-delimited file (default STDOUT)
  -v, --version         show program's version number and exit

Version:
    v0.10.2-dev
Example:
    > spacesavers2_grubbers -f /output/from/spacesavers2_finddup/prefix.files.gz

Outputs¶

The output is displayed on STDOUT and is tab-delimited with these columns:

Column	Description
1	combined hash
2	number of duplicates found
3	total size of all duplicates (human readable)
4	size of each duplicate (human readable)
5	original file
6	";"-separated list of duplicates files

Here is an example output line:

183e9dc341073d9b75c817f5ed07b9ac#183e9dc341073d9b75c817f5ed07b9ac   5   0.07 KiB    0.01 KiB    "/data/CCBR/abdelmaksoudaa/test/a"  "/data/CCBR/abdelmaksoudaa/test/b";"/data/CCBR/abde
lmaksoudaa/test/c";"/data/CCBR/abdelmaksoudaa/test/d";"/data/CCBR/abdelmaksoudaa/test/e";"/data/CCBR/abdelmaksoudaa/test/f"

spacesavers2_grubbers is typical used to find the "low-hanging" fruits ... aka ... the "high-value" duplicates which need to be deleted first to quickly have the biggest impact on the users overall digital footprint.