Skip to content

grubbers

spacesavers2_grubbers

This takes in the mimeo.files.gz generated by spacesavers2_mimeo and processes it to:

  • sort duplicates by total size
  • reports the "high-value" duplicates.

Deleting these high-value duplicates first will have the biggest impact on the users overall digital footprint.

Inputs

  • --filesgz output file from spacesavers2_mimeo.
  • --limit lower cut-off for output display (default 5 GiB). This means that duplicates with overall size of less than 5 GiB will not be displayed. Set 0 to report all.
╰─○ spacesavers2_grubbers --help
spacesavers2_grubbers:00000.00s:version: v0.10.2-dev
usage: spacesavers2_grubbers [-h] -f FILESGZ [-l LIMIT] [-o OUTFILE] [-v]

spacesavers2_grubbers: get list of large duplicates sorted by total size

options:
  -h, --help            show this help message and exit
  -f FILESGZ, --filesgz FILESGZ
                        spacesavers2_mimeo prefix.<user>.mimeo.files.gz file
  -l LIMIT, --limit LIMIT
                        stop showing duplicates with total size smaller than (5 default) GiB. Set 0 for unlimited.
  -o OUTFILE, --outfile OUTFILE
                        output tab-delimited file (default STDOUT)
  -v, --version         show program's version number and exit

Version:
    v0.10.2-dev
Example:
    > spacesavers2_grubbers -f /output/from/spacesavers2_finddup/prefix.files.gz

Outputs

The output is displayed on STDOUT and is tab-delimited with these columns:

Column Description
1 combined hash
2 number of duplicates found
3 total size of all duplicates (human readable)
4 size of each duplicate (human readable)
5 original file
6 ";"-separated list of duplicates files

Here is an example output line:

183e9dc341073d9b75c817f5ed07b9ac#183e9dc341073d9b75c817f5ed07b9ac   5   0.07 KiB    0.01 KiB    "/data/CCBR/abdelmaksoudaa/test/a"  "/data/CCBR/abdelmaksoudaa/test/b";"/data/CCBR/abde
lmaksoudaa/test/c";"/data/CCBR/abdelmaksoudaa/test/d";"/data/CCBR/abdelmaksoudaa/test/e";"/data/CCBR/abdelmaksoudaa/test/f"

spacesavers2_grubbers is typical used to find the "low-hanging" fruits ... aka ... the "high-value" duplicates which need to be deleted first to quickly have the biggest impact on the users overall digital footprint.