grubbers
spacesavers2_grubbers¶
This takes in the mimeo.files.gz
generated by spacesavers2_mimeo
and processes it to:
- sort duplicates by total size
- reports the "high-value" duplicates.
Deleting these high-value duplicates first will have the biggest impact on the users overall digital footprint.
Inputs¶
--filesgz
output file fromspacesavers2_mimeo
.--limit
lower cut-off for output display (default 5 GiB). This means that duplicates with overall size of less than 5 GiB will not be displayed. Set 0 to report all.
╰─○ spacesavers2_grubbers --help
spacesavers2_grubbers:00000.00s:version: v0.10.2-dev
usage: spacesavers2_grubbers [-h] -f FILESGZ [-l LIMIT] [-o OUTFILE] [-v]
spacesavers2_grubbers: get list of large duplicates sorted by total size
options:
-h, --help show this help message and exit
-f FILESGZ, --filesgz FILESGZ
spacesavers2_mimeo prefix.<user>.mimeo.files.gz file
-l LIMIT, --limit LIMIT
stop showing duplicates with total size smaller than (5 default) GiB. Set 0 for unlimited.
-o OUTFILE, --outfile OUTFILE
output tab-delimited file (default STDOUT)
-v, --version show program's version number and exit
Version:
v0.10.2-dev
Example:
> spacesavers2_grubbers -f /output/from/spacesavers2_finddup/prefix.files.gz
Outputs¶
The output is displayed on STDOUT and is tab-delimited with these columns:
Column | Description |
---|---|
1 | combined hash |
2 | number of duplicates found |
3 | total size of all duplicates (human readable) |
4 | size of each duplicate (human readable) |
5 | original file |
6 | ";"-separated list of duplicates files |
Here is an example output line:
183e9dc341073d9b75c817f5ed07b9ac#183e9dc341073d9b75c817f5ed07b9ac 5 0.07 KiB 0.01 KiB "/data/CCBR/abdelmaksoudaa/test/a" "/data/CCBR/abdelmaksoudaa/test/b";"/data/CCBR/abde
lmaksoudaa/test/c";"/data/CCBR/abdelmaksoudaa/test/d";"/data/CCBR/abdelmaksoudaa/test/e";"/data/CCBR/abdelmaksoudaa/test/f"
spacesavers2_grubbers
is typical used to find the "low-hanging" fruits ... aka ... the "high-value" duplicates which need to be deleted first to quickly have the biggest impact on the users overall digital footprint.