spacesavers2_mimeo¶
This takes in the catalog file generated by spacesavers2_catalog and processes it to:
- find duplicates
- create per-user summary reports for each user (and all users).
Inputs¶
--catalogis the output file fromspacesavers2_catalog. Thus,spacesavers2_catalogneeds to be run before runningspacesavers2_mimeo.--maxdepthmaximum folder depth upto which reports are aggregated--outdirpath to the output folder--prefixprefix to be added to the output file names eg. date etc.--duplicatesonlyonly report duplicates in the.files.gzoutput files. This saves a lot of disc space. (Highly recommended!)--quotadefines the size of the overall file mount. (eg. 200 TB for/data/CCBRon BIOWULF.) OccScore is dependent on this and should be provided appropriately for accurate results.
% spacesavers2_mimeo --help
usage: spacesavers2_mimeo [-h] -f CATALOG [-d MAXDEPTH] [-o OUTDIR] [-p PREFIX] [-q QUOTA] [-z | --duplicatesonly | --no-duplicatesonly] [-k | --kronaplot | --no-kronaplot] [-v]
spacesavers2_mimeo: find duplicates
options:
-h, --help show this help message and exit
-f CATALOG, --catalog CATALOG
spacesavers2_catalog output from STDIN or from catalog file
-d MAXDEPTH, --maxdepth MAXDEPTH
folder max. depth upto which reports are aggregated ... absolute path is used to calculate depth (Default: 10)
-o OUTDIR, --outdir OUTDIR
output folder
-p PREFIX, --prefix PREFIX
prefix for all output files
-q QUOTA, --quota QUOTA
total quota of the mount eg. 200 TB for /data/CCBR
-z, --duplicatesonly, --no-duplicatesonly
Print only duplicates to per user output file.
-k, --kronaplot, --no-kronaplot
Make kronaplots for duplicates.(ktImportText must be in PATH!)
-v, --version show program's version number and exit
Version:
v0.10.2-dev
Example:
> spacesavers2_mimeo -f /output/from/spacesavers2_catalog -o /path/to/output/folder -d 7 -q 10 -k
Outputs¶
After completion of run, spacesavers2_mimeo creates *.mimeo.files.gz (list of files per user + one "allusers" file) and .summary.txt (overall stats at various depths) files in the provided output folder. if -k is provided (and ktImportText from kronatools is in PATH) then krona specific TSV and HTML pages are also generated. It also generates a blamematrix.tsv file with folders on rows and users on columns with duplicate bytes per-folder-per-user. This file can be used to create a "heatmap" to pinpoint folder with highest duplicates overall as well as on a per-user basis.
Here are the details:
Duplicates¶
spacesavers2_mimeo uses the following logic to find duplicates:
- Bin files by their top (and bottom) xxHashes irrespective of user id (allusers mode)
- Check if each bin has unique sized files. If a bin has more than 1 size, then it needs to be binned further. Sometimes, xxHash of top and bottom chunks also gives the same combination of hash for differing files. These files will have different sizes. Hence, re-bin them accordingly.
- If same size, then check inodes. If all files in the same bin have the same inode, then these are just hard-links. But, if there are multiple inodes, then we have duplicates!
- If we have duplicates, then
spacesavers2_mimeokeeps track of number of duplicates per bin. Number of duplicates is equal to number of inodes in each bin minus one. - If we have duplicates, then the oldest file is identified and considered to be the original file. All other files are marked duplicate, irrespective of user id.
- duplicate files are reported in gzip format with the following columns for all users and per-user basis
Here is what the .files.gz file columns (space-separated) represent:
| Column | Description |
|---|---|
| 1 | top chunk and bottom chunk hashes separated by "#" |
| 2 | separator ":" |
| 3 | Number of duplicates files (not duplicate inodes) |
| 4 | Size of each file |
| 5 | List of users duplicates serapated by "##" |
NOTE: Number of dupicate files can be greater than number of duplicate inodes as each file can have multiple hard links already. Hence, while calculating total duplicate bytes we use (total_number_of_unique_inodes_per_group_of_duplicate_files - 1) X size_of_each_file. The "minus 1" is to not count the size of the original file.
Each file in the last column above is ";" separated with the same 13 items as described in the catalog file. The only difference is that the username and groupame are now appended to each file entry.
Along with creating one .mimeo.files.gz and .mimeo.summary.txt file per user encountered, spacesavers2_mimeo also generates a allusers.mimeo.files.gz file for all users combined. This file is later used by spacesavers2_blamematrix as input.
Summaries¶
Summaries, files ending with .mimeo.summary.txt are collected and reported for all users (allusers.mimeo.summary.txt) and per-user (USERNAME.mimeo.summary.txt) basis for user-defined depth (and beyond). The columns (tab-delimited) in the summary file:
| Column | Description |
|---|---|
| 1 | absolute path |
| 2 | total bytes |
| 3 | duplicate bytes |
| 4 | percent duplicate bytes |
| 5 | total files |
| 6 | duplicate files |
| 7 | percent duplicate files |
| 8 | average file age of all files (days) |
| 9 | average file age of duplicates (days) |
| 10 | AgeScore |
| 11 | DupScore |
| 12 | OccScore |
| 13 | OverallScore |
For columns 10 through 13, the same logic is used as spacesavers.
KronaTSV and KronaHTML¶
- KronaTSV is tab-delimited with first column showing the number of duplicate bytes and every subsequent column giving the folder depths.
- ktImportText is then used to convert the KronaTSV to KronaHTML which can be shared easily and only needs a HTML5 supporting browser for viewing.
Blamematrix¶
- rows are folders as 1 level deeper than the "mindepth"
- columns are all individual usernames, plus an "allusers" column
- only duplicate-bytes are reported