mimeo
spacesavers2_mimeo¶
This takes in the catalog
file generated by spacesavers2_catalog
and processes it to:
- find duplicates
- create per-user summary reports for each user (and all users).
Inputs¶
--catalog
is the output file fromspacesavers2_catalog
. Thus,spacesavers2_catalog
needs to be run before runningspacesavers2_mimeo
.--maxdepth
maximum folder depth upto which reports are aggregated--outdir
path to the output folder--prefix
prefix to be added to the output file names eg. date etc.--duplicatesonly
only report duplicates in the.files.gz
output files. This saves a lot of disc space. (Highly recommended!)--quota
defines the size of the overall file mount. (eg. 200 TB for/data/CCBR
on BIOWULF.) OccScore is dependent on this and should be provided appropriately for accurate results.
% spacesavers2_mimeo --help
usage: spacesavers2_mimeo [-h] -f CATALOG [-d MAXDEPTH] [-o OUTDIR] [-p PREFIX] [-q QUOTA] [-z | --duplicatesonly | --no-duplicatesonly] [-k | --kronaplot | --no-kronaplot] [-v]
spacesavers2_mimeo: find duplicates
options:
-h, --help show this help message and exit
-f CATALOG, --catalog CATALOG
spacesavers2_catalog output from STDIN or from catalog file
-d MAXDEPTH, --maxdepth MAXDEPTH
folder max. depth upto which reports are aggregated ... absolute path is used to calculate depth (Default: 10)
-o OUTDIR, --outdir OUTDIR
output folder
-p PREFIX, --prefix PREFIX
prefix for all output files
-q QUOTA, --quota QUOTA
total quota of the mount eg. 200 TB for /data/CCBR
-z, --duplicatesonly, --no-duplicatesonly
Print only duplicates to per user output file.
-k, --kronaplot, --no-kronaplot
Make kronaplots for duplicates.(ktImportText must be in PATH!)
-v, --version show program's version number and exit
Version:
v0.10.2-dev
Example:
> spacesavers2_mimeo -f /output/from/spacesavers2_catalog -o /path/to/output/folder -d 7 -q 10 -k
Outputs¶
After completion of run, spacesavers2_mimeo
creates *.mimeo.files.gz
(list of files per user + one "allusers" file) and .summary.txt
(overall stats at various depths) files in the provided output folder. if -k
is provided (and ktImportText from kronatools is in PATH) then krona specific TSV and HTML pages are also generated. It also generates a blamematrix.tsv
file with folders on rows and users on columns with duplicate bytes per-folder-per-user. This file can be used to create a "heatmap" to pinpoint folder with highest duplicates overall as well as on a per-user basis.
Here are the details:
Duplicates¶
spacesavers2_mimeo
uses the following logic to find duplicates:
- Bin files by their top (and bottom) xxHashes irrespective of user id (allusers mode)
- Check if each bin has unique sized files. If a bin has more than 1 size, then it needs to be binned further. Sometimes, xxHash of top and bottom chunks also gives the same combination of hash for differing files. These files will have different sizes. Hence, re-bin them accordingly.
- If same size, then check inodes. If all files in the same bin have the same inode, then these are just hard-links. But, if there are multiple inodes, then we have duplicates!
- If we have duplicates, then
spacesavers2_mimeo
keeps track of number of duplicates per bin. Number of duplicates is equal to number of inodes in each bin minus one. - If we have duplicates, then the oldest file is identified and considered to be the original file. All other files are marked duplicate, irrespective of user id.
- duplicate files are reported in gzip format with the following columns for all users and per-user basis
Here is what the .files.gz
file columns (space-separated) represent:
Column | Description |
---|---|
1 | top chunk and bottom chunk hashes separated by "#" |
2 | separator ":" |
3 | Number of duplicates files (not duplicate inodes) |
4 | Size of each file |
5 | List of users duplicates serapated by "##" |
NOTE: Number of dupicate files can be greater than number of duplicate inodes as each file can have multiple hard links already. Hence, while calculating total duplicate bytes we use (total_number_of_unique_inodes_per_group_of_duplicate_files - 1) X size_of_each_file. The "minus 1" is to not count the size of the original file.
Each file in the last column above is ";" separated with the same 13 items as described in the catalog
file. The only difference is that the username and groupame are now appended to each file entry.
Along with creating one .mimeo.files.gz
and .mimeo.summary.txt
file per user encountered, spacesavers2_mimeo
also generates a allusers.mimeo.files.gz
file for all users combined. This file is later used by spacesavers2_blamematrix
as input.
Summaries¶
Summaries, files ending with .mimeo.summary.txt
are collected and reported for all users (allusers.mimeo.summary.txt
) and per-user (USERNAME.mimeo.summary.txt
) basis for user-defined depth (and beyond). The columns (tab-delimited) in the summary file:
Column | Description |
---|---|
1 | absolute path |
2 | total bytes |
3 | duplicate bytes |
4 | percent duplicate bytes |
5 | total files |
6 | duplicate files |
7 | percent duplicate files |
8 | average file age of all files (days) |
9 | average file age of duplicates (days) |
10 | AgeScore |
11 | DupScore |
12 | OccScore |
13 | OverallScore |
For columns 10 through 13, the same logic is used as spacesavers.
KronaTSV and KronaHTML¶
- KronaTSV is tab-delimited with first column showing the number of duplicate bytes and every subsequent column giving the folder depths.
- ktImportText is then used to convert the KronaTSV to KronaHTML which can be shared easily and only needs a HTML5 supporting browser for viewing.
Blamematrix¶
- rows are folders as 1 level deeper than the "mindepth"
- columns are all individual usernames, plus an "allusers" column
- only duplicate-bytes are reported