mimeo

spacesavers2_mimeo¶

This takes in the catalog file generated by spacesavers2_catalog and processes it to:

find duplicates
create per-user summary reports for each user (and all users).

Inputs¶

--catalog is the output file from spacesavers2_catalog. Thus, spacesavers2_catalog needs to be run before running spacesavers2_mimeo.
--maxdepth maximum folder depth upto which reports are aggregated
--outdir path to the output folder
--prefix prefix to be added to the output file names eg. date etc.
--duplicatesonly only report duplicates in the .files.gz output files. This saves a lot of disc space. (Highly recommended!)
--quota defines the size of the overall file mount. (eg. 200 TB for /data/CCBR on BIOWULF.) OccScore is dependent on this and should be provided appropriately for accurate results.

% spacesavers2_mimeo --help
usage: spacesavers2_mimeo [-h] -f CATALOG [-d MAXDEPTH] [-o OUTDIR] [-p PREFIX] [-q QUOTA] [-z | --duplicatesonly | --no-duplicatesonly] [-k | --kronaplot | --no-kronaplot] [-v]

spacesavers2_mimeo: find duplicates

options:
  -h, --help            show this help message and exit
  -f CATALOG, --catalog CATALOG
                        spacesavers2_catalog output from STDIN or from catalog file
  -d MAXDEPTH, --maxdepth MAXDEPTH
                        folder max. depth upto which reports are aggregated ... absolute path is used to calculate depth (Default: 10)
  -o OUTDIR, --outdir OUTDIR
                        output folder
  -p PREFIX, --prefix PREFIX
                        prefix for all output files
  -q QUOTA, --quota QUOTA
                        total quota of the mount eg. 200 TB for /data/CCBR
  -z, --duplicatesonly, --no-duplicatesonly
                        Print only duplicates to per user output file.
  -k, --kronaplot, --no-kronaplot
                        Make kronaplots for duplicates.(ktImportText must be in PATH!)
  -v, --version         show program's version number and exit

Version:
    v0.10.2-dev
Example:
    > spacesavers2_mimeo -f /output/from/spacesavers2_catalog -o /path/to/output/folder -d 7 -q 10 -k

Outputs¶

After completion of run, spacesavers2_mimeo creates *.mimeo.files.gz (list of files per user + one "allusers" file) and .summary.txt (overall stats at various depths) files in the provided output folder. if -k is provided (and ktImportText from kronatools is in PATH) then krona specific TSV and HTML pages are also generated. It also generates a blamematrix.tsv file with folders on rows and users on columns with duplicate bytes per-folder-per-user. This file can be used to create a "heatmap" to pinpoint folder with highest duplicates overall as well as on a per-user basis.

Here are the details:

Duplicates¶

spacesavers2_mimeo uses the following logic to find duplicates:

Bin files by their top (and bottom) xxHashes irrespective of user id (allusers mode)
Check if each bin has unique sized files. If a bin has more than 1 size, then it needs to be binned further. Sometimes, xxHash of top and bottom chunks also gives the same combination of hash for differing files. These files will have different sizes. Hence, re-bin them accordingly.
If same size, then check inodes. If all files in the same bin have the same inode, then these are just hard-links. But, if there are multiple inodes, then we have duplicates!
If we have duplicates, then spacesavers2_mimeo keeps track of number of duplicates per bin. Number of duplicates is equal to number of inodes in each bin minus one.
If we have duplicates, then the oldest file is identified and considered to be the original file. All other files are marked duplicate, irrespective of user id.
duplicate files are reported in gzip format with the following columns for all users and per-user basis

Here is what the .files.gz file columns (space-separated) represent:

Column	Description
1	top chunk and bottom chunk hashes separated by "#"
2	separator ":"
3	Number of duplicates files (not duplicate inodes)
4	Size of each file
5	List of users duplicates serapated by "##"

NOTE: Number of dupicate files can be greater than number of duplicate inodes as each file can have multiple hard links already. Hence, while calculating total duplicate bytes we use (total_number_of_unique_inodes_per_group_of_duplicate_files - 1) X size_of_each_file. The "minus 1" is to not count the size of the original file.

Each file in the last column above is ";" separated with the same 13 items as described in the catalog file. The only difference is that the username and groupame are now appended to each file entry.

Along with creating one .mimeo.files.gz and .mimeo.summary.txt file per user encountered, spacesavers2_mimeo also generates a allusers.mimeo.files.gz file for all users combined. This file is later used by spacesavers2_blamematrix as input.

Summaries¶

Summaries, files ending with .mimeo.summary.txt are collected and reported for all users (allusers.mimeo.summary.txt) and per-user (USERNAME.mimeo.summary.txt) basis for user-defined depth (and beyond). The columns (tab-delimited) in the summary file:

Column	Description
1	absolute path
2	total bytes
3	duplicate bytes
4	percent duplicate bytes
5	total files
6	duplicate files
7	percent duplicate files
8	average file age of all files (days)
9	average file age of duplicates (days)
10	AgeScore
11	DupScore
12	OccScore
13	OverallScore

For columns 10 through 13, the same logic is used as spacesavers.

KronaTSV and KronaHTML¶

KronaTSV is tab-delimited with first column showing the number of duplicate bytes and every subsequent column giving the folder depths.
ktImportText is then used to convert the KronaTSV to KronaHTML which can be shared easily and only needs a HTML5 supporting browser for viewing.

Blamematrix¶

rows are folders as 1 level deeper than the "mindepth"
columns are all individual usernames, plus an "allusers" column
only duplicate-bytes are reported