|Find directories by name, path, or contents|
|E.g. Find directories which have 'accounts' anywhere in the directory name or path|
|E.g. Find directories which contain a particular file|
|E.g. Find directories which contain two particular files|
|E.g. Find biggest directories by total file size|
|E.g. Find biggest directories by total number of files|
|E.g. Find the directories that have the most Excel files|
|E.g. Find duplicate directories|
Find directories by name, path, or contents
The simplest way to search for directories is by matching text in their name or path, by querying the files table. This table has an entry for every file and directory scanned, and includes these text fields:
name - file or directory name, e.g. 'mydirectory'
parentpath - path of the parent directory e.g. '/Users/johnsmith/documents/'
fullpath - parentpath plus directory or file name e.g '/Users/johnsmith/documents/mydirectory/'
type - 'f' for files, 'd' for directories
See a full list of fields by typing ?files at the Crab command line, or see the online Documentation for "Tables"
SELECT parentpath FROM files
WHERE name = 'myfile1.txt' and parentpath in (select parentpath from files where name='myfile2.txt');
SELECT parentpath FROM files f1 join files f2 on f1.parentpath = f2.parentpath
WHERE f1.name = 'myfile1.txt' and f2.name = 'myfile2.txt';
This query finds candidate duplicate directories by looking for directories that contain the same total file size, and same number of files.
SELECT p1.pp, p2.pp, p1.size/1e9
(SELECT parentpath as pp, sum(bytes) ||':'|| count(*) as sig, sum(bytes) as size FROM files
GROUP BY parentpath ) AS p1
(SELECT parentpath as pp, sum(bytes) ||':'|| count(*) as sig FROM files
GROUP BY parentpath ) AS p2
ON p1.sig = p2.sig and p1.pp < p2.pp
ORDER BY p1.size DESC