The simplest way to search for directories is by matching text in their name or path, by querying the files table. The files table has an entry for every file and directory scanned, and includes these text fields:

    name - file or directory name, e.g. 'mydirectory'
    parentpath - path of the parent directory e.g. 'C:\Users\johnsmith\documents\'
    fullpath - parentpath plus directory or file name e.g 'C:\Users\johnsmith\documents\mydirectory\'
    type - 'f' for files, 'd' for directories


See a full list of fields by typing ?files at the Crab command line, or see the online Documentation for "Tables"

E.g. Find directories which have 'accounts' anywhere in the directory name or path


SELECT fullpath FROM files
WHERE fullpath like '%accounts%' and type = 'd';

E.g. Find directories which contain a particular file


SELECT parentpath FROM files
WHERE name = 'myfile.txt';

E.g. Find directories which contain two particular files


SELECT parentpath FROM files
WHERE name = 'myfile1.txt' and parentpath in (select parentpath from files where name='myfile2.txt');

An alternative query which uses a join rather than a subselect:

SELECT parentpath FROM files f1 join files f2 on f1.parentpath = f2.parentpath
WHERE f1.name = 'myfile1.txt' and f2.name = 'myfile2.txt';

E.g. Find biggest directories by total file size


SELECT parentpath, sum(bytes)/1e9 as GB FROM files
GROUP BY parentpath ORDER BY sum(bytes) DESC LIMIT 5;

E.g. Find biggest directories by total number of files


SELECT parentpath, count(*) as num_files FROM files
GROUP BY parentpath ORDER BY num_files DESC LIMIT 5;

E.g. Find the directories that have the most Excel files


SELECT parentpath, count(*) as num_files FROM files
WHERE extension like '.xls%'
GROUP BY parentpath ORDER BY num_file DESC LIMIT 5;

E.g. Find duplicate directories


This query finds candidate duplicate directories by looking for directories that contain the same total file size, and number of files.

SELECT p1.pp, p2.pp, p1.size/1e9 
FROM
(SELECT parentpath as pp, sum(bytes) ||':'|| count(*) as sig, sum(bytes) as size FROM files
GROUP BY parentpath ) AS p1
JOIN
(SELECT parentpath as pp, sum(bytes) ||':'|| count(*) as sig FROM files
GROUP BY parentpath ) AS p2
ON p1.sig = p2.sig and p1.pp < p2.pp
ORDER BY p1.size DESC
LIMIT 10;

The sig field is the 'signature' for each directory, calculated from its total file size, sum(bytes), and number of files, count(*)

© 2017 Etia UK