To compare directory contents you need a signature for each file - i.e. criteria to use when deciding whether files are the same or different. E.g. will you consider files to match just because they have the same name, or do they need to be identical?

File signatures


  • If you only care about whether a file is present, use each file's name as its signature.


  • If you want different versions of the same file to count as different files, a quick and dirty approach is to concatenate the filename with its size to make the signature, ie name || bytes.


  • If you want to be sure that different versions of a file will count as different files, at the expense of having to scan each file's contents at run time, concatenate the filename with its SHA-1 checksum, ie name || sha1(fullpath)


  • If you care about the location of each file in a directory tree, you can include part of the path in the signature: Trim the root directory of the tree from the front of each fullpath, and treat this as the signature. (Crab's below() function is provided to facilitate this). To this truncated path you can optionally concatenate the file's size in bytes or SHA-1 checksum.

Files in directory 'dirA' that are missing from 'dirB', by name


This query recursively identifies file names below directory 'dirA' that aren't found anywhere in the tree below 'dirB'

SELECT fullpath FROM files 
WHERE fullpath like '%\dirA\%'
and type = 'f'
and name not in (SELECT name FROM files
WHERE fullpath like '%\dirB\%');

N.B. This assumes you've only got one 'dirA' and one 'dirB' directory in your scan data, otherwise you'll need to write out more of the path, e.g. replace

WHERE fullpath like '%\dirA\%'

with

WHERE fullpath like 'C:\somespecificpath\dirA\%'

Files in directory 'dirA' that are missing from 'dirB', by name and size


This query recursively identifies files below directory 'dirA' that aren't found anywhere in the tree below 'dirB', using file name and size as a signature for each file.

SELECT fullpath FROM files 
WHERE fullpath like '%\dirA\%'
and type = 'f'
and name||bytes not in (SELECT name||bytes FROM files
WHERE fullpath like '%\dirB\%');

N.B. This assumes you've only got one 'dirA' and one 'dirB' directory in your scan data, otherwise you'll need to write out more of the path

Files in directory 'dirA' that are missing from 'dirB', by name and SHA-1


This query recursively identifies files below directory 'dirA' that aren't found anywhere in the tree below 'dirB', using file name and SHA-1 as a signature for each file.

SELECT fullpath FROM files 
WHERE fullpath like '%\dirA\%'
and type = 'f'
and name||sha1(fullpath) not in (SELECT name||sha1(fullpath) FROM files
WHERE fullpath like '%\dirB\%');

N.B. This assumes you've only got one 'dirA' and one 'dirB' directory in your scan data, otherwise you'll need to write out more of the path

Recursively find files in the 'Music' directory on drive C: that are missing from the 'Music' directory on F:


First scan the relevant directories on both drives

C:>  crab  C:\Users\johnsmith\Music  F:\Music

In this query we use name || bytes as the file signature.

SELECT name||bytes FROM files
WHERE parentpath like 'C:\%\Music\%' and type = 'f'
and name||bytes not in (SELECT name||bytes FROM files WHERE parentpath like 'F:\Music\%');

The wildcards and like operators make this query recursive.

Files in dirA or dirB, but not both (recursive)


This query classifies paths as containing dirA or dirB, and uses that to count how many paths each file signature appears on. We're using name || bytes as the file signature.

SELECT 
max(fullpath),
name||bytes,
count(distinct case when fullpath like '%\dirA\%' then 'A'
when fullpath like '%\dirB\%' then 'B' end) dirCount
FROM files
WHERE (fullpath like '%\dirA\%' or fullpath like '%\dirB\%') and type = 'f'
GROUP BY name||bytes
HAVING dirCount < 2;

N.B. This assumes you've only got one 'dirA' and one 'dirB' directory in your scan data, otherwise you'll need to write out more of the path

Files that are in both dirA and dirB


This query classifies paths as containing dirA or dirB, and uses this to count how many paths each file signature appears on. We're using name || bytes as the file signature.

SELECT 
max(fullpath),
min(fullpath),
name||bytes,
count(distinct case when fullpath like '%\dirA\%' then 'A'
when fullpath like '%\dirB\%' then 'B' end) dirCount
FROM files
WHERE (fullpath like '%\dirA\%' or fullpath like '%\dirB\%') and type = 'f'
GROUP BY name||bytes
HAVING dirCount = 2;

N.B. This assumes you've only got one 'dirA' and one 'dirB' directory in your scan data, otherwise you'll need to write out more of the path

Duplicate directories


This query finds possible duplicate directories by looking for directories that contain the same total file size, and same number of files.


SELECT p1.pp, p2.pp, p1.size/1e9
FROM
(SELECT parentpath as pp, sum(bytes) ||':'|| count(*) as sig, sum(bytes) as size FROM files
GROUP BY parentpath ) AS p1
JOIN
(SELECT parentpath as pp, sum(bytes) ||':'|| count(*) as sig FROM files
GROUP BY parentpath ) AS p2
ON p1.sig = p2.sig and p1.pp < p2.pp
ORDER BY p1.size DESC
LIMIT 10;

Results are sorted by directory size, and the ten largest are returned.

© 2017 Etia UK