Find files by name, path or metadata


You can only find files and directories that have been indexed during a scan. Project directory trees can generally be scanned in a second or two; a complete hard drive takes around 30 minutes. Most people scan project directories as needed, and the whole drive once a week or so.

The simplest way to search for files is by matching text in their name or path, by querying the files table. The files table has an entry for every file and directory scanned, and includes these text fields:

    name - file or directory name, e.g. 'somefile.txt'
    extension - extension part of the file name e.g. '.txt'
    parentpath - path of the parent directory e.g. '/Users/johnsmith/documents/'
    fullpath - parentpath plus filename e.g '/Users/johnsmith/documents/somefile.txt'
    type - 'f' for files, 'd' for directories

See a full list of fields by typing ?files at the Crab command line, or see the online Documentation for "Data Tables"

E.g. Find files or directories which have 'accounts' anywhere in the name or path


SELECT fullpath FROM files
WHERE fullpath like '%accounts%';

The like operator provides wildcard pattern matching, and ignores case, so this query will match 'Accounts' and 'ACCOUNTS' too. % is a wildcard that matches any string of zero or more characters.


Add
type = 'f' to the WHERE clause to exclude directories from the results

SELECT fullpath FROM files
WHERE fullpath like '%accounts%'
and type = 'f';


Use a LIMIT clause if the query returns too many results.

SELECT fullpath FROM files
WHERE fullpath like '%accounts%'
LIMIT 20;

E.g. Find Excel files which have 'accounts' anywhere in the filename or the path


SELECT fullpath FROM files
WHERE fullpath like '%accounts%' and extension like '.xls%';

E.g. Find Word files in the 'Documentation' directory


SELECT fullpath FROM files
WHERE parentpath = '/somepath/Documentation/'
and extension in ('.doc', '.docx');

This query searches for files with a particular parentpath, no like operator or wildcards, so it isn't a recursive search. It matches only direct children of the 'Documentation' directory.

E.g. Recursively list all Word files below the 'Documentation' directory


SELECT fullpath FROM files
WHERE parentpath like '/somepath/Documentation/%'
and extension in ('.doc', '.docx');

This query searches for files whose parentpath starts with /somepath/Documentation/, so this is a recursive search. It matches files in the Documentation directory itself, or in any subdirectories.

E.g. Recursively find files in 'This' directory missing from 'That' directory


SELECT name FROM files
WHERE fullpath like '/somepath/This/%'
and name not in (SELECT name FROM files WHERE fullpath like '/somepath2/That/%');


Only filenames are compared

The query reports files
and directories that are missing because there is no filter on type.

The wildcards and
like operators make this query recursive.

E.g. Find files with a % symbol in the name


SELECT fullpath FROM files WHERE name like '%^%%' escape '^';

Because % is a SQL wildcard it has to be escaped. Here ^ is nominated as the escape character, and it's then used to specify that the middle % symbol is to be interpreted as a literal character instead of a wildcard.

E.g. Recursively find disk image files bigger than 100MB in the 'Downloads' directory


SELECT fullpath FROM files
WHERE parentpath like '/somepath/Downloads/%'
and extension = '.dmg' and bytes > 100e6;

E.g. Find duplicate files bigger than 100MB


SELECT f1.bytes/1e6 as MB, f1.fullpath, f2.fullpath
FROM files f1 join files f2 on f1.fileid > f2.fileid
and MB > 100
and f1.name = f2.name and sha1(f1.fullpath) = sha1(f2.fullpath);

fileid is a unique id for each file scanned, it's the rownumber in the files table.

The query does a self-join from the files table to the files table but only joins to files with a larger fileid. This is to avoid a file matching itself, and to avoid file1 matching file2, and then file2 also matching file1. The query identifies files that are duplicates by matching names, and if the names match then it compares SHA-1 values.
sha1() is a Crab function that calculates the SHA-1 checksum for a file given its fullpath. SHA-1 is a kind of fingerprint for files, and matching SHA-1 is a very good indicator that files have the same contents. File sizes are converted from bytes to megabytes by dividing by 1e6.

E.g. Case sensitive: Find files which have 'Accounts' anywhere in the filename or the path


The glob operator provides case-sensitive pattern matching. Use asterisk wildcard * to match zero or more characters and a dot . to match any single character


SELECT fullpath FROM files
WHERE fullpath glob '*Accounts*';

E.g. Find files whose names consist of between 5 and 10 hexadecimal digits


SELECT fullpath FROM files
WHERE upper(name) match '^[A-F0-9]{5,10}$' and type = 'f';


This query uses the type field to restrict results to files (type = 'f').

The
match operator provides case sensitive matching using regex patterns. Here '^ and '$' are used to anchor the start of the pattern to the start of the matched string, and the end of the pattern to the end of the matched string - i.e. to match names that consist of nothing more than five to ten hexadecimal digits.

The
upper() function converts filenames to uppercase before testing, so that lowercase hex digits will also be matched.

E.g. Find any pdf files that aren't also somewhere on Dropbox


SELECT fullpath FROM files
WHERE extension = '.pdf' and parentpath not like '%/Dropbox/%'
and name||bytes not in (SELECT name||bytes FROM files WHERE extension = '.pdf' and parentpath like '%/Dropbox/%');


|| is the string concatenation operator. Here we're concatenating the filename and size to make a signature for each .pdf file found, and comparing these with signatures of .pdf files found recursively below the 'Dropbox' directory

E.g. Date functions: Find python files modified since midnight yesterday


SELECT fullpath FROM files
WHERE extension = '.py'
and modified >= date('now', 'start of day', '-1 day');

No restriction on path here, modified date is tested for all Python files that were scanned.

Find files with specific contents


To query file contents use the fileslines table, which has one row for each line of text in the files scanned. It has the same fields as the files table, but also a data field which contains the text, and a linenumber field.

This is a virtual table, so it reads contents of files at query run time, and does not replicate file contents in the Crab database. File details in the WHERE clause determine which files are opened and read.

Set the right encoding

Crab's default settings are configured for contents of text files with UTF-8 or ASCII characters. To change %encoding for files with different encoding, or occasional exotic characters see the Use Case "Read and Write File Contents"

E.g. List every Python file in directory 'MyProject' that has a line containing 'FIXME'


SELECT DISTINCT fullpath FROM fileslines
WHERE parentpath = '/somepath/MyProject/' and extension = '.py'
and data like '%FIXME%';


Not recursive.

We could take this further and use the MATCH operator with regex to check for a comment line, but the simple version works well enough.

E.g. Recursively count 'FIXME' lines in each Python file below directory 'MyProject'


SELECT fullpath, count(*) FROM fileslines
WHERE parentpath like '/somepath/MyProject/%' and extension = '.py'
and data like '%FIXME%'
GROUP BY fullpath ORDER BY count(*) DESC;

E.g. Total of Python 'FIXME' lines for every directory below the 'anaconda' directory


SELECT parentpath, count(*) FROM fileslines
WHERE parentpath like '/somepath/anaconda/%' and extension = '.py'
and data like '%FIXME%'
GROUP BY parentpath ORDER BY count(*) DESC;

Tips


  • Path abbreviations such as ~ . and .. aren't useful in describing paths here, because we're simply matching text. Use SQL string pattern matching instead.


  • Use ctrl+C to cancel a running query. Sometimes this will exit Crab, in which case just press up arrow at the command line to recall the command you previously used to start Crab. Edit the line to remove any scan path, Crab will restart, loading the data you were working with. If you made changes to settings such as %mode and %encoding, you'll have to redo these as they are not saved across sessions.


  • To find files by name or path use wildcard string matching with like, glob or match on the filename, fullpath or parentpath

    • like is not case sensitive. Use wildcard % to match zero or more characters, wildcard _ to match one character

    • glob is case sensitive. Use wildcard * to match zero or more characters, wildcard . to match one character.

    • match is case sensitive. Use with regex patterns.


  • Queries will run faster if you avoid wildcards on the left hand side of search patterns as this allows the SQL engine to use indexes

© 2017 Etia UK