Do I have to scan before running a query?

  • Crab populates the files table in its SQLite database by scanning the filesystem, to query this data you have to scan first. File contents in the fileslines table are read directly from file at query run time.

  • You can provide one or more paths at start up, for Crab to scan; or you can start Crab without a scan path to query data from an earlier scan.

  • You can specify a scan data file with the -db switch when starting Crab. To store scan data there, launch Crab with a path to scan. To query the scan data it contains, launch Crab without providing a path to scan

Why doesn't Crab use the system index?

  • This way you have control over exactly what is scanned and when

  • No risk that files will be missed

  • Can turn off system indexing: No risk that system will be unexpectedly slowed by system indexing process

  • Can avoid the need to specify path in the WHERE clause, by scanning only relevant directories.

Does Crab's index track file changes?

  • No, but the filesystem is scanned at thousands of files / second. So to rescan the relevant directories is generally just a moments work

  • A pathexists() function is available to check that files and directories are present at query run time, if querying an old scan.

  • Most people scan project directories on-demand in the default, unnamed database; with a whole drive scan once a week, stored in a named file.

What are Crab's processing limits? How big, how many?

  • The total number of commands that can be executed, or rows written, by a query is limited by memory constraints.

  • On macOS we've tested with hundreds of millions of rows without any problems.

  • On Windows the limit is currently much lower, around one million rows: On that platform Crab is currently a 32bit app.

Do I need a commercial license?

  • If you are being paid for the time you spending using Crab, you need a commercial license. If not, you don't.


Path abbreviations such as ~ . and .. won't match paths in Crab

Path abbreviations such as ~ . and .. aren't useful in describing paths in Crab because we're simply matching text. Use SQL string pattern matching with wildcards instead.

Use ctrl+C to cancel a running query

Sometimes this will exit Crab, in which case just press up arrow at the command line to recall the command you previously used to start Crab. Edit the line to remove any scan path, Crab will restart loading the data you were working with. If you made changes to settings such as %mode and %encoding, you'll have to redo these as they are not saved across sessions.

To find files by name or path use wildcard string matching

To find files by name or path use wildcard string matching with like, glob or match on the filename, fullpath or parentpath

  • LIKE is not case sensitive. Use wildcard % to match zero or more characters, wildcard _ to match one character

  • GLOB is case sensitive. Use wildcard * to match zero or more characters, wildcard . to match one character.

  • MATCH is case sensitive. Use with regex patterns.

Soft links aren't followed when scanning

Soft links are files which act as a placeholder for a file or directory elsewhere on the disk. This allows the same directory to appear in multiple places without taking up extra disk space.

On Mac, when Crab scans a soft link it records the soft link but not the contents of the target. This is different from Crab's behaviour on Windows where it follows the link.

You can identify soft links from the mode field in the files table . For directories the mode starts with a lowercase L, 'l', directories have a mode that starts with a 'd', and files have a mode that starts with a hyphen '-'.

Mounted disks are not scanned by default

A recursive scan will not include mounted filesystems by default. To scan these too, use the ‑mount option when launching Crab.

If you do it often, make a Crab script

If you have a set of queries you run often, put them together into a Crab script and run it with the %read command, or the -init start up option.

Crab scripts are just text files containing one or more queries (each terminated by a semicolon), and/or Crab commands (no semicolons). We recommend a .crab extension, but it's not compulsory.

Use appropriate encoding and error handler when querying filelines table

Crab's default settings are configured for UTF-8 and ASCII text files, any non UTF-8 character will cause your query to skip the rest of the file, so as to exclude contents of binary files from query results.

To search files that have occasional exotic characters change the %encoding setting from utf8:skipfile to utf8:ignore (filters out invalid characters) or utf8:replace (replaces invalid characters by ) e.g.

    %encoding utf8:ignore

For further info see the encoding command reference in the "Commands and settings" page of the "Documentation" menu

Change layout of query results

Default layout of query results from the interactive Crab prompt is dict. This gives results as a dictionary of fieldname:value pairs. When running batch queries list mode is the default format (comma delimited lists). You can choose the query result layout you prefer:

  • Each field on a line of its own

    %mode line

  • One column per field truncated to 10 chars; use the %width command to change number of chars.

    %mode column

  • Comma delimited output fields; use the %separator command to change the delimiter.

    %mode list

To return to the default:

  • Dictionary output mode

    %mode dict

column and list modes have a header row that can be toggled on or off with the %header command

Suppress query output to screen

By default exec(), write() and writeln() send lots of output to the screen. In the case of exec() it outputs every command executed along with the command's output. In the case of write and writeln() they copy to the screen what is written to file. If you are running these over many rows, you may prefer to turn off screen output, or redirect to file.

To discard the output use the following command before running the query:

%output /dev/null

Any error messages will still go to the Terminal window, as will subsequent CRAB> prompts

To switch output back to the screen do this:

%output stdout

If you want to keep a log of the output rather than discard it, use a filename in place of /dev/null e.g.

%output '/somepath/MyOutputLog.txt'

Can specify multiple scan paths

When launching Crab you can use specify multiple files and/or directories to scan, e.g.

$ crab dir1 dir2 dir3

Will recursively scan all files and directories in dir1, dir2 and dir3. This allows you to run queries that compare files on separate drives or disjoint areas of the filesystem.

Similarly you can specify multiple files to scan; then any query on the fileslines table will run over their combined contents


$ crab errlog1.txt errlog2.txt

Can use wildcards to specify scan paths

When launching Crab you can use wildcards to specify the files and/or directories to scan. The wildcard filters aren't recursive; the wildcards are unpacked and then recursive scans are started on the objects implied, and all objects inside them.


$ crab a*

Will recursively scan all files and directories in the current directory whose names begin with the letter a. By scanning only the relevant files you may be able to avoid the need for additional filters on filename or path in the WHERE clause of queries you run.

E.g to analyze all .log files in the current directory

$ crab *.log

Now any queries will search only the content of .log files in the current directory and you won't need a restriction on filename or path in your WHERE clause.

SELECT name, matchcount(data,',') numCommas, count(*) numLines
FROM fileslines
GROUP BY name, numCommas;

N.B. On Windows wildcards aren't currently available, but you can provide a list of files or directories to scan.

Test exec() queries by running them with the echo command

Prior to running a query that will modify the filesystem you should prefix your exec() function call with an echo command to check exactly which commands will be run on which files. E.g.

SELECT exec('echo', 'unzip', '-n', fullpath,'-d','/Users/johnsmith/Target Dir/')
FROM files
WHERE parentpath like '/Users/johnsmith/Source Dir/%'
and extension = '.zip';

Only when you're happy that the right files are being processed should you remove the 'echo':

SELECT exec('unzip', '-n', fullpath,'-d','/Users/johnsmith/Target Dir/')
FROM files
WHERE parentpath like '/Users/johnsmith/Source Dir/%'
and extension = '.zip';

Check if files are still present at query runtime with pathexists()

Crab doesn't track files that are moved or deleted after a scan. To check that a file or directory is still present at query run time use the pathexists() function.

E.g. to unzip a bunch of files when some have been moved or deleted since the last scan:

SELECT exec('unzip', '-n', fullpath,'-d','/Users/johnsmith/Target Dir/')
FROM files
WHERE parentpath like '/Users/johnsmith/Source Dir/%'
and extension = '.zip'
and pathexists(fullpath);

Use pipes or I/O redirection with exec()

If you need to use pipes or I/O redirection on macOS, use exec() with only one argument. exec() commands with one argument are run in a shell. In this case you will have to concatenate together the command, its options, its arguments and everything else into one string. You will also need to embed single quotes around names and paths to avoid problems with any that contain spaces, e.g.

SELECT exec("ls '" || fullpath || "' >> DirMyFiles.txt")
FROM files;

To make this a little easier to read, Crab has a function called q() which puts single quotes around its argument.

SELECT exec('ls ' || q(fullpath) || ' >> DirMyFiles.txt')
FROM files;

Prefer single quotes to double quotes in queries

When you specify a string literal in a query, use single quote marks wherever possible; if double quotes are used to delimit a field name, it will be evaluated.

It's convenient to use double quotes to enclose strings that contain single quotes, e.g. when a single quote is used as an apostrophe.

WHERE name LIKE "%Peters-MacBook-Air's conflicted copy 2014-05-19%"

In general this won't cause problems, but be aware that SQLite will first try to interpret a string in double quotes as a field name, a table name, or a keyword. If the string contents match one of these you will get unexpected results. For example "name" would be interpreted as a field in the file table, not a literal string.

A better way to enter a single quote mark is by doubling them up

WHERE name LIKE '%Peters-MacBook-Air''s conflicted copy 2014-05-19%'

Prefer single quotes to double quotes when specifying paths in Bash

Bash command line: Prefer single quotes to double quotes when specifying scan path arguments to crab, as strings in double quotes which contain Bash shell variables will be expanded before being passed to Crab.

Escape LIKE operator wildcards to match underscore and percent symbols

The _ and % symbols are wildcards for the LIKE operator. To search for a string that includes an exact match for either of them, you have to use an escape character; to do this you have to tell SQLite which character to use as an escape character. E.g. to match an underscore:

SELECT name, linenumber,data FROM fileslines
WHERE extension = '.py' AND data LIKE '%command\_%' ESCAPE '\';

Copy text files with write and writeln

• If you copy a non-native file using writeln, the copy will have native line endings for the platform you're working on,
ie Mac+Unix: ( LF)   Windows: (CR + LF)
• If you copy a text file using writeln, a newline will be added at the end of the last line of data even if there wasn't one there before.
• If you copy a text file using write, all the newlines will be eaten

Faster queries

Avoid using wildcards on the left side of a string that you're matching, for example in a path. '%/myproject/%' is much slower to match than '/Users/johnsmith/%'
This is because wildcards on the left of a string prevent the SQLite engine from using indexes.

Queries against fileslines can be sped up significantly by putting a restriction on the linenumber field: once the maximum linenumber specified has been reached, further reading of the file is not necessary.


SELECT name, upper(groupn(data, 'encoding:\s*(\S.*)\s*',1)) enc FROM fileslines
WHERE type = 'f' AND data MATCH 'encoding' AND linenumber < 50;

Joins on linenumber are slow

Joins on the fileslines table to itself - from one file to another - on linenumber are currently slow, avoid them if you can.

Crab script file comments

To enter a comment in a Crab script file, use the prefix %--, the remainder of the line will be ignored.

Here is an example Crab script that makes use of comments:


%-- change output format to list output with no headers and a pipe separator
%mode list
%header off
%separator '|'

%-- remove old log file
! rm -f importlog.txt

%-- send query output to log file
%output importlog.txt

%-- find the files with errors
SELECT fullpath,linenumber,data FROM fileslines
WHERE parentpath = '/Users/peterdalloz/Dropbox/PYTHONDROPBOX/sqlsys/'
AND data LIKE '%error%' AND EXTENSION NOT IN ('.crdb','.c')
AND pathexists(fullpath);

%-- reset output to screen
%output stdout

© 2017 Etia UK