FAQ

Do I have to scan before running a query?

  • Crab populates the files table in its SQLite database by scanning the filesystem, to query this data you have to scan first. File contents in the fileslines table are read directly from file at query run time.

  • You can provide one or more paths at start up, for Crab to scan; or you can start Crab without a scan path to query data from an earlier scan.

  • You can specify a scan data file with the -db switch when starting Crab. To store scan data there, launch Crab with a path to scan. To query the scan data it contains, launch Crab without providing a path to scan


Why doesn't Crab use the system index?

  • This way you have control over exactly what is scanned and when

  • No risk that files will be missed

  • Can turn off system indexing: No risk that system will be unexpectedly slowed by system indexing process

  • Can avoid the need to specify path in the WHERE clause, by scanning only relevant directories.



Does Crab's index track file changes?

  • No, but the filesystem is scanned at thousands of files / second. So to rescan the relevant directories is generally just a moments work

  • A pathexists() function is available to check that files and directories are present at query run time, if querying an old scan.

  • Most people scan project directories on-demand in the default, unnamed database; with a whole drive scan once a week, stored in a named file.



What are Crab's processing limits? How big, how many?

  • The total number of commands that queries can execute or rows they can write, is limited by memory constraints.

  • On macOS we've tested with hundreds of millions of rows without any problems.

  • On Windows the limit is currently around one million rows: Crab for Windows is currently a 32bit app.



Do I need a commercial license?

  • If you are being paid for the time you spending using Crab, you need a commercial license. If not, you don't.

Tips

Path abbreviations such as . and .. won't match paths in Crab

Path abbreviations such as . and .. aren't useful in describing paths in Crab because we're simply matching text. Use SQL string pattern matching with wildcards instead.


Use ctrl+C or ctrl+Break to cancel a running scan or query

Sometimes this will exit Crab, in which case just press up arrow at the command line to recall the command you previously used to start Crab. Edit the line to remove any scan path, Crab will restart loading the data you were working with. If you made changes to settings such as %mode and %encoding, you'll have to redo these as they are not saved across sessions.


To find files by name or path use wildcard string matching

To find files by name or path use wildcard string matching with like, glob or match on the filename, fullpath or parentpath

  • LIKE is not case sensitive. Use wildcard % to match zero or more characters, wildcard _ to match one character

  • GLOB is case sensitive. Use wildcard * to match zero or more characters, wildcard . to match one character.

  • MATCH is case sensitive. Use with regex patterns.


Soft links are followed when scanning

Soft links (also known as "symbolic links") are files which act as a placeholder for a file or directory elsewhere on the disk. This allows the same directory to appear in multiple places without taking up extra disk space.

On Windows, when Crab scans a soft link, it records the contents of the linked object. This is different from Mac where soft links are not followed; the link object itself is scanned.


Mounted folders are not scanned by default

A recursive scan will not include mounted folders by default. To scan these too, use the ‑mount option when launching Crab.


If you do it often, make a Crab script

If you have a set of queries you run often, put them together into a Crab script and run it with the %read command, or the -init start up option.

Crab scripts are just text files containing one or more queries (each terminated by a semicolon), and/or Crab commands (no semicolons). We recommend a .crab extension, but it's not compulsory.


Use appropriate encoding and error handler when querying filelines table

Crab's default settings are configured for UTF-8 and ASCII text files, any non UTF-8 character will cause your query to skip the rest of the file, so as to exclude contents of binary files from query results.

To search files that have occasional exotic characters change the %encoding setting from utf8:skipfile to utf8:ignore (filters out invalid characters) or utf8:replace (replaces invalid characters by ) e.g.

    %encoding utf8:ignore

For further info see the encoding command reference in the "Commands and settings" page of the "Documentation" menu


Change layout of query results

Default layout of query results from the interactive Crab prompt is dict. This gives results as a dictionary of fieldname:value pairs. When running batch queries list mode is the default format (comma delimited lists). You can choose the query result layout you prefer:

  • Each field on a line of its own

    %mode line

  • One column per field truncated to 10 chars; use the %width command to change number of chars.

    %mode column

  • Comma delimited output fields; use the %separator command to change the delimiter.

    %mode list

To return to the default:

  • Dictionary output mode

    %mode dict

column and list modes have a header row that can be toggled on or off with the %header command


Suppress query output to screen

By default exec(), write() and writeln() send lots of output to the screen. In the case of exec() it outputs every command executed along with the command's output. In the case of write and writeln() they copy to the screen what is written to file. If you are running these over many rows, you may prefer to turn off screen output, or redirect to file.

To discard the output use the following command before running the query:

%output NUL


Any error messages will still go to the Terminal window, as will subsequent CRAB> prompts

To switch output back to the screen do this:

%output stdout


If you want to keep a log of the output rather than discard it, use a filename in place of /dev/null e.g.

%output 'C:\somepath\MyLogFile.txt'



Can specify multiple scan paths

When launching Crab you can use specify multiple files and/or directories to scan, e.g.


C:\> crab E:\dir1 F:\dir2 .\dir3


Will recursively scan all files and directories in dir1, dir2 and dir3. This allows you to run queries that compare files on separate drives or disjoint areas of the filesystem.

Similarly you can specify multiple files to scan; then any query on the fileslines table will run over their combined contents, e.g.


C:\> crab errlog1.txt errolog2.txt


N.B. On Windows, Crab doesn't yet support wildcards to specify scan paths.


Test exec() queries by running them with the echo command

Prior to running a query that will modify the filesystem you should prefix your exec() function call with an echo command to check exactly which commands will be run on which files. E.g.

SELECT exec('echo', '7za.exe', 'e', '-y', fullpath,'‑oC:\Users\johnsmith\Target Dir\')
FROM files
WHERE parentpath like 'C:\Users\johnsmith\Source Dir\%'
and extension = '.zip';


Only when you're happy that the right files are being processed should you remove the 'echo'

SELECT exec('7za.exe', 'e', '-y', fullpath,'‑oC:\Users\johnsmith\Target Dir\')
FROM files
WHERE parentpath like 'C:\Users\johnsmith\Source Dir\%'
and extension = '.zip';



Check if files are still present at query runtime with pathexists()

Crab doesn't track files that are moved or deleted after a scan. To check that a file or directory is still present at query run time use the pathexists() function.

E.g. to unzip a bunch of files when some have been moved or deleted since the last scan:

SELECT exec('7za.exe', 'e', '-y', fullpath,'‑oC:\Users\johnsmith\Target Dir\')
FROM files
WHERE parentpath like 'C:\Users\johnsmith\Source Dir\%'
and extension = '.zip'
and pathexists(fullpath);



Use pipes or I/O redirection with exec()

Because exec() always runs in a shell on Windows, you can use pipes and I/O redirection. If you build strings to pass to the shell you must concatenate double quotes around names and paths to avoid problems with any that contain spaces. E.g.

SELECT exec('dir "'|| fullpath || '" >> DirMyFiles.txt')
FROM files;



Prefer single quotes to double quotes in queries

When you specify a string literal in a query, use single quote marks wherever possible; if double quotes are used to delimit a field name, it will be evaluated.
Double quotes are sometimes used for strings that contain single quotes, e.g. when a single quote is used as an apostrophe

SELECT * FROM files
WHERE name LIKE "%Peter's conflicted copy 2014-05-19%"

But tread carefully - SQLite will first try to interpret a string in double quotes as a field name, a table name, or a keyword. If the string contents match one of these you will get unexpected results. For example "name" would be interpreted as a field in the file table, not a literal string.

A better way to enter a single quote mark is by doubling them up

SELECT * FROM files
WHERE name LIKE '%Peter''s conflicted copy 2014-05-19%'



Use double quotes to specify paths at Windows command line

Windows command line: If paths contain spaces use double quotes when specifying scan path arguments to Crab, e.g.

  C:\>  crab   "C:\Users\johnsmith\My Project"




Escape LIKE operator wildcards to match underscore and percent symbols

The _ and % symbols are wildcards for the LIKE operator. To search for a string that includes an exact match for either of them, you have to use an escape character; to do this you have to tell SQLite which character to use as an escape character. E.g. to match an underscore:

SELECT name, linenumber,data FROM fileslines
WHERE extension = '.py' AND data LIKE '%command\_%' ESCAPE '\';



Copy text files with write and writeln

• If you copy a non-native file using writeln, the copy will have native line endings for the platform you're working on,
ie Mac+Unix: ( LF)   Windows: (CR + LF)
• If you copy a text file using writeln, a newline will be added at the end of the last line of data even if there wasn't one there before.
• If you copy a text file using write, all the newlines will be eaten


Faster queries

Avoid using wildcards on the left side of a string that you're matching. E.g. in a path; '%\myproject\%' is much slower to match than 'C:\Users\myproject\%' because the SQLite engine cannot use any indexes.

Queries against fileslines can be sped up significantly by putting a restriction on the linenumber field: once the maximum linenumber specified has been reached, further reading of the file is not necessary.

E.g.

SELECT name, upper(groupn(data, 'encoding:\s*(\S.*)\s*',1)) enc FROM fileslines
WHERE type = 'f' AND data MATCH 'encoding' AND linenumber < 50;



Joins on linenumber are slow

Joins on the fileslines table to itself - from one file to another - on linenumber are currently slow, avoid them if you can.


Crab script file comments

To enter a comment in a Crab script file, use the prefix %--, the remainder of the line will be ignored.

Here is an example Crab script that makes use of comments:


%-- IMPORT CHECK

%-- change output format to list output with no headers and a pipe separator
%mode list
%header off
%separator '|'

%-- remove old log file
! del /q importlog.txt

%-- send query output to log file
%output importlog.txt

%-- find the files with errors
SELECT fullpath,linenumber,data FROM fileslines
WHERE parentpath = 'C:\Users\peterdalloz\Dropbox\PYTHONDROPBOX\sqlsys\'
AND data LIKE '%error%' AND EXTENSION NOT IN ('.crdb','.c')
AND pathexists(fullpath);

%-- reset output to screen
%output stdout

© 2017 Etia UK