Crab - Filesystem SQL for macOS and Windows

Query and process files with SQL, just like data in a database

Shell programming languages are unfit for processing files.


Unless all your files are in one directory, Bash, Windows cmd, and PowerShell programming are likely the wrong tool for the job. Looping through files - or piping find and xarg results - forces you to rescan the filesystem for each pass, has problems for many corner cases (including spaces in filenames and premature wildcard expansion), runs slowly because they don't use indexes, is tortuous when combining multiple search criteria, basic arithmetic on results is painful, and comparing one set of files with another is a headache.


Meet Crab...

  • SQL's set based logic: Ideal for selecting groups of files for processing, for comparing one set of files with another, and for analyzing file stats

  • Use the LIKE operator: Find files with simple text pattern matching. No weird corner cases due to spaces in filenames, shell special characters, overlong path names, argument list too long errors, control characters, ...

  • Run Operating System commands on query results. No need to pipe anything, it's a SQL function. Use ORDER BY to specify the order files are processed.

  • Use SQL in Windows batch files, or run SQL scripts for repeatable queries and file processing

  • Write results to file or (if you have to..) pipe output to other commands

  • Combine search criteria using AND, OR and NOT. This is the 21st century!

  • No new language to learn or support, it's SQL

  • No need to install, no need for admin rights, just unzip and run

Move the shell sideways

Type 'crab' at the command line to get a Crab prompt powered by SQL. Crab scans the directories you specify and gives you an indexed table, files, that holds information about the files and directories, and a virtual table, fileslines, that gives you SQL access to file contents.

Scans run at several thousand files per second, so scanning a bunch of project or data directories usually takes no more than a second or two. The whole of a spinning disk 1TB drive takes about 30 minutes, SSDs are much faster. Most people scan project directories on demand, and the whole disk once a week or so - you can have multiple scan databases.

Just try these in Bash or PowerShell!

Find files where the filename or path includes 'sublime' but not 'backup'

select fullpath from files 
where fullpath like '%sublime%' and fullpath not like '%backup%';

Explanation: The files table has data for all files and directories scanned. The fullpath column is the absolute path including the filename, e.g '/Library/Fonts/Hei.ttf'. The LIKE operator matches strings, and the wildcard '%' matches anything including a zero length string. SQLite queries end with a semicolon.

Find files in directory dirA that are not in directory dirB

select name from files 
where fullpath like '%/dirA/%'
and name not in (select name from files where fullpath like '%/dirB/%');

Explanation: The subquery returns names of files that have directory dirB on the path. The main query returns files that have directory dirA on the path and whose names are not in the subquery results.

Find the five biggest files

select fullpath, bytes from files order by bytes desc limit 5;

Explanation: The bytes column is the file size in bytes. ORDER BY sorts the files by size in descending order (DESC), so the largest files are listed first. LIMIT 5 restricts results to the first five rows. If you come from SQL Server, this is equivalent to SELECT TOP 5. In general SQLite SQL is very similar to other flavors of SQL, including SQLServer.

Find the five biggest directories below the project folder, by total file size in GB

select parentpath, sum(bytes)/1e9 as GB from files 
where parentpath like '%/project/%'
group by parentpath order by sum(bytes) desc limit 5;

Explanation: The parentpath column records the parent directory for each file, e.g. the file '/Library/Fonts/Hei.ttf' has parentpath '/Library/Fonts/'. This query includes only files in directories that are below the project directory by string matching the parentpath. The query adds up the bytes for each file and groups the results to give a total for each parent directory. Results are sorted to give the largest parent directories first, and the results are limited to five rows. The total bytes for each directory is divided by 1e9 to convert from bytes to gigabytes.

List duplicate files bigger than 100MB

select f1.bytes/1e6 as MB, f1.fullpath, f2.fullpath 
from files f1 join files f2 on f1.fileid > f2.fileid
and MB > 100
and f1.name = f2.name and sha1(f1.fullpath) = sha1(f2.fullpath);

Explanation: The name column is the filename, e.g the file '/Library/Fonts/Hei.ttf' has name 'Hei.ttf'. fileid is the rownumber of the file in the files table, so it's a unique id for each file scanned.

The query does a self-join from the files table to the files table but only joins to files with a larger fileid. This is to avoid a file matching itself, and to avoid file1 matching file2, and then file2 also matching file1. The query identifies files that are duplicates by matching names, and if the names match then it compares SHA-1 values. sha1 is a Crab function that calculates the SHA-1 checksum for a file given its fullpath. SHA-1 is a kind of fingerprint for files, and matching SHA1 is a very good indicator that files have the same contents. File sizes are converted from bytes to megabytes by dividing by 1e6.

Open in Finder the directory that has the most Excel spreadsheets

select exec('open', parentpath), count(*) from files 
where extension like '.xls%'
group by parentpath order by count(*) desc limit 1;

Explanation: Crab's exec function executes operating system commands on data returned in query results, once for each result row. The extension column records the extension of each file or directory e.g the file '/Library/Fonts/Hei.ttf' has extension '.ttf'.

The query counts the files where the extension starts with '.xls' (so it matches '.xls' and '.xlsx' files), and groups the count by parent directory. Then it sorts the parent directories by this count and returns just one row - the parent directory with the most Excel files. Finally the query runs the 'open' command on this directory, which opens it in Finder.

Remove spaces from all directory names below a specific directory

select exec('mv', fullpath, parentpath||replace(name,' ','')) from files 
where fullpath like '/Users/OneDirection/%'
and type = 'd'
and not pathexists(parentpath||replace(name,' ',''))
order by depth desc;

Explanation: The type column has a 'd' for directories and an 'f' for files. This query uses pattern matching to pick up only directories that are below /Users/OneDirection/. It uses Crab's pathexists function to check that if we remove spaces from the directory name we will not collide with an object that already has that name. Finally it uses exec to invoke the OS X mv command to rename the file at fullpath with the name we provide. The new name is constructed using the SQLite string concatenation operator, ||, and replace, the SQLite string function. The order by clause forces subdirectories to be processed before their parents.

Back up any python files modified since midnight yesterday

select exec('cp', fullpath, '/Users/DennisNedry/ParkLife/Backups/WK42/'||name||'-'||fileid)
from files where extension = '.py'
and DATE(modified) >= DATE('now', 'start of day', '-1 day')

Explanation: This query filters files by their extension, and their modified date using the SQLite DATE function. Any file meeting the criteria for backup is returned as a result row, and the exec function operates on these using the OS X cp command to copy it from its current location, fullpath, to its destination. The destination path is made by string concantenation: the target directory plus the filename plus a suffix made from the fileid. We add this suffix to avoid problems if backing up two files with the same name.

Play a random Jack Johnson song

select exec('open', fullpath) from files 
where fullpath like '%jack%' and fullpath like '%jo%son%' and extension = '.mp3'
order by random() limit 1;

Explanation: This query returns .mp3 files which have jack and johnson or jonson somewhere in the filename or the directory name. The results are sorted randomly and the first row is returned. The exec function runs the OS X open command on the fullpath, which opens the file and plays it in iTunes, or whichever app you have associated with .mp3 files.

Report TODO and FIXME in any .c files

select fullpath, linenumber, data from fileslines 
where parentpath like '/Users/GN/HL3/%' and extension = '.c'
and (data like '%TODO%' or data like '%FIXME%');

Explanation: The fileslines table gives searchable access to current contents of files that have been scanned. Its columns are data, where each entry holds the text from one line of the file, and linenumber, which is a counter that starts at 1 for the first line of each file. It's a virtual table and doesn't duplicate or index file contents. fileslines also has columns from the files table, and you can use these to filter which files to search inside.

In this example we are filtering by parentpath and extension. Only files below the HL3 directory with a .c extension will have their contents searched. Lines matching 'TODO' or 'FIXME' will be returned, together with their linenumber, and the fullpath of the file.

List the even numbered lines of a file

select data from fileslines 
where fullpath = '/Users/Nathan/Ava.lisp' and linenumber % 2 = 0;

Explanation: This query filters fileslines by linenumber. Here we report lines where the linenumber gives zero remainder when divided by two.

Copy files that contain specific text

select exec('cp', fullpath, '/Users/petergibbons/tpsreports/'||name||'-'||fileid) from fileslines 
where parentpath = '/Users/petergibbons/LR427/' and data like '%cents%'
group by fullpath;

Explanation: This query searches files in directory LR427 for lines that contain the text 'cents'. Each line that matches will return the fullpath of the file that contains it. We group by fullpath so we have only one result row for each file that matches, then Crab's exec function copies these files to the tpsreports directory, appending the fileid to avoid name collisions.

Report any lines in newfile.py that aren't in oldfile.py

select f1.linenumber, f1.data from fileslines f1 
where f1.name = 'newfile.py'
and f1.data not in (select f2.data from fileslines f2 where f2.name = 'oldfile.py');

Explanation: This query uses a subselect to return lines from file oldfile.py, then it reports lines from newfile.py which aren't in there.

Sort the lines of a file and write them to a new file

select writeln('/Users/SJohnson/dictionary2.txt', data) 
from fileslines where fullpath = '/Users/SJohnson/dictionary.txt'
order by data;

Explanation: This query picks up all lines from file dictionary.txt and sorts them by the text of each line, data. Then it uses Crab's writeln function to write them to file dictionary2.txt. The writeln function writes one line for each result row.

Copy a file, removing non-ASCII characters

%encoding ascii:ignore

select writeln('funnyCharsGone.py',data) from fileslines
where fullpath = '/Users/Kerabatsos/funnyChars.py';

Explanation: The command %encoding ascii:ignore tells Crab to use ASCII encoding when reading or writing data, and to silently ignore any characters that don't match. The query reads lines of data from the file funnyChars.py using the encoding rules to filter out non ASCII characters . Then it writes the filtered data to file funnyCharsGone.py.

Count the words in a file using a regular expression

select sum(matchcount(groupn(data,"(\s|^|\')+(\w+)",2))) from fileslines 
where fullpath = '/Users/JG/files/164/164.txt';

Explanation: The Crab function groupn returns a list of all the matches for a given regular expression group. A group is a bracketed part of a pattern. In this example groupn is told to return the results for group 2, which is the second bracket in the pattern, ie (\w+) This has the effect of returning a list of the words found in data. Crab's matchcount function counts how many matches are in the list, so here it returns the word count for the line. The query returns the sum of all these word counts.

Add Crab to your toolkit!

Free for non commercial use

Any questions? sales@etia.co.uk