Crab - Filesystem SQL for macOS and Windows

Query and process files, like data in a database


FROM

  ls | awk -F . '{print $NF}' | sort | uniq -c | awk '{print $2,$1}'

TO

SELECT extension, count(*) FROM files GROUP BY extension;

Crab is a command line tool for analyzing and processing files with SQL.
It's for Bash and PowerShell users who need a tool that is strong in set based logic and in grouping, summing and counting data.
It's also for SQL users who don't want to learn a domain specific language just for file data.

  • SQL's set based logic: Ideal for selecting groups of files for processing, for comparing one set of files with another, and for analyzing file stats

  • Use the LIKE operator: Find files with simple text pattern matching. No corner cases due to spaces in filenames, shell special characters, overlong path names, argument list too long errors, ...

  • Run Operating System commands on files listed in query results. No need to pipe anything, it's a SQL function. Use ORDER BY to specify the order files are processed.

  • Use SQL in Bash scripts, or run SQL scripts for repeatable queries and file processing

  • Write query results to file or pipe output to other commands

  • Combine search criteria using AND, OR and NOT.

  • No new language to learn or support, it's SQL

  • No need to install, no need for admin rights, just unzip and run

Move the shell sideways

Type 'crab' at the command line to get a Crab prompt powered by SQL. Crab scans the directories you specify and gives you an indexed table, files, that holds information about the files and directories, and a virtual table, fileslines, that gives you SQL access to file contents.

Scans run at several thousand files per second, so scanning a bunch of project or data directories usually takes no more than a second or two. The whole of a spinning disk 1TB drive takes about 30 minutes, SSDs are much faster. Most people scan project directories on demand, and the whole disk once a week or so - you can have multiple scan databases.

Try these in Bash or PowerShell!

Find files where the filename or path includes 'sublime' but not 'backup'

select fullpath from files 
where fullpath like '%sublime%' and fullpath not like '%backup%';

Explanation: The files table has data for all files and directories scanned. The fullpath column is the absolute path including the filename, e.g '/Library/Fonts/Hei.ttf'. The LIKE operator matches strings, and the wildcard '%' matches anything including a zero length string. SQLite queries end with a semicolon.

Find files in directory dirA that are not in directory dirB

select name from files 
where fullpath like '%/dirA/%'
and name not in (select name from files where fullpath like '%/dirB/%');

Explanation: The subquery returns names of files that have directory dirB on the path. The main query returns files that have directory dirA on the path and whose names are not in the subquery results.

Find the five biggest files

select fullpath, bytes from files order by bytes desc limit 5;

Explanation: The bytes column is the file size in bytes. ORDER BY sorts the files by size in descending order (DESC), so the largest files are listed first. LIMIT 5 restricts results to the first five rows. If you come from SQL Server, this is equivalent to SELECT TOP 5. In general SQLite SQL is very similar to other flavors of SQL, including SQLServer.

Find the five biggest directories below the project folder, by total file size in GB

select parentpath, sum(bytes)/1e9 as GB from files 
where parentpath like '%/project/%'
group by parentpath order by sum(bytes) desc limit 5;

Explanation: The parentpath column records the parent directory for each file, e.g. the file '/Library/Fonts/Hei.ttf' has parentpath '/Library/Fonts/'. This query includes only files in directories that are below the project directory by string matching the parentpath. The query adds up the bytes for each file and groups the results to give a total for each parent directory. Results are sorted to give the largest parent directories first, and the results are limited to five rows. The total bytes for each directory is divided by 1e9 to convert from bytes to gigabytes.

List duplicate files bigger than 100MB

select f1.bytes/1e6 as MB, f1.fullpath, f2.fullpath 
from files f1 join files f2 on f1.fileid > f2.fileid
and MB > 100
and f1.name = f2.name and sha1(f1.fullpath) = sha1(f2.fullpath);

Explanation: The name column is the filename, e.g the file '/Library/Fonts/Hei.ttf' has name 'Hei.ttf'. fileid is the rownumber of the file in the files table, so it's a unique id for each file scanned.

The query does a self-join from the files table to the files table but only joins to files with a larger fileid. This is to avoid a file matching itself, and to avoid file1 matching file2, and then file2 also matching file1. The query identifies files that are duplicates by matching names, and if the names match then it compares SHA-1 values. sha1 is a Crab function that calculates the SHA-1 checksum for a file given its fullpath. SHA-1 is a kind of fingerprint for files, and matching SHA1 is a very good indicator that files have the same contents. File sizes are converted from bytes to megabytes by dividing by 1e6.

Open in Finder the directory that has the most Excel spreadsheets

select exec('open', parentpath), count(*) from files 
where extension like '.xls%'
group by parentpath order by count(*) desc limit 1;

Explanation: Crab's exec function executes operating system commands on data returned in query results, once for each result row. The extension column records the extension of each file or directory e.g the file '/Library/Fonts/Hei.ttf' has extension '.ttf'.

The query counts the files where the extension starts with '.xls' (so it matches '.xls' and '.xlsx' files), and groups the count by parent directory. Then it sorts the parent directories by this count and returns just one row - the parent directory with the most Excel files. Finally the query runs the 'open' command on this directory, which opens it in Finder.

Remove spaces from all directory names below a specific directory

select exec('mv', fullpath, parentpath||replace(name,' ','')) from files 
where fullpath like '/Users/OneDirection/%'
and type = 'd'
and not pathexists(parentpath||replace(name,' ',''))
order by depth desc;

Explanation: The type column has a 'd' for directories and an 'f' for files. This query uses pattern matching to pick up only directories that are below /Users/OneDirection/. It uses Crab's pathexists function to check that if we remove spaces from the directory name we will not collide with an object that already has that name. Finally it uses exec to invoke the mv command to rename the file at fullpath with the name we provide. The new name is constructed using the SQLite string concatenation operator, ||, and replace, the SQLite string function. The order by clause forces subdirectories to be processed before their parents.

Back up any python files modified since midnight yesterday

select exec('cp', fullpath, '/Users/DennisNedry/ParkLife/Backups/WK42/'||name||'-'||fileid)
from files where extension = '.py'
and modified >= date('now', 'start of day', '-1 day')

Explanation: This query filters files by their extension, and their modified date using the SQLite date() function. Any file meeting the criteria for backup is returned as a result row, and the exec function operates on these using the cp command to copy it from its current location, fullpath, to its destination. The destination path is made by string concatenation: the target directory plus the filename plus a suffix made from the fileid. We add this suffix to avoid problems if backing up two files with the same name.

Note that the modification date is picked up at scan time, so this query must be run on a recent scan.

Report TODO and FIXME in any .c files

select fullpath, linenumber, data from fileslines 
where parentpath like '/Users/GN/HL3/%' and extension = '.c'
and (data like '%TODO%' or data like '%FIXME%');

Explanation: The fileslines table gives searchable access to current contents of files that have been scanned. Its columns are data, where each entry holds the text from one line of the file, and linenumber, which is a counter that starts at 1 for the first line of each file. It's a virtual table which doesn't duplicate or index file contents. fileslines also has columns from the files table, and you can use these to filter which files to search inside.

In this example we are filtering by parentpath and extension. Only files below the HL3 directory with a .c extension will have their contents searched. Lines matching 'TODO' or 'FIXME' will be returned, together with their linenumber, and the fullpath of the file.

List the even numbered lines of a file

select data from fileslines 
where fullpath = '/Users/Nathan/Ava.lisp' and linenumber % 2 = 0;

Explanation: This query filters fileslines by linenumber. Here we report lines where the linenumber gives zero remainder when divided by two.

Copy files that contain specific text

select exec('cp', fullpath, '/Users/petergibbons/tpsreports/'||name||'-'||fileid) from fileslines 
where parentpath = '/Users/petergibbons/LR427/' and data like '%cents%'
group by fullpath;

Explanation: This query searches files in directory LR427 for lines that contain the text 'cents'. Each line that matches will return the fullpath of the file that contains it. We group by fullpath so we have only one result row for each file that matches, then Crab's exec function copies these files to the tpsreports directory, appending the fileid to avoid name collisions.

Report any lines in newfile.py that aren't in oldfile.py

select f1.linenumber, f1.data from fileslines f1 
where f1.name = 'newfile.py'
and f1.data not in (select f2.data from fileslines f2 where f2.name = 'oldfile.py');

Explanation: This query uses a subselect to return lines from file oldfile.py, then it reports lines from newfile.py which aren't in there.

Sort the lines of a file and write them to a new file

select writeln('/Users/SJohnson/dictionary2.txt', data) 
from fileslines where fullpath = '/Users/SJohnson/dictionary.txt'
order by data;

Explanation: This query picks up all lines from file dictionary.txt and sorts them by the text of each line, data. Then it uses Crab's writeln function to write them to file dictionary2.txt. The writeln function writes one line for each result row.

Recursively copy every .csv file, dropping the first line of each

select writeln(parentpath || basename || '-Headless' || extension, data) 
from fileslines
where parentpath like '/somepath/%'
and extension = '.csv' and linenumber > 1

Explanation: This query searches file contents of all '.csv' files below the directory '/somepath/'. The data for lines with linenumber>1 are written to file by Crab's writeln function. Each new file is written to the same directory as the original, and the name is given the suffix '-Headless'.

Copy a file, removing non-ASCII characters

%encoding ascii:ignore

select writeln('funnyCharsGone.py',data) from fileslines
where fullpath = '/Users/Kerabatsos/funnyChars.py';

Explanation: The command %encoding ascii:ignore tells Crab to use ASCII encoding when reading or writing data, and to silently ignore any characters that don't match. The query reads lines of data from the file funnyChars.py using the encoding rules to filter out non ASCII characters . Then it writes the filtered data to file funnyCharsGone.py.

Count the words in a file using a regular expression

select sum(matchcount(groupn(data,"(\s|^|\')+(\w+)",2))) from fileslines 
where fullpath = '/Users/JG/files/164/164.txt';

Explanation: The Crab function groupn returns a list of all the matches for a given regular expression group. A group is a bracketed part of a pattern. In this example groupn is told to return the results for group 2, which is the second bracket in the pattern, ie (\w+) This has the effect of returning a list of the words found in data. Crab's matchcount function counts how many matches are in the list, so here it returns the word count for the line. The query returns the sum of all these word counts.

Add Crab to your toolkit

Free for non commercial use

© 2017 Etia UK