Query your filesystem with SQL

  • Find files

    select fullpath from files 
    where (fullpath like '%color%' or fullpath like '%colour%')
    and extension = '.json';

  • Run a command on any set of files

    select exec('unzip',fullpath,'-d','/Users/Etia/Target Directory/') 
    from files
    extension = '.zip';

  • Search inside files

    select name, linenumber, data from fileslines 
    where fullpath like '/Users/Etia/logs/daily/%'
    data like '%Error%404%';

  • Analyze files

    select fullpath, bytes from files 
    where extension = '.data'
    order by bytes desc limit 5;

  • Leverage your SQL skills on the filesystem

  • Easy pattern matching on filenames, paths and file contents with LIKE.

  • Combine search criteria using AND, OR and NOT. This is the 21st century!

  • Execute OS commands on query results.

  • Install not needed, just unzip and run.

My Image

For Developers

  • Find files by half-remembered text in the filename, the path, or the content.

  • Search inside source and data files with LIKE, AND, OR, and NOT.

  • Bulk delete files by pattern matching.

  • Fix Dropbox conflicted copies.

  • Strip non ASCII characters from files.

  • Find duplicate files, or directories with duplicate contents

My Image

For Sys Admins

  • Measure filesystem growth, plan archive strategies, identify hotspots.

  • Compare filesystem trees, diff directories, resume an interrupted filesystem tree copy.

  • Fix unprintable characters in filenames.

  • Overcome shell command line length limitations of globbing.

  • Collect and bulk move files by simple pattern matching, or transforming file names.

My Image

For Data Scientists

  • Track ETL import progress, debug broken flows.

  • Transform text files with SQL string functions and regex.

  • Analyze text datasets and csv files without needing to import.

  • Pattern matching for a wide range of character encodings. Match different representations of the same character.

  • Data science at the command line.

Move the shell sideways

Type 'crab' in Terminal to get a Crab command line powered by SQL.

Crab scans the filesystem and gives you a physical table, files, that holds information about the files and directories, and a virtual table, fileslines, that gives you searchable access to file contents.

Scan your project directory when you want to query fast moving files - Crab scans most project directories instantaneously - and run an occasional overnight or lunchtime scan for the whole drive.

From the Crab command line you can simply use SQL string matching to find the files you want, and Crab's exec function to run commands on them

Just try these in Bash or PowerShell!

Find files where the filename or path includes 'sublime' and 'settings' but not 'backup'

select fullpath from files 
where fullpath like '%sublime%' and fullpath like '%settings%'
and fullpath not like '%backup%';

Explanation: The files table has data for all files and directories scanned. The fullpath column is the absolute path including the filename, e.g '/Library/Fonts/Hei.ttf'. The LIKE operator matches strings, and the wildcard '%' matches anything including a zero length string. SQLite queries end with a semicolon.

Find the five biggest files

select fullpath, bytes from files order by bytes desc limit 5;

Explanation: The bytes column is the file size in bytes. ORDER BY sorts the files by size in descending order (DESC), so the largest files are listed first. LIMIT 5 restricts results to the first five rows. If you come from SQL Server, this is equivalent to SELECT TOP 5. In general SQLite SQL is very similar to other flavors of SQL, including SQLServer.

Find the five biggest directories below the project folder, by total file size in GB

select parentpath, sum(bytes)/1e9 as GB from files 
where parentpath like '%/project/%'
group by parentpath order by sum(bytes) desc limit 5;

Explanation: The parentpath column records the parent directory for each file, e.g. the file '/Library/Fonts/Hei.ttf' has parentpath '/Library/Fonts/'. This query includes only files in directories that are below the project directory by string matching the parentpath. The query adds up the bytes for each file and groups the results to give a total for each parent directory. Results are sorted to give the largest parent directories first, and the results are limited to five rows. The total bytes for each directory is divided by 1e9 to convert from bytes to gigabytes.

List duplicate files bigger than 100MB

select f1.bytes/1e6 as MB, f1.fullpath, f2.fullpath 
from files f1 join files f2 on f1.fileid > f2.fileid
and MB > 100
and f1.name = f2.name and sha1(f1.fullpath) = sha1(f2.fullpath);

Explanation: The name column is the filename, e.g the file '/Library/Fonts/Hei.ttf' has name 'Hei.ttf'. fileid is the rownumber of the file in the files table, so it's a unique id for each file scanned.

The query does a self-join from the files table to the files table but only joins to files with a larger fileid. This is to avoid a file matching itself, and to avoid file1 matching file2, and then file2 also matching file1. The query identifies files that are duplicates by matching names, and if the names match then it compares SHA-1 values. sha1 is a Crab function that calculates the SHA-1 checksum for a file given its fullpath. SHA-1 is a kind of fingerprint for files, and matching SHA1 is a very good indicator that files have the same contents. File sizes are converted from bytes to megabytes by dividing by 1e6.

Open in finder the directory that has the most Excel spreadsheets

select exec('open', parentpath), count(*) from files 
where extension like '.xls%'
group by parentpath order by count(*) desc limit 1;

Explanation: Crab's exec function executes operating system commands on data returned in query results, once for each result row. The extension column records the extension of each file or directory e.g the file '/Library/Fonts/Hei.ttf' has extension '.ttf'.

The query counts the files where the extension starts with '.xls' (so it matches '.xls' and '.xlsx' files), and groups the count by parent directory. Then it sorts the parent directories by this count and returns just one row - the parent directory with the most excel files. Finally the query runs the 'open' command on this directory, which opens it in finder.

Remove spaces from all directory names below a specific directory

select exec('mv', fullpath, parentpath||replace(name,' ','')) from files 
where fullpath like '/Users/OneDirection/%'
and type = 'd'
and not pathexists(parentpath||replace(name,' ',''));

Explanation: The type column has a 'd' for directories and an 'f' for files. This query uses pattern matching to pick up only directories that are below /Users/OneDirection/. It uses Crab's pathexists function to check that if we remove spaces from the directory name we will not collide with an object that already has that name. Finally it uses exec to invoke the OS X mv command which in this case will rename the file at fullpath to give it the name we provide. The new name is constructed using the SQLite string concatenation operator, ||, and replace, the SQLite string function.

Back up any python files modified since midnight yesterday

select exec('cp', fullpath, '/Users/DennisNedry/ParkLife/Backups/WK42/'||name||'-'||fileid)
from files where extension = '.py'
and DATE(modified) >= DATE('now', 'start of day', '-1 day')

Explanation: This query filters files by their extension, and their modified date using the SQLite DATE function. Any file meeting the criteria for backup is returned as a result row, and the exec function operates on these using the OS X cp command to copy it from its current location, fullpath, to its destination. The destination path is made by string concantenation: the target directory plus the filename plus a suffix made from the fileid. We add this suffix to avoid problems if backing up two files with the same name.

Play a random Jack Johnson song

select exec('open', fullpath) from files 
where fullpath like '%jack%' and fullpath like '%jo%son%' and extension = '.mp3'
order by random() limit 1;

Explanation: This query returns .mp3 files which have jack and johnson or jonson somewhere in the filename or the directory name. The results are sorted randomly and the first row is returned. The exec function runs the OS X open command on the fullpath, which opens the file and plays it in iTunes, or whichever app you have associated with .mp3 files.

Report TODO and FIXME in any .c files

select fullpath, linenumber, data from fileslines 
where parentpath like '/Users/GN/HL3/%' and extension = '.c'
and (data like '%TODO%' or data like '%FIXME%');

Explanation: The fileslines table gives searchable access to current contents of files that have been scanned. Its columns are data, where each entry holds the text from one line of the file, and linenumber, which is a counter that starts at 1 for the first line of each file. It's a virtual table and doesn't duplicate or index file contents. fileslines also has columns from the files table, and you can use these to filter which files to search inside.

In this example we are filtering by parentpath and extension. Only files below the HL3 directory with a .c extension will have their contents searched. Lines matching 'TODO' or 'FIXME' will be returned, together with their linenumber, and the fullpath of the file.

List the first few lines of a file

select data from fileslines 
where fullpath = '/Users/Nathan/Ava.lisp' and linenumber > 1 and linenumber <= 10;

Explanation: This query filters fileslines by linenumber. Here we report lines 2 through 10 of the file.

Copy files that contain specific text

select exec('cp', fullpath, '/Users/petergibbons/tpsreports/'||name||'-'||fileid) from fileslines 
where parentpath = '/Users/petergibbons/LR427/' and data like '%cents%'
group by fullpath;

Explanation: This query searches files in directory LR427 for lines that contain the text 'cents'. Each line that matches will return the fullpath of the file that contains it. We group by fullpath so we have only one result row for each file that matches, then Crab's exec function copies these files to the tpsreports directory, appending the fileid to avoid name collisions.

Report any lines in newfile.py that aren't in oldfile.py

select f1.linenumber, f1.data from fileslines f1 
where f1.name = 'newfile.py'
and f1.data not in (select f2.data from fileslines f2 where f2.name = 'oldfile.py');

Explanation: This query uses a subselect to return lines from file oldfile.py, then it reports lines from newfile.py which aren't in there.

Sort the lines of a file and write them to a new file

select writeln('/Users/SJohnson/dictionary2.txt', data) 
from fileslines where fullpath = '/Users/SJohnson/dictionary.txt'
order by data;

Explanation: This query picks up all lines from file dictionary.txt and sorts them by the text of each line, data. Then it uses Crab's writeln function to write them to file dictionary2.txt. The writeln function writes one line for each result row.

Copy a file, removing non-ASCII characters

%encoding ascii:ignore

select writeln('funnyCharsGone.py',data) from fileslines
where fullpath = '/Users/Kerabatsos/funnyChars.py';

Explanation: The command %encoding ascii:ignore tells Crab to use ASCII encoding when reading or writing data, and to silently ignore any characters that don't match. The query reads lines of data from the file funnyChars.py using the encoding rules to filter out non ASCII characters . Then it writes the filtered data to file funnyCharsGone.py.

Count the words in a file using a regular expression

select sum(matchcount(groupn(data,"(\s|^|\')+(\w+)",2))) from fileslines 
where fullpath = '/Users/JG/files/164/164.txt';

Explanation: The Crab function groupn returns a list of all the matches for a given regular expression group. A group is a bracketed part of a pattern. In this example groupn is told to return the results for group 2, which is the second bracket in the pattern, ie (\w+) This has the effect of returning a list of the words found in data. Crab's matchcount function counts how many matches are in the list, so here it returns the word count for the line. The query returns the sum of all these word counts.

Transpose a CSV file and drop the header row

select writeln('payer_payee.csv', group_concat(itemn(data, i, ','))) 
from fileslines join iterator on i <= itemcount(data, ',')
where name = 'payee_payer.csv' and linenumber > 1
group by i;

Explanation: Crab's iterator table has one field i that goes from 1 up to a number we specify. It's useful for stepping through the items in a list, to return each of them on a separate row. Here the length of the list is itemcount(data, ','), and item i from the list is itemn(data, i, ',').

Joining from fileslines to iterator like this gives us every entry in the original CSV file on a separate row. But we want all the items from the first column to be on the first row, and all the items from the second column to be on the second row and so on. To do this we must group together all the items with the same value of i. In fact we want to concatenate them together into one row, with commas between them - and this is exactly what the SQLite function group_concat does. Finally writeln outputs the results to a new file, payer_payee.csv

Show differences between two directory trees

select  below(fullpath, '/projtree/', '/projtreebackup/') as path
from files
where fullpath like '%/projtree/%' or fullpath like '%/projtreebackup/%'
group by path having count(*) !=2;

Explanation: Crab's below function looks inside fullpath for the directories listed, and if it finds one it trims fullpath to give just the portion of the path below that directory. For example below('users/peterdalloz/projtreebackup/source/main.c' , '/projtree/', '/projtreebackup/') gives 'source/main.c'

By trimming the fullpath at these directories we get a set of relative paths for all the files below them. Any relative path that doesn't appear twice is a file or directory missing from one of the directory trees.

Any questions? sales@etia.co.uk