Functions


SQLite standard functions are
documented here

Crab functions are documented below, with the top six functions in bold

above(path, dir, offset)Returns the part of path that is above dir or dir's children, to support grouping and counting files by their ancestor directories
below(path, dir1 [,dir2]...)Returns the part of path that is below directory dir1, so you can easily compare or move directory trees
bytestr(s [,format])Returns the string or blob s as a printable byte sequence
chopstr(s, n)Chops n characters off the beginning of string s, returning the part of s that remains. If n is negative, chops characters from the end of s
codestr(s [, format])Returns the string or blob s as a sequence of unicode code points
eval(cmd [,arg1] [,arg2] ...)Runs and returns the output of operating system commands and other programs
exec(cmd [,arg1] [,arg2] ...)Executes operating system commands and other programs using data returned in query results
groupn(string, pattern, i)Returns all matches for group i when matching pattern to string as a regular expression
inlist(stringList, stringItem, sep)Returns list index of first occurrence of stringItem as an element in stringList, or 0 if not found
itemcount(stringList, sep)Returns length of stringList, counted as number of separators + 1
itemn(stringList, i, sep)Returns the i'th element from stringList, treating sep as list separator
string match patternReturns 1 (True) if string matches the regex pattern, 0 (False) if it does not
matchcount(string, pattern) or matchcount(groupn(string, pattern, j))Returns how many matches were found when matching regex pattern to string, or were returned by groupn
matchn(string, pattern, i) or matchn(groupn(string, pattern, j), i)Returns the i'th match of regex pattern in string, or the i'th element from a list of groupn match results
metadata(path, metaDataFieldName)Returns the value of some metadata attribute for the filesystem object. Currently macOS only.
normstr(string [,norm])Converts string to normal form, by expressing each unicode glyph as code points in a standard order
notbs(string)Removes the last character from string if it is a backslash, as required for directory path arguments to some Windows commands
pathexists(path)Tests whether a path - file or directory - exists at query run time. Returns 1 (True) if the path exists, 0 (False) if it does not
short(s, n)Restricts restricts string s to n characters by removing characters from the middle of the string, and replacing them with '~'
shortpath(path)Turns a normal Windows path into a path made of short DOS-style 8.3 abbreviated names
sublist(stringList, startPosition [,len], sep)Returns a sublist from stringList starting at index startPosition, of length len
u(string)Converts unicode escape sequences and c-style escape sequences in string to unicode characters
write(filename, [,data1] [,data2] ...)Writes query results to file
writeln(filename, [,data1] [,data2] ...)Writes query results to file, with newlines

above(path, dir, offset)

Returns the part of path that is above dir or dir's children, to support grouping and counting files by their ancestor directories.


Examples

E.g.

CRAB> select above('C:\Users\johnsmith\Dropbox\tests\test1.c','\johnsmith\', 0);
C:\users\johnsmith\
CRAB> select above('C:\Users\johnsmith\Dropbox\tests\test1.c','\johnsmith\', 1);
C:\Users\johnsmith\Dropbox\
CRAB> select above('C:\Users\johnsmith\Dropbox\tests\test1.c','\johnsmith\', 2);
C:\Users\johnsmith\Dropbox\tests\

E.g. Recursively count contents of the child directories of   \Dropbox\  

SELECT above(fullpath,'\Dropbox\',1) child, count(*) FROM files
WHERE fullpath LIKE 'C:\Users\johnsmith\Dropbox\%'
GROUP BY child ORDER BY count(*) DESC LIMIT 3;

CRAB> select above(fullpath,'\Dropbox\',1) child, count(*) from files where fullpath like 'C:\Users\johnsmith\Dropbox\%'group by child order by count(*) desc limit 3;
{child: C:\Users\johnsmith\Dropbox\Writing\, count(*): 118999}
{child: C:\Users\johnsmith\Dropbox\Gutenberg\, count(*): 33689}
{child: C:\Users\johnsmith\Dropbox\utilities\, count(*): 10569}

Parameters

path
A path to truncate.

dir
The directory where to start truncation.

offset
A depth offset to move the truncation start point up or down. A positive integer to move down, a negative integer to move up. E.g 1 for dir's child directories, 2 for dir's grandchild directories.


You can specify the truncation directory using an absolute path

SELECT above(fullpath,'C:\Users\johnsmith\Dropbox\',1) child, count(*)
FROM files
WHERE fullpath LIKE 'C:\Users\johnsmith\Dropbox\%'
GROUP BY child;

You can recursively count contents of the grandchildren by using an offset of 2 instead of 1

SELECT above(fullpath,'C:\Users\johnsmith\Dropbox\',2) gchild, count(*)
FROM files
WHERE fullpath LIKE 'C:\Users\johnsmith\Dropbox\%'
GROUP BY gchild ORDER BY count(*) DESC LIMIT 10;


below(path, dir1 [,dir2]...)

Returns the part of path that is below directory dir1, so you can easily compare or move directory trees

  • path is checked against the directories dir1, dir2,... in turn. When below finds a match, it returns the part of path below dirn. If none of the dirn are in path, NULL is returned.

Examples

E.g.

CRAB> select below('C:\Users\sysops\logfiles\serverlogs\2015\log20150801.txt', '\serverlogs\');
2015\log20150801.txt
CRAB> select below('C:\Users\sysops\logfiles\serverlogs\2015\log20150801.txt', '\applicationlogs\');

CRAB> select below('C:\Users\sysops\logfiles\serverlogs\2015\log20150801.txt', '\applicationlogs\', '\serverlogs\');
2015\log20150801.txt
CRAB>

E.g. Show differences between two directory trees


SELECT below(fullpath, '\projtree\', '\projtreebackup\') AS path, count(*)
FROM files
WHERE fullpath LIKE '%\projtree\%' OR fullpath LIKE '%\projtreebackup\%'
GROUP BY path HAVING count(*) !=2;

Parameters

path
A path string to be trimmed, typically fullpath or parentpath

dir1, dir2, ...
The directories below which you want the paths to start. The dirn should be specified as directory names delimited by '\', e.g. '\Dropbox\'

You can specify an absolute path for any of the dir's.  

SELECT below('C:\Users\sysops\logfiles\serverlogs\2015\log20150801.txt',
'C:\Users\sysops\logfiles\serverlogs\');

CRAB> select below('C:\Users\sysops\logfiles\serverlogs\2015\log20150801.txt', 'C:\Users\sysops\logfiles\serverlogs\');
2015\log20150801.txt


bytestr(s [,format])

Returns the string or blob s as a printable byte sequence.

  • Each byte is represented as a printable ASCII character if possible, otherwise as \xhh where hh are two hexidecimal digits giving the value of the byte.
  • Use this to see what is inside a binary file using binary encoding, or to see control codes in a text file, or to see what bytes are being used to encode Crab's utf-8 strings.
  • If you want to pattern match a glyph, you can use bytestr to see the sequence of bytes that make up the utf-8 representation of it, then use the X prefix to build a blob to match the bytes

E.g. to report which python files contain tab characters

SELECT fullpath, linenumber, bytestr(data) FROM fileslines
WHERE extension = '.py' AND data MATCH X'09';


format
You can supply a second argument, a format specifier, which dictates how the byte is printed when it's not a printable ASCII character. This follows the python integer formatting conventions. By default it is '\x%02X' which prints \xhh. But for example you can use '.'


E.g. to get hex dump type output for a file

  
%encoding binary:40
%mode list
%headers off

SELECT linenumber, hex(data), bytestr(data,".") FROM fileslines
WHERE fullpath = 'C:\Users\johnsmith\Dropbox\PS_Store'
AND data NOT MATCH u('^\x00*$') LIMIT 10;

CRAB> select linenumber, hex(data), bytestr(data,".") from fileslines where fullpath = 'C:\Users\johnsmith\Dropbox\PS_Store' and data not match u('^\x00*$') limit 10;
1,00000001427564310000980000000800000098000000200C0000100B0000300C0000400C00000000,....Bud1.............. .......0...@.....
2,000000000000080000000800000000000000000000000000000000000000000B00000002000000F8,........................................
3,0000000B00001000004A004500430054004900560000000000000000000000000000000000000000,.........J.E.C.T.I.V....................
26,0000000000000000000000000000000000000000000000000000000000000003000000010000000C,........................................
27,0000000A004F0042004A00450043005400490056004500436C737650626C6F62000003A162706C69,.....O.B.J.E.C.T.I.V.E.ClsvPblob....bpli
28,73743030DA0102030405060708090A0B0C0D0E5E5F60140C625F1012766965774F7074696F6E7356,st00...............^_`..b_..viewOptionsV
29,657273696F6E5F100F73686F7749636F6E507265766965775F101163616C63756C617465416C6C53,ersion_..showIconPreview_..calculateAllS
30,697A657357636F6C756D6E735F100F7363726F6C6C506F736974696F6E59587465787453697A655F,izesWcolumns_..scrollPositionYXtextSize_
31,100F7363726F6C6C506F736974696F6E585A736F7274436F6C756D6E5F101075736552656C617469,..scrollPositionXZsortColumn_..useRelati
32,766544617465735869636F6E53697A6510010908AA0F1820272F373F474F57D41011121314150C0C,veDatesXiconSize....... '/7?GOW.........

E.g. to look at utf8 encoding of korean characters

bytestr returns the bytes making up the Crab's internal utf-8 encoding of the korean characters
codestr returns the unicode codepoints of the korean characters

CRAB> %encoding euc-kr:replace
CRAB>
select linenumber, data, codestr(data),bytestr(data) from fileslines where fullpath like '%qqtest\ked%' and linenumber = 1302;
   linenumber = 
1302
         data = 
부르- sing (a song) 
codestr(data) = 
\uBD80\uB974- sing (a song) 
bytestr(data) = 
\xEB\xB6\x80\xEB\xA5\xB4- sing (a song) 

See Crab's codestr function
To return a sequence of unicode code points

See
SQLite hex function
To return a sequence of hex digits

See %encoding command
To change encoding


chopstr(s, n)

Chops n characters off the beginning of string s, returning the part of s that remains. If n is negative, chops characters from the end of s.


E.g. to trim the last character off every line of a CSV file to remove trailing commas

SELECT writeln('trim.csv',chopstr(trim(data),-1))
FROM fileslines WHERE name = 'csvc.csv';




codestr(s [, format])

Returns the string or blob
s as a sequence of unicode code points.

  • Where the code point is a printable ASCII character the character is returned, otherwise \uhhhh is returned, where hhhh are four hexadecimal digits specifying the number of the unicode character.
  • Use this to discover which code points make up a glyph.

format
You can supply a second argument, a format specifier, which dictates how the code is printed when it's not a printable ascii character. This follows the python integer formatting conventions. By default it is "\u%04X" which prints \uhhhh. But for example you can use '.' (see Crab's bytestr() function entry for an example)


Example: Select files with non ASCII characters in the name

To select filenames with non ASCII characters you can type the characters directly

SELECT fullpath FROM files WHERE name LIKE '%café%'

or you can use Crab's u() function to turn unicode escape strings into text

SELECT fullpath FROM files WHERE name LIKE u('%cafe\u0301%');

However there is often more than one way to enter the same glyph, e.g. 'é' can be 'e\u0301' or '\u00e9'. The one you type at the keyboard may not match the unicode sequence in the filename, even if they look the same. To see the code sequence inside a string, use the codestr() function.

SELECT codestr(name) FROM files WHERE name LIKE '%caf%';

To match 'é' whichever sequence has been used, Crab provides the normstr() function, which normalizes a code sequence to the Windows filename standard

CRAB> SELECT name, codestr(name) FROM files WHERE name LIKE normstr('%café%');
{name: café, codestr(name): cafe\u0301}
CRAB> SELECT name FROM files WHERE name LIKE u('%caf\u00e9%');
CRAB> SELECT name FROM files WHERE name LIKE normstr(u('%caf\u00e9%'));
{name: café}

See Crab's bytestr function
To return a byte sequence

See Crab's normstr function
To normalize the encoding of a string

See Crab's u function
To convert a string with unicode escapes to text

See SQlite hex function
To get all bytes in a string as a sequence of hex digits

See %encoding command
To change encoding


eval(cmd [,arg1] [,arg2] ...)

Runs and returns the output of operating system commands and other programs.

  • Use eval when you want to test the output of some command in a query's WHERE clause, or to return command output as part of a query's result set.
  • eval may execute multiple times for each row, and even for rows that don’t appear in the final results, depending on the whim of the query optimizer, so eval should only be used with commands that don't change filesystem data.

Parameters

The first argument should be the command name and later ones the command's arguments. Arguments are automatically escaped, so you don't need to worry about paths that contain spaces.


Example

E.g. Use
eval with the 'dir' command's '/q' option to find files with a particular owner

SELECT
fullpath,
eval('dir', '/q', fullpath) owner
FROM files
WHERE fullpath LIKE 'C:\Users\johnsmith\%' AND pathexists(fullpath) AND owner LIKE '%peterdalloz%';


Tips

  • eval returns a string.
  • Trailing newlines are stripped, other newlines appear as '\x0A'


See Crab's exec function
To execute a command once for each query result row


exec(cmd [,arg1] [,arg2] ...)

Executes operating system commands and other programs using data returned in query results.

  • Use exec when you want to run a command once for each query result row.
  • Typically used to run commands on files, using fullpath as one of the arguments.

Parameters

The first argument should be the command name and later ones the command's arguments. Arguments are automatically escaped, so you don't need to worry about paths that contain spaces.


Examples

E.g. Use
exec with the COPY command to back up any file under the documents directory that was modified today

SELECT exec('copy', fullpath, 'C:\Users\johnsmith\backups\'||name||'.'||fileid)
FROM files
WHERE fullpath LIKE 'C:\Users\johnsmith\documents\%'
AND modified >= date('now','start of day')

Explanation: The exec function is evaluated for every file meeting the criteria of the WHERE clause - every recently modified file. For each of these files the copy command is executed, with the first argument (copy from) being the fullpath of the file, and the second argument (copy to) being a string concatenated from the target directory plus the filename plus a dot plus its unique fileid number. We use the fileid number to avoid name collisions in case any files have the same name.


E.g. Copy files that contain specific text

  
SELECT exec('copy', fullpath, 'C:\Users\petergibbons\tpsreports\'||name||'-'||fileid)
FROM fileslines
WHERE parentpath = 'C:\Users\petergibbons\LR427\' AND data LIKE '%cents%'
GROUP BY fullpath;

Explanation: Here the query is against fileslines, so the fullpath is returned for every matching line in each file. But we only want to copy each file once: GROUP BY fullpath achieves this.


Restrictions

exec is a delayed evaluation function. It sends output to the screen, but does not return results you can use in other parts of your query because the rest of the query has already finished evaluation when it executes.

For this reason it is only supported in the SELECT clause of a top level query, you can't use exec in subqueries, WHERE clauses, ORDER BY clauses, inside other functions and so on.

If you want to process the output of operating system commands, for example to test in your query's WHERE clause, see Crab's
eval function.


Tips

  • Safety first: before using exec to change data it's good practice to try your query with the 'echo' command so you can check that the commands that will be executed are what you expect, e.g.
    SELECT exec('echo', 'copy', fullpath, 'C:\Users\johnsmith\backups\'||name||'.'||fileid) 
FROM files
WHERE fullpath LIKE 'C:\Users\johnsmith\documents\%'
AND modified>=date('now','start of day')

  • Remember that any changes you make to the filesystem - files you delete or add - won't be reflected in the scan data until you scan again.
  • DISTINCT doesn't work with exec, because exec has to be evaluated before DISTINCT can check its output. Use GROUP BY instead.


See Crab's eval function
To use the value returned by a command


groupn(string, pattern, i)

Returns all matches for group i when matching pattern to string as a regular expression. 

  • A group is a bracketed part of a pattern that does not start with a question mark.
  • By default the matches are returned in one string separated by '|', but you can change the separator with the %matchsep command
  • Use matchn or itemn to pick out one of the matches

Examples

E.g. return all the matches for group 2

select groupn("« Ce qui nous paraît ici mériter l'intérêt","(?u)(\s|'|^)+(\w+)",2);

CRAB>  select  groupn("«  Ce  qui  nous  paraît  ici  mériter  l'intérêt","(?u)(\s|'|^)+(\w+)",2);
Ce|qui|nous|paraît|ici|mériter|l|intérêt

Explanation: The pattern "(?u)(\s|'|^)+(\w+)" has two groups: because the first bracket starts with ? it isn't a capturing group. The number 2 tells groupn to return matches for the second group. The (?u) element tells the regex engine to include unicode letters in \w matches


E.g. return match number 3 for group 2


select matchn(groupn("« Ce qui nous paraît ici mériter l'intérêt","(?u)(\s|'|^)+(\w+)",2),3);

CRAB> select matchn(groupn("« Ce qui nous paraît ici mériter l'intérêt","(?u)(\s|'|^)+(\w+)",2),3);
nous


E.g. use the iterator table to return all the matches on separate rows

  select matchn(groupn("« Ce qui nous paraît ici mériter l'intérêt","(?u)(\s|'|^)+(\w+)",2),i) 
from iterator
where i <= matchcount(groupn("« Ce qui nous paraît ici mériter l'intérêt","(?u)(\s|'|^)+(\w+)",2));

CRAB> select matchn(groupn("« Ce qui nous paraît ici mériter l'intérêt","(?u)(\s|'|^)+(\w+)",2),i) from iterator where i <= matchcount(groupn("« Ce qui nous paraît ici mériter l'intérêt","(?u)(\s|'|^)+(\w+)",2));
Ce
qui
nous
paraît
ici
mériter
l
disait


E.g. Use the same logic to extract and count the words for all the lines in a file

  select matchn(groupn(data,"(?u)(\s|^|')+(\w+)",2),i) word, count(*)
from fileslines join iterator on i<=matchcount(groupn(data,"(?u)(\s|^|')+(\w+)",2))
where fullpath = 'C:\Users\johnsmith\12246.txt'
group by word order by count(*) desc limit 30;

CRAB> select matchn(groupn(data,"(?u)(\s|^|')+(\w+)",2),i) word, count(*) from fileslines join iterator on i<=matchcount(groupn(data,"(?u)(\s|^|')+(\w+)",2)) where fullpath = 'C:\Users\johnsmith\12246.txt' group by word order by count(*) desc limit 5;
{word: de, count(*): 1663}
{word: a, count(*): 1005}
{word: et, count(*): 919}
{word: en, count(*): 822}
{word: le, count(*): 758}

pattern
Regex pattern uses python regex syntax, see https://docs.python.org/2/library/re.html
See Crab's match function entry for more syntax


See Crab's other regex matching functions: match, matchn, matchcount

See %matchsep command
To change the separator in a groupn match result list

See How to Regex
A short tutorial on regular expressions with Crab


inlist(stringList, stringItem, sep)

Returns list index of first occurrence of stringItem as an element in stringList, or 0 if not found.

  • Use when processing a string value as a list, e.g. to treat a path as a '\'-delimited list, to process a list of regular expression match results, or a row of csv data from fileslines.
  • SQLite treats all non zero values as True, so inlist can be used to test if stringItem is contained in stringList.

Syntax and logic are modelled on the
SQLite instr function

  • List index of the first element is 1.
  • Finds the left-most match.
  • Matches whole single list elements only, so inlist('ABC\DEF', 'B','\') will not find a match and inlist('A\B\C\D\E', 'B\C','\') will also not find a match.
  • List elements can contain whitespace, for an item to match it must have the same whitespace.

Examples

SELECT inlist('C:\Users\johnsmith\Dropbox\PYTHONDROPBOX\sqlsys','PYTHONDROPBOX','\');
5

SELECT inlist('C:\Users\johnsmith\Dropbox\PYTHONDROPBOX\sqlsys','C:','\');
1

SELECT inlist('\b\a\d\a\b\o\o\m','a','\');
3

SELECT inlist('The quick brown fox jumps over the lazy dog', 'fox', ' ');
4


Parse csv files

select linenumber, data from fileslines
where fullpath like '%TSP4\santa_cities.csv' and inlist(data,'12751',',');

linenumber data
37 35,12751,2804
9841 9839,12751,797
12753 12751,14860,6014
12864 12862,12751,1173
13016 13014,16310,12751



Parameters

stringList
The string to be searched, interpreted as a list

stringItem
The element to find

sep
List separator. If stringList is a path use '\'. If stringList is a list of regular expression match results use '|', unless the default separator has been changed with the %matchsep command
Multi character separators are allowed.
Blob separators are supported, such as
X'09' to represent tab.



See Crab's other list-processing functions: itemcount, itemn, sublist


itemcount(stringList, sep)

Returns length of stringList, counted as number of separators + 1.

  • Use when processing a string value as a list, e.g. to treat a path as a '\'-delimited list, to process a list of regular expression match results, or a row of csv data from fileslines.
  • Blank list elements are included in the count.

Parameters

stringList

The string to be searched, interpreted as a list

sep
List separator. If stringList is a path use '\'. If stringList is a list of regular expression match results use '|', unless the default separator has been changed with the %matchsep command
Multi character separators are allowed.
Blob separators are supported, such as
X'09' to represent tab.


Examples

E.g.

CRAB> SELECT itemcount('P\QPR\T','\');
{itemcount('P\QPR\T','\'): 3}
CRAB> SELECT itemcount('C:\Users\johnsmith\Dropbox','\');
{itemcount('C:\Users\johnsmith\Dropbox','\'): 4}
CRAB> SELECT itemcount('C:\Users\johnsmith\Dropbox\','\');
{itemcount('C:\Users\johnsmith\Dropbox\','\'): 5}
CRAB> SELECT itemcount('','\');
{itemcount('','\'): 1}
CRAB> SELECT itemcount(NULL,'\');
{itemcount(NULL,'\'): 0}

E.g. Check the number of columns for each row of a tsv file

SELECT itemcount(data,X'09') AS ncols, count(*) FROM fileslines
WHERE name = 'ufo_awesome.tsv'
GROUP BY ncols
ORDER BY count(*) desc;

CRAB> SELECT itemcount(data,X'09') AS ncols, count(*) FROM fileslines WHERE name = 'ufo_awesome.tsv' GROUP BY ncols ORDER BY count(*) desc;
{ncols: 6, count(*): 61067}
{ncols: 7, count(*): 134}
{ncols: 8, count(*): 52}
{ncols: 10, count(*): 37}
{ncols: 9, count(*): 31}
{ncols: 13, count(*): 13}
{ncols: 11, count(*): 12}
{ncols: 12, count(*): 9}
{ncols: 14, count(*): 6}
{ncols: 15, count(*): 4}
{ncols: 21, count(*): 3}
{ncols: 56, count(*): 3}
{ncols: 4, count(*): 2}
{ncols: 16, count(*): 2}
{ncols: 18, count(*): 2}
{ncols: 28, count(*): 2}
{ncols: 17, count(*): 1}
{ncols: 19, count(*): 1}
{ncols: 23, count(*): 1}
{ncols: 26, count(*): 1}
{ncols: 29, count(*): 1}
{ncols: 33, count(*): 1}
{ncols: 34, count(*): 1}
{ncols: 36, count(*): 1}
{ncols: 38, count(*): 1}
{ncols: 46, count(*): 1}
{ncols: 58, count(*): 1}
{ncols: 76, count(*): 1}
{ncols: 118, count(*): 1}
{ncols: 123, count(*): 1}

See Crab's other list-processing functions: inlist, itemn, sublist


itemn(stringList, i, sep)

Returns the i'th element from stringList, treating sep as list separator

  • Use when processing a string as a list, e.g. to treat a path as a '\'-delimited list, to process a list of regular expression match results, or a row of csv data from fileslines.

Examples

E.g.

CRAB> SELECT itemn('P\QPR\T',2,'\');
{itemn('P\QPR\T',2,'\'): QPR}
CRAB> SELECT itemn('C:\Users\johnsmith\Dropbox\',1,'\');
{itemn('C:\Users\johnsmith\Dropbox\',1,'\'): C:}
CRAB> SELECT itemn('C:\Users\johnsmith\Dropbox\',-2,'\');
{itemn('C:\Users\johnsmith\Dropbox\',-2,'\'): Dropbox}
CRAB> SELECT itemn('',1,'\');
{itemn('',1,'\'): }
CRAB> SELECT itemn(NULL,1,'\');
{itemn(NULL,1,'\'): }

E.g Total of column 1 and column 3 for every csv file in \myproject\, skipping header rows

SELECT sum(itemn(data,1,',') + itemn(data,3,',')) FROM fileslines
WHERE extension = '.csv' and parentpath = 'C:\Users\johnsmith\myproject\'
and linenumber > 1



Parameters

stringList
The string to be sliced, interpreted as a list.

i
Index of the required element.
The first item of
stringList has position 1.
If
startPosition is negative, count backwards from the end of the list, e.g. last list element has position -1.
If you select a list element that is beyond the end of the list, NULL is returned

sep
List separator. If stringList is a path use '\'. If stringList is a list of regular expression match results use '|', unless the default separator has been changed with the %matchsep command
Multi character separators are allowed.
Blob separators are supported, such as
X'09' to represent tab.


Tips

  • itemn(X,i,sep) is equivalent to sublist(X,i,1,sep)
  • Use itemcount(X,sep) to find list length
  • Use Crab's iterator table to process multiple list items from a row of data
  • Paths that end in a directory have an empty string last element. E.g. 'C:\a\b\c\' is a list of length 5 where item 1 = 'C:', item 2 = 'a' and item 5 = ''


See Crab's other list-processing functions: inlist, itemcount, sublist


string  match  pattern

Returns 1 (True) if
string matches the regex pattern, 0 (False) if it does not.

  • Pattern matching isn't anchored, so can be matched anywhere in string, not just at the beginning.
  • By default match is case sensitive

Examples

E.g. list functions defined with upper case names in demo.py

SELECT data FROM fileslines WHERE name = 'demo.py' AND data MATCH 'def\s+[A-Z]+\(';

E.g. list all the classes defined in python files in the Dropbox folder

SELECT distinct trim(data) FROM fileslines
WHERE extension = '.py' AND fullpath LIKE '%\Dropbox\%' AND data MATCH '^\s*class\s+.*';


pattern
Use python regular expression syntax, see https://docs.python.org/2/library/re.html

By default regex matching is case sensitive, use (?i) at the start of your pattern to turn off case sensitivity.
Use
(?u) at the start of your pattern for Unicode compatibility, e.g. for \w to match non ASCII word characters.

Matching isn't anchored to the start of string, specify ^ in your pattern to match the start of the string, and $ to match the end

. matches any character

E.g. 'a.b' matches a followed by any character followed by b


* requires the preceeding pattern to match 0 or more times
+ requires the preceeding pattern to match 1 or more times
? requires the preceeding pattern to match 0 or 1 times

E.g.
'a*b' matches any number of a's, followed by b
'a.*b' matches a, followed by any number of characters (possibly none), followed by b
'a+' matches one or more a's
'ab?c' matches 'abc' or 'ac'


By default
* + and ? match as much text as possible, but tacking a '?' on the end makes them non-greedy. So *? +? and ?? are non greedy

\s matches any whitespace character
\S matches any non-whitespace character
\d matches any digit [0-9]
\D matches any non digit
\w matches any word character (alphanumeric plus underscore)
\W matches any non-word character (alphanumeric plus underscore)

\t matches tab

\ also escapes special characters, so \* matches asterisk, \+ matches plus sign, \? matches question mark, \\ matches backslash

[] matches any character inside the brackets, e.g. [A-Z0-9] matches uppercase A thru Z and 0 thru 9
    If
^ is the first character inside the bracket it makes the bracket match anything except the characters listed
    So e.g.
[^A-Z] matches anything except upper case A thru Z



See Crab's other regex matching functions: groupn, matchn, matchcount

See How to Regex
A short tutorial on regular expressions with Crab


matchcount(string, pattern) or matchcount(groupn(string, pattern, j))

Returns how many matches were found when matching regex
pattern to string, or were returned by groupn

  • Pattern matching isn't anchored, so can be matched anywhere in string, not just at the beginning.
  • By default regex matching is case sensitive.
  • Often used with the iterator table to return all the matches on separate rows

E.g. here is a group that matches 8 words

SELECT groupn("« Ce qui nous paraît ici mériter l'intérêt","(?u)(\s|^|')+(\w+)",2);

CRAB> select groupn("« Ce qui nous paraît ici mériter l'intérêt", "(?u)(\s|^|')+(\w+)",2);
Ce|qui|nous|paraît|ici|mériter|l|intérêt

Here is how to use matchcount to count the matches

SELECT matchcount(groupn("« Ce qui nous paraît ici mériter l'intérêt","(?u)(\s|^|')+(\w+)",2));

CRAB> select matchcount(groupn("« Ce qui nous paraît ici mériter l'intérêt", "(?u)(\s|^|')+(\w+)",2));
8

Here is how to use matchcount with the iterator table to return the matches on separate rows

  SELECT matchn(groupn("« Ce qui nous paraît ici mériter l'intérêt","(?u)(\s|^|')+(\w+)",2),i) 
FROM iterator
WHERE i <= matchcount(groupn("« Ce qui nous paraît ici mériter l'intérêt","(?u)(\s|^|')+(\w+)",2));


CRAB> select matchn(groupn("« Ce qui nous paraît ici mériter l'intérêt", "(?u)(\s|^|')+(\w+)",2),i) from iterator where i <= matchcount(groupn("« Ce qui nous paraît ici mériter l'intérêt", "(?u)(\s|^|')+(\w+)",2));
Ce
qui
nous
paraît
ici
mériter
l
intérêt

Here's how to count word frequencies in a whole file

  SELECT matchn(data,"\w+\'*\w*",i) wrd, count(*)
FROM fileslines JOIN iterator ON i <= matchcount(data,"\w+'*\w*")
WHERE fullpath LIKE '%sqlsys\qqtest\76.txt'group by wrd
ORDER BY count(*) DESC LIMIT 5;

CRAB> select matchn(data,"\w+'*\w*",i) word, count(*) from fileslines join iterator on i <= matchcount(data,"\w+'*\w*") where fullpath like '%sqlsys\qqtest\76.txt' group by word order by count(*) desc limit 5;
and         6209      
the         4736      
I           3277      
a           3177      
to          3010  


See How to Regex
A short tutorial on regular expressions with Crab

See Crab's other regex matching functions: groupn, match, matchn


matchn(string, pattern, i)    or    matchn(groupn(string, pattern, j), i)

Returns the i'th match of regex pattern in string, or the i'th element from a list of groupn match results

  • Pattern matching isn't anchored, so can be matched anywhere in string, not just at the beginning.
  • By default regex matching is case sensitive.

Examples

E.g. Count word frequencies

SELECT matchn(data,"\w+'*\w*",i) word, count(*)
FROM fileslines JOIN iterator ON i <= matchcount(data,"\w+'*\w*")
WHERE fullpath LIKE '%sqlsys\qqtest\76.txt' GROUP BY word
ORDER BY count(*) DESC LIMIT 50;

E.g. Report email addresses from any file below project folder

SELECT fullpath, matchn(data, '[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,6}',i) emailAddy
FROM fileslines JOIN iterator ON i<=matchcount(data, '[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,6}')
WHERE fullpath LIKE 'C:\Users\johnsmith\myproject\%';



pattern
Regex pattern uses python regex syntax, see https://docs.python.org/2/library/re.html
See Crab's
match function entry for more regex pattern info



See How to Regex
A short tutorial on regular expressions with Crab

See Crab's other regex matching functions: groupn, match, matchcount


metadata(path, metaDataFieldName)

Returns the value of some metadata attribute for the filesystem object

Currently macOS only


normstr(string [,norm])

Converts string to normal form, by expressing each unicode glyph as code points in a standard order.

  • Use it to match non ASCII characters in filenames or file contents

Example

CRAB> SELECT name, codestr(name) FROM files WHERE name LIKE '%café%';
CRAB> SELECT name, codestr(name) FROM files WHERE name LIKE normstr('%café%');
{name: Café de Flore, codestr(name): Cafe\u0301 de Flore}
CRAB> select codestr('café');
{codestr('café'): caf\u00E9}

norm

If you want to specify the normalization, normstr can take a second argument . This can be 'NFC', 'NFD', 'NFKC' or 'NFKD'.

On Windows the normalization of filenames typed at the keyboard is usually 'NFC', on Mac the normalization of filenames is always 'NFD'. On Unix there is no standard.


The default value of norm is 'NFD'


See Crab's codestr function
To look at the code points which make up a string


pathexists(path)

Tests whether a path - file or directory - exists at query run time. Returns 1 (True) if the path exists, 0 (False) if it does not.

  • Use it to avoid errors when running commands or fileslines queries against files that might have been moved or deleted since the scan.
  • Use it to check that you're not going to clobber anything before moving or renaming a file.

Examples

-- avoid error due to sha1 function on non existent file
SELECT f1.bytes/1e9 as GB, f1.fullpath, f2.fullpath
FROM files f1 JOIN files f2 ON f1.bytes>1e8 AND f1.fileid>f2.fileid
AND f1.name=f2.name
AND f1.bytes=f2.bytes AND sha1(f1.fullpath) = sha1(f2.fullpath)
AND pathexists(f1.fullpath) AND pathexists(f2.fullpath);



-- avoid error due to query against fileslines for file that no longer exists
SELECT fullpath, count(*) FROM fileslines
WHERE parentpath LIKE 'C:\Users\johnsmith\tempfiles\%' AND pathexists(fullpath)
GROUP BY fullpath;



-- avoid name collision when renaming dictories by removing spaces
SELECT exec('rename', fullpath, replace(name,' ','')) FROM files
WHERE fullpath LIKE 'C:\Users\OneDirection\%' AND type = 'd'
AND NOT pathexists(parentpath||replace(name,' ',''));


short(s, n)

Restricts restricts string s to n characters by removing characters from the middle of the string, and replacing them with '~'

  • Use it to shorten wide strings in column output mode, when the most interesting information is at the start and the end of the string, e.g. paths and filenames

E.g.

CRAB> %mode column
CRAB> select short(name,20), bytes FROM files limit 10;
short(name,20)        bytes       
myproject             1972        
.DS_Store             43012       
DriftOffBo~singV3.py  12661       
ZZZBookCho~troller.h  283         
ZZZBookCho~troller.m  7007        
ZZZDetailN~troller.h  253         
ZZZDetailN~troller.m  1268        
ZZZDevView~troller.h  290         
ZZZDevView~troller.m  4303        
ZZZDriftOf~elegate.h  618          

Tips

  • Crab's short function will cause a column to display with width n, even if you used the %width command to assign column widths.


See %mode command
To change to column output mode


sublist(stringList, startPosition [,len], sep)

Returns a sublist from stringList starting at index startPosition, of length len.

  • Use when processing a string value as a list, e.g. to treat a path as a '\'-delimited list, to process a list of regular expression match results, or a row of csv data from fileslines.
  • Syntax and logic are modelled on the SQLite substr() function.

Examples

CRAB> SELECT sublist("a\b\32\74", 3, 2, '\');
{sublist("a\b\32\74", 3, 2, '\'): 32\74}
CRAB> SELECT sublist("a\b\32\74", -3, 2, '\');
{sublist("a\b\32\74", -3, 2, '\'): b\32}
CRAB> SELECT SUBLIST("a\b\32\74", -2, -2, '\');
{SUBLIST("a\b\32\74", -2, -2, '\'): a\b}
CRAB> SELECT SUBLIST("a\b\32\74", 2, 1, '\');
{SUBLIST("a\b\32\74", 2, 1, '\'): b}

Recursively count files beneath each directory at some level in the filesystem


SELECT sublist(fullpath,1,5,'\') AS dir, count(*) FROM files
WHERE fullpath LIKE 'C:\Users\johnsmith\Dropbox\%'
GROUP BY dir;


Extract columns 2 through 4 from a csv file

SELECT sublist(data,2,2,',')
FROM fileslines
WHERE fullpath = 'C:\Users\johnsmith\anaconda\lib\python2.7\site-packages\statsmodels\tsa\tests\results\arima111_forecasts.csv';


stringList
The string to be sliced, interpreted as a list.

startPosition
Starting position for the sublist. This is the first element of the returned list unless len is negative.
The first item of
stringList has position 1.
If
startPosition is negative, count backwards from the end of the list, e.g. last list element has position -1.

len
If optional argument len is given, then sublist returned is len items long.
If
len is positive, returns len items from stringList, starting from and including startPosition
If
len is negative, returns the abs(len) items from stringList, preceeding and not including startPosition
If
len is omitted then sublist(stringList, startPosition, sep) returns all items through the end of stringList beginning with startPosition.

sep
List separator. If stringList is a path use '\'. If stringList is a list of regular expression match results use '|', unless the default separator has been changed with the %matchsep command
Multi character separators are allowed.
Blob separators are supported, such as
X'09' to represent tab.


Tips
  • The convenience function itemn(X,i,sep) is equivalent to sublist(X,i,1,sep)
  • Use itemcount(X,sep) to find list length
  • Use Crab's iterator table to process multiple list items from a row of data
  • Paths that end in a directory have an empty string last element. E.g. 'C:\a\b\c\' is a list of length 5 where item 1 = 'C:', item 2 = 'a' and item 5 = ''



See Crab's other list-processing functions: inlist, itemcount, itemn

See Crab's above and below functions
Alternative function for truncating and trimming paths


u(string)

Converts unicode escape sequences and c-style escape sequences in string to unicode characters.

  • \uhhhh are converted to unicode code points with hex code hhhh
  • \xhh are converted to unicode code points with hex code 00hh.
  • u() is the inverse function to codestr()

Examples

CRAB> select u('"Ah. Philosophy," said \u0950');
"Ah. Philosophy," said   

Two ways to get unicode char U#00E1

CRAB> select u('\xe1');
á
CRAB> select u('\u00e1');
á

What bytes make up this character?

CRAB> select bytestr(u('\xe1'));
\xC3\xA1
CRAB>
select bytestr(u('\u00e1'));
\xC3\xA1

Applying u() to \xhh gives unicode character hh, not byte hh

CRAB> select u('\xc3\xa1');
á

To make byte strings use the SQLite blob prefix X with a string of hex digits

CRAB> select cast(X'c3a1' as text);
á

Tips

  • Crab displays newline characters in query results as \x0A, so as not to mess up the row-oriented output.


See also Crab's bytestr and codestr functions, and the %encoding command


write(filename, [,data1] [,data2] ...)

Writes query results to file

  • Data is appended to the file, one write operation for each result row. If the file doesn't exist, it is created. If you want a newline at the end of each row of data, use the writeln function instead.
  • Data is written using the encoding specified with the %encoding command. The default is utf8.

filename
The first argument is the name or full path of the file to write to

data1, ...
The following arguments, are the data you want written. These can be strings, numbers or blobs. Numbers and blobs will be converted to strings,. If you use binary encoding blob data will be written unchanged.


Examples

E.g. To copy a binary file, 40 bytes at a time


%encoding binary:40

SELECT write('C:\Users\johnsmith\myproject\cp001.tif',data) FROM fileslines
WHERE fullpath LIKE '%\myproject\001.tif' "



Restrictions

write is a delayed evaluation function like exec. It sends output to the screen, but does not return results you can use in other parts of your query because the rest of the query has already finished evaluation when it executes.

For this reason it is only supported in the SELECT clause of a top level query, not in subqueries, WHERE clauses, ORDER BY clauses, inside other functions and so on.


Tips

  • Remember that any changes you make to the filesystem - files you delete or add - won't be reflected in the files table until you scan the changed directories.


See Crab's writeln function
To write data one line at a time

See %encoding command
To write binary files, or different character sets


writeln(filename, [,data1] [,data2] ...)

Writes query results to file with newlines

  • Data is appended to the file, one line for each result row. If the file doesn't exist, it is created. If you don't want a newline at the end of each row of data, use the write function instead.
  • Data is written using the encoding specified with the %encoding command. The default is utf8.

filename
The first argument is the name or full path of the file to write to

data1, ...
The following arguments, are the data you want written. These can be strings, numbers or blobs. Numbers and blobs will be converted to strings,. If you use binary encoding blob data will be written unchanged.


Examples

E.g. To search and replace text in a set of files: This example reads data from every .c file in directory \HL2\, and writes the substituted text to a file that is given the suffix .003


SELECT writeln(fullpath||'.003', replace(data,'Kanal','Canal')) FROM fileslines
WHERE extension = '.c' AND parentpath LIKE '%\HL2\' ORDER BY fullpath, linenumber;



E.g. To strip all non ascii characters from a file


%encoding ascii:ignore

SELECT writeln(fullpath||'.clean', data) FROM fileslines
WHERE fullpath = 'C:\Users\gn\HL3\menuOptions.py';



Restrictions

writeln is a delayed evaluation function like exec. It sends output to the screen, but does not return results you can use in other parts of your query because the rest of your query has already finished evaluation when it executes.

For this reason it is only supported in the SELECT clause of a top level query, not in subqueries, WHERE clauses, ORDER BY clauses, inside other functions and so on.


Tips

  • Remember that any changes you make to the filesystem - files you delete or add - won't be reflected in the files table until you scan the changed directories.


See Crab's write function
To write data without the newlines

See %encoding command
To write binary files, or different character sets


© 2017 Etia UK