To query file contents, select from the fileslines table. This has one row for each line of text in every file scanned. It has the same fields as the files table, but also a data field which contains the text, and a linenumber field. It is a virtual table, so it reads contents of files at query run time, and does not replicate file contents in the Crab database. File details in the WHERE clause determine which files are opened and read.

WARNING: When querying file contents set the right encoding and error handling, or you may miss data.
(See the box below titled "Set the right encoding")

List processing functions for delimited files

List processing functions are available to extend standard SQL when processing delimited file contents.
Treat the text of each line of the file as a list, stringList, with delimiter sep

inlist(stringList, stringItem, sep)Returns list index of first occurrence of stringItem as an element in stringList, or 0 if not found
itemcount(stringList, sep)Returns length of stringList, counted as number of separators + 1
itemn(stringList, i, sep)Returns the i'th element from stringList, treating sep as list separator
sublist(stringList, startPosition [,len], sep)Returns a sublist from stringList starting at index startPosition, of length len

Regex functions

As well as the MATCH operator, a number of regex pattern processing functions are available

groupn(string, pattern, i)Returns all matches for group i when matching pattern to string as a regular expression
string MATCH patternReturns 1 (True) if string matches the regex pattern, 0 (False) if it does not
matchcount(string, pattern) or
matchcount(groupn(stringpatternj))
Returns how many matches were found when matching regex pattern to string, or were returned by groupn
matchn(string, pattern, i) or
matchn(groupn(string, pattern, j), i)
Returns the i'th match of regex pattern in string, or the i'th element from a list of groupn match results

Extra string functions

Crab extends standard SQL with extra string functions, particularly useful for processing files.

bytestr(s [,format])Returns the string or BLOB s as a printable byte sequence
chopstr(s, n)Chops n characters off the beginning of string s, returning the part of s that remains. If n is negative, chops characters from the end of s
codestr(s [, format])Returns the string or BLOB s as a sequence of unicode code points
normstr(string [,norm])Converts string to normal form, by expressing each unicode glyph as code points in a standard order
notbs(string)Removes the last character from string if it is a backslash, as required for directory path arguments to some Windows commands
short(s, n)Restricts restricts string s to n characters in width, by removing characters from the middle of the string, and replacing them with '~'
shortpath(path)Turns a normal Windows path string into a path string made of short DOS-style 8.3 abbreviated names
u(string)Converts unicode escape sequences and c-style escape sequences in string to unicode characters.

Standard string functions

Just for completeness it's worth listing here the standard SQLite string functions

hex(b)Returns the BLOB b as an upper case hex string, two hex characters per byte
instr(s, pattern)Returns the position of pattern in string s, or 0 if not found
length(s)Returns the length of string s in characters (not bytes)
lower(s)Converts string s to lower case (ASCII characters only)
ltrim(s [,chars])Removes any and all characters in string chars from the left of string s. If chars not specified, removes spaces
replace(s,oldstringnewstring)Replaces every occurence of oldstring by newstring throughout string s
rtrim(s [,chars])Removes any and all characters in string chars from the right of string s. If chars not specified, removes spaces
substr(s,startposn,len) Returns a substring of s that begins at startposn (an integer) and which is len characters long
trim(s [,chars])Removes any and all characters in string chars from both ends of string s. If chars not specified, removes spaces
unicode(s)Returns the numeric unicode code point corresponding to the first character of the string s
upper(s)Converts string s to upper case (ASCII characters only)
x'hh'The prefix x flags a hex character string as a BLOB. You can use this as a string of UTF‑8 characters, e.g. x'09' is tab

Encoding

Set the right encoding

Crab's default settings are configured for UTF-8 and ASCII text files, any non UTF-8 character will cause your query to skip the rest of the file, so as to exclude contents of binary files from query results.

  • To search files that have occasional exotic characters change the %encoding setting from utf8:skipfile to utf8:ignore or utf8:replace, e.g.

    %encoding utf8:ignore

    However this will include contents of binary files in queries against fileslines, unless you exclude them with WHERE clause criteria.

  • To search files that are not ASCII or UTF-8, specify the appropriate encoding. E.g. for Korean

    %encoding euc-kr:replace

  • You can check all current settings with the %show command

    %show

Examples

E.g. Report the most common comma-count for lines in a file


SELECT matchcount(data, ',') numCommas, count(*) numLines
FROM fileslines
WHERE name = 'ufo_awesome.csv'
GROUP BY numCommas;

Reports how many lines in file 'ufo_awesome.csv' have 0 commas, 1 commas, 2 commas, 3 commas, ... The most common number of commas is probably the correct data format! If the lines have highly variable number of commas, or they all have zero, it may be that this file is not comma delimited.

The matchcount() function matches the regex pattern - here just a comma character - to each line of the file, and returns the number of occurrences, ie the number of commas in the line. The GROUP BY numCommas clause tells the query to count how many lines have each number of occurrences.

Commas inside quotes are counted the same as other commas, you'll need more complex regex to treat them differently.

N.B. This assumes you have scanned only one file called 'ufo_awesome.csv'. If there are more, you'll need to replace the filter on name by a filter on the fullpath of the specific file you want to analyze. Alternatively, group by fullpath as well as numCommas.

E.g. Report the most common tab-count for lines in a file


SELECT matchcount(data, x'09') numTabs, count(*) numLines
FROM fileslines
WHERE name = 'ufo_awesome.csv'
GROUP BY numTabs;

Same as the previous query, but with tab delimiters: Reports how many lines in file 'ufo_awesome.csv' have 0 tabs, 1 tab, 2 tabs, 3 tabs, ...

The matchcount() function matches the regex pattern - just UTF‑8 character 09, a tab character - to each line of the file, and returns the number of occurrences, ie the number of tabs in the line. The GROUP BY numTabs clause tells the query to count how many lines have each number of occurrences.

N.B. This assumes you have scanned only one file called 'ufo_awesome.csv'. If there are more, you'll need to replace the filter on name by a filter on the fullpath of the specific file you want to analyze. Alternatively, group by fullpath as well as numTabs.

E.g. Keep only lines that have the correct number of tabs


SELECT writeln(parentpath || 'ufo_awesome-cleaned.csv', data)
FROM fileslines
WHERE name = 'ufo_awesome.csv' and matchcount(data,x'09') = 5;

After you've run the previous query to work out how many tab delimiters are supposed to be in the data, you can use this query to keep only lines with the correct number.

N.B. This assumes you have scanned only one file called 'ufo_awesome.csv'. If there are more, you'll need to replace the filter on name by a filter on fullpath

E.g. Filter lines with correct number of tabs to one file, lines with incorrect number to another


SELECT case when matchcount(data, x'09') = 5
then writeln(parentpath || 'ufo_awesome-cleaned.csv', data)
else writeln(parentpath || 'ufo_awesome-badrows.csv', data) end
FROM fileslines
WHERE name = 'ufo_awesome.csv';

This is similar to the previous query, but it additionally filters any lines with an incorrect number of tabs to a separate exceptions file.

N.B. Although writeln() isn't generally allowed in subexpressions or subqueries, it's ok to wrap it in a CASE expression, as here.

N.B. This assumes you have scanned only one file called 'ufo_awesome.csv'. If there are more, you'll need to replace the filter on name by a filter on fullpath, otherwise lines from all of them will be processed.

E.g. Count number of lines that have quote-marks


SELECT sum(matchcount(data,'"|''')>0) numLinesWithQuotes, count(*) numLines
FROM fileslines
WHERE name = 'ufo_awesome.csv';

Reports how many lines have either a single or a double quote mark, in file 'ufo_awesome.csv'

The matchcount() function is used to count how many matches were found for our regex pattern in each line of data. This particular pattern matches either a double quote or a single quote; in SQLite a single quote character can be entered inside a single-quote delimited string by writing two adjacent single quote characters: '', as we've done here.

The expression    matchcount() > 0    returns 1 (True) if matchcount() returns 1 or more matches, and 0 (False) otherwise. Summing these gives a count of the lines which have a match.

N.B. This assumes you have scanned only one file called 'ufo_awesome.csv'. If there are more, you'll need to replace the filter on name by a filter on the fullpath of the specific file you want to analyze. Alternatively, group by fullpath to report the count for each file.

E.g. Copy files trimming white space from each line, and removing blank lines


SELECT writeln(parentpath||basename||'-trimmed'||extension, trim(data))
FROM fileslines
WHERE parentpath = 'C:\somepath\MyProject\' and extension = '.dat' and length(trim(data))>0;


Copies every '.dat' file in the 'MyProject' directory to a 'trimmed' version.

N.B. Because we are only picking up files where the parentpath is equal to 'MyProject', this is not recursive.

E.g. Report a frequency distribution of the values in column 4, for rows with well formed data


SELECT itemn(data,4, x'09') objectSighted, count(*) numEntries
FROM fileslines
WHERE name = 'ufo_awesome.csv' and matchcount(data, x'09') = 5
GROUP BY objectSighted
ORDER BY count(*) DESC;

The query reports a frequency distribution for values in column #4 of a tab-separated text file, including only rows that have 5 tab characters.

The query filters by filename. If you have scanned two files with the same name, data from both files will be included. To restrict results to one of them you'll need to replace the filter on name by a filter on the fullpath of the specific file you want to analyze. Alternatively calculate the frequency distribution of column #4 over a whole set of files by using a looser filter, e.g. wildcard on name or parentpath, or some specific extension

E.g. Sum integers, one per line, from file 'numbers.txt'


SELECT sum(cast(data as integer))
FROM fileslines
WHERE fullpath = 'C:\somepath\numbers.txt';

E.g. Copy every text file in directory 'MyProject', sorting the lines of each file as an integer


SELECT writeln(parentpath || basename || '-Sorted' || extension, data)
FROM fileslines
WHERE parentpath = 'C:\somepath\MyProject\'
and extension = '.txt'
ORDER BY fullpath, cast(data as integer);

Copied files are saved to the same directory, and have '-Sorted' appended to the original names.

E.g. For each subdirectory of 'MyProject' copy all the text files into one concatenated file per subdirectory


SELECT writeln(parentpath || 'combined.txt', data)
FROM fileslines
WHERE parentpath like 'C:\somepath\MyProject\%'
and extension = '.txt';

Each subdirectory of 'C:\somepath\MyProject\' (including the 'MyProject' directory itself) will get a file called 'combined.txt', with the concatenated contents of all its text files.

E.g. Copy 'SomeFile.txt' to 'SomeFile2.txt' replacing 'this text' by 'that text' throughout


SELECT writeln('C:\somepath2\SomeFile2.txt', replace(data, 'this text', 'that text'))
FROM fileslines
WHERE fullpath = 'C:\somepath\SomeFile.txt';

E.g. Recursively copy every text file in directory 'MyProject', removing non ASCII characters


%encoding ascii:ignore


SELECT writeln(parentpath || basename || '-ASCII' || extension, data)
FROM fileslines
WHERE parentpath like 'C:\somepath\MyProject\%'
and extension = '.txt';

%encoding utf8:skipfile


First step is to set Crab encoding to read ASCII characters and silently ignore any others.

Next the target files are all read and copied, in the process each character is passed through this ASCII filter. Copies are saved to the same directory as the original file, with an -ASCII suffix.

Finally the encoding is set back to default.

Tip: Encoding error handling


Use the %encoding command to specify the behaviour you require when byte sequences are read that are not valid characters in your chosen encoding, e.g.

    %encoding utf8:ignore

Valid error handlers are:
    ignore - the invalid character is ignored
    replace - the invalid character is replaced by a placeholder symbol
    skipfile - the invalid character is treated as end of file marker



Tip: Save typing by scanning only the files you want to analyze


When launching Crab you can specify files to scan rather than directory paths. If you do this all queries are restricted to the scanned files, so you won't need a filter on filename or path in the WHERE clause

NB On Windows wildcard scan paths aren't yet supported, but you can provide a list of files, or scan just one directory, see the ‑maxdepth option

E.g to analyze all .log files in the current directory

$ crab *.log

Now any queries will search only the content of .log files in the current directory and you won't need a restriction on filename or path in your WHERE clause.

SELECT name, itemcount(data,',') numCommas, count(*) numLines
FROM fileslines
GROUP BY name, numCommas;



Tip: Memory constraints


The total number of rows that can be written to file by a query is limited by memory constraints.

On macOS we've tested with hundreds of millions of rows without any problems. On Windows the limit is currently much lower, a few million rows, because on that platform Crab is currently a 32bit app. A 64bit version is under development.



Tip: Suppress write() and writeln() echo to screen


By default write() and writeln() output every write command executed to your Command Line window. If you are running a write function over hundreds of thousands or millions of rows this will be slow, and your screen will be a mess.

You can speed up the query and keep the screen clear by discarding the output. Use the following command before running the query:

%output NUL


Any error messages will still go to the Command Line window, as will subsequent CRAB> prompts

To switch output back to the screen do this:

%output stdout


If you want to keep a log of the output rather than discard it, use a filename in place of NUL, e.g.

%output 'C:\somepath\MyLogFile.txt'




Tip: Alternative output modes


When you query file contents you might prefer a different output format for query results.

  • Each field on a line of its own

    %mode line

  • One column per field truncated to 10 chars, use the %width command to change number of chars.

    %mode column

  • Comma delimited output fields

    %mode list
To return to the default:

  • Dictionary output mode

    %mode dict
column and list modes have a header row that can be toggled on or off with the %header command

© 2017 Etia UK