Read file contents

To query file contents select from the fileslines table. It has one row for each line of text in every file scanned. It has the same fields as the files table, but also a data field which contains the text, and a linenumber field.

fileslines is a virtual table, it reads contents of files at query run time and does not replicate file contents in the Crab database. File details in the WHERE clause determine which files are opened and read.

Set the right encoding

Crab's default settings are configured for UTF-8 and ASCII text files, any non UTF-8 character will cause your query to skip the rest of the file, so as to exclude contents of binary files from query results.

  • To search files that have occasional exotic characters change the %encoding setting from utf8:skipfile to utf8:ignore or utf8:replace, e.g.

    %encoding utf8:ignore

    This will include contents of binary files in queries against fileslines, unless you exclude them with WHERE clause criteria.

  • To search files that are not ASCII or UTF-8, specify the appropriate encoding. E.g. for Korean

    %encoding euc-kr:replace

  • You can check all current settings with the %show command

    %show

E.g. Display every line of file 'Import.log', with line numbers


SELECT linenumber, data
FROM fileslines
WHERE fullpath = '/somepath/Import.log';

See Tip at the bottom of this page, "Alternative output modes" for various ways to lay out query results

E.g. Recursively find all text files below 'MyProject' directory containing the string 'Error'


SELECT name, linenumber, data
FROM fileslines
WHERE fullpath like '/somepath/MyProject/%.txt'
and data like '%Error%';


The query reports file names, line numbers and line text for the lines that match.

The search is recursive because the wildcard pattern for fullpath will match .txt files below 'MyProject' at any depth.

E.g. Find all .log files in 'MyProject' directory containing lines which match either of two patterns


SELECT DISTINCT fullpath FROM fileslines
WHERE parentpath = '/somepath/MyProject/'
and extension = '.log'
and (data like '%Error% 271 %' or data like '%Error% 314 %');

This query reports the fullpath of each file that matches. We use DISTINCT to get the path once for each file that matches instead of once for each line that matches.

The search isn't recursive, it searches only .log files that are direct children of 'MyProject'.

E.g. Find all .log files in 'MyProject' directory which contain a match for three patterns, possibly on different lines


SELECT
fullpath,
max(case when data like '%Error% 271 %' then 1 else 0 end) as err271,
max(case when data like '%Error% 314 %' then 1 else 0 end) as err314,
max(case when data like '%Error% 1414 %' then 1 else 0 end) as err141
FROM fileslines
WHERE parentpath = '/somepath/MyProject/'
and extension = '.log'
GROUP BY fullpath
HAVING err271 + err314 + err141 = 3;

E.g. Recursively count 'FIXME' lines in each Python file below the 'MyProject' directory


SELECT fullpath, count(*) FROM fileslines
WHERE parentpath like '/somepath/MyProject/%'
and extension = '.py' and data like '%FIXME%'
GROUP BY fullpath ORDER BY count(*) DESC;

E.g. Recursively count total lines of code in all Python files below the 'MyProject' directory


SELECT count(*) FROM fileslines
WHERE parentpath like '/somepath/MyProject/%'
and extension = '.py' and length(trim(data))>0;

This counts all lines that have some non-whitespace characters.

E.g. Recursively find all classes defined in Python files below 'MyProject' directory


SELECT fullpath, linenumber, trim(data)
FROM fileslines
WHERE extension = '.py' and parentpath like '/somepath/MyProject/%'
and data match '^\s*class\s+.*'
ORDER BY trim(data) DESC;

E.g. Sum integers, one per line, from file 'numbers.txt'


SELECT sum(cast(data as integer))
FROM fileslines
WHERE fullpath = '/somepath/numbers.txt';

E.g. Copy every file listed in 'files.txt' to directory 'PublishMe'


SELECT exec('cp', '-f', data, '/somepath2/PublishMe/')
FROM fileslines
WHERE fullpath = '/somepath/files.txt';


'files.txt' must contain a valid file name or full path on each line, because the query is executing 'cp' on each one, and will halt if an error is returned.

If 'files.txt' lines include whitespace, remove it using the trim function: in the query change data to trim(data)

If 'files.txt' contains file names without path prefixes, they will be assumed to be in the directory where you invoked Crab.

Crab needs to have scanned the file 'files.txt' in order for it to be known to fileslines, but does not need to have scanned the files listed there, nor the target directory 'PublishMe'.

The '-f' option means that copied files will overwrite anything with the same name. Use pathexists() if you don't want to overwrite.

Write to files

writeln() and write()

Use Crab functions writeln() and write() to write to one or more files from a query.


SELECT writeln(FileName, DataToWrite) FROM ...

or

SELECT write(FileName, DataToWrite) FROM ...



Specify the file to write to, FileName, with a string giving the file name or full path. This can be calculated at query run time, possibly different for each result row. The string DataToWrite will be appended to the file.

Use the writeln() function for creating line-oriented files, it writes a newline at the end of each string whereas write() does not write any delimiter.

Like exec(), write() and writeln() are only allowed in the SELECT clause at the top level of a query. This is because they are delayed evaluation functions: The rest of the query must have finished evalution before strings can be passed to them.

Make sure Crab's %encoding setting, UTF-8 by default, matches the file contents.

To turn off writeln() and write() echo to screen, see "Tips" at the bottom of this page.

E.g. Copy every text file in a directory, sorting the lines of each file as an integer


SELECT writeln(parentpath || basename || '-Sorted' || extension, data)
FROM fileslines
WHERE parentpath = '/Users/johnsmith/MyProject/'
and extension = '.txt'
ORDER BY fullpath, cast(data as integer);

Copied files are saved to the same directory, and have '-Sorted' appended to the original names.

E.g. Concatenate the first three lines of every text file in directory 'MyProject' into a single file


SELECT writeln('/Users/johnsmith/Heads.txt', data)
FROM fileslines
WHERE parentpath = '/Users/johnsmith/MyProject/'
and extension = '.txt'
and linenumber <= 3
ORDER BY name, linenumber;

The output file is '/Users/johnsmith/Heads.txt'

E.g. Copy every text file in a directory, removing the last line of each


SELECT writeln(parentpath || basename || '-1' || extension, data)
FROM fileslines f1
WHERE parentpath = '/somepath/MyProject/'
and extension = '.txt'
and linenumber < (SELECT max(linenumber) FROM fileslines f2
WHERE f2.fullpath=f1.fullpath);

Copied files have '-1' appended to the original name

E.g. Recursively copy every text file in directory 'MyProject', removing non ASCII characters


%encoding ascii:ignore

SELECT writeln(parentpath || basename || '-ASCII' || extension, data)
FROM fileslines
WHERE fullpath like '/Users/johnsmith/MyProject/%'
and extension = '.txt';

%encoding utf8:skipfile


First step is to set Crab encoding to read ASCII characters and silently ignore any others.

Next the target files are all read and copied, in the process each character is passed through this ASCII filter. Copies are saved to the same directory as the original file, with an -ASCII suffix.

Finally the encoding is set back to default.

E.g. Copy 'SomeFile.txt' to 'SomeFile2.txt' replacing 'this text' by 'that text' throughout



SELECT writeln('/somepath2/SomeFile2.txt', replace(data, 'this text', 'that text'))
FROM fileslines
WHERE fullpath = '/somepath/SomeFile.txt';

Tip: Encoding error handling


Use the %encoding command to specify the behaviour you require when byte sequences are read that are not valid characters in your chosen encoding, e.g.

    %encoding utf8:ignore

Valid error handlers are:
    ignore - the invalid character is ignored
    replace - the invalid character is replaced by a placeholder symbol
    skipfile - the invalid character is treated as end of file marker



Tip: Suppress write() and writeln() echo to screen


By default write() and writeln() output every write command executed to your Command Line window. If you are running a write function over hundreds of thousands or millions of rows this will be slow.

You can speed up the query by discarding the output. Use the following command before running the query:

%output /dev/null

Any error messages will still go to the Command Line window, as will subsequent CRAB> prompts

To switch output back to the screen do this:

%output stdout

If you want to keep a log of the output rather than discard it, use a filename in place of /dev/null, e.g.

%output '/somepath/MyOutputLog.txt'




Tip: Memory constraints


The total number of rows that can be written to file by a query is limited by memory constraints.

On macOS we've tested with hundreds of millions of rows without any problems. On Windows the limit is currently much lower, a few million rows, because on that platform Crab is currently a 32bit app. A 64bit version is under development.



Tip: Alternative output modes


When you query file contents you might prefer a different output format for query results.

  • Each field on a line of its own

    %mode line

  • One column per field truncated to 10 chars, use the %width command to change number of chars.

    %mode column

  • Comma delimited output fields

    %mode list

To return to the default:

  • Dictionary output mode

    %mode dict

column and list modes have a header row that can be toggled on or off with the %header command

© 2017 Etia UK