How to Regex

Let's say we're interested in matching the words in some lines of data. I'm going to use the following line of data as an example: "Les Oeuvres completes de Francois Villon, by Francois Villon"

The regex pattern
\w+ matches one or more word-characters, and the pattern \W+ matches one or more non-word characters. Let's try matching one more non-word characters followed by one or more word-characters


select matchn("Les Oeuvres completes de Francois Villon, by Francois Villon",'\W+\w+',1) as word;

{word: Oeuvres}


The first word we get is Oeuvres. That's because "Les" isn't preceeded by a non-word character. It isn't preceeded by *any* character.

The sixth word matched is


select matchn(groupn("Les Oeuvres completes de Francois Villon, by Francois Villon",'\W+\w+'),6) as word;

{word: , by}


You can see that
matchn returns the word matched by \w+ and the preceeding comma matched by \W+

If we want a list of words, we need the non-word characters in the match to help us identify the words, but we don't want them returned. The way to do this is with a "group". This is a part of the pattern that captures the characters it matches. Use the Crab function
groupn to return the characters captured by a particular group. A group is represented in the pattern by brackets.

Below is the same pattern we had before, but with brackets around the part of the pattern that matches the word-characters.
matchn gives the same result it did before we added the bracket. The pattern matches the same text irrespective of the fact that it has a capturing group


select matchn("Les Oeuvres completes de Francois Villon, by Francois Villon",'\W+(\w+)',6) as word;

{word: , by}


But now we can use the
groupn function just to pick out the part of the match that was captured by the bracket


select groupn("Les Oeuvres completes de Francois Villon, by Francois Villon",'\W+(\w+)') as wordlist;

{wordlist: Oeuvres|completes|de|Francois|Villon|by|Francois|Villon}


groupn has returned all the matches in a list delimited by '|'. In this example there is only one capturing group, so groupn doesn't need an index to tell it which group to return. When you have more than one capturing group you'll need an index.

You can use
matchn to return just one match from the list returned by groupn.

If we pick out the sixth match, now we get the word 'by'


select matchn(groupn("Les Oeuvres completes de Francois Villon, by Francois Villon",'\W+(\w+)'),6) as word;

{word: by}


To pick out all the words we can use the
iterator virtual table and its field i which gives all the integers from 1 up to whichever limit we choose.

Here's how



select matchn(groupn("Les Oeuvres completes de Francois Villon, by Francois Villon",'\W+(\w+)'),i) as word
from iterator where i<=matchcount(groupn("Les Oeuvres completes de Francois Villon, by Francois Villon",'\W+(\w+)'));

{word: Oeuvres}
{word: completes}
{word: de}
{word: Francois}
{word: Villon}
{word: by}
{word: Francois}
{word: Villon}



Note the use of
matchcount to identify how many words we need to extract with the iterator.

Only one job remains, to pick up the first word from the text. To do that we change the pattern so it identifies a word as a sequence of word characters after
either a non-word character, or the start of the line. This means using the regex or operator: '|' (don't be confused with our use of the same character to delimit the list of matches earlier!).

To get the 'or' operator to work correctly we need to add some more brackets to make it clear what it applies to. Now that we have more brackets in the regex we need also to tell
groupn which brackets it's supposed to extract the value from - that's where the number 2 comes from


select matchn(groupn("Les Oeuvres completes de Francois Villon, by Francois Villon",'(^|\W)+(\w+)',2),i)
as word from iterator
where i<=matchcount(groupn("Les Oeuvres completes de Francois Villon, by Francois Villon",'(^|\W)+(\w+)',2));

{word: Les}
{word: Oeuvres}
{word: completes}
{word: de}
{word: Francois}
{word: Villon}
{word: by}
{word: Francois}
{word: Villon}



Exploiting greediness


We've written the query this way to illustrate the behavior of the regex matching functions. But for this example there is a simpler way to write it. That's because pattern matching is greedy, so the pattern
'\w+' will match as many word characters as it can. We don't need to worry about whether it's preceeded by any non word characters, the match will grow to capture as many word characters as it can find.

In fact the simplest version of the query is this:


select matchn("Les Oeuvres completes de Francois Villon, by Francois Villon",'\w+',i)
as word from iterator
where i<=matchcount("Les Oeuvres completes de Francois Villon, by Francois Villon",'\w+');

{word: Les}
{word: Oeuvres}
{word: completes}
{word: de}
{word: Francois}
{word: Villon}
{word: by}
{word: Francois}
{word: Villon}


© 2017 Etia UK