Regular Expressions

WARNING: This is a woefully incomplete overview of regular expressions. It would be absurd to try to fully cover the topic in a short handout like this. Hopefully, this will provide some of the basics to get you started, but to really understand regular expressions, I implore you to read as much of Mastering Regular Expressions by Jeffrey E.F. Friedl as you have time for.

A regular expression is a sequences of characters that describes or matches a given amount of text. For example, the sequence bob, considered as a regular expression, would match any occurance of the word “bob” inside of another text. The following is a rather rudimentary introduction to the basics of regular expressions. We could spend the entire semester studying regular expressions if we put our mind to it. . . Nevertheless, we’ll just have a basic introduction to them this week and learn more advanced technique as we explore different text processing applications over the course of the semester.

A truly wonderful book written on the subject is: Mastering Regular Expressions by Jeffrey Friedl. Chapter 1, available via the Safari Network (through NYU) can be found here:

http://safari.oreilly.com/0596002890/mastregex2-CHP-1

Regular expressions (sometimes referred to as ‘regex’ for short) have both literal characters and meta characters. In bob, all three characters are literal, i.e. the ‘b’ wants to match a ‘b’, the ‘o’ an ‘o’, etc. We might also have the regular expression:

^bob

In this case, the ‘^’ is a meta character, i.e. it does not want to match the character ‘^’, but instead indicates the “beginning of a line.” In other words the regex above would find a match in:

bob goes to the park.

but would not find a match in:

jill and bob go to the park.

Here are a few common meta-characters (I’m listing them below as they would appear in a Java regular expression, which may differ slightly from perl, php, .net, etc.) used to get us started:

Position Metacharacters:

^     beginning of line
$     end of line
\\b    word boundary
\\B    a non word boundary

Single Character Metacharacters:

.     any one character
\\d    any digit from 0 to 9
\\w    any word character (a-z,A-Z,0-9)
\\W    any non-word character
\\s    any whitespace character (tab, new line, form feed, end of line, carriage return)
\\S    any non whitespace character

Quantifiers (refer to the character that precedes it):

?     appearing once or not at all
*     appearing zero or more times
+     appearing one or more times
{min,max} appearing within the specified range

Using the above, we could come up with some quick examples:

^$ –> matches beginning of line followed by end of line, i.e. match any blank line!

ing\b –> matches ‘ing’ followed by a word boundary, i.e. any time ‘ing’ appears at the end of a word!

Character Classes allow one to do an “or” statement amongst individual characters and are denoted by characters enclosed in brackets, i.e. [aeiou] means match any vowel. Using a “^” negates the character class, i.e. [^aeiou] means match any character not a vowel (note this isn’t just limited to letters, it really means anything at all that is not an a, e, i, o, or u.) A hyphen indicates a range of characters, such as [0-9] or [a-z].

Another key metacharacter is |, meaning or. This is known as the concept of Alternation.

John | Jon -> match “John” or Jon”

note: this regex could also be written as Joh?n, meaning match “Jon” with an option “h” between the “o” and “n.”

Parentheses can also be used to constrain the alternation, i.e.:

(212|646|917)\d* matches any sequence of zero or more digits preceded by 212, 646, or 917 (presumably to retrieve phone #’s with NYC area codes). Note this regular expression would need to be improved to take into consideration white spaces and/or punctuation.

Parentheses also serve the purpose of capturing groups for back-references. For example, examine the following regular expression:

\b([0-9A-Za-z]+)\s+\1\b

The first part of the expression without parentheses would read: \b([0-9A-Za-z]+) meaning match any “word” containing at least one or more letters/digits. The next part \s+ means any sequence of at least one white space. The third part \1 says match whatever you matched that was enclosed inside the first set of parentheses, i.e. ([0-9A-Za-z]+). So, thinking this over, what will this regular expression match in the following line:

This is really really super super duper duper fun.  Fun!