WARNING: This is a woefully incomplete overview of regular expressions. It would be absurd to try to fully cover the topic in a short handout like this. Hopefully, this will provide some of the basics to get you started, but to really understand regular expressions, I implore you to read as much of Mastering Regular Expressions by Jeffrey E.F. Friedl as you have time for.
A regular expression is a sequences of characters that describes or matches a given amount of text. For example, the sequence
A truly wonderful book written on the subject is: Mastering Regular Expressions by Jeffrey Friedl. Chapter 1, available via the Safari Network (through NYU) can be found here:
http://safari.oreilly.com/0596002890/mastregex2-CHP-1
Regular expressions (sometimes referred to as ‘regex’ for short) have both literal characters and meta characters. In
In this case, the ‘^’ is a meta character, i.e. it does not want to match the character ‘^’, but instead indicates the “beginning of a line.” In other words the regex above would find a match in:
bob goes to the park.
but would not find a match in:
jill and bob go to the park.
Here are a few common meta-characters (I’m listing them below as they would appear in a Java regular expression, which may differ slightly from perl, php, .net, etc.) used to get us started:
Position Metacharacters:
^ beginning of line
$ end of line
\\b word boundary
\\B a non word boundarySingle Character Metacharacters:
. any one character \\d any digit from 0 to 9 \\w any word character (a-z,A-Z,0-9) \\W any non-word character \\s any whitespace character (tab, new line, form feed, end of line, carriage return) \\S any non whitespace character
Quantifiers (refer to the character that precedes it):
? appearing once or not at all * appearing zero or more times + appearing one or more times {min,max} appearing within the specified range
Using the above, we could come up with some quick examples:
Character Classes allow one to do an “or” statement amongst individual characters and are denoted by characters enclosed in brackets, i.e.
Another key metacharacter is |, meaning or. This is known as the concept of Alternation.
note: this regex could also be written as
Parentheses can also be used to constrain the alternation, i.e.:
Parentheses also serve the purpose of capturing groups for back-references. For example, examine the following regular expression:
The first part of the expression without parentheses would read:
This is really really super super duper duper fun. Fun!