www.cryer.co.uk
Brian Cryer's Web Resources

regular expression

regular expression
A regular expression is a way of expressing a text pattern for the purpose of matching a string or part of a string. Regular expressions are often used either to extract information from a string or to verify that a string is of the correct format. When referred to, regular expressions are often abbreviated to simply "regex".

The following table provides a summary of the different types of regular expressions together with examples of their use:

Regex Meaning
. A dot matches any single character - except a carriage return or line feed.
c.t Matches "cat", "cbt" ... "c1t" ... "c&t" etc (with anything before the "c" and anything after the "t").
[0-9] Matches anything inside the square brackets against a single character; so in this case any single character in the range 0 to 9.
[09] Matches anything inside the square brackets against a single character; in this case the single character 0 or 9 (but not 1 to 8). This can be useful if you don't know the case of something, because [aA] will match either a or A.
[0-9a-zA-Z] Matches any single character in the range 0 to 9 or a to z or A to Z.
? Matches the preceding element one time or zero times. This is equivalent to {0,1}. It can be thought of as making the previous element optional.
?? Like ?, it matches the previous element 0 or 1 times, but weighted towards matching it 0 times rather than 1 if possible.
* Matches the preceding element zero or more times.
*? Matches the preceding element zero or more times, but unlike * will match against the shortest possible match.
For example "b[an]*a" when applied to the word "banana" will match the entire word, but "b[an]*?a" will match with just "ban".
a[0-9]*z Will match against "az", "a1z" ... "a999z" etc.
+ Matches the preceding element one or more times.
+? Matches the preceding element one or more times, but unlike + will match against the shortest possible substring.
a[0-9]+z Will match against "a1z" ... "a999z" etc (but not "az").
{number} Matches the preceding element the specified number of times.
a[0-1]{2}z Will match against "a00z", "a10z", "a01z" and "a11z".
{min,max} Matches the preceding element a minimum of "min" times at at most "max" times.
a[0-1]{1,2}z Will match against "a0z", "a1z", "a00z", "a10z", "a01z" and "a11z".
[^...] Inverts a match - matches anything except.
a[^0-1]z Will match against any three letter string starting with "a", ending with "z" that is not "a0z" or "a1z".
^ Matches at the start of the line.
$ Matches at the end of the line.
\ Escape character. Allows the following character to be treated as a literal rather than having a special meaning.

Thus \+ would match against + rather than + having its normal meaning. This also means that to include a \ as a slash you would need to use \\ (to escape the slash).

\A Matches at the start of the string. When dealing with a single line expression \A is equivalent to ^.
Note: \A is not supported on all implementations of regex.
\Z Matches at the end of the string. When dealing with single line expressions \Z is equivalent to $.
Note: \Z is not supported on all implementations of regex.
\xNN Will match against the single character with the hex code 'NN'. So \x09 will match against a tab character and \x20 will match against a space.
See www.cryer.co.uk/brian/misc/ascii_table.htm for a list of hex codes for ASCII characters.
\t Match against the TAB character (same as \x09).
\n Match against a new-line (same as \x0a).
\r Match against a carriage return (same as \x0d).
\f Match against a form-feed (same as \x0c).
\a Match against a bell character (same as \x07).
\b Match against a word boundary. A word boundary will match against any of: (i.) The beginning of the string, (ii.) the end of the string or (iii.) anything which is not [a-zA-Z0-9_]
\e Match against an escape character (same as \x16)
\r\n Match against a carriage return line feed combination (in that order).
\w Match against any alphanumeric character (i.e. 0 to 9, a to z, A to Z and the underscore).
\W Match against any non-alphanumeric character.
\d Match against any numeric character.
\D Match against any non-numeric character.
\s Any space (same as [ \t\n\r\f]).
\S Match against any non-space.
one|two Match everything to the left of the bar ('|') or everything to the right, so in this case "one" or "two".
( … ) Brackets define a group or sub-expression. When a regular expression is used in a situation where the contents of a match need to be extracted then brackets define the group or sub-expression the matching value of which can be retrieved.
\b(one|two)\b Match against the whole word "one" or the word "two", specifically it matches against a word boundary, then "one" or "two", followed by another word boundary.
(?<!a)b Look behind - matches against "b" but only if the previous character was not "a". So "(?<!a)b" will match against "rubble" but not against "table".
(?<=a)b Look behind - matches against "b" but only if the previous character was "a". So "(?<!a)b" will match against "table" but not against "rubble"

For more information see: