www.cryer.co.uk
Brian Cryer's Web Resources

regular expression

regular expression
A regular expression is a way of expressing a text pattern for the purpose of matching a string or part of a string. Regular expressions are often used either to extract information from a string or to verify that a string is of the correct format. When referred to, regular expressions are often abbreviated to simply "regex".

The following table provides a summary of the different types of regular expressions together with examples of their use:

Regex Meaning
. A dot matches any single character - except a carriage return or line feed.
c.t Matches "cat", "cbt" ... "c1t" ... "c&t" etc (with anything before the "c" and anything after the "t").
[0-9] Matches anything inside the square brackets against a single character; so in this case any single character in the range 0 to 9.
[09] Matches anything inside the square brackets against a single character; in this case the single character 0 or 9 (but not 1 to 8). This can be useful if you don't know the case of something, because [aA] will match either a or A.
[0-9a-zA-Z] Matches any single character in the range 0 to 9 or a to z or A to Z.
? Matches the preceding element one time or zero times. This is equivalent to {0,1}. It can be thought of as making the previous element optional.
?? Like ?, it matches the previous element 0 or 1 times, but weighted towards matching it 0 times rather than 1 if possible.
* Matches the preceding element zero or more times.
*? Matches the preceding element zero or more times, but unlike * will match against the shortest possible match.

For example "b[an]*a" when applied to the word "banana" will match the entire word, but "b[an]*?a" will match with just "ba".

a[0-9]*z Will match against "az", "a1z" ... "a999z" etc.
+ Matches the preceding element one or more times.
+? Matches the preceding element one or more times, but unlike + will match against the shortest possible substring.

For example "b[an]+?a" when applied to the word "banana" will match the entire word, but "b[an]+?a" will match with just "bana".

a[0-9]+z Will match against "a1z" ... "a999z" etc (but not "az").
{number} Matches the preceding element the specified number of times.
a[0-1]{2}z Will match against "a00z", "a10z", "a01z" and "a11z".
{min,max} Matches the preceding element a minimum of "min" times at at most "max" times.
a[0-1]{1,2}z Will match against "a0z", "a1z", "a00z", "a10z", "a01z" and "a11z".
[^...] Inverts a match - matches anything except.
a[^0-1]z Will match against any three letter string starting with "a", ending with "z" that is not "a0z" or "a1z".
^ Matches at the start of the line.
$ Matches at the end of the line.
\ Escape character. Allows the following character to be treated as a literal rather than having a special meaning.

Thus \+ would match against + rather than + having its normal meaning. This also means that to include a \ as a slash you would need to use \\ (to escape the slash).

\A Matches at the start of the string. When dealing with a single line expression \A is equivalent to ^.
Note: \A is not supported on all implementations of regex.
\Z Matches at the end of the string. When dealing with single line expressions \Z is equivalent to $.
Note: \Z is not supported on all implementations of regex.
\xNN Will match against the single character with the hex code 'NN'. So \x09 will match against a tab character and \x20 will match against a space.
See www.cryer.co.uk/brian/misc/ascii_table.htm for a list of hex codes for ASCII characters.
\t Match against the TAB character (same as \x09).
\n Match against a new-line (same as \x0a).
\r Match against a carriage return (same as \x0d).
\f Match against a form-feed (same as \x0c).
\a Match against a bell character (same as \x07).
\b Match against a word boundary. A word boundary is the beginning or end of a sequence matched by \w (i.e. [a-zA-Z0-9_]). It is the boundary which is matched, not the character, so \bone\b will match against the word "one" but not "someone".
\B Match against anything which is not a word boundary.
\d Match against any numeric character, i.e. 0 to 9.
\D Match against any non-numeric character.
\e Match against an escape character (same as \x16)
\r\n Match against a carriage return line feed combination (in that order).
\w Match against any alphanumeric character, i.e. 0 to 9, a to z, A to Z and the underscore. Notionally this is matching against any word character, hence the "w".
\W Match against any non-alphanumeric character.
\s Any space (same as [ \t\n\r\f]).
\S Match against any non-space.
one|two Match everything to the left of the bar ('|') or everything to the right, so in this case "one" or "two".
( … ) Brackets define a group or sub-expression. When a regular expression is used in a situation where the contents of a match need to be extracted then brackets define the group or sub-expression the matching value of which can be retrieved.
\b(one|two)\b Match against the whole word "one" or the word "two", specifically it matches against a word boundary, then "one" or "two", followed by another word boundary.
(?<!a)b Look behind - matches against "b" but only if the previous character was not "a". So "(?<!a)b" will match against "rubble" but not against "table".
(?<=a)b Look behind - matches against "b" but only if the previous character was "a". So "(?<=a)b" will match against "table" but not against "rubble"
l(?!oo)k Negative look ahead. Will match a "l" only if not followed by "oo". So "l(?!oo)k" will match against "leek" but not "look".
l(?=oo)k Positive look ahead. Will match a "l" only if followed by "oo". So "l(?=oo)k" will match against "look" but not "leek".

Be aware that if a tool or utility claims to support regular expressions then it may support only a subset of the above or occasionally may use their own unique syntax for custom extensions to the above. This is true of some Microsoft products. For example Microsoft Exchange only supports a small subset of regular expressions (see https://technet.microsoft.com/en-GB/library/aa997187%28v=exchg.141%29.aspx), and Microsoft Expression uses a different syntax for some elements (see https://msdn.microsoft.com/en-us/library/cc295435.aspx).

For more information see: