Okapi Shared HelpRegular Expressions |
|
- Overview |
This section provides an overview of the syntax to use when working with regular expressions.
The regular expression language is designed and optimized to manipulate text. The language comprises two basic character types: literal (normal) text characters and meta-characters which are instructions for the regular expression.
For example, the regular expression \scat
matches all
occurrences of the string "cat" that are preceded by any white-space
character, such as a space or a tab. So in the string "A bearcat is bigger
than a cat
" the match would be: "A bearcat is bigger than a
cat
", but not "A bearcat
is bigger than a cat
".
Regular expressions can perform very complex searches, using classes of characters, groupings, back-referencing, zero-width assertions and many different types of conditions and options.
Here are a few examples of regular expressions. The text matched by the expression is highlighted in yellow. Named groups and their corresponding matches are sometimes highlighted in other colors. All the examples assume no options are set, except is stated otherwise.
Expression: tag1|tag2 Options: None. Matches: Before <tag1> and <tag2> after
Expression: tag\b Options: None. Matches: Before tag tagtag after
Expression: <.*>
Options: None.
Matches: Before <tag1> and <tag2> after
Expression: <.*?> Options: None. Matches: Before <tag1> and <tag2> after
Expression: colou?r Options: None. Matches: Color, colour, color
Expression: (C|c)olou?r Options: None. Matches: Color, colour, color
Expression: (?<grp1>\s\w+)\k<grp1> Options: None. Matches: Like the theory or the the theme
Expression: (?<grp1>\s\w+)\k<grp1>\b
Options: None.
Matches: Like the theory or the the theme
Expression: <img.*?((alt\s*=\s*(?<q>'|"))(?<text>.*?)\k<q>.*?)?> Options: Single-line: on, Ignore case: on. Matches: Click <img src='go.png' alt ='Start Now!'> to start. Matches: Click <img alt='Start Now!'> to start. Matches: Click <IMG ALT = "Start Now!" SRC="go.png"> to start. Matches: Click <img src='go.png'> to start.
Expression: <\w+.*?((?<q>"|')(.*?)\k<q>.*?)?>|</\w+.*?> Options: Single-line: on, Ignore case: on. Matches: <P id="1">This is<br> a<a attr="<test>" href='#abc'>link</a>.</P>
Expression: %(([-0+ #]?)[-0+ #]?)((\d\$)?)(([\d\*]*)(\.[\d\*]*)?)[dioxXucsfeEgGpn] Options: Ignore case: on Matches: %d files not found, including %s (%3.2d%% done) Matches: %1$d files not found, including %2$s (%3$*.*d%% done)
Characters other than '.
', '$
', '^
',
'{
', '[
', '(
', '|
', ')
',
'*
', '+
', '?
', and '\
'
match themselves. Otherwise you must prefix the character with a '\
'
to match as a literal. For example, to match the question mark ('?
') character you
must use "\?
" or to match the backslash ('\
')
you must use "\\
".
A character class is a set of characters that will find a match if any one of the characters included in the set matches. You can specify character classes using the sequences listed in the following table:
Sequence | Description |
---|---|
. |
Matches any character except \n .
If modified by the single-line option, a period character matches
any character. |
[aeiou] |
Matches any single character included
in the specified set of characters (here, any character in "aeiou "). |
[^aeiou] |
Matches any single character not
in the specified set of characters (here, anything but any character
in "aeiou "). |
[0-9a-fA-F] |
Use of a hyphen '– '
allows to specify contiguous character ranges (here, any character
in "0123456789abcdefABCDEF ". |
\d |
Matches any decimal digit. Equivalent
to \p{Nd} , or [0-9] for non-Unicode,
ECMAScript behavior. |
\D |
Matches any non-digit. Equivalent to
\P{Nd} , or [^0-9] for non-Unicode,
ECMAScript behavior. |
\p{name} |
Matches any character in the named
character class specified by {name} . The name must be
one of the Unicode groups and block ranges. For example: Ll ,
Nd , Z , Lu , Lo ,
Lt , IsGreek , or IsBoxDrawing . |
\P{name} |
Matches any character not text
not included in the named character class specified by {name} . |
\s |
Matches any white-space character.
Equivalent to the Unicode character categories
[\f\n\r\t\v\x85\p{Z}] . If the ECMAScript option is set,
\s is equivalent to [ \f\n\r\t\v] . |
\S |
Matches any non-white-space character.
Equivalent to the Unicode character categories
[^\f\n\r\t\v\x85\p{Z}] . If the ECMAScript option is set,
\S is equivalent to [^ \f\n\r\t\v] . |
\w |
Matches any word character. Equivalent
to the Unicode character categories [\p{Ll}\p{Lu}\p{Lt}\p{Lo}\p{Nd}\p{Pc}] .
With the ECMAScript option set, \w is equivalent to
[a-zA-Z_0-9] . |
\W |
Matches any non-word character.
Equivalent to the Unicode categories [^\p{Ll}\p{Lu}\p{Lt}\p{Lo}\p{Nd}\p{Pc}] .
If the ECMAScript option is set, \W is equivalent to
[^a-zA-Z_0-9] . |
Unicode groups and block ranges (to use
with \p{name}
):
Character Class | Name | Description |
---|---|---|
Uppercase letter | Lu |
Matches any one capital letter. |
Lowercase letter | Ll |
Matches any one lower case letter. |
Title case letter | Lt |
Matches characters that combine an
uppercase letter with a lowercase letter, such as 'Nj'
and 'Dz' . |
Modifier letter | Lm |
Matches letters or punctuation, such as commas, cross accents, and double prime. These letters are used to indicate modifications to the preceding letter. |
Other letter | Lo | Matches other letters, such as gothic letter ahsa. |
Decimal digit | Nd |
Matches decimal digits such as 0-9 (and their full-width equivalents). |
Letter digit | Nl |
Matches letter digits such as ideographic number zero or roman numerals. |
Other digit | No |
Matches other digits such as old italic number one. |
Open punctuation | Ps |
Matches opening punctuation such as open brackets and braces. |
Close punctuation | Pe |
Matches closing punctuation such as closing brackets and braces. |
Initial quote punctuation | Pi |
Matches initial double quotation marks. |
Final quote punctuation | Pf |
Matches single quotation marks and ending double quotation marks. |
Dash punctuation | Pd |
Matches the dash mark. |
Connector punctuation | Pc |
Matches the underscore or underline mark. |
Other punctuation | Po |
Matches other punctuation characters such as commas, colons, semi-colons, or slash. |
Space separator | Zs |
Matches blanks. |
Line separator | Zl |
Matches the Unicode character U+2028. |
Paragraph separator | Zp |
Matches the Unicode character U+2029. |
Non-spacing mark | Mn |
Matches non-spacing marks. |
Combining mark | Mc |
Matches combining marks. |
Enclosing mark | Me |
Matches enclosing marks. |
Math symbol | Sm |
Matches '+', '=', '~', '|', '<', and '>'. |
Currency symbol | Sc |
Matches currency symbols such as '$'. |
Modifier symbol | Sk |
Matches modifier symbols such as circumflex accent, grave accent, and macron. |
Other symbol | So |
Matches other symbols, such as the copyright sign, pilcrow sign, or the degree sign. |
Other control | Cc |
Matches end of line. |
Other format | Cf |
Formatting control character such as the bidirectional control characters. |
Surrogate | Cs |
Matches one half of a surrogate pair. |
Other private-use | Co |
Matches any character from the private-use area. |
Other not assigned | Cn |
Matches characters that do not map to a Unicode character. |
Quantifiers add optional quantity data to a regular expression. A quantifier expression applies to the character, group, or character class that immediately precedes it. You can specify quantifiers using the sequences listed in the following table:
Sequence | Description |
---|---|
* |
Specifies zero or more matches.
Equivalent to {0,} . |
+ |
Specifies one or more matches.
Equivalent to {1,} . |
? |
Specifies zero or one matches.
Equivalent to {0,1} . |
*? |
Specifies the first match that
consumes as few repeats as possible (equivalent to lazy * ). |
+? |
Specifies as few repeats as possible,
but at least one (equivalent to lazy + ). |
?? |
Specifies zero repeats if possible, or
one (lazy ? ). |
{n} |
Specifies exactly n
matches. |
{n,} |
Specifies at least n
matches. |
{n,m} |
Specifies at least n , but
no more than m , matches. |
{n}? |
Equivalent to {n} (lazy
{n} ). |
{n,}? |
Specifies as few repeats as possible,
but at least n (lazy {n,} ). |
{n,m}? |
Specifies as few repeats as possible
between n and m (lazy {n,m} ). |
You can apply options to a regular expression to modify the matching
behavior. Options can be set outside the expression, using the user
interface check boxes, and they can be set within the regular expression
pattern itself, using the inline (?imnsx-imnsx:)
grouping
construct or (?imnsx-imnsx)
miscellaneous construct.
In inline option constructs, a minus sign '-
' before an
option or set of options turns off those options. For example, the inline
construct (?ix-ms)
turns on the Ignore Case and Ignore Pattern
White Space options and turns off the Multiline and Single-line options.
The options available are the following:
Option | Inline Flag | Description |
---|---|---|
Ignore Case | i |
Specifies case-insensitive matching. |
Multiline | m |
Changes the meaning of ^
and $ so that they match at the beginning and end of
any line, not just the beginning and end of the whole string. |
Explicit Capture | n |
Specifies that the only valid captures
are explicitly named or numbered groups of the form
(?<name>...) . This allows parentheses to act as non-capturing
groups without the syntactic clumsiness of (?:...) . |
Single-line | s |
Changes the meaning of the period
character '. ' so that it matches every character
instead of every character except \n . |
Ignore Pattern Whitespace | x |
Specifies that un-escaped white space
is excluded from the pattern and enables comments following a number
sign '# '. Note that white space is never eliminated
from within a character class. For a list of escaped white-space
characters, see the Escapes section. |
ECMAScript | (N/A) | Enables ECMAScript-compliant behavior for the expression. This option can be used only in conjunction with the Ignore Case and Multiline options. Using other options will generate an error. |
Zero-width assertions do not cause the matching engine to advance through
the string or consume characters. They only cause a match to succeed or fail
depending on the current position in the string. For example, ^
specifies that the current position is at the beginning of a line or string.
Therefore, the regular expression ^FTP
returns only those
occurrences of the character string "FTP
" that occur at the
beginning of a line.
The assertions are expressed using the following meta-characters:
Assertion | Description |
---|---|
^ |
Specifies that the match must occur at the beginning of the text, or of the line if the Multiline option is set. |
$ |
Specifies that the match must occur at
the end of the text, before \n at the end of the text,
or at the end of the line if the Multiline
option is set. |
\A |
Specifies that the match must occur at the beginning of the text (ignores the Multiline option). |
\z |
Specifies that the match must occur at
the end of the text or before \n at the end of the text
(ignores the Multiline option). |
\Z |
Specifies that the match must occur at the end of the text (ignores the Multiline option). |
\G |
Specifies that the match must occur at the point where the previous match ended. This allows you to ensure that matches are all contiguous in some cases. |
\b |
Specifies that the match must occur on
a boundary between \w (alphanumeric) and \W
(non-alphanumeric) characters. The match must occur on word
boundaries, that is, at the first or last characters in words
separated by any non-alphanumeric characters. |
\B |
Specifies that the match must not
occur on a \b boundary. |
Grouping constructs allow you to capture groups of sub-expressions and to increase the efficiency of regular expressions with non-capturing look-ahead and look-behind modifiers. The available grouping constructs are listed in the following table:
Construct | Description |
---|---|
( ) |
Captures the matched substring (or
non-capturing group; see the Explicit
Capture option for more information). Captures using ()
are numbered automatically based on the order of the opening
parenthesis, starting from one. The first capture, numbered zero, is
the text matched by the whole pattern. |
(?<name> ) |
Captures the matched substring into a
group name or number name. The string used for name must not contain
any punctuation and it cannot begin with a number. You can use
single quotes instead of angle brackets; for example,
(?'name') . |
(?<name1-name2> ) |
Balancing group definition. Deletes
the definition of the previously defined group 'name2' and stores in
group 'name1' the interval between the previously defined 'name2'
group and the current group. If no group 'name2' is defined, the
match backtracks. Because deleting the last definition of 'name2'
reveals the previous definition of 'name2', this construct allows
the stack of captures for group 'name2' to be used as a counter for
keeping track of nested constructs such as parentheses. The 'name1'
name is optional. You can also use single quotes instead of angle
brackets; for example, (?'name1-name2') . |
(?imnsx-imnsx: ) |
Applies or disables the specified
options within the sub-expression. For example, (?i-s: )
turns on the Ignore Case option and
disables the Single-line option. See
the Options section for a complete list of
the possible options. |
(?: ) |
Non-capturing group. |
(?= ) |
Zero-width positive look-ahead
assertion. Continues match only if the sub-expression matches at
this position on the right. For example, \w+(?=\d)
matches a word followed by a digit, without matching the digit. This
construct does not backtrack. |
(?! ) |
Zero-width negative look-ahead
assertion. Continues match only if the sub-expression does not match
at this position on the right. For example, \b(?!un)\w+\b
matches words that do not begin with un . |
(?<= ) |
Zero-width positive look-behind
assertion. Continues match only if the sub-expression matches at
this position on the left. For example, (?<=19)99
matches instances of 99 that follow 19 .
This construct does not backtrack. |
(?<! ) |
Zero-width negative look-behind assertion. Continues match only if the sub-expression does not match at the position on the left. |
(?> ) | Non-backtracking sub-expression (also known as a "greedy" sub-expression). The sub-expression is fully matched once, and then does not participate piecemeal in backtracking. That is, the sub-expression matches only strings that would be matched by the sub-expression alone. |
Back-references provide a way to find repeating groups of characters. You can think of them as a shorthand instruction to match the same string again. You can use the following constructs:
Construct | Description |
---|---|
\number |
Simple back-reference. For example,
(\w)\1 finds doubled word-characters. |
\k<name> |
Named back-reference. For example,
(?<char>\w)\k<char> finds doubled word-characters. The
expression (?<43>\w)\43 does the same. You can also use
single quotes instead of angle brackets; for example, \k'char' . |
The expressions \1
through \9
always refer to
back-references, not octal codes. Multi-digit
expressions \11
and up are considered back-references if there
is a back-reference corresponding to that number; otherwise, they are
interpreted as octal codes (unless the starting digits are 8 or 9, in which
case they are treated as literal "8" and "9"). If a regular expression
contains a back-reference to an undefined group number, it is considered a
parsing error.
If the ambiguity is a problem, you can always use the \k<n>
notation, which cannot be confused with octal codes. Similarly,
hexadecimal codes such as \xHH
are unambiguous and cannot be confused with back-references.
A back-reference refers to the most recent definition of a group (the
definition most immediately to the left, when matching left to right). That
is, when a group makes multiple captures, a back-reference refers to the
most recent capture. For example, (?<1>a)(?<1>\1b)*
matches
aababb
, with the capturing pattern (a)(ab)(abb)
.
Note that looping quantifiers do not clear group definitions.
If a group has not captured any substring, a back-reference to that group
is undefined and never matches. For example, the expression \1()
never matches anything, but the expression ()\1
matches the
empty string.
Conditional constructs let you define expressions with either/or matching. The available constructs are listed in the following table:
Construct | Description |
---|---|
| |
Or operator. Matches any one of the terms
separated by the | (vertical bar) character; for example,
cat|dog|duck . The leftmost successful match wins. |
(?(expression)yes|no) |
Matches the 'yes' part if the expression
matches at this point; otherwise, matches the 'no' part. The 'no' part
is optional. The expression can be any valid sub-expression, but it is
turned into a zero-width assertion, so this syntax is equivalent to
(?(?=expression)yes|no) . Note that if the expression is the
name of a named group or a capturing group number, the alternation
construct is interpreted as a capture test (see below). To avoid
confusion in these cases, you can spell out the inside
(?=expression) explicitly. |
(?(name)yes|no) |
Matches the 'yes' part if the named capture string has a match; otherwise, matches the 'no' part. The 'no' part is optional. If the given name does not correspond to the name or number of a capturing group used in this expression, the alternation construct is interpreted as an expression test (see above). |
The following table lists other constructs you can use in a regular expression:
Construct | Description |
---|---|
(?imnsx-imnsx) |
Applies or disables the specified
options within the sub-expression. For example, (?i-s)
turns on the Ignore Case option and
disables the Single-line option. See
the Options section for a complete list of
the possible options. The option changes are effective until the end
of the enclosing group. You can also use the
grouping construct
(?imnsx-imnsx: ) , which is a cleaner form. |
(?# ) |
Inline comment inserted within a regular expression. The comment terminates at the first closing parenthesis character. |
# end-of-line |
X-mode comment. The comment begins at an
un-escaped # and continues to the end of the line. The
Ignore Pattern Whitespace option
must be activated for this type of comment to be recognized. |
Substitutions are allowed only within replacement patterns. For similar
functionality within regular expressions, use a
back-reference (e.g. \1
).
Character escapes and substitutions are the only special constructs
recognized in a replacement pattern. All other constructs are not recognized
in replacement patterns. For example, the replacement pattern a*${txt}b
inserts the string "a*
" followed by the substring matched by
the "txt
" capturing group (if any), followed by the string "b
".
The character '*
' is not recognized as a meta-character within
a replacement pattern. Similarly, $
patterns are not recognized
within regular expression search patterns (within the search pattern $
designates the end of the string).
You can use the following named and numbered replacement patterns:
Substitution | Description |
---|---|
$number |
Substitutes the last substring matched
by group number 'number' (decimal value, first group is $1 ). |
${name} |
Substitutes the last substring matched
by a (?<name> ) group. |
$$ |
Substitutes a single literal '$ '
character. |
$& |
Substitutes a copy of the entire match itself. |
$` |
Substitutes all the text of the input string before the match. |
$' |
Substitutes all the text of the input string after the match. |
$+ |
Substitutes the last group captured. |
$_ |
Substitutes the entire input string. |