Okapi Help - Regular Expressions

- Overview
- Examples
- Escapes
- Character Classes
- Quantifiers
- Options
- Zero-width Assertions
- Grouping Constructs
- Back-reference Constructs
- Conditional Constructs
- Miscellaneous Constructs
- Substitutions

Overview

This section provides an overview of the syntax to use when working with regular expressions.

The regular expression language is designed and optimized to manipulate text. The language comprises two basic character types: literal (normal) text characters and meta-characters which are instructions for the regular expression.

For example, the regular expression \scat matches all occurrences of the string "cat" that are preceded by any white-space character, such as a space or a tab. So in the string "A bearcat is bigger than a cat" the match would be: "A bearcat is bigger than acat", but not "A bearcat is bigger than a cat".

Regular expressions can perform very complex searches, using classes of characters, groupings, back-referencing, zero-width assertions and many different types of conditions and options.

Examples

Here are a few examples of regular expressions. The text matched by the expression is highlighted in yellow. Named groups and their corresponding matches are sometimes highlighted in other colors. All the examples assume no options are set, except is stated otherwise.

Expression: tag1|tag2
   Options: None.
   Matches: Before <tag1> and <tag2> after

Expression: tag\b
   Options: None.
   Matches: Before tag tagtag after

Expression: <.*>
   Options: None.
   Matches: Before <tag1> and <tag2> after

Expression: <.*?>
   Options: None.
   Matches: Before <tag1> and <tag2> after

Expression: colou?r
   Options: None.
   Matches: Color, colour, color

Expression: (C|c)olou?r
   Options: None.
   Matches: Color, colour, color

Expression: (?<grp1>\s\w+)\k<grp1>
   Options: None.
   Matches: Like the theory or the the theme

Expression: (?<grp1>\s\w+)\k<grp1>\b
   Options: None.
   Matches: Like the theory or the the theme

Expression: <img.*?((alt\s*=\s*(?<q>'|"))(?<text>.*?)\k<q>.*?)?>
   Options: Single-line: on, Ignore case: on.
   Matches: Click <img src='go.png' alt ='Start Now!'> to start.
   Matches: Click <img alt='Start Now!'> to start.
   Matches: Click <IMG  ALT  =  "Start Now!" SRC="go.png"> to start.
   Matches: Click <img src='go.png'> to start.

Expression: <\w+.*?((?<q>"|')(.*?)\k<q>.*?)?>|</\w+.*?>
   Options: Single-line: on, Ignore case: on.
   Matches: <P id="1">This is<br> a<a attr="&lt;test>" href='#abc'>link</a>.</P>

Expression: %(([-0+ #]?)[-0+ #]?)((\d\$)?)(([\d\*]*)(\.[\d\*]*)?)[dioxXucsfeEgGpn]
   Options: Ignore case: on
   Matches: %d files not found, including %s (%3.2d%% done)
   Matches: %1$d files not found, including %2$s (%3$*.*d%% done)

Escapes

Characters other than '.', '$', '^', '{', '[', '(', '|', ')', '*', '+', '?', and '\' match themselves. Otherwise you must prefix the character with a '\' to match as a literal. For example, to match the question mark ('?') character you must use "\?" or to match the backslash ('\') you must use "\\".

Sequence	Description
`\a`	Matches a bell (U+0007).
`\b`	Matches a backspace (U+0008) if in a `[]` character class; otherwise `\b` denotes a word boundary (between `\w` and `\W` characters).
`\cC`	Matches an ASCII control character. For example, `\cC` matches Control+C (U+0003).
`\e`	Matches an escape (U+001B).
`\f`	Matches a form feed (U+000C).
`\n`	Matches a new line (U+000A).
`\r`	Matches a carriage return (U+000D).
`\t`	Matches a tab (U+0009).
`\uHHHH`	Matches the Unicode character U+HHHH. Use hexadecimal representation of exactly four digits.
`\v`	Matches a vertical tab (U+000B).
`\xHH`	Matches the ASCII character U+00HH. Use hexadecimal representation of exactly two digits.
`\000`	Matches the ASCII character 000 in octal (up to three digits); numbers with no leading zero are back-references if they have only one digit or if they correspond to a capturing group number. For example, the character `\040` represents a space.
`\X`	Matches X, where X is: '`.`', '`$`', '`^`', '`{`', '`[`', '`(`', '`\|`', '`)`', '`*`', '`+`', '`?`', or '`\`'. For example, "`\$`" matches '`$`'.

Character Classes

A character class is a set of characters that will find a match if any one of the characters included in the set matches. You can specify character classes using the sequences listed in the following table:

Sequence	Description
`.`	Matches any character except `\n`. If modified by the single-line option, a period character matches any character.
`[aeiou]`	Matches any single character included in the specified set of characters (here, any character in "`aeiou`").
`[^aeiou]`	Matches any single character not in the specified set of characters (here, anything but any character in "`aeiou`").
`[0-9a-fA-F]`	Use of a hyphen '`–`' allows to specify contiguous character ranges (here, any character in "`0123456789abcdefABCDEF`".
`\d`	Matches any decimal digit. Equivalent to `\p{Nd}`, or `[0-9]` for non-Unicode, ECMAScript behavior.
`\D`	Matches any non-digit. Equivalent to `\P{Nd}`, or `[^0-9]` for non-Unicode, ECMAScript behavior.
`\p{name}`	Matches any character in the named character class specified by `{name}`. The name must be one of the Unicode groups and block ranges. For example: `Ll`, `Nd`, `Z`, `Lu`, `Lo`, `Lt`, `IsGreek`, or `IsBoxDrawing`.
`\P{name}`	Matches any character not text not included in the named character class specified by `{name}`.
`\s`	Matches any white-space character. Equivalent to the Unicode character categories `[\f\n\r\t\v\x85\p{Z}]`. If the ECMAScript option is set, `\s` is equivalent to `[ \f\n\r\t\v]`.
`\S`	Matches any non-white-space character. Equivalent to the Unicode character categories `[^\f\n\r\t\v\x85\p{Z}]`. If the ECMAScript option is set, `\S` is equivalent to `[^ \f\n\r\t\v]`.
`\w`	Matches any word character. Equivalent to the Unicode character categories `[\p{Ll}\p{Lu}\p{Lt}\p{Lo}\p{Nd}\p{Pc}]`. With the ECMAScript option set, `\w` is equivalent to `[a-zA-Z_0-9]`.
`\W`	Matches any non-word character. Equivalent to the Unicode categories `[^\p{Ll}\p{Lu}\p{Lt}\p{Lo}\p{Nd}\p{Pc}]`. If the ECMAScript option is set, `\W` is equivalent to `[^a-zA-Z_0-9]`.

Unicode groups and block ranges (to use with \p{name}):

Character Class	Name	Description
Uppercase letter	`Lu`	Matches any one capital letter.
Lowercase letter	`Ll`	Matches any one lower case letter.
Title case letter	`Lt`	Matches characters that combine an uppercase letter with a lowercase letter, such as `'Nj'` and `'Dz'`.
Modifier letter	`Lm`	Matches letters or punctuation, such as commas, cross accents, and double prime. These letters are used to indicate modifications to the preceding letter.
Other letter	Lo	Matches other letters, such as gothic letter ahsa.
Decimal digit	`Nd`	Matches decimal digits such as 0-9 (and their full-width equivalents).
Letter digit	`Nl`	Matches letter digits such as ideographic number zero or roman numerals.
Other digit	`No`	Matches other digits such as old italic number one.
Open punctuation	`Ps`	Matches opening punctuation such as open brackets and braces.
Close punctuation	`Pe`	Matches closing punctuation such as closing brackets and braces.
Initial quote punctuation	`Pi`	Matches initial double quotation marks.
Final quote punctuation	`Pf`	Matches single quotation marks and ending double quotation marks.
Dash punctuation	`Pd`	Matches the dash mark.
Connector punctuation	`Pc`	Matches the underscore or underline mark.
Other punctuation	`Po`	Matches other punctuation characters such as commas, colons, semi-colons, or slash.
Space separator	`Zs`	Matches blanks.
Line separator	`Zl`	Matches the Unicode character U+2028.
Paragraph separator	`Zp`	Matches the Unicode character U+2029.
Non-spacing mark	`Mn`	Matches non-spacing marks.
Combining mark	`Mc`	Matches combining marks.
Enclosing mark	`Me`	Matches enclosing marks.
Math symbol	`Sm`	Matches '+', '=', '~', '\|', '<', and '>'.
Currency symbol	`Sc`	Matches currency symbols such as '$'.
Modifier symbol	`Sk`	Matches modifier symbols such as circumflex accent, grave accent, and macron.
Other symbol	`So`	Matches other symbols, such as the copyright sign, pilcrow sign, or the degree sign.
Other control	`Cc`	Matches end of line.
Other format	`Cf`	Formatting control character such as the bidirectional control characters.
Surrogate	`Cs`	Matches one half of a surrogate pair.
Other private-use	`Co`	Matches any character from the private-use area.
Other not assigned	`Cn`	Matches characters that do not map to a Unicode character.

Quantifiers

Quantifiers add optional quantity data to a regular expression. A quantifier expression applies to the character, group, or character class that immediately precedes it. You can specify quantifiers using the sequences listed in the following table:

Sequence	Description
`*`	Specifies zero or more matches. Equivalent to `{0,}`.
`+`	Specifies one or more matches. Equivalent to `{1,}`.
`?`	Specifies zero or one matches. Equivalent to `{0,1}`.
`*?`	Specifies the first match that consumes as few repeats as possible (equivalent to lazy `*`).
`+?`	Specifies as few repeats as possible, but at least one (equivalent to lazy `+`).
`??`	Specifies zero repeats if possible, or one (lazy `?`).
`{n}`	Specifies exactly `n` matches.
`{n,}`	Specifies at least `n` matches.
`{n,m}`	Specifies at least `n`, but no more than `m`, matches.
`{n}?`	Equivalent to `{n}` (lazy `{n}`).
`{n,}?`	Specifies as few repeats as possible, but at least `n` (lazy `{n,}`).
`{n,m}?`	Specifies as few repeats as possible between `n` and m (lazy `{n,m}`).

Options

You can apply options to a regular expression to modify the matching behavior. Options can be set outside the expression, using the user interface check boxes, and they can be set within the regular expression pattern itself, using the inline (?imnsx-imnsx:) grouping construct or (?imnsx-imnsx) miscellaneous construct.

In inline option constructs, a minus sign '-' before an option or set of options turns off those options. For example, the inline construct (?ix-ms) turns on the Ignore Case and Ignore Pattern White Space options and turns off the Multiline and Single-line options.

The options available are the following:

Option	Inline Flag	Description
Ignore Case	`i`	Specifies case-insensitive matching.
Multiline	`m`	Changes the meaning of `^` and `$` so that they match at the beginning and end of any line, not just the beginning and end of the whole string.
Explicit Capture	`n`	Specifies that the only valid captures are explicitly named or numbered groups of the form `(?<name>...)`. This allows parentheses to act as non-capturing groups without the syntactic clumsiness of `(?:...)`.
Single-line	`s`	Changes the meaning of the period character '`.`' so that it matches every character instead of every character except `\n`.
Ignore Pattern Whitespace	`x`	Specifies that un-escaped white space is excluded from the pattern and enables comments following a number sign '`#`'. Note that white space is never eliminated from within a character class. For a list of escaped white-space characters, see the Escapes section.
ECMAScript	(N/A)	Enables ECMAScript-compliant behavior for the expression. This option can be used only in conjunction with the Ignore Case and Multiline options. Using other options will generate an error.

Zero-width Assertions

Zero-width assertions do not cause the matching engine to advance through the string or consume characters. They only cause a match to succeed or fail depending on the current position in the string. For example, ^ specifies that the current position is at the beginning of a line or string. Therefore, the regular expression ^FTP returns only those occurrences of the character string "FTP" that occur at the beginning of a line.

The assertions are expressed using the following meta-characters:

Assertion	Description
`^`	Specifies that the match must occur at the beginning of the text, or of the line if the Multiline option is set.
`$`	Specifies that the match must occur at the end of the text, before `\n` at the end of the text, or at the end of the line if the Multiline option is set.
`\A`	Specifies that the match must occur at the beginning of the text (ignores the Multiline option).
`\z`	Specifies that the match must occur at the end of the text or before `\n` at the end of the text (ignores the Multiline option).
`\Z`	Specifies that the match must occur at the end of the text (ignores the Multiline option).
`\G`	Specifies that the match must occur at the point where the previous match ended. This allows you to ensure that matches are all contiguous in some cases.
`\b`	Specifies that the match must occur on a boundary between `\w` (alphanumeric) and `\W` (non-alphanumeric) characters. The match must occur on word boundaries, that is, at the first or last characters in words separated by any non-alphanumeric characters.
`\B`	Specifies that the match must not occur on a `\b` boundary.

Grouping Constructs

Grouping constructs allow you to capture groups of sub-expressions and to increase the efficiency of regular expressions with non-capturing look-ahead and look-behind modifiers. The available grouping constructs are listed in the following table:

Construct	Description
`( )`	Captures the matched substring (or non-capturing group; see the Explicit Capture option for more information). Captures using `()` are numbered automatically based on the order of the opening parenthesis, starting from one. The first capture, numbered zero, is the text matched by the whole pattern.
`(?<name> )`	Captures the matched substring into a group name or number name. The string used for name must not contain any punctuation and it cannot begin with a number. You can use single quotes instead of angle brackets; for example, `(?'name')`.
`(?<name1-name2> )`	Balancing group definition. Deletes the definition of the previously defined group 'name2' and stores in group 'name1' the interval between the previously defined 'name2' group and the current group. If no group 'name2' is defined, the match backtracks. Because deleting the last definition of 'name2' reveals the previous definition of 'name2', this construct allows the stack of captures for group 'name2' to be used as a counter for keeping track of nested constructs such as parentheses. The 'name1' name is optional. You can also use single quotes instead of angle brackets; for example, `(?'name1-name2')`.
`(?imnsx-imnsx: )`	Applies or disables the specified options within the sub-expression. For example, `(?i-s: )` turns on the Ignore Case option and disables the Single-line option. See the Options section for a complete list of the possible options.
`(?: )`	Non-capturing group.
`(?= )`	Zero-width positive look-ahead assertion. Continues match only if the sub-expression matches at this position on the right. For example, `\w+(?=\d)` matches a word followed by a digit, without matching the digit. This construct does not backtrack.
`(?! )`	Zero-width negative look-ahead assertion. Continues match only if the sub-expression does not match at this position on the right. For example, `\b(?!un)\w+\b` matches words that do not begin with `un`.
`(?<= )`	Zero-width positive look-behind assertion. Continues match only if the sub-expression matches at this position on the left. For example, `(?<=19)99` matches instances of `99` that follow `19`. This construct does not backtrack.
`(?<! )`	Zero-width negative look-behind assertion. Continues match only if the sub-expression does not match at the position on the left.
(?> )	Non-backtracking sub-expression (also known as a "greedy" sub-expression). The sub-expression is fully matched once, and then does not participate piecemeal in backtracking. That is, the sub-expression matches only strings that would be matched by the sub-expression alone.

Back-reference Constructs

Back-references provide a way to find repeating groups of characters. You can think of them as a shorthand instruction to match the same string again. You can use the following constructs:

Construct	Description
`\number`	Simple back-reference. For example, `(\w)\1` finds doubled word-characters.
`\k<name>`	Named back-reference. For example, `(?<char>\w)\k<char>` finds doubled word-characters. The expression `(?<43>\w)\43` does the same. You can also use single quotes instead of angle brackets; for example, `\k'char'`.

Back-references and octal notation

The expressions \1 through \9 always refer to back-references, not octal codes. Multi-digit expressions \11 and up are considered back-references if there is a back-reference corresponding to that number; otherwise, they are interpreted as octal codes (unless the starting digits are 8 or 9, in which case they are treated as literal "8" and "9"). If a regular expression contains a back-reference to an undefined group number, it is considered a parsing error.

If the ambiguity is a problem, you can always use the \k<n> notation, which cannot be confused with octal codes. Similarly, hexadecimal codes such as \xHH are unambiguous and cannot be confused with back-references.

Back-reference matching

A back-reference refers to the most recent definition of a group (the definition most immediately to the left, when matching left to right). That is, when a group makes multiple captures, a back-reference refers to the most recent capture. For example, (?<1>a)(?<1>\1b)* matches aababb, with the capturing pattern (a)(ab)(abb). Note that looping quantifiers do not clear group definitions.

If a group has not captured any substring, a back-reference to that group is undefined and never matches. For example, the expression \1() never matches anything, but the expression ()\1 matches the empty string.

Conditional Constructs

Conditional constructs let you define expressions with either/or matching. The available constructs are listed in the following table:

Construct	Description
`\|`	Or operator. Matches any one of the terms separated by the `\|` (vertical bar) character; for example, `cat\|dog\|duck`. The leftmost successful match wins.
`(?(expression)yes\|no)`	Matches the 'yes' part if the expression matches at this point; otherwise, matches the 'no' part. The 'no' part is optional. The expression can be any valid sub-expression, but it is turned into a zero-width assertion, so this syntax is equivalent to `(?(?=expression)yes\|no)`. Note that if the expression is the name of a named group or a capturing group number, the alternation construct is interpreted as a capture test (see below). To avoid confusion in these cases, you can spell out the inside `(?=expression)` explicitly.
`(?(name)yes\|no)`	Matches the 'yes' part if the named capture string has a match; otherwise, matches the 'no' part. The 'no' part is optional. If the given name does not correspond to the name or number of a capturing group used in this expression, the alternation construct is interpreted as an expression test (see above).

Miscellaneous Constructs

The following table lists other constructs you can use in a regular expression:

Construct	Description
`(?imnsx-imnsx)`	Applies or disables the specified options within the sub-expression. For example, `(?i-s)` turns on the Ignore Case option and disables the Single-line option. See the Options section for a complete list of the possible options. The option changes are effective until the end of the enclosing group. You can also use the grouping construct `(?imnsx-imnsx: )`, which is a cleaner form.
`(?# )`	Inline comment inserted within a regular expression. The comment terminates at the first closing parenthesis character.
`# end-of-line`	X-mode comment. The comment begins at an un-escaped `#` and continues to the end of the line. The Ignore Pattern Whitespace option must be activated for this type of comment to be recognized.

Substitutions

Substitutions are allowed only within replacement patterns. For similar functionality within regular expressions, use a back-reference (e.g. \1).

Character escapes and substitutions are the only special constructs recognized in a replacement pattern. All other constructs are not recognized in replacement patterns. For example, the replacement pattern a*${txt}b inserts the string "a*" followed by the substring matched by the "txt" capturing group (if any), followed by the string "b". The character '*' is not recognized as a meta-character within a replacement pattern. Similarly, $ patterns are not recognized within regular expression search patterns (within the search pattern $ designates the end of the string).

You can use the following named and numbered replacement patterns:

Substitution	Description
`$number`	Substitutes the last substring matched by group number 'number' (decimal value, first group is `$1`).
`${name}`	Substitutes the last substring matched by a `(?<name> )` group.
`$$`	Substitutes a single literal '`$`' character.
`$&`	Substitutes a copy of the entire match itself.
$`	Substitutes all the text of the input string before the match.
`$'`	Substitutes all the text of the input string after the match.
`$+`	Substitutes the last group captured.
`$_`	Substitutes the entire input string.

Okapi Shared Help