模式语法

模式语法 -- 解说 Perl 兼容正则表达式的语法

说明

PCRE 库是一组用和 Perl 5 相同的语法和语义实现了正则表达式模式匹配的函数,不过有少许区别(见下面)。当前 PCRE 的实现是与 Perl 5.005 相符的。

与 Perl 的区别

这里谈到的区别是就 Perl 5.005 来说的。

  1. 默认情况下,空白字符是 C 语言库函数 isspace() 所能识别的任何字符,尽管有可能与别的字符类型表编译在一起。通常 isspace() 匹配空格,换页符,换行符,回车符,水平制表符和垂直制表符。Perl 5 不再将垂直制表符包括在空白字符中了。事实上长久以来存在于 Perl 文档中的转义序列 \v 从未被识别过,不过该字符至少到 5.002 为止都被当成空白字符的。在 5.004 和 5.005 中 \s 不匹配此字符。

  2. PCRE 不允许在向前断言中使用重复的数量符。Perl 允许这样,但可能不是你想象中的含义。例如,(?!a){3} 并不是断言下面三个字符不是“a”,而是断言下一个字符不是“a”三次。

  3. 捕获出现在排除模式断言中的子模式虽然被计数,但并未在偏移向量中设定其条目。Perl 在匹配失败前从此种模式中设定其数字变量,但只在排触摸式断言只包含一个分支时。

  4. 尽管目标字符串中支持二进制的零字符,但不能出现在模式字符串中,因为它被当作普通的 C 字符串传递,以二进制零终止。转义序列“\\x00”可以在模式中用来表示二进制零。

  5. 不支持下列 Perl 转义序列:\l,\u,\L,\U。事实上这些是由 Perl 的字符串处理来实现的,并不是模式匹配引擎的一部分。

  6. 不支持 Perl 的 \G 断言,因为这和单个的模式匹配无关。

  7. 很明显,PCRE 不支持 (?{code}) 结构。

  8. 当部分模式重复的时候,有关 Perl 5.005_02 捕获字符串的设定有些古怪的地方。举例说,用模式 /^(a(b)?)+$/ 去匹配 "aba" 会将 $2 设为 "b",但是用模式 /^(aa(bb)?)+$/ 去匹配 "aabbaa" 会使 $2 无值。然而,如果把模式改成 /^(aa(b(b))?)+$/,则 $2(和 $3)就有值了。在 Perl 5.004 中以上两种情况下 $2 都会被赋值,在 PCRE 中也是 TRUE。如果以后 Perl 改了,PCRE 可能也会跟着改。

  9. 另一个未解决的矛盾是 Perl 5.005_02 中模式 /^(a)?(?(1)a|b)+$/ 能匹配上字符串 "a",但是 PCRE 不会。然而,在 Perl 和 PCRE 中用 /^(a)?a/ 去匹配 "a" 会使 $1 没有值。

  10. PCRE 提供了一些对 Perl 正则表达式机制的扩展:

    1. 尽管向后断言必须匹配固定长度字符串,但每个向后断言的分支可以匹配不同长度的字符串。Perl 5.005 要求所有分支的长度相同。

    2. 如果设定了 PCRE_DOLLAR_ENDONLY 而没有设定 PCRE_MULTILINE,则 $ 元字符只匹配字符串的最末尾。

    3. 如果设定了 PCRE_EXTRA,反斜线后面跟一个没有特殊含义的字母会出错。

    4. 如果设定了 PCRE_UNGREEDY,则重复的数量符的 greed 被反转,即,默认时不是 greedy,但如果后面跟上一个问号就变成 greedy 了。

正则表达式详解

介绍

下面说明 PCRE 所支持的正则表达式的语法和语义。Perl 文档和很多其它书中也解说了正则表达式,有的书中有很多例子。Jeffrey Friedl 写的“Mastering Regular Expressions”,由 O'Reilly 出版社发行(ISBN 1-56592-257-3),包含了大量细节。这里的说明只是个参考文档。

正则表达式是从左向右去匹配目标字符串的一组模式。大多数字符在模式中表示它们自身并匹配目标中相应的字符。作为一个小例子,模式 The quick brown fox 匹配了目标字符串中与其完全相同的一部分。

元字符

正则表达式的威力在于其能够在模式中包含选择和循环。它们通过使用元字符来编码在模式中,元字符不代表其自身,它们用一些特殊的方式来解析。

有两组不同的元字符:一种是模式中除了方括号内都能被识别的,还有一种是在方括号内被识别的。方括号之外的元字符有这些:

\

有数种用途的通用转义符

^

断言目标的开头(或在多行模式下行的开头,即紧随一换行符之后)

$

断言目标的结尾(或在多行模式下行的结尾,即紧随一换行符之前)

.

匹配除了换行符外的任意一个字符(默认情况下)

[

字符类定义开始

]

字符类定义结束

|

开始一个多选一的分支

(

子模式开始

)

子模式结束

?

扩展 ( 的含义,也是 0 或 1 数量限定符,以及数量限定符最小值

*

匹配 0 个或多个的数量限定符

+

匹配 1 个或多个的数量限定符

{

最少/最多数量限定开始

}

最少/最多数量限定结束

模式中方括号内的部分称为“字符类”。字符类中可用的元字符为:

\

通用转义字符

^

排除字符类,但仅当其为第一个字符时有效

-

指出字符范围

]

结束字符类

以下说明了每一个元字符的用法。

反斜线(\)

反斜线字符有几种用途。首先,如果其后跟着一个非字母数字字符,则取消该字符可能具有的任何特殊含义。此种将反斜线用作转义字符的用法适用于无论是字符类之中还是之外。

例如,如果想匹配一个“*”字符,则在模式中用“\*”。这适用于无论下一个字符是否会被当作元字符来解释,因此在非字母数字字符之前加上一个“\”来指明该字符就代表其本身总是安全的。尤其是,如果要匹配一个反斜线,用“\\”。

如果模式编译时加上了 PCRE_EXTENDED 选项,模式中的空白字符(字符类中以外的)以及字符类之外的“#”到换行符之间的字符都被忽略。可以用转义的反斜线将空白字符或者“#”字符包括到模式中去。

反斜线的第二种用途提供了一种在模式中以可见方式去编码不可打印字符的方法。并没有不可打印字符出现的限制,除了代表模式结束的二进制零以外。但用文本编辑器来准备模式的时候,通常用以下的转义序列来表示那些二进制字符更容易一些:

\a

alarm,即 BEL 字符(0x07)

\cx

"control-x",其中 x 是任意字符

\e

escape(0x1B)

\f

换页符 formfeed(0x0C)

\n

换行符 newline(0x0A)

\r

回车符 carriage return(0x0D)

\t

制表符 tab(0x09)

\xhh

十六进制代码为 hh 的字符

\ddd

八进制代码为 ddd 的字符,或 backreference

\cx”的精确效果如下:如果“x”是小写字母,则被转换为大写字母。接着字符中的第 6 位(0x40)被反转。从而“\cz”成为 0x1A,但“\c{”成为 0x3B,而“\c;”成为 0x7B。

在“\x”之后最多再读取两个十六进制数字(其中的字母可以是大写或小写)。在 UTF-8 模式下,允许用“\x{...}”,花括号中的内容是表示十六进制数字的字符串。原来的十六进制转义序列 \xhh 如果其值大于 127 的话则匹配了一个双字节 UTF-8 字符。

在“\0”之后最多再读取两个八进制数字。以上两种情况下,如果少于两个数字,则只使用已出现的。因此序列“\0\x\07”代表两个二进制的零加一个 BEL 字符。如果是八进制数字则确保在开始的零后面再提供两个数字。

处理反斜线后面跟着一个不是 0 的数字比较复杂。在字符类之外,PCRE 以十进制数字读取该数字及其后面的数字。如果数字小于 10,或者之前表达式中捕获到至少该数字的左圆括号,则这个序列将被作为逆向引用。有关此如何运作的说明在后面,以及括号内的子模式。

在字符类之中,或者如果十进制数字大于 9 并且之前没有那么多捕获的子模式,PCRE 重新从反斜线开始读取其后的最多三个八进制数字,并以最低位的 8 个比特产生出一个单一字节。任何其后的数字都代表自身。例如:

\040

另一种表示空格的方法

\40

同上,如果之前捕获的子模式少于 40 个的话

\7

总是一个逆向引用

\11

可能是个逆向引用,或者是制表符 tab

\011

总是表示制表符 tab

\0113

表示制表符 tab 后面跟着一个字符“3”

\113

表示八进制代码为 113 的字符(因为不能超过 99 个逆向引用)

\377

表示一个所有的比特都是 1 的字节

\81

要么是一个逆向引用,要么是一个二进制的零后面跟着两个字符“8”和“1”

注意八进制值 100 或更大的值之前不能以零打头,因为不会读取(反斜线后)超过三个八进制数字。

所有的定义了一个单一字节的序列可以用于字符类之中或之外。此外,在字符类之中,序列“\b”被解释为反斜线字符(0x08),而在字符类之外有不同含义(见下面)。

反斜线的第三个用法是指定通用字符类型:

\d

任一十进制数字

\D

任一非十进制数的字符

\s

任一空白字符

\S

任一非空白字符

\w

任一“字”的字符

\W

任一“非字”的字符

任何一个转义序列将完整的字符组合分割成两个分离的部分。任一给定的字符匹配一个且仅一个转义序列。

“字”的字符是指任何一个字母或数字或下划线,也就是说,任何可以是 Perl "word" 的字符。字母和数字的定义由 PCRE 字符表控制,可能会根据指定区域的匹配而改变(见上面的“区域支持”)。举例说,在 "fr" (French) 区域,某些编码大于 128 的字符用来表示重音字母,这些字符能够被 \w 所匹配。

这些字符类型序列可以出现在字符类之中和之外。每一个匹配相应类型中的一个字符。如果当前匹配点在目标字符串的结尾,以上所有匹配都失败,因为没有字符可供匹配。

反斜线的第四个用法是某些简单的断言。断言是指在一个匹配中的特定位置必须达到的条件,并不会消耗目标字符串中的任何字符。子模式中更复杂的断言的用法在下面描述。反斜线的断言有:

\b

字分界线

\B

非字分界线

\A

目标的开头(独立于多行模式)

\Z

目标的结尾或位于结尾的换行符前(独立于多行模式)

\z

目标的结尾(独立于多行模式)

\G

目标中的第一个匹配位置

这些断言可能不能出现在字符类中(但是注意 "\b" 有不同的含义,在字符类之中也就是反斜线字符)。

字边界是目标字符串中的一个位置,其当前字符和前一个字符不能同时匹配 \w 或者 \W(也就是其中一个匹配 \w 而另一个匹配 \W),或者是字符串的开头或结尾,假如第一个或最后一个字符匹配 \w 的话。

\A\Z\z 断言与传统的音调符和美元符(下面说明)的不同之处在于它们仅匹配目标字符串的绝对开头和结尾而不管设定了任何选项。它们不受 PCRE_NOTBOLPCRE_NOTEOL 选项的影响。\Z\z 的不同之处在于 \Z 匹配了作为字符串最后一个字符的换行符之前以及字符串的结尾,而 \z 仅匹配字符串的结尾。

The \G assertion is true only when the current matching position is at the start point of the match, as specified by the offset argument of preg_match(). It differs from \A when the value of offset is non-zero. It is available since PHP 4.3.3.

\Q and \E can be used to ignore regexp metacharacters in the pattern since PHP 4.3.3. For example: \w+\Q.$.\E$ will match one or more word characters, followed by literals .$. and anchored at the end of the string.

Unicode character properties

Since PHP 4.4.0 and 5.1.0, three additional escape sequences to match generic character types are available when UTF-8 mode is selected. They are:

\p{xx}

a character with the xx property

\P{xx}

a character without the xx property

\X

an extended Unicode sequence

The property names represented by xx above are limited to the Unicode general category properties. Each character has exactly one such property, specified by a two-letter abbreviation. For compatibility with Perl, negation can be specified by including a circumflex between the opening brace and the property name. For example, \p{^Lu} is the same as \P{Lu}.

If only one letter is specified with \p or \P, it includes all the properties that start with that letter. In this case, in the absence of negation, the curly brackets in the escape sequence are optional; these two examples have the same effect:


      \p{L}
      \pL
     

表格 1. Supported property codes

COther
CcControl
CfFormat
CnUnassigned
CoPrivate use
CsSurrogate
LLetter
LlLower case letter
LmModifier letter
LoOther letter
LtTitle case letter
LuUpper case letter
MMark
McSpacing mark
MeEnclosing mark
MnNon-spacing mark
NNumber
NdDecimal number
NlLetter number
NoOther number
PPunctuation
PcConnector punctuation
PdDash punctuation
PeClose punctuation
PfFinal punctuation
PiInitial punctuation
PoOther punctuation
PsOpen punctuation
SSymbol
ScCurrency symbol
SkModifier symbol
SmMathematical symbol
SoOther symbol
ZSeparator
ZlLine separator
ZpParagraph separator
ZsSpace separator

Extended properties such as "Greek" or "InMusicalSymbols" are not supported by PCRE.

Specifying caseless matching does not affect these escape sequences. For example, \p{Lu} always matches only upper case letters.

The \X escape matches any number of Unicode characters that form an extended Unicode sequence. \X is equivalent to (?>\PM\pM*).

That is, it matches a character without the "mark" property, followed by zero or more characters with the "mark" property, and treats the sequence as an atomic group (see below). Characters with the "mark" property are typically accents that affect the preceding character.

Matching characters by Unicode property is not fast, because PCRE has to search a structure that contains data for over fifteen thousand characters. That is why the traditional escape sequences such as \d and \w do not use Unicode properties in PCRE.

音调符(^)和美元符($)

在字符类之外,默认匹配模式下,音调符是一个仅在当前匹配点是目标字符串的开头时才为真的断言。在字符类之中,音调符的含义完全不同(见下面)。

如果涉及到几选一时音调符不需要是模式的第一个字符,但如果出现在某个分支中则应该是该选择分支的第一个字符。如果所有的选择分支都以音调符开头,这就是说,如果模式限制为只匹配目标的开头,那么这是一个紧固模式。(也有其它结构可以使模式成为紧固的。)

美元符是一个仅在当前匹配点是目标字符串的结尾或者当最后一个字符是换行符时其前面的位置时为 TRUE 的断言(默认情况下)。如果涉及到几选一时美元符不需要是模式的最后一个字符,但应该是其出现的分支中的最后一个字符。美元符在字符类之中没有特殊含义。

美元符的含义可被改变使其仅匹配字符串的结尾,只要在编译或匹配时设定了 PCRE_DOLLAR_ENDONLY 选项即可。这并不影响 \Z 断言。

如果设定了 PCRE_MULTILINE 选项则音调符和美元符的含义被改变了。此种情况下,它们分别匹配紧接着内部 "\n" 字符的之后和之前,再加上目标字符串的开头和结尾。例如模式 /^abc$/ 在多行模式下匹配了目标字符串 "def\nabc",但正常时不匹配。因此,由于所有分支都以 "^" 开头而在单行模式下成为紧固的模式在多行模式下为非紧固的。如果设定了 PCRE_MULTILINE,则 PCRE_DOLLAR_ENDONLY 选项会被忽略。

注意 \A,\Z 和 \z 序列在两种情况下都可以用来匹配目标的开头和结尾,如果模式所有的分支都以 \A 开始则其总是紧固的,不论是否设定了 PCRE_MULTILINE

句号(.)

在字符类之外,模式中的圆点可以匹配目标中的任何一个字符,包括不可打印字符,但不匹配换行符(默认情况下)。如果设定了 PCRE_DOTALL 则圆点也会匹配换行符。处理圆点与处理音调符和美元符是完全独立的,唯一的联系就是它们都涉及到换行符。圆点在字符类之中没有特殊含义。

\C 可以用来匹配单一字节。在 UTF-8 模式下这有意义,因为句号可以匹配由多个字节组成的整个字符。

方括号([])

左方括号开始了一个字符类,右方括号结束之。单独一个右方括号不是特殊字符。如果在字符类之中需要一个右方括号,则其应该是字符类中的第一个字符(如果有音调符的话,则紧接音调符之后),或者用反斜线转义。

字符类匹配目标中的一个字符,该字符必须是字符类定义的字符集中的一个;除非字符类中的第一个字符是音调符,此情况下目标字符必须不在字符类定义的字符集中。如果在字符类中需要音调符本身,则其必须不是第一个字符,或用反斜线转义。

举例说,字符类 [aeiou] 匹配了任何一个小写元音字母,而 [^aeiou] 匹配了任何一个不是小写元音字母的字符。注意音调符只是一个通过枚举指定那些不在字符类之中的字符的符号。不是断言:仍旧会消耗掉目标字符串中的一个字符,如果当前位置在字符串结尾的话则失败。

当设定了不区分大小写的匹配时,字符类中的任何字母同时代表了其大小写形式,因此举例说,小写的 [aeiou] 同时匹配了 "A" 和 "a",小写的 [^aeiou] 不匹配 "A",但区分大小写时则会匹配。

换行符在字符类中不会特殊对待,不论 PCRE_DOTALL 或者 PCRE_MULTILINE 选项设定了什么值。形如 [^a] 的字符类总是能够和换行符相匹配的。

减号(-)字符可以在字符类中指定一个字符范围。例如,[d-m] 匹配了 d 和 m 之间的任何字符,包括两者。如果字符类中需要减号本身,则必须用反斜线转义或者放到一个不能被解释为指定范围的位置,典型的位置是字符类中的第一个或最后一个字符。

字面上的 "]" 不可能被当成字符范围的结束。形如 [W-]46] 的模式会被解释为包括两个字符的字符类("W" and "-")后面跟着字符串 "46]",因此其会匹配 "W46]" 或者 "-46]"。然而,如果将 "]" 用反斜线转义,则会被当成范围的结束来解释。因此 [W-\]46] 会被解释为一个字符类,包含有一个范围以及两个单独的字符。八进制或十六进制表示的 "]" 也可以用来表示范围的结束。

范围是以 ASCII 比较顺序来操作的。也可以用于用数字表示的字符,例如 [\000-\037]。在不区分大小写匹配中如果范围里包括了字母,则同时匹配大小写字母。例如 [W-c] 等价于 [][\^_`wxyzabc] 不区分大小写地匹配。如果使用了 "fr" 区域的字符表,[\xc8-\xcb] 匹配了大小写的重音 E 字符。

字符类型 \d,\D,\s,\S,\w 和 \W 也可以出现于字符类中,并将其所能匹配的字符添加进字符类中。例如,[\dABCDEF] 匹配了任何十六进制数字。用音调符可以很方便地制定严格的字符集,例如 [^\W_] 匹配了任何字母或数字,但不匹配下划线。

任何除了 \,-,^(位于开头)以及结束的 ] 之外的非字母数字字符在字符类中都没有特殊含义,但是将它们转义也没有坏处。

竖线(|)

竖线字符用来分隔多选一模式。例如,模式:
gilbert|sullivan
匹配了 "gilbert" 或者 "sullivan" 中的一个。可以有任意多个分支,也可以有空的分支(匹配空字符串)。匹配进程从左到右轮流尝试每个分支,并使用第一个成功匹配的分支。如果分支在子模式(在下面定义)中,则“成功匹配”表示同时匹配了子模式中的分支以及主模式的其它部分。

内部选项设定

PCRE_CASELESSPCRE_MULTILINEPCRE_DOTALLPCRE_EXTRAPCRE_EXTENDED 的设定可以在模式内部通过包含在 "(?" 和 ")" 之间的 Perl 选项字母序列来改变。选项字母为:

表格 2. 内部选项字母

i代表 PCRE_CASELESS
m代表 PCRE_MULTILINE
s代表 PCRE_DOTALL
x代表 PCRE_EXTENDED
U代表 PCRE_UNGREEDY
X代表 PCRE_EXTRA

例如,(?im) 设定了不区分大小写,多行匹配。也可以通过在字母前加上减号来取消这些选项。例如组合的选项 (?im-sx),设定了 PCRE_CASELESSPCRE_MULTILINE,并取消了 PCRE_DOTALLPCRE_EXTENDED。如果一个字母在减号之前与之后都出现了,则该选项被取消设定。

如果选项改变出现于顶层(即不在子模式的括号中),则改变应用于其后的剩余模式。因此 /ab(?i)c/ 只匹配 "abc" 和 and "abC"。此行为是自 PHP 4.3.3 起绑定的 PCRE 4.0 中被修改的。在此版本之前 /ab(?i)c/ 的执行与 /abc/i 相同(例如匹配 "ABC" 和 "aBc")。

如果选项改变出现于子模式中,则效果不同。这是 Perl 5.005 的行为的一个变化。子模式中的选项改变只影响到子模式内部其后的部分,因此 (a(?i)b)c 将只匹配 "abc" 和 "aBc"(假定没有使用 PCRE_CASELESS)。这意味着选项在模式的不同部位可以造成不同的设定。在一个分支中的改变可以传递到同一个子模式中后面的分支中,例如 (a(?i)b|c) 将匹配 "ab","aB","c" 和 "C",尽管在匹配 "C" 的时候第一个分支会在选项设定之前就被丢弃。这是因为选项设定的效果是在编译时确定的,否则会造成非常怪异的行为。

PCRE 专用选项 PCRE_UNGREEDYPCRE_EXTRA 可以和 Perl 兼容选项以同样的方式来改变,分别使用字母 U 和 X。(?X) 标记设定有些特殊,它必须出现于任何其它特性之前。最好放在最开头的位置。

子模式

子模式由圆括号定界,可以嵌套。将模式中的一部分标记为子模式可以:

1. 将多选一的分支局部化。例如,模式:
cat(aract|erpillar|)
匹配了 "cat","cataract" 或 "caterpillar" 之一,没有圆括号的话将匹配 "cataract","erpillar" 或空字符串。

2. 将子模式设定为捕获子模式(如同以前定义的)。当整个模式匹配时,目标字符串中匹配了子模式的部分会通过 pcre_exec()ovector 参数传递回调用者。左圆括号从左到右计数(从 1 开始)以取得捕获子模式的数目。

例如,如果将字符串 "the red king" 来和模式
the ((red|white) (king|queen))
进行匹配,捕获的子串为 "red king","red" 以及 "king",并被计为 1,2 和 3。

The fact that plain parentheses fulfil two functions is not always helpful. There are often times when a grouping subpattern is required without a capturing requirement. If an opening parenthesis is followed by "?:", the subpattern does not do any capturing, and is not counted when computing the number of any subsequent capturing subpatterns. For example, if the string "the white queen" is matched against the pattern the ((?:red|white) (king|queen)) the captured substrings are "white queen" and "queen", and are numbered 1 and 2. The maximum number of captured substrings is 99, and the maximum number of all subpatterns, both capturing and non-capturing, is 200.

As a convenient shorthand, if any option settings are required at the start of a non-capturing subpattern, the option letters may appear between the "?" and the ":". Thus the two patterns


       (?i:saturday|sunday)
       (?:(?i)saturday|sunday)
    

match exactly the same set of strings. Because alternative branches are tried from left to right, and options are not reset until the end of the subpattern is reached, an option setting in one branch does affect subsequent branches, so the above patterns match "SUNDAY" as well as "Saturday".

It is possible to name the subpattern with (?P<name>pattern) since PHP 4.3.3. Array with matches will contain the match indexed by the string alongside the match indexed by a number, then.

Repetition

Repetition is specified by quantifiers, which can follow any of the following items:

  • a single character, possibly escaped

  • the . metacharacter

  • a character class

  • a back reference (see next section)

  • a parenthesized subpattern (unless it is an assertion - see below)

The general repetition quantifier specifies a minimum and maximum number of permitted matches, by giving the two numbers in curly brackets (braces), separated by a comma. The numbers must be less than 65536, and the first must be less than or equal to the second. For example: z{2,4} matches "zz", "zzz", or "zzzz". A closing brace on its own is not a special character. If the second number is omitted, but the comma is present, there is no upper limit; if the second number and the comma are both omitted, the quantifier specifies an exact number of required matches. Thus [aeiou]{3,} matches at least 3 successive vowels, but may match many more, while \d{8} matches exactly 8 digits. An opening curly bracket that appears in a position where a quantifier is not allowed, or one that does not match the syntax of a quantifier, is taken as a literal character. For example, {,6} is not a quantifier, but a literal string of four characters.

The quantifier {0} is permitted, causing the expression to behave as if the previous item and the quantifier were not present.

For convenience (and historical compatibility) the three most common quantifiers have single-character abbreviations:

表格 3. Single-character quantifiers

*equivalent to {0,}
+equivalent to {1,}
?equivalent to {0,1}

It is possible to construct infinite loops by following a subpattern that can match no characters with a quantifier that has no upper limit, for example: (a?)*

Earlier versions of Perl and PCRE used to give an error at compile time for such patterns. However, because there are cases where this can be useful, such patterns are now accepted, but if any repetition of the subpattern does in fact match no characters, the loop is forcibly broken.

By default, the quantifiers are "greedy", that is, they match as much as possible (up to the maximum number of permitted times), without causing the rest of the pattern to fail. The classic example of where this gives problems is in trying to match comments in C programs. These appear between the sequences /* and */ and within the sequence, individual * and / characters may appear. An attempt to match C comments by applying the pattern /\*.*\*/ to the string /* first command */ not comment /* second comment */ fails, because it matches the entire string due to the greediness of the .* item.

However, if a quantifier is followed by a question mark, then it ceases to be greedy, and instead matches the minimum number of times possible, so the pattern /\*.*?\*/ does the right thing with the C comments. The meaning of the various quantifiers is not otherwise changed, just the preferred number of matches. Do not confuse this use of question mark with its use as a quantifier in its own right. Because it has two uses, it can sometimes appear doubled, as in \d??\d which matches one digit by preference, but can match two if that is the only way the rest of the pattern matches.

If the PCRE_UNGREEDY option is set (an option which is not available in Perl) then the quantifiers are not greedy by default, but individual ones can be made greedy by following them with a question mark. In other words, it inverts the default behaviour.

Quantifiers followed by + are "possessive". They eat as many characters as possible and don't return to match the rest of the pattern. Thus .*abc matches "aabc" but .*+abc doesn't because .*+ eats the whole string. Possessive quantifiers can be used to speed up processing since PHP 4.3.3.

When a parenthesized subpattern is quantified with a minimum repeat count that is greater than 1 or with a limited maximum, more store is required for the compiled pattern, in proportion to the size of the minimum or maximum.

If a pattern starts with .* or .{0,} and the PCRE_DOTALL option (equivalent to Perl's /s) is set, thus allowing the . to match newlines, then the pattern is implicitly anchored, because whatever follows will be tried against every character position in the subject string, so there is no point in retrying the overall match at any position after the first. PCRE treats such a pattern as though it were preceded by \A. In cases where it is known that the subject string contains no newlines, it is worth setting PCRE_DOTALL when the pattern begins with .* in order to obtain this optimization, or alternatively using ^ to indicate anchoring explicitly.

When a capturing subpattern is repeated, the value captured is the substring that matched the final iteration. For example, after (tweedle[dume]{3}\s*)+ has matched "tweedledum tweedledee" the value of the captured substring is "tweedledee". However, if there are nested capturing subpatterns, the corresponding captured values may have been set in previous iterations. For example, after /(a|(b))+/ matches "aba" the value of the second captured substring is "b".

Back references

Outside a character class, a backslash followed by a digit greater than 0 (and possibly further digits) is a back reference to a capturing subpattern earlier (i.e. to its left) in the pattern, provided there have been that many previous capturing left parentheses.

However, if the decimal number following the backslash is less than 10, it is always taken as a back reference, and causes an error only if there are not that many capturing left parentheses in the entire pattern. In other words, the parentheses that are referenced need not be to the left of the reference for numbers less than 10. See the section entitled "Backslash" above for further details of the handling of digits following a backslash.

A back reference matches whatever actually matched the capturing subpattern in the current subject string, rather than anything matching the subpattern itself. So the pattern (sens|respons)e and \1ibility matches "sense and sensibility" and "response and responsibility", but not "sense and responsibility". If caseful matching is in force at the time of the back reference, then the case of letters is relevant. For example, ((?i)rah)\s+\1 matches "rah rah" and "RAH RAH", but not "RAH rah", even though the original capturing subpattern is matched caselessly.

There may be more than one back reference to the same subpattern. If a subpattern has not actually been used in a particular match, then any back references to it always fail. For example, the pattern (a|(bc))\2 always fails if it starts to match "a" rather than "bc". Because there may be up to 99 back references, all digits following the backslash are taken as part of a potential back reference number. If the pattern continues with a digit character, then some delimiter must be used to terminate the back reference. If the PCRE_EXTENDED option is set, this can be whitespace. Otherwise an empty comment can be used.

A back reference that occurs inside the parentheses to which it refers fails when the subpattern is first used, so, for example, (a\1) never matches. However, such references can be useful inside repeated subpatterns. For example, the pattern (a|b\1)+ matches any number of "a"s and also "aba", "ababaa" etc. At each iteration of the subpattern, the back reference matches the character string corresponding to the previous iteration. In order for this to work, the pattern must be such that the first iteration does not need to match the back reference. This can be done using alternation, as in the example above, or by a quantifier with a minimum of zero.

Assertions

An assertion is a test on the characters following or preceding the current matching point that does not actually consume any characters. The simple assertions coded as \b, \B, \A, \Z, \z, ^ and $ are described above. More complicated assertions are coded as subpatterns. There are two kinds: those that look ahead of the current position in the subject string, and those that look behind it.

An assertion subpattern is matched in the normal way, except that it does not cause the current matching position to be changed. Lookahead assertions start with (?= for positive assertions and (?! for negative assertions. For example, \w+(?=;) matches a word followed by a semicolon, but does not include the semicolon in the match, and foo(?!bar) matches any occurrence of "foo" that is not followed by "bar". Note that the apparently similar pattern (?!foo)bar does not find an occurrence of "bar" that is preceded by something other than "foo"; it finds any occurrence of "bar" whatsoever, because the assertion (?!foo) is always TRUE when the next three characters are "bar". A lookbehind assertion is needed to achieve this effect.

Lookbehind assertions start with (?<= for positive assertions and (?<! for negative assertions. For example, (?<!foo)bar does find an occurrence of "bar" that is not preceded by "foo". The contents of a lookbehind assertion are restricted such that all the strings it matches must have a fixed length. However, if there are several alternatives, they do not all have to have the same fixed length. Thus (?<=bullock|donkey) is permitted, but (?<!dogs?|cats?) causes an error at compile time. Branches that match different length strings are permitted only at the top level of a lookbehind assertion. This is an extension compared with Perl 5.005, which requires all branches to match the same length of string. An assertion such as (?<=ab(c|de)) is not permitted, because its single top-level branch can match two different lengths, but it is acceptable if rewritten to use two top-level branches: (?<=abc|abde) The implementation of lookbehind assertions is, for each alternative, to temporarily move the current position back by the fixed width and then try to match. If there are insufficient characters before the current position, the match is deemed to fail. Lookbehinds in conjunction with once-only subpatterns can be particularly useful for matching at the ends of strings; an example is given at the end of the section on once-only subpatterns.

Several assertions (of any sort) may occur in succession. For example, (?<=\d{3})(?<!999)foo matches "foo" preceded by three digits that are not "999". Notice that each of the assertions is applied independently at the same point in the subject string. First there is a check that the previous three characters are all digits, then there is a check that the same three characters are not "999". This pattern does not match "foo" preceded by six characters, the first of which are digits and the last three of which are not "999". For example, it doesn't match "123abcfoo". A pattern to do that is (?<=\d{3}...)(?<!999)foo

This time the first assertion looks at the preceding six characters, checking that the first three are digits, and then the second assertion checks that the preceding three characters are not "999".

Assertions can be nested in any combination. For example, (?<=(?<!foo)bar)baz matches an occurrence of "baz" that is preceded by "bar" which in turn is not preceded by "foo", while (?<=\d{3}...(?<!999))foo is another pattern which matches "foo" preceded by three digits and any three characters that are not "999".

Assertion subpatterns are not capturing subpatterns, and may not be repeated, because it makes no sense to assert the same thing several times. If any kind of assertion contains capturing subpatterns within it, these are counted for the purposes of numbering the capturing subpatterns in the whole pattern. However, substring capturing is carried out only for positive assertions, because it does not make sense for negative assertions.

Assertions count towards the maximum of 200 parenthesized subpatterns.

Once-only subpatterns

With both maximizing and minimizing repetition, failure of what follows normally causes the repeated item to be re-evaluated to see if a different number of repeats allows the rest of the pattern to match. Sometimes it is useful to prevent this, either to change the nature of the match, or to cause it fail earlier than it otherwise might, when the author of the pattern knows there is no point in carrying on.

Consider, for example, the pattern \d+foo when applied to the subject line 123456bar

After matching all 6 digits and then failing to match "foo", the normal action of the matcher is to try again with only 5 digits matching the \d+ item, and then with 4, and so on, before ultimately failing. Once-only subpatterns provide the means for specifying that once a portion of the pattern has matched, it is not to be re-evaluated in this way, so the matcher would give up immediately on failing to match "foo" the first time. The notation is another kind of special parenthesis, starting with (?> as in this example: (?>\d+)bar

This kind of parenthesis "locks up" the part of the pattern it contains once it has matched, and a failure further into the pattern is prevented from backtracking into it. Backtracking past it to previous items, however, works as normal.

An alternative description is that a subpattern of this type matches the string of characters that an identical standalone pattern would match, if anchored at the current point in the subject string.

Once-only subpatterns are not capturing subpatterns. Simple cases such as the above example can be thought of as a maximizing repeat that must swallow everything it can. So, while both \d+ and \d+? are prepared to adjust the number of digits they match in order to make the rest of the pattern match, (?>\d+) can only match an entire sequence of digits.

This construction can of course contain arbitrarily complicated subpatterns, and it can be nested.

Once-only subpatterns can be used in conjunction with look-behind assertions to specify efficient matching at the end of the subject string. Consider a simple pattern such as abcd$ when applied to a long string which does not match. Because matching proceeds from left to right, PCRE will look for each "a" in the subject and then see if what follows matches the rest of the pattern. If the pattern is specified as ^.*abcd$ then the initial .* matches the entire string at first, but when this fails (because there is no following "a"), it backtracks to match all but the last character, then all but the last two characters, and so on. Once again the search for "a" covers the entire string, from right to left, so we are no better off. However, if the pattern is written as ^(?>.*)(?<=abcd) then there can be no backtracking for the .* item; it can match only the entire string. The subsequent lookbehind assertion does a single test on the last four characters. If it fails, the match fails immediately. For long strings, this approach makes a significant difference to the processing time.

When a pattern contains an unlimited repeat inside a subpattern that can itself be repeated an unlimited number of times, the use of a once-only subpattern is the only way to avoid some failing matches taking a very long time indeed. The pattern (\D+|<\d+>)*[!?] matches an unlimited number of substrings that either consist of non-digits, or digits enclosed in <>, followed by either ! or ?. When it matches, it runs quickly. However, if it is applied to aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa it takes a long time before reporting failure. This is because the string can be divided between the two repeats in a large number of ways, and all have to be tried. (The example used [!?] rather than a single character at the end, because both PCRE and Perl have an optimization that allows for fast failure when a single character is used. They remember the last single character that is required for a match, and fail early if it is not present in the string.) If the pattern is changed to ((?>\D+)|<\d+>)*[!?] sequences of non-digits cannot be broken, and failure happens quickly.

Conditional subpatterns

It is possible to cause the matching process to obey a subpattern conditionally or to choose between two alternative subpatterns, depending on the result of an assertion, or whether a previous capturing subpattern matched or not. The two possible forms of conditional subpattern are


       (?(condition)yes-pattern)
       (?(condition)yes-pattern|no-pattern)
    

If the condition is satisfied, the yes-pattern is used; otherwise the no-pattern (if present) is used. If there are more than two alternatives in the subpattern, a compile-time error occurs.

There are two kinds of condition. If the text between the parentheses consists of a sequence of digits, then the condition is satisfied if the capturing subpattern of that number has previously matched. Consider the following pattern, which contains non-significant white space to make it more readable (assume the PCRE_EXTENDED option) and to divide it into three parts for ease of discussion: ( \( )? [^()]+ (?(1) \) )

The first part matches an optional opening parenthesis, and if that character is present, sets it as the first captured substring. The second part matches one or more characters that are not parentheses. The third part is a conditional subpattern that tests whether the first set of parentheses matched or not. If they did, that is, if subject started with an opening parenthesis, the condition is TRUE, and so the yes-pattern is executed and a closing parenthesis is required. Otherwise, since no-pattern is not present, the subpattern matches nothing. In other words, this pattern matches a sequence of non-parentheses, optionally enclosed in parentheses.

If the condition is the string (R), it is satisfied if a recursive call to the pattern or subpattern has been made. At "top level", the condition is false.

If the condition is not a sequence of digits or (R), it must be an assertion. This may be a positive or negative lookahead or lookbehind assertion. Consider this pattern, again containing non-significant white space, and with the two alternatives on the second line:


       (?(?=[^a-z]*[a-z])
       \d{2}-[a-z]{3}-\d{2}  |  \d{2}-\d{2}-\d{2} )
    

The condition is a positive lookahead assertion that matches an optional sequence of non-letters followed by a letter. In other words, it tests for the presence of at least one letter in the subject. If a letter is found, the subject is matched against the first alternative; otherwise it is matched against the second. This pattern matches strings in one of the two forms dd-aaa-dd or dd-dd-dd, where aaa are letters and dd are digits.

Comments

The sequence (?# marks the start of a comment which continues up to the next closing parenthesis. Nested parentheses are not permitted. The characters that make up a comment play no part in the pattern matching at all.

If the PCRE_EXTENDED option is set, an unescaped # character outside a character class introduces a comment that continues up to the next newline character in the pattern.

Recursive patterns

Consider the problem of matching a string in parentheses, allowing for unlimited nested parentheses. Without the use of recursion, the best that can be done is to use a pattern that matches up to some fixed depth of nesting. It is not possible to handle an arbitrary nesting depth. Perl 5.6 has provided an experimental facility that allows regular expressions to recurse (among other things). The special item (?R) is provided for the specific case of recursion. This PCRE pattern solves the parentheses problem (assume the PCRE_EXTENDED option is set so that white space is ignored): \( ( (?>[^()]+) | (?R) )* \)

First it matches an opening parenthesis. Then it matches any number of substrings which can either be a sequence of non-parentheses, or a recursive match of the pattern itself (i.e. a correctly parenthesized substring). Finally there is a closing parenthesis.

This particular example pattern contains nested unlimited repeats, and so the use of a once-only subpattern for matching strings of non-parentheses is important when applying the pattern to strings that do not match. For example, when it is applied to (aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa() it yields "no match" quickly. However, if a once-only subpattern is not used, the match runs for a very long time indeed because there are so many different ways the + and * repeats can carve up the subject, and all have to be tested before failure can be reported.

The values set for any capturing subpatterns are those from the outermost level of the recursion at which the subpattern value is set. If the pattern above is matched against (ab(cd)ef) the value for the capturing parentheses is "ef", which is the last value taken on at the top level. If additional parentheses are added, giving \( ( ( (?>[^()]+) | (?R) )* ) \) then the string they capture is "ab(cd)ef", the contents of the top level parentheses. If there are more than 15 capturing parentheses in a pattern, PCRE has to obtain extra memory to store data during a recursion, which it does by using pcre_malloc, freeing it via pcre_free afterwards. If no memory can be obtained, it saves data for the first 15 capturing parentheses only, as there is no way to give an out-of-memory error from within a recursion.

Since PHP 4.3.3, (?1), (?2) and so on can be used for recursive subpatterns too. It is also possible to use named subpatterns: (?P>foo).

If the syntax for a recursive subpattern reference (either by number or by name) is used outside the parentheses to which it refers, it operates like a subroutine in a programming language. An earlier example pointed out that the pattern (sens|respons)e and \1ibility matches "sense and sensibility" and "response and responsibility", but not "sense and responsibility". If instead the pattern (sens|respons)e and (?1)ibility is used, it does match "sense and responsibility" as well as the other two strings. Such references must, however, follow the subpattern to which they refer.

Performances

Certain items that may appear in patterns are more efficient than others. It is more efficient to use a character class like [aeiou] than a set of alternatives such as (a|e|i|o|u). In general, the simplest construction that provides the required behaviour is usually the most efficient. Jeffrey Friedl's book contains a lot of discussion about optimizing regular expressions for efficient performance.

When a pattern begins with .* and the PCRE_DOTALL option is set, the pattern is implicitly anchored by PCRE, since it can match only at the start of a subject string. However, if PCRE_DOTALL is not set, PCRE cannot make this optimization, because the . metacharacter does not then match a newline, and if the subject string contains newlines, the pattern may match from the character immediately following one of them instead of from the very start. For example, the pattern (.*) second matches the subject "first\nand second" (where \n stands for a newline character) with the first captured substring being "and". In order to do this, PCRE has to retry the match starting after every newline in the subject.

If you are using such a pattern with subject strings that do not contain newlines, the best performance is obtained by setting PCRE_DOTALL, or starting the pattern with ^.* to indicate explicit anchoring. That saves PCRE from having to scan along the subject looking for a newline to restart at.

Beware of patterns that contain nested indefinite repeats. These can take a long time to run when applied to a string that does not match. Consider the pattern fragment (a+)*

This can match "aaaa" in 33 different ways, and this number increases very rapidly as the string gets longer. (The * repeat can match 0, 1, 2, 3, or 4 times, and for each of those cases other than 0, the + repeats can match different numbers of times.) When the remainder of the pattern is such that the entire match is going to fail, PCRE has in principle to try every possible variation, and this can take an extremely long time.

An optimization catches some of the more simple cases such as (a+)*b where a literal character follows. Before embarking on the standard matching procedure, PCRE checks that there is a "b" later in the subject string, and if there is not, it fails the match immediately. However, when there is no following literal this optimization cannot be used. You can see the difference by comparing the behaviour of (a+)*\d with the pattern above. The former gives a failure almost instantly when applied to a whole line of "a" characters, whereas the latter takes an appreciable time with strings longer than about 20 characters.