Import Tcl-core 8.6.6 (as of svn r86089)
This commit is contained in:
858
doc/re_syntax.n
Normal file
858
doc/re_syntax.n
Normal file
@@ -0,0 +1,858 @@
|
||||
'\"
|
||||
'\" Copyright (c) 1998 Sun Microsystems, Inc.
|
||||
'\" Copyright (c) 1999 Scriptics Corporation
|
||||
'\"
|
||||
'\" See the file "license.terms" for information on usage and redistribution
|
||||
'\" of this file, and for a DISCLAIMER OF ALL WARRANTIES.
|
||||
'\"
|
||||
.so man.macros
|
||||
.ie '\w'o''\w'\C'^o''' .ds qo \C'^o'
|
||||
.el .ds qo u
|
||||
.TH re_syntax n "8.1" Tcl "Tcl Built-In Commands"
|
||||
.BS
|
||||
.SH NAME
|
||||
re_syntax \- Syntax of Tcl regular expressions
|
||||
.BE
|
||||
.SH DESCRIPTION
|
||||
.PP
|
||||
A \fIregular expression\fR describes strings of characters.
|
||||
It's a pattern that matches certain strings and does not match others.
|
||||
.SH "DIFFERENT FLAVORS OF REs"
|
||||
Regular expressions
|
||||
.PQ RE s ,
|
||||
as defined by POSIX, come in two flavors: \fIextended\fR REs
|
||||
.PQ ERE s
|
||||
and \fIbasic\fR REs
|
||||
.PQ BRE s .
|
||||
EREs are roughly those of the traditional \fIegrep\fR, while BREs are
|
||||
roughly those of the traditional \fIed\fR. This implementation adds
|
||||
a third flavor, \fIadvanced\fR REs
|
||||
.PQ ARE s ,
|
||||
basically EREs with some significant extensions.
|
||||
.PP
|
||||
This manual page primarily describes AREs. BREs mostly exist for
|
||||
backward compatibility in some old programs; they will be discussed at
|
||||
the end. POSIX EREs are almost an exact subset of AREs. Features of
|
||||
AREs that are not present in EREs will be indicated.
|
||||
.SH "REGULAR EXPRESSION SYNTAX"
|
||||
.PP
|
||||
Tcl regular expressions are implemented using the package written by
|
||||
Henry Spencer, based on the 1003.2 spec and some (not quite all) of
|
||||
the Perl5 extensions (thanks, Henry!). Much of the description of
|
||||
regular expressions below is copied verbatim from his manual entry.
|
||||
.PP
|
||||
An ARE is one or more \fIbranches\fR,
|
||||
separated by
|
||||
.QW \fB|\fR ,
|
||||
matching anything that matches any of the branches.
|
||||
.PP
|
||||
A branch is zero or more \fIconstraints\fR or \fIquantified atoms\fR,
|
||||
concatenated.
|
||||
It matches a match for the first, followed by a match for the second, etc;
|
||||
an empty branch matches the empty string.
|
||||
.SS QUANTIFIERS
|
||||
A quantified atom is an \fIatom\fR possibly followed
|
||||
by a single \fIquantifier\fR.
|
||||
Without a quantifier, it matches a single match for the atom.
|
||||
The quantifiers,
|
||||
and what a so-quantified atom matches, are:
|
||||
.RS 2
|
||||
.TP 6
|
||||
\fB*\fR
|
||||
.
|
||||
a sequence of 0 or more matches of the atom
|
||||
.TP
|
||||
\fB+\fR
|
||||
.
|
||||
a sequence of 1 or more matches of the atom
|
||||
.TP
|
||||
\fB?\fR
|
||||
.
|
||||
a sequence of 0 or 1 matches of the atom
|
||||
.TP
|
||||
\fB{\fIm\fB}\fR
|
||||
.
|
||||
a sequence of exactly \fIm\fR matches of the atom
|
||||
.TP
|
||||
\fB{\fIm\fB,}\fR
|
||||
.
|
||||
a sequence of \fIm\fR or more matches of the atom
|
||||
.TP
|
||||
\fB{\fIm\fB,\fIn\fB}\fR
|
||||
.
|
||||
a sequence of \fIm\fR through \fIn\fR (inclusive) matches of the atom;
|
||||
\fIm\fR may not exceed \fIn\fR
|
||||
.TP
|
||||
\fB*? +? ?? {\fIm\fB}? {\fIm\fB,}? {\fIm\fB,\fIn\fB}?\fR
|
||||
.
|
||||
\fInon-greedy\fR quantifiers, which match the same possibilities,
|
||||
but prefer the smallest number rather than the largest number
|
||||
of matches (see \fBMATCHING\fR)
|
||||
.RE
|
||||
.PP
|
||||
The forms using \fB{\fR and \fB}\fR are known as \fIbound\fRs. The
|
||||
numbers \fIm\fR and \fIn\fR are unsigned decimal integers with
|
||||
permissible values from 0 to 255 inclusive.
|
||||
.SS ATOMS
|
||||
An atom is one of:
|
||||
.RS 2
|
||||
.IP \fB(\fIre\fB)\fR 6
|
||||
matches a match for \fIre\fR (\fIre\fR is any regular expression) with
|
||||
the match noted for possible reporting
|
||||
.IP \fB(?:\fIre\fB)\fR
|
||||
as previous, but does no reporting (a
|
||||
.QW non-capturing
|
||||
set of parentheses)
|
||||
.IP \fB()\fR
|
||||
matches an empty string, noted for possible reporting
|
||||
.IP \fB(?:)\fR
|
||||
matches an empty string, without reporting
|
||||
.IP \fB[\fIchars\fB]\fR
|
||||
a \fIbracket expression\fR, matching any one of the \fIchars\fR (see
|
||||
\fBBRACKET EXPRESSIONS\fR for more detail)
|
||||
.IP \fB.\fR
|
||||
matches any single character
|
||||
.IP \fB\e\fIk\fR
|
||||
matches the non-alphanumeric character \fIk\fR
|
||||
taken as an ordinary character, e.g. \fB\e\e\fR matches a backslash
|
||||
character
|
||||
.IP \fB\e\fIc\fR
|
||||
where \fIc\fR is alphanumeric (possibly followed by other characters),
|
||||
an \fIescape\fR (AREs only), see \fBESCAPES\fR below
|
||||
.IP \fB{\fR
|
||||
when followed by a character other than a digit, matches the
|
||||
left-brace character
|
||||
.QW \fB{\fR ;
|
||||
when followed by a digit, it is the beginning of a \fIbound\fR (see above)
|
||||
.IP \fIx\fR
|
||||
where \fIx\fR is a single character with no other significance,
|
||||
matches that character.
|
||||
.RE
|
||||
.SS CONSTRAINTS
|
||||
A \fIconstraint\fR matches an empty string when specific conditions
|
||||
are met. A constraint may not be followed by a quantifier. The
|
||||
simple constraints are as follows; some more constraints are described
|
||||
later, under \fBESCAPES\fR.
|
||||
.RS 2
|
||||
.TP 8
|
||||
\fB^\fR
|
||||
.
|
||||
matches at the beginning of a line
|
||||
.TP
|
||||
\fB$\fR
|
||||
.
|
||||
matches at the end of a line
|
||||
.TP
|
||||
\fB(?=\fIre\fB)\fR
|
||||
.
|
||||
\fIpositive lookahead\fR (AREs only), matches at any point where a
|
||||
substring matching \fIre\fR begins
|
||||
.TP
|
||||
\fB(?!\fIre\fB)\fR
|
||||
.
|
||||
\fInegative lookahead\fR (AREs only), matches at any point where no
|
||||
substring matching \fIre\fR begins
|
||||
.RE
|
||||
.PP
|
||||
The lookahead constraints may not contain back references (see later),
|
||||
and all parentheses within them are considered non-capturing.
|
||||
.PP
|
||||
An RE may not end with
|
||||
.QW \fB\e\fR .
|
||||
.SH "BRACKET EXPRESSIONS"
|
||||
A \fIbracket expression\fR is a list of characters enclosed in
|
||||
.QW \fB[\|]\fR .
|
||||
It normally matches any single character from the list
|
||||
(but see below). If the list begins with
|
||||
.QW \fB^\fR ,
|
||||
it matches any single character (but see below) \fInot\fR from the
|
||||
rest of the list.
|
||||
.PP
|
||||
If two characters in the list are separated by
|
||||
.QW \fB\-\fR ,
|
||||
this is shorthand for the full \fIrange\fR of characters between those two
|
||||
(inclusive) in the collating sequence, e.g.
|
||||
.QW \fB[0\-9]\fR
|
||||
in Unicode matches any conventional decimal digit. Two ranges may not share an
|
||||
endpoint, so e.g.
|
||||
.QW \fBa\-c\-e\fR
|
||||
is illegal. Ranges in Tcl always use the
|
||||
Unicode collating sequence, but other programs may use other collating
|
||||
sequences and this can be a source of incompatibility between programs.
|
||||
.PP
|
||||
To include a literal \fB]\fR or \fB\-\fR in the list, the simplest
|
||||
method is to enclose it in \fB[.\fR and \fB.]\fR to make it a
|
||||
collating element (see below). Alternatively, make it the first
|
||||
character (following a possible
|
||||
.QW \fB^\fR ),
|
||||
or (AREs only) precede it with
|
||||
.QW \fB\e\fR .
|
||||
Alternatively, for
|
||||
.QW \fB\-\fR ,
|
||||
make it the last character, or the second endpoint of a range. To use
|
||||
a literal \fB\-\fR as the first endpoint of a range, make it a
|
||||
collating element or (AREs only) precede it with
|
||||
.QW \fB\e\fR .
|
||||
With the exception of
|
||||
these, some combinations using \fB[\fR (see next paragraphs), and
|
||||
escapes, all other special characters lose their special significance
|
||||
within a bracket expression.
|
||||
.SS "CHARACTER CLASSES"
|
||||
Within a bracket expression, the name of a \fIcharacter class\fR
|
||||
enclosed in \fB[:\fR and \fB:]\fR stands for the list of all
|
||||
characters (not all collating elements!) belonging to that class.
|
||||
Standard character classes are:
|
||||
.IP \fBalpha\fR 8
|
||||
A letter.
|
||||
.IP \fBupper\fR 8
|
||||
An upper-case letter.
|
||||
.IP \fBlower\fR 8
|
||||
A lower-case letter.
|
||||
.IP \fBdigit\fR 8
|
||||
A decimal digit.
|
||||
.IP \fBxdigit\fR 8
|
||||
A hexadecimal digit.
|
||||
.IP \fBalnum\fR 8
|
||||
An alphanumeric (letter or digit).
|
||||
.IP \fBprint\fR 8
|
||||
A "printable" (same as graph, except also including space).
|
||||
.IP \fBblank\fR 8
|
||||
A space or tab character.
|
||||
.IP \fBspace\fR 8
|
||||
A character producing white space in displayed text.
|
||||
.IP \fBpunct\fR 8
|
||||
A punctuation character.
|
||||
.IP \fBgraph\fR 8
|
||||
A character with a visible representation (includes both \fBalnum\fR
|
||||
and \fBpunct\fR).
|
||||
.IP \fBcntrl\fR 8
|
||||
A control character.
|
||||
.PP
|
||||
A locale may provide others. A character class may not be used as an endpoint
|
||||
of a range.
|
||||
.RS
|
||||
.PP
|
||||
(\fINote:\fR the current Tcl implementation has only one locale, the Unicode
|
||||
locale, which supports exactly the above classes.)
|
||||
.RE
|
||||
.SS "BRACKETED CONSTRAINTS"
|
||||
There are two special cases of bracket expressions: the bracket
|
||||
expressions
|
||||
.QW \fB[[:<:]]\fR
|
||||
and
|
||||
.QW \fB[[:>:]]\fR
|
||||
are constraints, matching empty strings at the beginning and end of a word
|
||||
respectively.
|
||||
.\" note, discussion of escapes below references this definition of word
|
||||
A word is defined as a sequence of word characters that is neither preceded
|
||||
nor followed by word characters. A word character is an \fIalnum\fR character
|
||||
or an underscore
|
||||
.PQ \fB_\fR "" .
|
||||
These special bracket expressions are deprecated; users of AREs should use
|
||||
constraint escapes instead (see below).
|
||||
.SS "COLLATING ELEMENTS"
|
||||
Within a bracket expression, a collating element (a character, a
|
||||
multi-character sequence that collates as if it were a single
|
||||
character, or a collating-sequence name for either) enclosed in
|
||||
\fB[.\fR and \fB.]\fR stands for the sequence of characters of that
|
||||
collating element. The sequence is a single element of the bracket
|
||||
expression's list. A bracket expression in a locale that has
|
||||
multi-character collating elements can thus match more than one
|
||||
character. So (insidiously), a bracket expression that starts with
|
||||
\fB^\fR can match multi-character collating elements even if none of
|
||||
them appear in the bracket expression!
|
||||
.RS
|
||||
.PP
|
||||
(\fINote:\fR Tcl has no multi-character collating elements. This information
|
||||
is only for illustration.)
|
||||
.RE
|
||||
.PP
|
||||
For example, assume the collating sequence includes a \fBch\fR multi-character
|
||||
collating element. Then the RE
|
||||
.QW \fB[[.ch.]]*c\fR
|
||||
(zero or more
|
||||
.QW \fBch\fRs
|
||||
followed by
|
||||
.QW \fBc\fR )
|
||||
matches the first five characters of
|
||||
.QW \fBchchcc\fR .
|
||||
Also, the RE
|
||||
.QW \fB[^c]b\fR
|
||||
matches all of
|
||||
.QW \fBchb\fR
|
||||
(because
|
||||
.QW \fB[^c]\fR
|
||||
matches the multi-character
|
||||
.QW \fBch\fR ).
|
||||
.SS "EQUIVALENCE CLASSES"
|
||||
Within a bracket expression, a collating element enclosed in \fB[=\fR
|
||||
and \fB=]\fR is an equivalence class, standing for the sequences of
|
||||
characters of all collating elements equivalent to that one, including
|
||||
itself. (If there are no other equivalent collating elements, the
|
||||
treatment is as if the enclosing delimiters were
|
||||
.QW \fB[.\fR \&
|
||||
and
|
||||
.QW \fB.]\fR .)
|
||||
For example, if \fBo\fR and \fB\*(qo\fR are the members of an
|
||||
equivalence class, then
|
||||
.QW \fB[[=o=]]\fR ,
|
||||
.QW \fB[[=\*(qo=]]\fR ,
|
||||
and
|
||||
.QW \fB[o\*(qo]\fR \&
|
||||
are all synonymous. An equivalence class may not be an endpoint of a range.
|
||||
.RS
|
||||
.PP
|
||||
(\fINote:\fR Tcl implements only the Unicode locale. It does not define any
|
||||
equivalence classes. The examples above are just illustrations.)
|
||||
.RE
|
||||
.SH ESCAPES
|
||||
Escapes (AREs only), which begin with a \fB\e\fR followed by an
|
||||
alphanumeric character, come in several varieties: character entry,
|
||||
class shorthands, constraint escapes, and back references. A \fB\e\fR
|
||||
followed by an alphanumeric character but not constituting a valid
|
||||
escape is illegal in AREs. In EREs, there are no escapes: outside a
|
||||
bracket expression, a \fB\e\fR followed by an alphanumeric character
|
||||
merely stands for that character as an ordinary character, and inside
|
||||
a bracket expression, \fB\e\fR is an ordinary character. (The latter
|
||||
is the one actual incompatibility between EREs and AREs.)
|
||||
.SS "CHARACTER-ENTRY ESCAPES"
|
||||
Character-entry escapes (AREs only) exist to make it easier to specify
|
||||
non-printing and otherwise inconvenient characters in REs:
|
||||
.RS 2
|
||||
.TP 5
|
||||
\fB\ea\fR
|
||||
.
|
||||
alert (bell) character, as in C
|
||||
.TP
|
||||
\fB\eb\fR
|
||||
.
|
||||
backspace, as in C
|
||||
.TP
|
||||
\fB\eB\fR
|
||||
.
|
||||
synonym for \fB\e\fR to help reduce backslash doubling in some
|
||||
applications where there are multiple levels of backslash processing
|
||||
.TP
|
||||
\fB\ec\fIX\fR
|
||||
.
|
||||
(where \fIX\fR is any character) the character whose low-order 5 bits
|
||||
are the same as those of \fIX\fR, and whose other bits are all zero
|
||||
.TP
|
||||
\fB\ee\fR
|
||||
.
|
||||
the character whose collating-sequence name is
|
||||
.QW \fBESC\fR ,
|
||||
or failing that, the character with octal value 033
|
||||
.TP
|
||||
\fB\ef\fR
|
||||
.
|
||||
formfeed, as in C
|
||||
.TP
|
||||
\fB\en\fR
|
||||
.
|
||||
newline, as in C
|
||||
.TP
|
||||
\fB\er\fR
|
||||
.
|
||||
carriage return, as in C
|
||||
.TP
|
||||
\fB\et\fR
|
||||
.
|
||||
horizontal tab, as in C
|
||||
.TP
|
||||
\fB\eu\fIwxyz\fR
|
||||
.
|
||||
(where \fIwxyz\fR is one up to four hexadecimal digits) the Unicode
|
||||
character \fBU+\fIwxyz\fR in the local byte ordering
|
||||
.TP
|
||||
\fB\eU\fIstuvwxyz\fR
|
||||
.
|
||||
(where \fIstuvwxyz\fR is one up to eight hexadecimal digits) reserved
|
||||
for a Unicode extension up to 21 bits. The digits are parsed until the
|
||||
first non-hexadecimal character is encountered, the maximun of eight
|
||||
hexadecimal digits are reached, or an overflow would occur in the maximum
|
||||
value of \fBU+\fI10ffff\fR.
|
||||
.TP
|
||||
\fB\ev\fR
|
||||
.
|
||||
vertical tab, as in C are all available.
|
||||
.TP
|
||||
\fB\ex\fIhh\fR
|
||||
.
|
||||
(where \fIhh\fR is one or two hexadecimal digits) the character
|
||||
whose hexadecimal value is \fB0x\fIhh\fR.
|
||||
.TP
|
||||
\fB\e0\fR
|
||||
.
|
||||
the character whose value is \fB0\fR
|
||||
.TP
|
||||
\fB\e\fIxyz\fR
|
||||
.
|
||||
(where \fIxyz\fR is exactly three octal digits, and is not a \fIback
|
||||
reference\fR (see below)) the character whose octal value is
|
||||
\fB0\fIxyz\fR. The first digit must be in the range 0-3, otherwise
|
||||
the two-digit form is assumed.
|
||||
.TP
|
||||
\fB\e\fIxy\fR
|
||||
.
|
||||
(where \fIxy\fR is exactly two octal digits, and is not a \fIback
|
||||
reference\fR (see below)) the character whose octal value is
|
||||
\fB0\fIxy\fR
|
||||
.RE
|
||||
.PP
|
||||
Hexadecimal digits are
|
||||
.QR \fB0\fR \fB9\fR ,
|
||||
.QR \fBa\fR \fBf\fR ,
|
||||
and
|
||||
.QR \fBA\fR \fBF\fR .
|
||||
Octal digits are
|
||||
.QR \fB0\fR \fB7\fR .
|
||||
.PP
|
||||
The character-entry escapes are always taken as ordinary characters.
|
||||
For example, \fB\e135\fR is \fB]\fR in Unicode, but \fB\e135\fR does
|
||||
not terminate a bracket expression. Beware, however, that some
|
||||
applications (e.g., C compilers and the Tcl interpreter if the regular
|
||||
expression is not quoted with braces) interpret such sequences
|
||||
themselves before the regular-expression package gets to see them,
|
||||
which may require doubling (quadrupling, etc.) the
|
||||
.QW \fB\e\fR .
|
||||
.SS "CLASS-SHORTHAND ESCAPES"
|
||||
Class-shorthand escapes (AREs only) provide shorthands for certain
|
||||
commonly-used character classes:
|
||||
.RS 2
|
||||
.TP 10
|
||||
\fB\ed\fR
|
||||
.
|
||||
\fB[[:digit:]]\fR
|
||||
.TP
|
||||
\fB\es\fR
|
||||
.
|
||||
\fB[[:space:]]\fR
|
||||
.TP
|
||||
\fB\ew\fR
|
||||
.
|
||||
\fB[[:alnum:]_]\fR (note underscore)
|
||||
.TP
|
||||
\fB\eD\fR
|
||||
.
|
||||
\fB[^[:digit:]]\fR
|
||||
.TP
|
||||
\fB\eS\fR
|
||||
.
|
||||
\fB[^[:space:]]\fR
|
||||
.TP
|
||||
\fB\eW\fR
|
||||
.
|
||||
\fB[^[:alnum:]_]\fR (note underscore)
|
||||
.RE
|
||||
.PP
|
||||
Within bracket expressions,
|
||||
.QW \fB\ed\fR ,
|
||||
.QW \fB\es\fR ,
|
||||
and
|
||||
.QW \fB\ew\fR \&
|
||||
lose their outer brackets, and
|
||||
.QW \fB\eD\fR ,
|
||||
.QW \fB\eS\fR ,
|
||||
and
|
||||
.QW \fB\eW\fR \&
|
||||
are illegal. (So, for example,
|
||||
.QW \fB[a-c\ed]\fR
|
||||
is equivalent to
|
||||
.QW \fB[a-c[:digit:]]\fR .
|
||||
Also,
|
||||
.QW \fB[a-c\eD]\fR ,
|
||||
which is equivalent to
|
||||
.QW \fB[a-c^[:digit:]]\fR ,
|
||||
is illegal.)
|
||||
.SS "CONSTRAINT ESCAPES"
|
||||
A constraint escape (AREs only) is a constraint, matching the empty
|
||||
string if specific conditions are met, written as an escape:
|
||||
.RS 2
|
||||
.TP 6
|
||||
\fB\eA\fR
|
||||
.
|
||||
matches only at the beginning of the string (see \fBMATCHING\fR,
|
||||
below, for how this differs from
|
||||
.QW \fB^\fR )
|
||||
.TP
|
||||
\fB\em\fR
|
||||
.
|
||||
matches only at the beginning of a word
|
||||
.TP
|
||||
\fB\eM\fR
|
||||
.
|
||||
matches only at the end of a word
|
||||
.TP
|
||||
\fB\ey\fR
|
||||
.
|
||||
matches only at the beginning or end of a word
|
||||
.TP
|
||||
\fB\eY\fR
|
||||
.
|
||||
matches only at a point that is not the beginning or end of a word
|
||||
.TP
|
||||
\fB\eZ\fR
|
||||
.
|
||||
matches only at the end of the string (see \fBMATCHING\fR, below, for
|
||||
how this differs from
|
||||
.QW \fB$\fR )
|
||||
.TP
|
||||
\fB\e\fIm\fR
|
||||
.
|
||||
(where \fIm\fR is a nonzero digit) a \fIback reference\fR, see below
|
||||
.TP
|
||||
\fB\e\fImnn\fR
|
||||
.
|
||||
(where \fIm\fR is a nonzero digit, and \fInn\fR is some more digits,
|
||||
and the decimal value \fImnn\fR is not greater than the number of
|
||||
closing capturing parentheses seen so far) a \fIback reference\fR, see
|
||||
below
|
||||
.RE
|
||||
.PP
|
||||
A word is defined as in the specification of
|
||||
.QW \fB[[:<:]]\fR
|
||||
and
|
||||
.QW \fB[[:>:]]\fR
|
||||
above. Constraint escapes are illegal within bracket expressions.
|
||||
.SS "BACK REFERENCES"
|
||||
A back reference (AREs only) matches the same string matched by the
|
||||
parenthesized subexpression specified by the number, so that (e.g.)
|
||||
.QW \fB([bc])\e1\fR
|
||||
matches
|
||||
.QW \fBbb\fR
|
||||
or
|
||||
.QW \fBcc\fR
|
||||
but not
|
||||
.QW \fBbc\fR .
|
||||
The subexpression must entirely precede the back reference in the RE.
|
||||
Subexpressions are numbered in the order of their leading parentheses.
|
||||
Non-capturing parentheses do not define subexpressions.
|
||||
.PP
|
||||
There is an inherent historical ambiguity between octal
|
||||
character-entry escapes and back references, which is resolved by
|
||||
heuristics, as hinted at above. A leading zero always indicates an
|
||||
octal escape. A single non-zero digit, not followed by another digit,
|
||||
is always taken as a back reference. A multi-digit sequence not
|
||||
starting with a zero is taken as a back reference if it comes after a
|
||||
suitable subexpression (i.e. the number is in the legal range for a
|
||||
back reference), and otherwise is taken as octal.
|
||||
.SH "METASYNTAX"
|
||||
In addition to the main syntax described above, there are some special
|
||||
forms and miscellaneous syntactic facilities available.
|
||||
.PP
|
||||
Normally the flavor of RE being used is specified by
|
||||
application-dependent means. However, this can be overridden by a
|
||||
\fIdirector\fR. If an RE of any flavor begins with
|
||||
.QW \fB***:\fR ,
|
||||
the rest of the RE is an ARE. If an RE of any flavor begins with
|
||||
.QW \fB***=\fR ,
|
||||
the rest of the RE is taken to be a literal string, with
|
||||
all characters considered ordinary characters.
|
||||
.PP
|
||||
An ARE may begin with \fIembedded options\fR: a sequence
|
||||
\fB(?\fIxyz\fB)\fR (where \fIxyz\fR is one or more alphabetic
|
||||
characters) specifies options affecting the rest of the RE. These
|
||||
supplement, and can override, any options specified by the
|
||||
application. The available option letters are:
|
||||
.RS 2
|
||||
.TP 3
|
||||
\fBb\fR
|
||||
.
|
||||
rest of RE is a BRE
|
||||
.TP 3
|
||||
\fBc\fR
|
||||
.
|
||||
case-sensitive matching (usual default)
|
||||
.TP 3
|
||||
\fBe\fR
|
||||
.
|
||||
rest of RE is an ERE
|
||||
.TP 3
|
||||
\fBi\fR
|
||||
.
|
||||
case-insensitive matching (see \fBMATCHING\fR, below)
|
||||
.TP 3
|
||||
\fBm\fR
|
||||
.
|
||||
historical synonym for \fBn\fR
|
||||
.TP 3
|
||||
\fBn\fR
|
||||
.
|
||||
newline-sensitive matching (see \fBMATCHING\fR, below)
|
||||
.TP 3
|
||||
\fBp\fR
|
||||
.
|
||||
partial newline-sensitive matching (see \fBMATCHING\fR, below)
|
||||
.TP 3
|
||||
\fBq\fR
|
||||
.
|
||||
rest of RE is a literal
|
||||
.PQ quoted
|
||||
string, all ordinary characters
|
||||
.TP 3
|
||||
\fBs\fR
|
||||
.
|
||||
non-newline-sensitive matching (usual default)
|
||||
.TP 3
|
||||
\fBt\fR
|
||||
.
|
||||
tight syntax (usual default; see below)
|
||||
.TP 3
|
||||
\fBw\fR
|
||||
.
|
||||
inverse partial newline-sensitive
|
||||
.PQ weird
|
||||
matching (see \fBMATCHING\fR, below)
|
||||
.TP 3
|
||||
\fBx\fR
|
||||
.
|
||||
expanded syntax (see below)
|
||||
.RE
|
||||
.PP
|
||||
Embedded options take effect at the \fB)\fR terminating the sequence.
|
||||
They are available only at the start of an ARE, and may not be used
|
||||
later within it.
|
||||
.PP
|
||||
In addition to the usual (\fItight\fR) RE syntax, in which all
|
||||
characters are significant, there is an \fIexpanded\fR syntax,
|
||||
available in all flavors of RE with the \fB\-expanded\fR switch, or in
|
||||
AREs with the embedded x option. In the expanded syntax, white-space
|
||||
characters are ignored and all characters between a \fB#\fR and the
|
||||
following newline (or the end of the RE) are ignored, permitting
|
||||
paragraphing and commenting a complex RE. There are three exceptions
|
||||
to that basic rule:
|
||||
.IP \(bu 3
|
||||
a white-space character or
|
||||
.QW \fB#\fR
|
||||
preceded by
|
||||
.QW \fB\e\fR
|
||||
is retained
|
||||
.IP \(bu 3
|
||||
white space or
|
||||
.QW \fB#\fR
|
||||
within a bracket expression is retained
|
||||
.IP \(bu 3
|
||||
white space and comments are illegal within multi-character symbols
|
||||
like the ARE
|
||||
.QW \fB(?:\fR
|
||||
or the BRE
|
||||
.QW \fB\e(\fR
|
||||
.PP
|
||||
Expanded-syntax white-space characters are blank, tab, newline, and
|
||||
any character that belongs to the \fIspace\fR character class.
|
||||
.PP
|
||||
Finally, in an ARE, outside bracket expressions, the sequence
|
||||
.QW \fB(?#\fIttt\fB)\fR
|
||||
(where \fIttt\fR is any text not containing a
|
||||
.QW \fB)\fR )
|
||||
is a comment, completely ignored. Again, this is not
|
||||
allowed between the characters of multi-character symbols like
|
||||
.QW \fB(?:\fR .
|
||||
Such comments are more a historical artifact than a useful facility,
|
||||
and their use is deprecated; use the expanded syntax instead.
|
||||
.PP
|
||||
\fINone\fR of these metasyntax extensions is available if the
|
||||
application (or an initial
|
||||
.QW \fB***=\fR
|
||||
director) has specified that the
|
||||
user's input be treated as a literal string rather than as an RE.
|
||||
.SH MATCHING
|
||||
In the event that an RE could match more than one substring of a given
|
||||
string, the RE matches the one starting earliest in the string. If
|
||||
the RE could match more than one substring starting at that point, its
|
||||
choice is determined by its \fIpreference\fR: either the longest
|
||||
substring, or the shortest.
|
||||
.PP
|
||||
Most atoms, and all constraints, have no preference. A parenthesized
|
||||
RE has the same preference (possibly none) as the RE. A quantified
|
||||
atom with quantifier \fB{\fIm\fB}\fR or \fB{\fIm\fB}?\fR has the same
|
||||
preference (possibly none) as the atom itself. A quantified atom with
|
||||
other normal quantifiers (including \fB{\fIm\fB,\fIn\fB}\fR with
|
||||
\fIm\fR equal to \fIn\fR) prefers longest match. A quantified atom
|
||||
with other non-greedy quantifiers (including \fB{\fIm\fB,\fIn\fB}?\fR
|
||||
with \fIm\fR equal to \fIn\fR) prefers shortest match. A branch has
|
||||
the same preference as the first quantified atom in it which has a
|
||||
preference. An RE consisting of two or more branches connected by the
|
||||
\fB|\fR operator prefers longest match.
|
||||
.PP
|
||||
Subject to the constraints imposed by the rules for matching the whole
|
||||
RE, subexpressions also match the longest or shortest possible
|
||||
substrings, based on their preferences, with subexpressions starting
|
||||
earlier in the RE taking priority over ones starting later. Note that
|
||||
outer subexpressions thus take priority over their component
|
||||
subexpressions.
|
||||
.PP
|
||||
The quantifiers \fB{1,1}\fR and \fB{1,1}?\fR can be used to
|
||||
force longest and shortest preference, respectively, on a
|
||||
subexpression or a whole RE.
|
||||
.RS
|
||||
.PP
|
||||
\fBNOTE:\fR This means that you can usually make a RE be non-greedy overall by
|
||||
putting \fB{1,1}?\fR after one of the first non-constraint atoms or
|
||||
parenthesized sub-expressions in it. \fIIt pays to experiment\fR with the
|
||||
placing of this non-greediness override on a suitable range of input texts
|
||||
when you are writing a RE if you are using this level of complexity.
|
||||
.PP
|
||||
For example, this regular expression is non-greedy, and will match the
|
||||
shortest substring possible given that
|
||||
.QW \fBabc\fR
|
||||
will be matched as early as possible (the quantifier does not change that):
|
||||
.PP
|
||||
.CS
|
||||
ab{1,1}?c.*x.*cba
|
||||
.CE
|
||||
.PP
|
||||
The atom
|
||||
.QW \fBa\fR
|
||||
has no greediness preference, we explicitly give one for
|
||||
.QW \fBb\fR ,
|
||||
and the remaining quantifiers are overridden to be non-greedy by the preceding
|
||||
non-greedy quantifier.
|
||||
.RE
|
||||
.PP
|
||||
Match lengths are measured in characters, not collating elements. An
|
||||
empty string is considered longer than no match at all. For example,
|
||||
.QW \fBbb*\fR
|
||||
matches the three middle characters of
|
||||
.QW \fBabbbc\fR ,
|
||||
.QW \fB(week|wee)(night|knights)\fR
|
||||
matches all ten characters of
|
||||
.QW \fBweeknights\fR ,
|
||||
when
|
||||
.QW \fB(.*).*\fR
|
||||
is matched against
|
||||
.QW \fBabc\fR
|
||||
the parenthesized subexpression matches all three characters, and when
|
||||
.QW \fB(a*)*\fR
|
||||
is matched against
|
||||
.QW \fBbc\fR
|
||||
both the whole RE and the parenthesized subexpression match an empty string.
|
||||
.PP
|
||||
If case-independent matching is specified, the effect is much as if
|
||||
all case distinctions had vanished from the alphabet. When an
|
||||
alphabetic that exists in multiple cases appears as an ordinary
|
||||
character outside a bracket expression, it is effectively transformed
|
||||
into a bracket expression containing both cases, so that \fBx\fR
|
||||
becomes
|
||||
.QW \fB[xX]\fR .
|
||||
When it appears inside a bracket expression,
|
||||
all case counterparts of it are added to the bracket expression, so
|
||||
that
|
||||
.QW \fB[x]\fR
|
||||
becomes
|
||||
.QW \fB[xX]\fR
|
||||
and
|
||||
.QW \fB[^x]\fR
|
||||
becomes
|
||||
.QW \fB[^xX]\fR .
|
||||
.PP
|
||||
If newline-sensitive matching is specified, \fB.\fR and bracket
|
||||
expressions using \fB^\fR will never match the newline character (so
|
||||
that matches will never cross newlines unless the RE explicitly
|
||||
arranges it) and \fB^\fR and \fB$\fR will match the empty string after
|
||||
and before a newline respectively, in addition to matching at
|
||||
beginning and end of string respectively. ARE \fB\eA\fR and \fB\eZ\fR
|
||||
continue to match beginning or end of string \fIonly\fR.
|
||||
.PP
|
||||
If partial newline-sensitive matching is specified, this affects
|
||||
\fB.\fR and bracket expressions as with newline-sensitive matching,
|
||||
but not \fB^\fR and \fB$\fR.
|
||||
.PP
|
||||
If inverse partial newline-sensitive matching is specified, this
|
||||
affects \fB^\fR and \fB$\fR as with newline-sensitive matching, but
|
||||
not \fB.\fR and bracket expressions. This is not very useful but is
|
||||
provided for symmetry.
|
||||
.SH "LIMITS AND COMPATIBILITY"
|
||||
No particular limit is imposed on the length of REs. Programs
|
||||
intended to be highly portable should not employ REs longer than 256
|
||||
bytes, as a POSIX-compliant implementation can refuse to accept such
|
||||
REs.
|
||||
.PP
|
||||
The only feature of AREs that is actually incompatible with POSIX EREs
|
||||
is that \fB\e\fR does not lose its special significance inside bracket
|
||||
expressions. All other ARE features use syntax which is illegal or
|
||||
has undefined or unspecified effects in POSIX EREs; the \fB***\fR
|
||||
syntax of directors likewise is outside the POSIX syntax for both BREs
|
||||
and EREs.
|
||||
.PP
|
||||
Many of the ARE extensions are borrowed from Perl, but some have been
|
||||
changed to clean them up, and a few Perl extensions are not present.
|
||||
Incompatibilities of note include
|
||||
.QW \fB\eb\fR ,
|
||||
.QW \fB\eB\fR ,
|
||||
the lack of special treatment for a trailing newline, the addition of
|
||||
complemented bracket expressions to the things affected by
|
||||
newline-sensitive matching, the restrictions on parentheses and back
|
||||
references in lookahead constraints, and the longest/shortest-match
|
||||
(rather than first-match) matching semantics.
|
||||
.PP
|
||||
The matching rules for REs containing both normal and non-greedy
|
||||
quantifiers have changed since early beta-test versions of this
|
||||
package. (The new rules are much simpler and cleaner, but do not work
|
||||
as hard at guessing the user's real intentions.)
|
||||
.PP
|
||||
Henry Spencer's original 1986 \fIregexp\fR package, still in
|
||||
widespread use (e.g., in pre-8.1 releases of Tcl), implemented an
|
||||
early version of today's EREs. There are four incompatibilities
|
||||
between \fIregexp\fR's near-EREs
|
||||
.PQ RREs " for short"
|
||||
and AREs. In roughly increasing order of significance:
|
||||
.IP \(bu 3
|
||||
In AREs, \fB\e\fR followed by an alphanumeric character is either an
|
||||
escape or an error, while in RREs, it was just another way of writing
|
||||
the alphanumeric. This should not be a problem because there was no
|
||||
reason to write such a sequence in RREs.
|
||||
.IP \(bu 3
|
||||
\fB{\fR followed by a digit in an ARE is the beginning of a bound,
|
||||
while in RREs, \fB{\fR was always an ordinary character. Such
|
||||
sequences should be rare, and will often result in an error because
|
||||
following characters will not look like a valid bound.
|
||||
.IP \(bu 3
|
||||
In AREs, \fB\e\fR remains a special character within
|
||||
.QW \fB[\|]\fR ,
|
||||
so a literal \fB\e\fR within \fB[\|]\fR must be written
|
||||
.QW \fB\e\e\fR .
|
||||
\fB\e\e\fR also gives a literal \fB\e\fR within \fB[\|]\fR in RREs,
|
||||
but only truly paranoid programmers routinely doubled the backslash.
|
||||
.IP \(bu 3
|
||||
AREs report the longest/shortest match for the RE, rather than the
|
||||
first found in a specified search order. This may affect some RREs
|
||||
which were written in the expectation that the first match would be
|
||||
reported. (The careful crafting of RREs to optimize the search order
|
||||
for fast matching is obsolete (AREs examine all possible matches in
|
||||
parallel, and their performance is largely insensitive to their
|
||||
complexity) but cases where the search order was exploited to
|
||||
deliberately find a match which was \fInot\fR the longest/shortest
|
||||
will need rewriting.)
|
||||
.SH "BASIC REGULAR EXPRESSIONS"
|
||||
BREs differ from EREs in several respects.
|
||||
.QW \fB|\fR ,
|
||||
.QW \fB+\fR ,
|
||||
and \fB?\fR are ordinary characters and there is no equivalent for their
|
||||
functionality. The delimiters for bounds are \fB\e{\fR and
|
||||
.QW \fB\e}\fR ,
|
||||
with \fB{\fR and \fB}\fR by themselves ordinary characters. The
|
||||
parentheses for nested subexpressions are \fB\e(\fR and
|
||||
.QW \fB\e)\fR ,
|
||||
with \fB(\fR and \fB)\fR by themselves ordinary
|
||||
characters. \fB^\fR is an ordinary character except at the beginning
|
||||
of the RE or the beginning of a parenthesized subexpression, \fB$\fR
|
||||
is an ordinary character except at the end of the RE or the end of a
|
||||
parenthesized subexpression, and \fB*\fR is an ordinary character if
|
||||
it appears at the beginning of the RE or the beginning of a
|
||||
parenthesized subexpression (after a possible leading
|
||||
.QW \fB^\fR ).
|
||||
Finally, single-digit back references are available, and \fB\e<\fR and
|
||||
\fB\e>\fR are synonyms for
|
||||
.QW \fB[[:<:]]\fR
|
||||
and
|
||||
.QW \fB[[:>:]]\fR
|
||||
respectively; no other escapes are available.
|
||||
.SH "SEE ALSO"
|
||||
RegExp(3), regexp(n), regsub(n), lsearch(n), switch(n), text(n)
|
||||
.SH KEYWORDS
|
||||
match, regular expression, string
|
||||
.\" Local Variables:
|
||||
.\" mode: nroff
|
||||
.\" End:
|
||||
Reference in New Issue
Block a user