Pcre exec

Поиск по шаблону
The function pcre_exec is called to match a subject string against a compiled  pattern, which is passed in the code argument. If the pattern was studied, the result of the study should be  passed  in  the  extra argument. This function is the main matching facility of the library, and it operates in a Perl-like manner. For specialist use there is also an alternative matching function, which is described below in the sec- tion about the pcre_dfa_exec function.

In most applications, the pattern will have been compiled (and option- ally  studied)  in the same process that calls pcre_exec. However, it is possible to save compiled patterns and study data, and then use them later in  different processes, possibly even on different hosts. For a discussion about this, see the pcreprecompile documentation.

Here is an example of a simple call to pcre_exec:

How pcre_exec returns captured substrings
In general, a pattern matches a certain portion of the subject, and in addition, further substrings from the subject may be picked out by parts of the pattern. Following the usage in Jeffrey Friedl's book, this is called "capturing" in what follows, and the phrase "capturing subpattern" is used for a fragment of a pattern that picks out a sub- string. PCRE supports several other kinds of parenthesized subpattern that do not cause substrings to be captured.

Captured substrings are returned to the caller via a vector of integers whose address is passed in ovector. The number of elements in the vec- tor is passed in ovecsize, which must be a non-negative number. Note: this argument is NOT the size of ovector in bytes.

Первые 2/3 массива используются для возвращения подстрок, каждая подстрока соответствует паре значений типа int. Оставшаяся треть используется для нужд pcre_exec во время работы и не используется для возвращения информации. Число, переданное в ovecsize ВСЕГДА должно быть кратно трём, если не кратно, оно округляется вниз.

When a match is successful, information about captured substrings is returned in pairs of integers, starting at the beginning of ovector, and continuing up to two-thirds of its length at the most. The first element of each pair is set to the byte offset of the first character in a substring, and the second is set to the byte offset of the first character after the end of a substring. Note: these values are always byte offsets, even in UTF-8 mode. They are not character counts.

The first pair of integers, ovector[0] and ovector[1], identify the portion of the subject string matched by the entire pattern. The next pair is used for the first capturing subpattern, and so on. The value returned by pcre_exec is one more than the highest numbered pair that has been set. For example, if two substrings have been captured, the returned value is 3. If there are no capturing subpatterns, the return value from a successful match is 1, indicating that just the first pair of offsets has been set.

If a capturing subpattern is matched repeatedly, it is the last portion of the string that it matched that is returned.

If the vector is too small to hold all the captured substring offsets, it is used as far as possible (up to two-thirds of its length), and the function returns a value of zero. If the substring offsets are not of interest, pcre_exec may be called with ovector passed as NULL and ovecsize as zero. However, if the pattern contains back references and the ovector is not big enough to remember the related substrings, PCRE has to get additional memory for use during matching. Thus it is usu- ally advisable to supply an ovector.

The pcre_fullinfo function can be used to find out how many capturing subpatterns there are in a compiled pattern. The smallest size for ovector that will allow for n captured substrings, in addition to the offsets of the substring matched by the whole pattern, is (n+1)*3.

It is possible for capturing subpattern number n+1 to match some part of the subject when subpattern n has not been used at all. For example, if the string "abc" is matched against the pattern (a|(z))(bc) the return from the function is 4, and subpatterns 1 and 3 are matched, but 2 is not. When this happens, both values in the offset pairs corre- sponding to unused subpatterns are set to -1.

Offset values that correspond to unused subpatterns at the end of the expression are also set to -1. For example, if the string "abc" is matched against the pattern (abc)(x(yz)?)? subpatterns 2 and 3 are not matched. The return from the function is 2, because the highest used capturing subpattern number is 1, and the offsets for for the second and third capturing subpatterns (assuming the vector is large enough, of course) are set to -1.

Note: Elements of ovector that do not correspond to capturing parenthe- ses in the pattern are never changed. That is, if a pattern contains n capturing parentheses, no more than ovector[0] to ovector[2n+1] are set by pcre_exec. The other elements retain whatever values they previ- ously had.

Some convenience functions are provided for extracting the captured substrings as separate strings. These are described below.

Extra data for pcre_exec
If the extra argument is not NULL, it must point to a pcre_extra  data block. The pcre_study function returns such a block (when it doesn't return NULL), but you can also create one for yourself, and pass addi- tional information  in it. The pcre_extra block contains the following fields (not necessarily in this order):

The flags field is a bitmap that specifies which of the  other  fields are set. The flag bits are:

PCRE_EXTRA_STUDY_DATA PCRE_EXTRA_MATCH_LIMIT PCRE_EXTRA_MATCH_LIMIT_RECURSION PCRE_EXTRA_CALLOUT_DATA PCRE_EXTRA_TABLES PCRE_EXTRA_MARK

Other flag  bits should be set to zero. The study_data field is set in the pcre_extra block that is returned by pcre_study,  together  with the appropriate flag bit. You should not set this yourself, but you may add to the block by setting the other fields and  their  corresponding flag bits.

The match_limit field provides a means of preventing PCRE from using up a vast amount of resources when running patterns that are not going to match,  but  which  have  a very large number of possibilities in their search trees. The classic example is a pattern that uses nested unlim- ited repeats.

Internally, PCRE uses a function called match which it calls repeat- edly (sometimes recursively). The limit set by match_limit is  imposed on the  number  of times this function is called during a match, which has the effect of limiting the amount of backtracking  that  can  take place. For patterns that are not anchored, the count restarts from zero for each position in the subject string.

The default value for the limit can be set when  PCRE  is  built;  the default default  is 10 million, which handles all but the most extreme cases. You can override the default by  suppling  pcre_exec  with  a pcre_extra     block    in    which    match_limit    is    set,    and PCRE_EXTRA_MATCH_LIMIT is set in the flags  field. If the  limit  is exceeded, pcre_exec returns PCRE_ERROR_MATCHLIMIT.

The match_limit_recursion field is similar to match_limit, but instead of limiting the total number of times that match is called, it limits the depth  of  recursion. The recursion depth is a smaller number than the total number of calls, because not all calls to match are recur- sive. This limit is of use only if it is set smaller than match_limit.

Limiting the  recursion  depth  limits the amount of stack that can be used, or, when PCRE has been compiled to use memory on the heap instead of the stack, the amount of heap memory that can be used.

The default  value  for  match_limit_recursion can be set when PCRE is built; the default default  is  the  same  value  as  the  default  for match_limit. You can override the default by suppling pcre_exec with a pcre_extra  block  in  which  match_limit_recursion  is   set,   and PCRE_EXTRA_MATCH_LIMIT_RECURSION is  set  in  the  flags field. If the limit is exceeded, pcre_exec returns PCRE_ERROR_RECURSIONLIMIT.

The callout_data field is used in conjunction with the "callout"  fea- ture, and is described in the pcrecallout documentation.

The tables  field  is  used  to  pass  a  character  tables pointer to pcre_exec; this overrides the value that is stored with the  compiled pattern. A non-NULL value is stored with the compiled pattern only if custom tables were supplied to pcre_compile via  its  tableptr  argu- ment. If NULL is passed to pcre_exec using this mechanism, it forces PCRE's internal tables to be used. This facility is helpful  when  re- using patterns  that  have been saved after compiling with an external set of tables, because the external tables might  be  at  a  different address when  pcre_exec is called. See the pcreprecompile documenta- tion for a discussion of saving compiled patterns for later use.

If PCRE_EXTRA_MARK is set in the flags field, the mark field  must  be set  to  point  to a char * variable. If the pattern contains any back- tracking control verbs such as (*MARK:NAME), and the execution ends up with  a  name  to  pass back, a pointer to the name string (zero termi- nated) is placed in the variable pointed to  by  the  mark  field. The names are  within  the  compiled pattern; if you wish to retain such a name you must copy it before freeing the memory of a compiled  pattern. If there  is no name to pass back, the variable pointed to by the mark field set to NULL. For details of the backtracking control verbs,  see the section entitled "Backtracking control" in the pcrepattern documen- tation.

Option bits for pcre_exec
The unused bits of the options argument for pcre_exec must be  zero. The only  bits  that  may  be set are PCRE_ANCHORED, PCRE_NEWLINE_xxx, PCRE_NOTBOL,  PCRE_NOTEOL,    PCRE_NOTEMPTY,    PCRE_NOTEMPTY_ATSTART, PCRE_NO_START_OPTIMIZE,  PCRE_NO_UTF8_CHECK,   PCRE_PARTIAL_SOFT,  and PCRE_PARTIAL_HARD.

PCRE_ANCHORED

The PCRE_ANCHORED option limits pcre_exec to matching at  the  first matching position. If a  pattern was compiled with PCRE_ANCHORED, or turned out to be anchored by virtue of its contents, it cannot be  made unachored at matching time.

PCRE_BSR_ANYCRLF PCRE_BSR_UNICODE

These options (which are mutually exclusive) control what the \R escape sequence matches. The choice is either to match only CR, LF, or  CRLF, or to  match  any Unicode newline sequence. These options override the choice that was made or defaulted when the pattern was compiled.

PCRE_NEWLINE_CR PCRE_NEWLINE_LF PCRE_NEWLINE_CRLF PCRE_NEWLINE_ANYCRLF PCRE_NEWLINE_ANY

These options override the  newline  definition  that  was  chosen  or defaulted  when the pattern was compiled. For details, see the descrip- tion of pcre_compile above. During matching,  the  newline  choice affects the  behaviour  of the dot, circumflex, and dollar metacharac- ters. It may also alter the way the match position is advanced after a match failure for an unanchored pattern.

When PCRE_NEWLINE_CRLF,  PCRE_NEWLINE_ANYCRLF,  or PCRE_NEWLINE_ANY is set, and a match attempt for an unanchored pattern fails when the  cur- rent position  is  at  a  CRLF  sequence,  and the pattern contains no explicit matches for  CR  or  LF  characters,  the  match  position  is advanced by two characters instead of one, in other words, to after the CRLF.

The above rule is a compromise that makes the most common cases work as expected. For example,  if  the  pattern  is .+A (and the PCRE_DOTALL option is not set), it does not match the string "\r\nA" because, after failing at the start, it skips both the CR and the LF before retrying. However, the pattern [\r\n]A does match that string, because  it  con- tains an explicit CR or LF reference, and so advances only by one char- acter after the first failure.

An explicit match for CR of LF is either a literal appearance of one of those characters,  or  one  of the \r or \n escape sequences. Implicit matches such as [^X] do not count, nor does \s (which includes CR  and LF in the characters that it matches).

Notwithstanding the above, anomalous effects may still occur when CRLF is a valid newline sequence and explicit \r or \n escapes appear in the pattern.

PCRE_NOTBOL

This option specifies that first character of the subject string is not the beginning of a line, so the circumflex  metacharacter  should  not match before it. Setting this without PCRE_MULTILINE (at compile time) causes circumflex never to match. This option affects only the  behav- iour of the circumflex metacharacter. It does not affect \A.

PCRE_NOTEOL

This option specifies that the end of the subject string is not the end of a line, so the dollar metacharacter should not match it nor (except in  multiline mode) a newline immediately before it. Setting this with- out PCRE_MULTILINE (at compile time) causes dollar never to match. This option affects only the behaviour of the dollar metacharacter. It does not affect \Z or \z.

PCRE_NOTEMPTY

An empty string is not considered to be a valid match if this option is set. If there are alternatives in the pattern, they are tried. If all the alternatives match the empty string, the entire match  fails. For example, if the pattern

a?b?

is applied  to  a  string not beginning with "a" or "b", it matches an empty string at the start of the subject. With PCRE_NOTEMPTY set, this match is not valid, so PCRE searches further into the string for occur- rences of "a" or "b".

PCRE_NOTEMPTY_ATSTART

This is like PCRE_NOTEMPTY, except that an empty string match that  is not  at  the  start  of  the  subject  is  permitted. If the pattern is anchored, such a match can occur only if the pattern contains \K.

Perl   has    no    direct    equivalent    of    PCRE_NOTEMPTY     or PCRE_NOTEMPTY_ATSTART,  but  it  does  make a special case of a pattern match of the empty string within its split function, and when  using the /g  modifier. It is  possible  to emulate Perl's behaviour after matching a null string by first trying the match again at the same off- set with  PCRE_NOTEMPTY_ATSTART  and  PCRE_ANCHORED,  and then if that fails, by advancing the starting offset (see below) and trying an ordi- nary match  again. There is some code that demonstrates how to do this in the pcredemo sample program. In the most general case, you have  to check  to  see  if the newline convention recognizes CRLF as a newline, and if so, and the current character is CR followed by LF, advance the starting offset by two characters instead of one.

PCRE_NO_START_OPTIMIZE

There are a number of optimizations that pcre_exec uses at the start of a match, in order to speed up the process. For example,  if  it  is known that an unanchored match must start with a specific character, it searches the subject for that character, and fails  immediately  if  it cannot  find  it,  without actually running the main matching function. This means that a special item such as (*COMMIT) at the start of a pat- tern is  not  considered until after a suitable starting point for the match has been found. When callouts or (*MARK) items are in use, these "start-up" optimizations can cause them to be skipped if the pattern is never actually used. The start-up optimizations are in effect  a  pre- scan of the subject that takes place before the pattern is run.

The PCRE_NO_START_OPTIMIZE option disables the start-up optimizations, possibly causing performance to suffer, but  ensuring  that  in  cases where the  result is "no match", the callouts do occur, and that items such as (*COMMIT) and (*MARK) are considered at every possible starting position in  the  subject  string. If PCRE_NO_START_OPTIMIZE is set at compile time, it cannot be unset at matching time.

Setting PCRE_NO_START_OPTIMIZE can change the outcome  of  a  matching operation. Consider the pattern

(*COMMIT)ABC

When this  is  compiled, PCRE records the fact that a match must start with the character "A". Suppose the subject string  is  "DEFABC". The start-up optimization  scans along the subject, finds "A" and runs the first match attempt from there. The (*COMMIT) item means that the pat- tern must  match the current starting position, which in this case, it does. However, if the same match is  run  with  PCRE_NO_START_OPTIMIZE set, the  initial  scan  along the subject string does not happen. The first match attempt is run starting from  "D"  and  when  this  fails, (*COMMIT) prevents  any  further  matches  being tried, so the overall result is "no match". If the pattern is studied, more  start-up  opti- mizations may  be  used. For example, a minimum length for the subject may be recorded. Consider the pattern

(*MARK:A)(X|Y)

The minimum length for a match is one character. If the  subject  is "ABC",  there  will  be  attempts  to  match "ABC", "BC", "C", and then finally an empty string. If the pattern is studied, the final attempt does not take place, because PCRE knows that the subject is too short, and so the (*MARK) is never encountered. In this case,  studying  the pattern does  not  affect the overall match result, which is still "no match", but it does affect the auxiliary information that is returned.

PCRE_NO_UTF8_CHECK

When PCRE_UTF8 is set at compile time, the validity of the subject as a UTF-8 string is automatically checked when pcre_exec is subsequently called. The value of startoffset is also checked to  ensure  that  it points  to  the start of a UTF-8 character. There is a discussion about the validity of UTF-8 strings in the section on UTF-8 support  in  the main pcre  page. If an  invalid  UTF-8  sequence  of bytes is found, pcre_exec returns the  error  PCRE_ERROR_BADUTF8  or,  if  PCRE_PAR- TIAL_HARD is set and the problem is a truncated UTF-8 character at the end of the subject, PCRE_ERROR_SHORTUTF8. If startoffset  contains  a value  that does not point to the start of a UTF-8 character (or to the end of the subject), PCRE_ERROR_BADUTF8_OFFSET is returned.

If you already know that your subject is valid, and you want  to  skip these   checks    for   performance   reasons,   you   can   set   the PCRE_NO_UTF8_CHECK option when calling pcre_exec. You might want  to do  this  for the second and subsequent calls to pcre_exec if you are making repeated calls to find all the  matches  in  a  single  subject string. However, you  should  be  sure  that the value of startoffset points to the start of a UTF-8 character (or the end of the  subject). When PCRE_NO_UTF8_CHECK is set, the effect of passing an invalid UTF-8 string as a subject or an invalid value of startoffset  is  undefined. Your program may crash.

PCRE_PARTIAL_HARD PCRE_PARTIAL_SOFT

These options turn on the partial matching feature. For backwards com- patibility, PCRE_PARTIAL is a synonym for PCRE_PARTIAL_SOFT. A partial match occurs if the end of the subject string is reached successfully, but there are not enough subject characters to complete the match. If this happens when PCRE_PARTIAL_SOFT (but not PCRE_PARTIAL_HARD) is set, matching continues by testing any remaining alternatives. Only if  no complete  match  can be found is PCRE_ERROR_PARTIAL returned instead of PCRE_ERROR_NOMATCH. In other words, PCRE_PARTIAL_SOFT  says  that  the caller is  prepared to handle a partial match, but only if no complete match can be found.

If PCRE_PARTIAL_HARD is set, it overrides PCRE_PARTIAL_SOFT. In this case, if  a  partial  match  is found, pcre_exec immediately returns PCRE_ERROR_PARTIAL, without considering  any  other  alternatives. In other words, when PCRE_PARTIAL_HARD is set, a partial match is consid- ered to be more important that an alternative complete match.

In both cases, the portion of the string that was inspected  when  the partial match was found is set as the first matching string. There is a more detailed discussion of partial and multi-segment  matching,  with examples, in the pcrepartial documentation.

The string to be matched by pcre_exec

The subject string is passed to pcre_exec as a pointer in subject, a length (in bytes) in length, and a starting byte offset in startoffset. If this  is  negative  or  greater  than  the  length  of the subject, pcre_exec returns PCRE_ERROR_BADOFFSET. When the starting offset  is zero,  the  search  for a match starts at the beginning of the subject, and this is by far the most common case. In UTF-8 mode, the byte offset must point  to  the start of a UTF-8 character (or the end of the sub- ject). Unlike the pattern string, the subject may contain binary  zero bytes.

A non-zero  starting offset is useful when searching for another match in the same subject by calling pcre_exec again after a previous suc- cess. Setting startoffset differs from just passing over a shortened string and setting PCRE_NOTBOL in the case of a  pattern  that  begins with any kind of lookbehind. For example, consider the pattern

\Biss\B

which finds  occurrences  of "iss" in the middle of words. (\B matches only if the current position in the subject is not a  word  boundary.) When applied  to the string "Mississipi" the first call to pcre_exec finds the first occurrence. If pcre_exec is called again  with  just the remainder  of  the  subject,  namely  "issipi", it does not match, because \B is always false at the start of the subject, which is deemed to be  a  word  boundary. However, if pcre_exec is passed the entire string again, but with startoffset set to 4, it finds the second occur- rence of "iss" because it is able to look behind the starting point to discover that it is preceded by a letter.

Finding all the matches in a subject is tricky when  the  pattern  can match an empty string. It is possible to emulate Perl's /g behaviour by first trying  the  match  again  at  the   same   offset,   with   the PCRE_NOTEMPTY_ATSTART and  PCRE_ANCHORED  options,  and  then  if that fails, advancing the starting offset  and  trying  an  ordinary  match again. There is some code that demonstrates how to do this in the pcre- demo sample program. In the most general case, you have to check to see if the newline convention recognizes CRLF as a newline, and if so, and the current character is CR followed by LF, advance the starting offset by two characters instead of one.

If a  non-zero starting offset is passed when the pattern is anchored, one attempt to match at the given offset is made. This can only succeed if the  pattern  does  not require the match to be at the start of the subject.