Поиск по шаблонуEdit
int pcre_exec(const pcre *code, const pcre_extra *extra, const char *subject, int length, int startoffset, int options, int *ovector, int ovecsize);
The function pcre_exec() is called to match a subject string against a compiled pattern, which is passed in the code argument. If the pattern was studied, the result of the study should be passed in the extra argument. This function is the main matching facility of the library, and it operates in a Perl-like manner. For specialist use there is also an alternative matching function, which is described below in the sec- tion about the pcre_dfa_exec() function.
In most applications, the pattern will have been compiled (and option- ally studied) in the same process that calls pcre_exec(). However, it is possible to save compiled patterns and study data, and then use them later in different processes, possibly even on different hosts. For a discussion about this, see the pcreprecompile documentation.
Here is an example of a simple call to pcre_exec():
int rc; int ovector; rc = pcre_exec( re, /* result of pcre_compile() */ NULL, /* we didn't study the pattern */ "some string", /* the subject string */ 11, /* the length of the subject string */ 0, /* start at offset 0 in the subject */ 0, /* default options */ ovector, /* vector of integers for substring information */ 30); /* number of elements (NOT size in bytes) */
How pcre_exec() returns captured substringsEdit
In general, a pattern matches a certain portion of the subject, and in addition, further substrings from the subject may be picked out by parts of the pattern. Following the usage in Jeffrey Friedl's book, this is called "capturing" in what follows, and the phrase "capturing subpattern" is used for a fragment of a pattern that picks out a sub- string. PCRE supports several other kinds of parenthesized subpattern that do not cause substrings to be captured.
Captured substrings are returned to the caller via a vector of integers whose address is passed in ovector. The number of elements in the vec- tor is passed in ovecsize, which must be a non-negative number. Note: this argument is NOT the size of ovector in bytes.
Первые 2/3 массива используются для возвращения подстрок, каждая подстрока соответствует паре значений типа int. Оставшаяся треть используется для нужд pcre_exec() во время работы и не используется для возвращения информации. Число, переданное в ovecsize ВСЕГДА должно быть кратно трём, если не кратно, оно округляется вниз.
When a match is successful, information about captured substrings is returned in pairs of integers, starting at the beginning of ovector, and continuing up to two-thirds of its length at the most. The first element of each pair is set to the byte offset of the first character in a substring, and the second is set to the byte offset of the first character after the end of a substring. Note: these values are always byte offsets, even in UTF-8 mode. They are not character counts.
The first pair of integers, ovector and ovector, identify the portion of the subject string matched by the entire pattern. The next pair is used for the first capturing subpattern, and so on. The value returned by pcre_exec() is one more than the highest numbered pair that has been set. For example, if two substrings have been captured, the returned value is 3. If there are no capturing subpatterns, the return value from a successful match is 1, indicating that just the first pair of offsets has been set.
If a capturing subpattern is matched repeatedly, it is the last portion of the string that it matched that is returned.
If the vector is too small to hold all the captured substring offsets, it is used as far as possible (up to two-thirds of its length), and the function returns a value of zero. If the substring offsets are not of interest, pcre_exec() may be called with ovector passed as NULL and ovecsize as zero. However, if the pattern contains back references and the ovector is not big enough to remember the related substrings, PCRE has to get additional memory for use during matching. Thus it is usu- ally advisable to supply an ovector.
The pcre_fullinfo() function can be used to find out how many capturing subpatterns there are in a compiled pattern. The smallest size for ovector that will allow for n captured substrings, in addition to the offsets of the substring matched by the whole pattern, is (n+1)*3.
It is possible for capturing subpattern number n+1 to match some part of the subject when subpattern n has not been used at all. For example, if the string "abc" is matched against the pattern (a|(z))(bc) the return from the function is 4, and subpatterns 1 and 3 are matched, but 2 is not. When this happens, both values in the offset pairs corre- sponding to unused subpatterns are set to -1.
Offset values that correspond to unused subpatterns at the end of the expression are also set to -1. For example, if the string "abc" is matched against the pattern (abc)(x(yz)?)? subpatterns 2 and 3 are not matched. The return from the function is 2, because the highest used capturing subpattern number is 1, and the offsets for for the second and third capturing subpatterns (assuming the vector is large enough, of course) are set to -1.
Note: Elements of ovector that do not correspond to capturing parenthe- ses in the pattern are never changed. That is, if a pattern contains n capturing parentheses, no more than ovector to ovector[2n+1] are set by pcre_exec(). The other elements retain whatever values they previ- ously had.
Some convenience functions are provided for extracting the captured substrings as separate strings. These are described below.
Extra data for pcre_exec()Edit
If the extra argument is not NULL, it must point to a pcre_extra data block. The pcre_study() function returns such a block (when it doesn't return NULL), but you can also create one for yourself, and pass addi- tional information in it. The pcre_extra block contains the following fields (not necessarily in this order):
unsigned long int flags; void *study_data; unsigned long int match_limit; unsigned long int match_limit_recursion; void *callout_data; const unsigned char *tables; unsigned char **mark;
The flags field is a bitmap that specifies which of the other fields are set. The flag bits are:
PCRE_EXTRA_STUDY_DATA PCRE_EXTRA_MATCH_LIMIT PCRE_EXTRA_MATCH_LIMIT_RECURSION PCRE_EXTRA_CALLOUT_DATA PCRE_EXTRA_TABLES PCRE_EXTRA_MARK
Other flag bits should be set to zero. The study_data field is set in the pcre_extra block that is returned by pcre_study(), together with the appropriate flag bit. You should not set this yourself, but you may add to the block by setting the other fields and their corresponding flag bits.
The match_limit field provides a means of preventing PCRE from using up a vast amount of resources when running patterns that are not going to match, but which have a very large number of possibilities in their search trees. The classic example is a pattern that uses nested unlim- ited repeats.
Internally, PCRE uses a function called match() which it calls repeat- edly (sometimes recursively). The limit set by match_limit is imposed on the number of times this function is called during a match, which has the effect of limiting the amount of backtracking that can take place. For patterns that are not anchored, the count restarts from zero for each position in the subject string.
The default value for the limit can be set when PCRE is built; the default default is 10 million, which handles all but the most extreme cases. You can override the default by suppling pcre_exec() with a pcre_extra block in which match_limit is set, and PCRE_EXTRA_MATCH_LIMIT is set in the flags field. If the limit is exceeded, pcre_exec() returns PCRE_ERROR_MATCHLIMIT.
The match_limit_recursion field is similar to match_limit, but instead of limiting the total number of times that match() is called, it limits the depth of recursion. The recursion depth is a smaller number than the total number of calls, because not all calls to match() are recur- sive. This limit is of use only if it is set smaller than match_limit.
Limiting the recursion depth limits the amount of stack that can be used, or, when PCRE has been compiled to use memory on the heap instead of the stack, the amount of heap memory that can be used.
The default value for match_limit_recursion can be set when PCRE is built; the default default is the same value as the default for match_limit. You can override the default by suppling pcre_exec() with a pcre_extra block in which match_limit_recursion is set, and PCRE_EXTRA_MATCH_LIMIT_RECURSION is set in the flags field. If the limit is exceeded, pcre_exec() returns PCRE_ERROR_RECURSIONLIMIT.
The callout_data field is used in conjunction with the "callout" fea- ture, and is described in the pcrecallout documentation.
The tables field is used to pass a character tables pointer to pcre_exec(); this overrides the value that is stored with the compiled pattern. A non-NULL value is stored with the compiled pattern only if custom tables were supplied to pcre_compile() via its tableptr argu- ment. If NULL is passed to pcre_exec() using this mechanism, it forces PCRE's internal tables to be used. This facility is helpful when re- using patterns that have been saved after compiling with an external set of tables, because the external tables might be at a different address when pcre_exec() is called. See the pcreprecompile documenta- tion for a discussion of saving compiled patterns for later use.
If PCRE_EXTRA_MARK is set in the flags field, the mark field must be set to point to a char * variable. If the pattern contains any back- tracking control verbs such as (*MARK:NAME), and the execution ends up with a name to pass back, a pointer to the name string (zero termi- nated) is placed in the variable pointed to by the mark field. The names are within the compiled pattern; if you wish to retain such a name you must copy it before freeing the memory of a compiled pattern. If there is no name to pass back, the variable pointed to by the mark field set to NULL. For details of the backtracking control verbs, see the section entitled "Backtracking control" in the pcrepattern documen- tation.
Option bits for pcre_exec()Edit
The unused bits of the options argument for pcre_exec() must be zero. The only bits that may be set are PCRE_ANCHORED, PCRE_NEWLINE_xxx, PCRE_NOTBOL, PCRE_NOTEOL, PCRE_NOTEMPTY, PCRE_NOTEMPTY_ATSTART, PCRE_NO_START_OPTIMIZE, PCRE_NO_UTF8_CHECK, PCRE_PARTIAL_SOFT, and PCRE_PARTIAL_HARD.
The PCRE_ANCHORED option limits pcre_exec() to matching at the first matching position. If a pattern was compiled with PCRE_ANCHORED, or turned out to be anchored by virtue of its contents, it cannot be made unachored at matching time.
These options (which are mutually exclusive) control what the \R escape sequence matches. The choice is either to match only CR, LF, or CRLF, or to match any Unicode newline sequence. These options override the choice that was made or defaulted when the pattern was compiled.
PCRE_NEWLINE_CR PCRE_NEWLINE_LF PCRE_NEWLINE_CRLF PCRE_NEWLINE_ANYCRLF PCRE_NEWLINE_ANY
These options override the newline definition that was chosen or defaulted when the pattern was compiled. For details, see the descrip- tion of pcre_compile() above. During matching, the newline choice affects the behaviour of the dot, circumflex, and dollar metacharac- ters. It may also alter the way the match position is advanced after a match failure for an unanchored pattern.
When PCRE_NEWLINE_CRLF, PCRE_NEWLINE_ANYCRLF, or PCRE_NEWLINE_ANY is set, and a match attempt for an unanchored pattern fails when the cur- rent position is at a CRLF sequence, and the pattern contains no explicit matches for CR or LF characters, the match position is advanced by two characters instead of one, in other words, to after the CRLF.
The above rule is a compromise that makes the most common cases work as expected. For example, if the pattern is .+A (and the PCRE_DOTALL option is not set), it does not match the string "\r\nA" because, after failing at the start, it skips both the CR and the LF before retrying. However, the pattern [\r\n]A does match that string, because it con- tains an explicit CR or LF reference, and so advances only by one char- acter after the first failure.
An explicit match for CR of LF is either a literal appearance of one of those characters, or one of the \r or \n escape sequences. Implicit matches such as [^X] do not count, nor does \s (which includes CR and LF in the characters that it matches).
Notwithstanding the above, anomalous effects may still occur when CRLF is a valid newline sequence and explicit \r or \n escapes appear in the pattern.
This option specifies that first character of the subject string is not the beginning of a line, so the circumflex metacharacter should not match before it. Setting this without PCRE_MULTILINE (at compile time) causes circumflex never to match. This option affects only the behav- iour of the circumflex metacharacter. It does not affect \A.
This option specifies that the end of the subject string is not the end of a line, so the dollar metacharacter should not match it nor (except in multiline mode) a newline immediately before it. Setting this with- out PCRE_MULTILINE (at compile time) causes dollar never to match. This option affects only the behaviour of the dollar metacharacter. It does not affect \Z or \z.
An empty string is not considered to be a valid match if this option is set. If there are alternatives in the pattern, they are tried. If all the alternatives match the empty string, the entire match fails. For example, if the pattern
is applied to a string not beginning with "a" or "b", it matches an empty string at the start of the subject. With PCRE_NOTEMPTY set, this match is not valid, so PCRE searches further into the string for occur- rences of "a" or "b".
This is like PCRE_NOTEMPTY, except that an empty string match that is not at the start of the subject is permitted. If the pattern is anchored, such a match can occur only if the pattern contains \K.
Perl has no direct equivalent of PCRE_NOTEMPTY or PCRE_NOTEMPTY_ATSTART, but it does make a special case of a pattern match of the empty string within its split() function, and when using the /g modifier. It is possible to emulate Perl's behaviour after matching a null string by first trying the match again at the same off- set with PCRE_NOTEMPTY_ATSTART and PCRE_ANCHORED, and then if that fails, by advancing the starting offset (see below) and trying an ordi- nary match again. There is some code that demonstrates how to do this in the pcredemo sample program. In the most general case, you have to check to see if the newline convention recognizes CRLF as a newline, and if so, and the current character is CR followed by LF, advance the starting offset by two characters instead of one.
There are a number of optimizations that pcre_exec() uses at the start of a match, in order to speed up the process. For example, if it is known that an unanchored match must start with a specific character, it searches the subject for that character, and fails immediately if it cannot find it, without actually running the main matching function. This means that a special item such as (*COMMIT) at the start of a pat- tern is not considered until after a suitable starting point for the match has been found. When callouts or (*MARK) items are in use, these "start-up" optimizations can cause them to be skipped if the pattern is never actually used. The start-up optimizations are in effect a pre- scan of the subject that takes place before the pattern is run.
The PCRE_NO_START_OPTIMIZE option disables the start-up optimizations, possibly causing performance to suffer, but ensuring that in cases where the result is "no match", the callouts do occur, and that items such as (*COMMIT) and (*MARK) are considered at every possible starting position in the subject string. If PCRE_NO_START_OPTIMIZE is set at compile time, it cannot be unset at matching time.
Setting PCRE_NO_START_OPTIMIZE can change the outcome of a matching operation. Consider the pattern
When this is compiled, PCRE records the fact that a match must start with the character "A". Suppose the subject string is "DEFABC". The start-up optimization scans along the subject, finds "A" and runs the first match attempt from there. The (*COMMIT) item means that the pat- tern must match the current starting position, which in this case, it does. However, if the same match is run with PCRE_NO_START_OPTIMIZE set, the initial scan along the subject string does not happen. The first match attempt is run starting from "D" and when this fails, (*COMMIT) prevents any further matches being tried, so the overall result is "no match". If the pattern is studied, more start-up opti- mizations may be used. For example, a minimum length for the subject may be recorded. Consider the pattern
The minimum length for a match is one character. If the subject is "ABC", there will be attempts to match "ABC", "BC", "C", and then finally an empty string. If the pattern is studied, the final attempt does not take place, because PCRE knows that the subject is too short, and so the (*MARK) is never encountered. In this case, studying the pattern does not affect the overall match result, which is still "no match", but it does affect the auxiliary information that is returned.
When PCRE_UTF8 is set at compile time, the validity of the subject as a UTF-8 string is automatically checked when pcre_exec() is subsequently called. The value of startoffset is also checked to ensure that it points to the start of a UTF-8 character. There is a discussion about the validity of UTF-8 strings in the section on UTF-8 support in the main pcre page. If an invalid UTF-8 sequence of bytes is found, pcre_exec() returns the error PCRE_ERROR_BADUTF8 or, if PCRE_PAR- TIAL_HARD is set and the problem is a truncated UTF-8 character at the end of the subject, PCRE_ERROR_SHORTUTF8. If startoffset contains a value that does not point to the start of a UTF-8 character (or to the end of the subject), PCRE_ERROR_BADUTF8_OFFSET is returned.
If you already know that your subject is valid, and you want to skip these checks for performance reasons, you can set the PCRE_NO_UTF8_CHECK option when calling pcre_exec(). You might want to do this for the second and subsequent calls to pcre_exec() if you are making repeated calls to find all the matches in a single subject string. However, you should be sure that the value of startoffset points to the start of a UTF-8 character (or the end of the subject). When PCRE_NO_UTF8_CHECK is set, the effect of passing an invalid UTF-8 string as a subject or an invalid value of startoffset is undefined. Your program may crash.
These options turn on the partial matching feature. For backwards com- patibility, PCRE_PARTIAL is a synonym for PCRE_PARTIAL_SOFT. A partial match occurs if the end of the subject string is reached successfully, but there are not enough subject characters to complete the match. If this happens when PCRE_PARTIAL_SOFT (but not PCRE_PARTIAL_HARD) is set, matching continues by testing any remaining alternatives. Only if no complete match can be found is PCRE_ERROR_PARTIAL returned instead of PCRE_ERROR_NOMATCH. In other words, PCRE_PARTIAL_SOFT says that the caller is prepared to handle a partial match, but only if no complete match can be found.
If PCRE_PARTIAL_HARD is set, it overrides PCRE_PARTIAL_SOFT. In this case, if a partial match is found, pcre_exec() immediately returns PCRE_ERROR_PARTIAL, without considering any other alternatives. In other words, when PCRE_PARTIAL_HARD is set, a partial match is consid- ered to be more important that an alternative complete match.
In both cases, the portion of the string that was inspected when the partial match was found is set as the first matching string. There is a more detailed discussion of partial and multi-segment matching, with examples, in the pcrepartial documentation.
The string to be matched by pcre_exec()
The subject string is passed to pcre_exec() as a pointer in subject, a length (in bytes) in length, and a starting byte offset in startoffset. If this is negative or greater than the length of the subject, pcre_exec() returns PCRE_ERROR_BADOFFSET. When the starting offset is zero, the search for a match starts at the beginning of the subject, and this is by far the most common case. In UTF-8 mode, the byte offset must point to the start of a UTF-8 character (or the end of the sub- ject). Unlike the pattern string, the subject may contain binary zero bytes.
A non-zero starting offset is useful when searching for another match in the same subject by calling pcre_exec() again after a previous suc- cess. Setting startoffset differs from just passing over a shortened string and setting PCRE_NOTBOL in the case of a pattern that begins with any kind of lookbehind. For example, consider the pattern
which finds occurrences of "iss" in the middle of words. (\B matches only if the current position in the subject is not a word boundary.) When applied to the string "Mississipi" the first call to pcre_exec() finds the first occurrence. If pcre_exec() is called again with just the remainder of the subject, namely "issipi", it does not match, because \B is always false at the start of the subject, which is deemed to be a word boundary. However, if pcre_exec() is passed the entire string again, but with startoffset set to 4, it finds the second occur- rence of "iss" because it is able to look behind the starting point to discover that it is preceded by a letter.
Finding all the matches in a subject is tricky when the pattern can match an empty string. It is possible to emulate Perl's /g behaviour by first trying the match again at the same offset, with the PCRE_NOTEMPTY_ATSTART and PCRE_ANCHORED options, and then if that fails, advancing the starting offset and trying an ordinary match again. There is some code that demonstrates how to do this in the pcre- demo sample program. In the most general case, you have to check to see if the newline convention recognizes CRLF as a newline, and if so, and the current character is CR followed by LF, advance the starting offset by two characters instead of one.
If a non-zero starting offset is passed when the pattern is anchored, one attempt to match at the given offset is made. This can only succeed if the pattern does not require the match to be at the start of the subject.