Website hosting service by Active-Venture.com
  

 Back to Index

Debugging regular expressions

There are two ways to enable debugging output for regular expressions.

If your perl is compiled with -DDEBUGGING, you may use the -Dr flag on the command line.

Otherwise, one can use re 'debug', which has effects at compile time and run time. It is not lexically scoped.

Compile-time output

The debugging output at compile time looks like this:

 
  Compiling REx `[bc]d(ef*g)+h[ij]k$'
  size 45 Got 364 bytes for offset annotations.
  first at 1
  rarest char g at 0
  rarest char d at 0
     1: ANYOF[bc](12)
    12: EXACT <d>(14)
    14: CURLYX[0] {1,32767}(28)
    16:   OPEN1(18)
    18:     EXACT <e>(20)
    20:     STAR(23)
    21:       EXACT <f>(0)
    23:     EXACT <g>(25)
    25:   CLOSE1(27)
    27:   WHILEM[1/1](0)
    28: NOTHING(29)
    29: EXACT <h>(31)
    31: ANYOF[ij](42)
    42: EXACT <k>(44)
    44: EOL(45)
    45: END(0)
  anchored `de' at 1 floating `gh' at 3..2147483647 (checking floating) 
        stclass `ANYOF[bc]' minlen 7 
  Offsets: [45]
  	1[4] 0[0] 0[0] 0[0] 0[0] 0[0] 0[0] 0[0] 0[0] 0[0] 0[0] 5[1]
  	0[0] 12[1] 0[0] 6[1] 0[0] 7[1] 0[0] 9[1] 8[1] 0[0] 10[1] 0[0]
  	11[1] 0[0] 12[0] 12[0] 13[1] 0[0] 14[4] 0[0] 0[0] 0[0] 0[0]
  	0[0] 0[0] 0[0] 0[0] 0[0] 0[0] 18[1] 0[0] 19[1] 20[0]  
  Omitting $` $& $' support.  

The first line shows the pre-compiled form of the regex. The second shows the size of the compiled form (in arbitrary units, usually 4-byte words) and the total number of bytes allocated for the offset/length table, usually 4+size*8. The next line shows the label id of the first node that does a match.

The

 
  anchored `de' at 1 floating `gh' at 3..2147483647 (checking floating) 
        stclass `ANYOF[bc]' minlen 7   

line (split into two lines above) contains optimizer information. In the example shown, the optimizer found that the match should contain a substring de at offset 1, plus substring gh at some offset between 3 and infinity. Moreover, when checking for these substrings (to abandon impossible matches quickly), Perl will check for the substring gh before checking for the substring de. The optimizer may also use the knowledge that the match starts (at the first id) with a character class, and no string shorter than 7 characters can possibly match.

The fields of interest which may appear in this line are

anchored STRING at POS
 
floating STRING at POS1..POS2
See above.
matching floating/anchored
Which substring to check first.
minlen
The minimal length of the match.
stclass TYPE
Type of first matching node.
noscan
Don't scan for the found substrings.
isall
Means that the optimizer information is all that the regular expression contains, and thus one does not need to enter the regex engine at all.
GPOS
Set if the pattern contains \G.
plus
Set if the pattern starts with a repeated char (as in x+y).
implicit
Set if the pattern starts with .*.
with eval
Set if the pattern contain eval-groups, such as (?{ code }) and (??{ code }).
anchored(TYPE)
If the pattern may match only at a handful of places, (with TYPE being BOL, MBOL, or GPOS. See the table below.

If a substring is known to match at end-of-line only, it may be followed by $, as in floating `k'$.

The optimizer-specific information is used to avoid entering (a slow) regex engine on strings that will not definitely match. If the isall flag is set, a call to the regex engine may be avoided even when the optimizer found an appropriate place for the match.

Above the optimizer section is the list of nodes of the compiled form of the regex. Each line has format

id: TYPE OPTIONAL-INFO (next-id)

Types of nodes

Here are the possible types, with short descriptions:

 
    # TYPE arg-description [num-args] [longjump-len] DESCRIPTION

    # Exit points
    END		no	End of program.
    SUCCEED	no	Return from a subroutine, basically.

    # Anchors:
    BOL		no	Match "" at beginning of line.
    MBOL	no	Same, assuming multiline.
    SBOL	no	Same, assuming singleline.
    EOS		no	Match "" at end of string.
    EOL		no	Match "" at end of line.
    MEOL	no	Same, assuming multiline.
    SEOL	no	Same, assuming singleline.
    BOUND	no	Match "" at any word boundary
    BOUNDL	no	Match "" at any word boundary
    NBOUND	no	Match "" at any word non-boundary
    NBOUNDL	no	Match "" at any word non-boundary
    GPOS	no	Matches where last m//g left off.

    # [Special] alternatives
    ANY		no	Match any one character (except newline).
    SANY	no	Match any one character.
    ANYOF	sv	Match character in (or not in) this class.
    ALNUM	no	Match any alphanumeric character
    ALNUML	no	Match any alphanumeric char in locale
    NALNUM	no	Match any non-alphanumeric character
    NALNUML	no	Match any non-alphanumeric char in locale
    SPACE	no	Match any whitespace character
    SPACEL	no	Match any whitespace char in locale
    NSPACE	no	Match any non-whitespace character
    NSPACEL	no	Match any non-whitespace char in locale
    DIGIT	no	Match any numeric character
    NDIGIT	no	Match any non-numeric character

    # BRANCH	The set of branches constituting a single choice are hooked
    #		together with their "next" pointers, since precedence prevents
    #		anything being concatenated to any individual branch.  The
    #		"next" pointer of the last BRANCH in a choice points to the
    #		thing following the whole choice.  This is also where the
    #		final "next" pointer of each individual branch points; each
    #		branch starts with the operand node of a BRANCH node.
    #
    BRANCH	node	Match this alternative, or the next...

    # BACK	Normal "next" pointers all implicitly point forward; BACK
    #		exists to make loop structures possible.
    # not used
    BACK	no	Match "", "next" ptr points backward.

    # Literals
    EXACT	sv	Match this string (preceded by length).
    EXACTF	sv	Match this string, folded (prec. by length).
    EXACTFL	sv	Match this string, folded in locale (w/len).

    # Do nothing
    NOTHING	no	Match empty string.
    # A variant of above which delimits a group, thus stops optimizations
    TAIL	no	Match empty string. Can jump here from outside.

    # STAR,PLUS	'?', and complex '*' and '+', are implemented as circular
    #		BRANCH structures using BACK.  Simple cases (one character
    #		per match) are implemented with STAR and PLUS for speed
    #		and to minimize recursive plunges.
    #
    STAR	node	Match this (simple) thing 0 or more times.
    PLUS	node	Match this (simple) thing 1 or more times.

    CURLY	sv 2	Match this simple thing {n,m} times.
    CURLYN	no 2	Match next-after-this simple thing 
    #			{n,m} times, set parens.
    CURLYM	no 2	Match this medium-complex thing {n,m} times.
    CURLYX	sv 2	Match this complex thing {n,m} times.

    # This terminator creates a loop structure for CURLYX
    WHILEM	no	Do curly processing and see if rest matches.

    # OPEN,CLOSE,GROUPP	...are numbered at compile time.
    OPEN	num 1	Mark this point in input as start of #n.
    CLOSE	num 1	Analogous to OPEN.

    REF		num 1	Match some already matched string
    REFF	num 1	Match already matched string, folded
    REFFL	num 1	Match already matched string, folded in loc.

    # grouping assertions
    IFMATCH	off 1 2	Succeeds if the following matches.
    UNLESSM	off 1 2	Fails if the following matches.
    SUSPEND	off 1 1	"Independent" sub-regex.
    IFTHEN	off 1 1	Switch, should be preceded by switcher .
    GROUPP	num 1	Whether the group matched.

    # Support for long regex
    LONGJMP	off 1 1	Jump far away.
    BRANCHJ	off 1 1	BRANCH with long offset.

    # The heavy worker
    EVAL	evl 1	Execute some Perl code.

    # Modifiers
    MINMOD	no	Next operator is not greedy.
    LOGICAL	no	Next opcode should set the flag only.

    # This is not used yet
    RENUM	off 1 1	Group with independently numbered parens.

    # This is not really a node, but an optimized away piece of a "long" node.
    # To simplify debugging output, we mark it as if it were a node
    OPTIMIZED	off	Placeholder for dump.  

Following the optimizer information is a dump of the offset/length table, here split across several lines:

 
  Offsets: [45]
  	1[4] 0[0] 0[0] 0[0] 0[0] 0[0] 0[0] 0[0] 0[0] 0[0] 0[0] 5[1]
  	0[0] 12[1] 0[0] 6[1] 0[0] 7[1] 0[0] 9[1] 8[1] 0[0] 10[1] 0[0]
  	11[1] 0[0] 12[0] 12[0] 13[1] 0[0] 14[4] 0[0] 0[0] 0[0] 0[0]
  	0[0] 0[0] 0[0] 0[0] 0[0] 0[0] 18[1] 0[0] 19[1] 20[0]    

The first line here indicates that the offset/length table contains 45 entries. Each entry is a pair of integers, denoted by offset[length]. Entries are numbered starting with 1, so entry #1 here is 1[4] and entry #12 is 5[1]. 1[4] indicates that the node labeled 1: (the 1: ANYOF[bc]) begins at character position 1 in the pre-compiled form of the regex, and has a length of 4 characters. 5[1] in position 12 indicates that the node labeled 12: (the 12: EXACT <d>) begins at character position 5 in the pre-compiled form of the regex, and has a length of 1 character. 12[1] in position 14 indicates that the node labeled 14: (the 14: CURLYX[0] {1,32767}) begins at character position 12 in the pre-compiled form of the regex, and has a length of 1 character---that is, it corresponds to the + symbol in the precompiled regex.

0[0] items indicate that there is no corresponding node.

Run-time output

First of all, when doing a match, one may get no run-time output even if debugging is enabled. This means that the regex engine was never entered and that all of the job was therefore done by the optimizer.

If the regex engine was entered, the output may look like this:

 
  Matching `[bc]d(ef*g)+h[ij]k$' against `abcdefg__gh__'
    Setting an EVAL scope, savestack=3
     2 <ab> <cdefg__gh_>    |  1: ANYOF
     3 <abc> <defg__gh_>    | 11: EXACT <d>
     4 <abcd> <efg__gh_>    | 13: CURLYX {1,32767}
     4 <abcd> <efg__gh_>    | 26:   WHILEM
				0 out of 1..32767  cc=effff31c
     4 <abcd> <efg__gh_>    | 15:     OPEN1
     4 <abcd> <efg__gh_>    | 17:     EXACT <e>
     5 <abcde> <fg__gh_>    | 19:     STAR
			     EXACT <f> can match 1 times out of 32767...
    Setting an EVAL scope, savestack=3
     6 <bcdef> <g__gh__>    | 22:       EXACT <g>
     7 <bcdefg> <__gh__>    | 24:       CLOSE1
     7 <bcdefg> <__gh__>    | 26:       WHILEM
				    1 out of 1..32767  cc=effff31c
    Setting an EVAL scope, savestack=12
     7 <bcdefg> <__gh__>    | 15:         OPEN1
     7 <bcdefg> <__gh__>    | 17:         EXACT <e>
       restoring \1 to 4(4)..7
				    failed, try continuation...
     7 <bcdefg> <__gh__>    | 27:         NOTHING
     7 <bcdefg> <__gh__>    | 28:         EXACT <h>
				    failed...
				failed...  

The most significant information in the output is about the particular node of the compiled regex that is currently being tested against the target string. The format of these lines is

STRING-OFFSET <PRE-STRING> <POST-STRING> |ID: TYPE

The TYPE info is indented with respect to the backtracking level. Other incidental information appears interspersed within.

 

  

 

Domain name registration - 
Register cheap domain name from $7.95 and enjoy free domain services 
 

Cheap domain name search service -
Domain name services at just
$8.95/year only
 


Buy domain name registration and cheap domain transfer at low, affordable price.

2002-2004 Active-Venture.com Web Site Hosting Service

 

[ Usenet is like a herd of performing elephants with diarrhea; massive, difficult to redirect, awe-inspiring, entertaining, and a source of mind-boggling amounts of excrement when you least expect it.   ]

 

 
 
 

Disclaimer: This documentation is provided only for the benefits of our web hosting customers.
For authoritative source of the documentation, please refer to http://www.perldoc.com