Matching with Regular Expression

Matches with m//

We’ve been writing patterns in pairs of forward slashes, like /fred/. But this is actually a shortcut for the m// (pattern match) operator. As you saw with the qw// operator, you may choose any pair of delimiters to quote the contents. So you could write that same expression as m(fred), m<fred>, m{fred}, or m[fred] using those paired delimiters, or as m,fred,, m!fred!, m^fred^, or many other ways using nonpaired delimiters. (Nonpaired delimiters are the ones that don’t have a different “left” and “right” variety; the same punctuation mark is used for both ends.)

The shortcut is that if you choose the forward slash as the delimiter, you may omit the initial m. Since Perl folks love to avoid typing extra characters, you’ll see most pattern matches written using slashes, as in /fred/.

If you’re using paired delimiters, you shouldn’t generally have to worry about using the delimiter inside the pattern, since that delimiter will generally be paired inside your pattern. That is, m(fred(.*)barney) and m{\w{2,}} and m[wilma[\n\t]+betty] are all fine, even though the pattern contains the quoting character, since each “left” has a corresponding “right.” But the angle brackets (< and >) aren’t regular expression metacharacters, so they may not be paired.

Option Modifiers

There are several option modifier letters, sometimes called flags, which may be appended as a group right after the ending delimiter of a regular expression to change its behavior from the default.

Case-insensitive Matching with /i

To make a case-insensitive pattern match, so you can match FRED as easily as fred or Fred, use the /i modifier:

print "Would you like to play a game?";
chomp($_ = <STDIN>);
if (/yes/i) { # case insensitive match
    print "In that case, I recommend that you go bowling.\n";
}
Matching Any Character with /s

By default, the dot (.) doesn’t match newline, and this makes sense for most “look within a single line” patterns. If you might have newlines in your strings, and you want the dot to be able to match them, the /s modifier will do the job. It changes every dot in the pattern to act like the character class [\d\D] does, which is to match any character, even if it is a newline. Of course, you have to have a string with newlines for this to make a difference:

$_ = "I saw Barney\ndown at the bowling alley\nwith Fred\nlast night.\n"; 
if (/Barney.*Fred/s) { 
    print "That string mentions Fred after Barney!\n"; 
}

Without the /s modifier, that match would fail, since the two names aren’t on the same line.

Adding Whitespace with /x

The third modifier you’ll see allows you to add arbitrary whitespace to a pattern to make it easier to read:

/-?\d+\.?\d*/ # what is this doing
/ -? \d+ \.? \d* /x # a little better

Since the /x allows whitespace inside the pattern, a literal space or tab character within the pattern is ignored. You could use a backslashed space or \t (among many other ways) to match these, but it’s more common to use \s (or \s* or \s+) when you want to match whitespace anyway.

Remember that in Perl, comments may be included as part of the whitespace. Now we can put comments into that pattern to tell what it’s really doing:

/
-?   # an optional minus sign
\d+  # one or more digits before the decimal point
\.?  # an optional decimal point
\d*  # some optional digits after the decimal point
/x   # end of string

Since the pound sign indicates the start of a comment, use # or [#] in the rare case that you need to match a pound sign. And be careful not to include the closing delimiter inside the comments, or it will prematurely terminate the pattern.

Combining Option Modifiers

If you have more than one option modifier to use on the same pattern, they may be used one after the other (their order isn’t significant):

if (/barney.*fred/is) { # both /i and /s
    print "That string mentions Fred after Barney!\n";
}

Or as a more expanded version with comments:

if (m{
    barney # the little guy
    .* # anything in between
    fred # the loud guy
    }six) { # all three of /s and /i and /x
    print "That String mentions Fred after Barney!\n";
}

Note the shift to curly braces here for the delimiters as well, allowing programmer-style editors to easily bounce from the beginning to the end of the regular expression.

Other Options

There are many other option modifiers available. We’ll cover those as we get to them, or you can read about them in the perlop manpage and in the descriptions of m// and the other regular expression operators that you’ll see later in this chapter.

Anchors

By default, if a pattern doesn’t match at the start of the string, it can “float” on down the string, trying to match somewhere else. But there are a number of anchors that may be used to hold the pattern at a particular point in a string.

The caret anchor (^) marks the beginning of the string, while the dollar sign ($) marks the end(1). So the pattern /^fred/ will match fred only at the start of the string; it wouldn’t match manfred mann. And /rock$/ will match rock only at the end of the string; it wouldn’t match knute rockne.

Note1: Actually, it matches either the end of the string or at a newline at the end of the string. That’s so you can match the end of the string whether it has a trailing newline or not. Most folks don’t worry about this distinction much, but once in a long while it’s important to remember that /^fred$/ will match either "fred" or "fred\n" with equal ease.

Sometimes you’ll want to use both of these anchors to ensure that the pattern matches an entire string. A common example is /^\s*$/, which matches a blank line. But this “blank” line may include some whitespace characters, like tabs and spaces, which are invisible to you and me.

Word Anchors

Anchors aren’t just at the ends of the string. The word-boundary anchor, \b, matches at either end of a word. So you can use /\bfred\b/ to match the word fred but not frederick or alfred or manfred mann. This is similar to the feature often called something like “match whole words only” in a word processor’s search command. More precisely, the \b anchor matches at the start or end of a group of \w characters.

The nonword-boundary anchor is \B; it matches at any point where \b would not match. So the pattern /\bsearch\B/ will match searches, searching, and searched, but not search or researching.

The Binding Operator, =~

Matching against $_ is merely the default; the binding operator, =~, tells Perl to match the pattern on the right against the string on the left, instead of matching against $_. For example:

my $some_other = "I dream of betty rubble."; 
if ($some_other =~ /\brub/) { 
    print "Aye, there's the rub.\n"; 
}

Another example:

print "Do you like Perl? ";
my $likes_perl = (<STDIN> =~ /\byes\b/i); # the parentheses are not necessary
... # Time passes...
if ($likes_perl) {
print "You said earlier that you like Perl, so...\n";
...
}

Interpolating into Patterns

The regular expression is double-quote interpolated, just as if it were a double-quoted string. This allows us to write a quick grep-like program like this:

#!/usr/bin/perl -w 
my $what = "larry"; 
while (<>) { 
    if (/^($what)/) { # pattern is anchored at beginning of string 
        print "We saw $what in beginning of $_"; 
    } 
}

The pattern will be built up out of whatever’s in $what when we run the pattern match. In this case, it’s the same as if we had written /^(larry)/, looking for larry at the start of each line.

But we didn’t have to get the value of $what from a literal string; we could have gotten it instead from the command-line arguments in @ARGV:

my $what = shift @ARGV;

The Match Variables

So far, when we’ve put parentheses into patterns, they’ve been used only for their ability to group parts of a pattern together. But parentheses also trigger the regular expression engine’s memory. The memory holds the part of the string matched by the part of the pattern inside parentheses. If there is more than one pair of parentheses, there will be more than one memory. Each regular expression memory holds part of the original string, not part of the pattern.

Since these variables hold strings, they are scalar variables; in Perl, they have names like $1 and $2. There are as many of these variables as there are pairs of memory parentheses in the pattern. As you’d expect, $4 means the string matched by the fourth set of parentheses.

These match variables are a big part of the power of regular expressions because they let us pull out the parts of a string:

$_ = "Hello there, neighbor"; 
if (/\s(\w+),/) { # memorize the word between space and comma 
    print "the word was $1\n"; # the word was there 
}

Or you could use more than one memory at once:

$_ = "Hello there, neighbor"; 
if (/(\S+) (\S+), (\S+)/) { 
    print "words were $1 $2 $3\n"; 
}

You could even have an empty match variable if that part of the pattern might be empty. That is, a match variable may contain the empty string:

my $dino = "I fear that I'll be extinct after 1000 years."; 
if ($dino =~ /(\d*) years/) { 
    print "That said '$1' years.\n"; # 1000 
} 

$dino = "I fear that I'll be extinct after a few million years."; 
if ($dino =~ /(\d*) years/) { 
    print "That said '$1' years.\n"; # empty string 
}
The Persistence of Memory

These match variables generally stay around until the next successful pattern match. That is, an unsuccessful match leaves the previous memories intact, but a successful one resets them all. This correctly implies that you shouldn’t use these match variables unless the match succeeded; otherwise, you could be seeing a memory from some previous pattern. The following (bad) example is supposed to print a word matched from (wilma. But if the match fails, it will use whatever leftover string happens to be found in $$1.

$wilma =~ /(\w+)/; # BAD! Untested match result 
print "Wilma's word was $1... or was it?\n";

This is another reason that a pattern match is almost always found in the conditional expression of an if or while:

if ($wilma =~ /(\w+)/) { 
    print "Wilma's word was $1.\n"; 
} else { 
    print "Wilma doesn't have a word.\n"; 
}

Since these memories don’t stay around forever, you shouldn’t use a match variable like $1 more than a few lines after its pattern match. If your maintenance programmer adds a new regular expression between your regular expression and your use of $1, you’ll be getting the value of $1 for the second match, rather than the first. For this reason, if you need a memory for more than a few lines, it’s generally best to copy it into an ordinary variable. Doing this helps make the code more readable at the same time:

if ($wilma =~ /(\w+)/) { 
    my $wilma_word = $1; 
    # ... 
}
Noncapturing Parentheses

Consider a regular expression where we want to make part of it optional, but capture only another part of it. In this example, we want “bronto” to be optional, but to make it optional, we have to group that sequence of characters with parentheses. Later in the pattern, we use an alternation to get either “steak” or “burger”, and we want to know which one we found.

if (/(bronto)?saurus (steak|burger)/) { 
    print "Fred wants a $2\n"; 
}

Even if bronto is not there, its part of the pattern goes into $1. Perl just counts the order of the opening parentheses to decide what the memory variables will be. The part that we want to remember ends up in $2. In more complicated patterns, this situation can get quite confusing.

Fortunately, Perl’s regular expressions have a way to use parentheses to group things but not trigger the memory variables. We call these noncapturing parentheses, and we write them with a special sequence. We add a question mark and a colon after the opening parenthesis, (?:), and that tells Perl we use these parentheses only for grouping.

We change our regular expression to use noncapturing parentheses around “bronto”, and the part that we want to remember now shows up in $1:

if (/(?:bronto)?saurus (steak|burger)/) { 
    print "Fred wants a $1\n"; 
}
Named Captures

Instead of remembering numbers such as $1, Perl 5.10 lets us name the captures directly in the regular expression. It saves the text it matches in the hash named %+: the key is the label we used and the value is the part of the string that it matched. To label a match variable, we use (?<LABEL>PATTERN) where we replace LABEL with our own names. We label the first capture name1 and the second one name2, and look in $+{name1} and $+{name2} to find their values:

my $names = 'Fred or Barney'; 
if( $names =~ m/(?<name1>\w+) (and|or) (?<name2>\w+)/ ) { 
    say "I saw $+{name1} and $+{name2}"; 
}

Now that we have a way to label matches, we also need a way to refer to them for back references. Previously, we used either \1 or \g{1} for this. With a labeled group, we can use the label in \g{label}:

my $names = 'Fred Flinstone and Wilma Flinstone'; 
if( $names =~ m/(?<last_name>\w+) and \w+ \g{last_name}/ ) { 
    say "I saw $+{last_name}"; 
}

We can do the same thing with another syntax. Instead of using \g{label}, we use \k<label>:

my $names = 'Fred Flinstone and Wilma Flinstone'; 
if( $names =~ m/(?<last_name>\w+) and \w+ \k<last_name>/ ) {
    say "I saw $+{last_name}"; 
}
The Automatic Match Variables

There are three more match variables that you get for free, whether the pattern has memory parentheses or not. The names are punctuation marks: $&, $`, and $'. The part of the string that actually matched the pattern is automatically stored in $&:

if ("Hello there, neighbor" =~ /\s(\w+),/) { 
    print "That actually matched '$&'.\n"; 
}

That tells us that the part that matched was " there," (with a space, a word, and a comma). Memory one, in $1, has just the five-letter word there, but $& has the entire matched section.

Whatever came before the matched section is in $</strong>, and whatever was after it is in <strong>$&#39;</strong>. Another way to say that is that <strong>$ holds whatever the regular expression engine had to skip over before it found the match, and $' has the remainder of the string that the pattern never got to. If you glue these three strings together in order, you’ll always get back the original string:

if ("Hello there, neighbor" =~ /\s(\w+),/) { 
    print "That was ($`)($&)($').\n"; 
}

Any or all of these three automatic match variables may be empty, of course, just like the numbered match variables. And they have the same scope as the numbered match variables. Generally, that means that they’ll stay around until the next successful pattern match.

Now, we said earlier that these three are “free.” Well, freedom has its price. In this case, the price is that once you use any one of these automatic match variables anywhere in your entire program, other regular expressions will run a little more slowly. Now, this isn’t a giant slowdown, but it’s enough of a worry that many Perl programmers will simply never use these automatic match variables. Instead, they’ll use a workaround. For example, if the only one you need is $&, just put parentheses around the whole pattern and use $1 instead (you may need to renumber the pattern’s memories, of course).

General Quantifiers

A quantifier in a pattern means to repeat the preceding item a certain number of times. You’ve already seen three quantifiers: *, +, and ?. But if none of those three suits your needs, just use a comma-separated pair of numbers inside curly braces ({}) to specify exactly how few and how many repetitions are allowed.

So the pattern /a{5,15}/ will match from 5 to 15 repetitions of the letter a. If you omit the second number (but include the comma), there’s no upper limit to the number of times the item will match. So, /(fred){3,}/ will match if there are three or more instances of fred in a row (with no extra characters, like spaces, allowed between each fred and the next).If you omit the comma as well as the upper bound, the number given is an exact count: /\w{8}/ will match exactly eight word characters.

In fact, the three quantifier characters that you saw earlier are just common shortcuts. The star is the same as the quantifier {0,}, meaning zero or more. The plus is the same as {1,}, meaning one or more. And the question mark could be written as {0,1}. In practice, it’s unusual to need any curly-brace quantifiers, since the three shortcut characters are nearly always the only ones needed.

Precedence

Regular expression precedence

Regular expression featureExample
Parentheses (grouping or memory)(...), (?:...), (?
Quantifiersa* a+ a? a{n,m}
Anchors and sequenceabc ^a a$
Alternationa|b|c
Atomsa [abc] \d \1
Examples of Precedence

When you need to decipher a complex regular expression, you’ll need to do as Perl does and use the precedence chart to see what’s really going on.

For example, /^fred|barney$/ is probably not what the programmer intended. That’s because the vertical bar of alternation is very low precedence; it cuts the pattern in two. That pattern matches either fred at the beginning of the string or barney at the end. It’s much more likely that the programmer wanted /^(fred|barney)$/, which will match if the whole line has nothing but fred or nothing but barney. And what will /(wilma| pebbles?)/ match? The question mark applies to the previous character so that will match either wilma or pebbles or pebble, perhaps as part of a larger string (since there are no anchors).

The pattern /^(\w+)\s+(\w+)$/ matches lines that have a “word,” some required whitespace, and another “word,” with nothing else before or after. That might be used to match lines like fred flintstone, for example. The parentheses around the words aren’t needed for grouping, so they may be intended to save those substrings into the regular expression memories.

When you’re trying to understand a complex pattern, it may be helpful to add parentheses to clarify the precedence. That’s okay, but remember that grouping parentheses are also automatically memory parentheses; use the noncapturing parentheses if you just want to group things.

And There’s More

Although we’ve covered all of the regular expression features that most people are likely to need for everyday programming, there are still even more features. A few are covered in the Alpaca book, but also check the perlre, perlrequick, and perlretut manpages for more information about what patterns in Perl can do. And check out YAPE::Regexp::Explain in CPAN as a regular-expression-to-English translator.

A Pattern Test Program

#!/usr/bin/perl 
while (<>) { # take one input line at a time 
    chomp; 
    if (/YOUR_PATTERN_GOES_HERE/) { 
        print "Matched: |$`<$&>$'|\n"; # the special match vars 
    } else { 
        print "No match: |$_|\n"; 
    } 
}

发表评论

电子邮件地址不会被公开。 必填项已用*标注