Regex basics

Using Simple Patterns

To match a pattern (regular expression) against the contents of $_, simply put the pattern between a pair of forward slashes (/), like we do here:

$_ = "yabba dabba doo"; 
if (/abba/) { 
    print "It matched!\n"; 
}

The expression /abba/ looks for that four-letter string in $_; if it finds it, it returns a true value. In this case, it’s found more than once, but that doesn’t make any difference. If it’s found at all, it’s a match; if it’s not in there at all, it fails.

About Metacharacters

Of course, if patterns matched only simple literal strings, they wouldn’t be very useful. That’s why there are a number of special characters, called metacharacters, that have special meanings in regular expressions.

For example, the dot (.) is a wildcard character—it matches any single character except a newline (which is represented by "\n"). So, the pattern /bet.y/ would match betty. Or it would match betsy, or bet=y, or bet.y, or any other string that has bet, followed by any one character (except a newline), followed by y. It wouldn’t match bety or betsey, though, since those don’t have exactly one character between the t and the y. The dot always matches exactly one character.

So, if you wanted to match a period in the string, you could use the dot. But that would match any possible character (except a newline), which might be more than you wanted. If you wanted the dot to match just a period, you can simply backslash it. In fact, that rule goes for all of Perl’s regular expression metacharacters: a backslash in front of any metacharacter makes it nonspecial. So, the pattern /3\.14159/ doesn’t have a wildcard character.

Simple Quantifiers

It often happens that you’ll need to repeat something in a pattern. The star (*) means to match the preceding item zero or more times.

What if you wanted to allow something besides tab characters? The dot matches any character,* so .* will match any character, any number of times.

The star is formally called a quantifier, meaning that it specifies a quantity of the preceding item. But it’s not the only quantifier; the plus (+) is another. The plus means to match the preceding item one or more times: /fred +barney/ matches if fred and barney are separated by spaces and only spaces. (The space is not a metacharacter.) This won’t match fredbarney, since the plus means that there must be one or more spaces between the two names, so at least one space is required.

There’s a third quantifier like the star and plus, but more limited. It’s the question mark(?), which means that the preceding item is optional. That is, the preceding item may occur once or not at all. Like the other two quantifiers, the question mark means that the preceding item appears a certain number of times. It’s just that in this case the item may match one time (if it’s there) or zero times (if it’s not). There aren’t any other possibilities.

Grouping in Patterns

As in mathematics, parentheses (( )) may be used for grouping. So, parentheses are also metacharacters. As an example, the pattern /fred+/ matches strings like freddddddddd, but strings like that don’t show up often in real life. But the pattern /(fred)+/ matches strings like fredfredfred, which is more likely to be what you wanted. And what about the pattern /(fred)*/? That matches strings like hello, world.

The parentheses also give us a way to reuse part of the string directly in the match. We can use back references to refer to text that we matched in the parentheses. We denote a back reference as a backslash followed by a number, like \1, \2, and so on. The number denotes the parentheses group. Examples:

$_ = "abba"; 
if (/(.)\1/) { # matches 'bb' 
    print "It matched same character next to itself!\n"; 
}

The back reference doesn’t have to be right next to the parentheses group. The next pattern matches any four nonnewline characters after a literal y, and we use the \1 back reference to denote that we want to match the same four characters after the d:

$_ = "yabba dabba doo"; 
if (/y(....) d\1/) {
    print "It matched the same after y and d!\n"; 
}

We can use multiple groups of parentheses, and each group gets its own back reference. We want to match a nonnewline character in a parentheses group, followed by another nonnewline character in a parentheses group. After those two groups, we use the back reference \2 followed by the back reference \1. In effect, we’re matching a palindrome such as abba:

$_ = "yabba dabba doo"; 
if (/y(.)(.)\2\1/) { # matches 'abba' 
    print "It matched the same after y and d!\n"; 
}

Now, this brings up the question “How do I know which group gets which number?” Fortunately, Larry did the easiest thing for humans to understand: just count the order of the opening parenthesis and ignore nesting:

$_ = "yabba dabba doo"; 
if (/y((.)(.)\3\2) d\1/) { 
    print "It matched!\n"; 
}

Perl 5.10 has a new way to denote back references. Instead of using the backslash and a number, it uses \g{N}, where N is the number of the back reference that you want to use. This notation can make it easier for us to show what we intend in the pattern.

Consider the problem where you want to use a back reference next to a part of the pattern that is a number. In this regular expression, we want to use \1 to repeat the character we matched in the parentheses and follow that with the literal string 11:

# a wrong example
$_ = "aa11bb"; 
if (/(.)\111/) { 
    print "It matched!\n"; 
}

Perl has to guess what we mean there. Is that \1, \11, or \111? Perl will create as many back references as we need, so it assumes that we mean \111. Since we don’t have 111 (or 11) parentheses groups, Perl complains when it tries to compile the program.

By using \g{1}, we disambiguate the back reference and the literal parts of the pattern:

use 5.010; 
$_ = "aa11bb"; 
if (/(.)\g{1}11/) { 
    print "It matched!\n"; 
}

With the \g{N} notation, we can also use negative numbers. Instead of specifying the absolute number of the parentheses group, we can specify a relative back reference. We can rewrite the last example to use −1 as the number to do the same thing:

use 5.010; 
$_ = "aa11bb"; 
if (/(.)\g{-1}11/) { 
    print "It matched!\n"; 
}
Alternatives

The vertical bar (|), often pronounced “or” in this usage, means that either the left side may match, or the right side. That is, if the part of the pattern on the left of the bar fails, the part on the right gets a chance to match. So, /fred|barney|betty/ will match any string that mentions fred, or barney, or betty.

Now you can make patterns like /fred( |\t)+barney/, which matches if fred and barney are separated by spaces, tabs, or a mixture of the two.

If you wanted the characters between fred and barney to all be the same, you could rewrite that pattern as /fred( +|\t+)barney/. In this case, the separators must be all spaces or all tabs.

Character Classes

A character class, a list of possible characters inside square brackets ([]), matches any single character from within the class. It matches just one single character, but that one character may be any of the ones listed.

For example, the character class [abcwxyz] may match any one of those seven characters. For convenience, you may specify a range of characters with a hyphen (-) so that class may also be written as [a-cw-z]. That didn’t save much typing, but it’s more usual to make a character class like [a-zA-Z] to match any one letter out of that set of 52.

Of course, a character class will be just part of a full pattern; it will never stand on its own in Perl. For example, you might see code that says something like this:

$_ = "The HAL-9000 requires authorization to continue."; 
if (/HAL-[0-9]+/) { 
    print "The string mentions some model of HAL computer.\n"; 
}

Sometimes, it’s easier to specify the characters left out, rather than the ones within the character class. A caret (^) at the start of the character class negates it. That is, [^def] will match any single character except one of those three. And [^n\-z] matches any character except for n, hyphen, or z. (Note that the hyphen is backslashed because it’s special inside a character class. But the first hyphen in /HAL-[0-9]+/ doesn’t need a backslash because hyphens aren’t special outside a character class.)

Character Class Shortcuts

Some character classes appear so frequently that they have shortcuts. For example, the character class for any digit, [0-9], may be abbreviated as \d. Thus, the pattern from the example about HAL could be written /HAL-\d+/ instead.

The shortcut \w is a so-called word character: [A-Za-z0-9_]. Of course, \w doesn’t match a “word”; it merely matches a single “word” character. To match an entire word, though, the plus modifier is handy. A pattern like /fred \w+ barney/ will match fred and a space, then a “word,” then a space and barney. That is, it’ll match if there’s one word between fred and barney, set off by single spaces.

As you may have noticed in that previous example, it might be handy to be able to match spaces more flexibly. The \s shortcut is good for whitespace; it’s the same as [\f\t\n\r ]. That is, it’s the same as a class containing the five whitespace characters: form-feed, tab, newline, carriage return, and the space character itself. These are the characters that merely move the printing position around; they don’t use any ink. Still, like the other shortcuts you’ve just seen, \s matches just a single character from the class, so it’s usual to use either \s* for any amount of whitespace (including none at all), or \s+ for one or more whitespace characters.

Perl 5.10 adds more character classes for whitespace. The \h shortcut only matches horizontal whitespace, which you can write as the character class [\t ] to match a tab and a space. The \v shortcut only matches vertical whitespace, or [\f\n\r]. The \R shortcut matches any sort of linebreak, meaning that you don’t have to think about which operating system you’re using and what it thinks a linebreak is since \R will figure it out.

Negating the Shortcuts

Sometimes you may want the opposite of one of these three shortcuts. That is, you may want [^\d], [^\w], or [^\s], meaning a nondigit character, a nonword character, or a nonwhitespace character. That’s easy enough to accomplish by using their uppercase counterparts: \D, \W, or \S. These match any character that their counterpart would not match.

Any of these shortcuts will work either in place of a character class (standing on their own in a pattern), or inside the square brackets of a larger character class. That means that you could now use /[\dA-Fa-f]+/ to match hexadecimal (base 16) numbers, which use letters ABCDEF (or the same letters in lowercase) as additional digits.

Another compound character class is [\d\D], which means any digit, or any nondigit. That is to say, any character at all! This is a common way to match any character, even a newline. (As opposed to ., which matches any character except a newline.)

发表评论

邮箱地址不会被公开。 必填项已用*标注