Processing Text with Regular Expressions

Substitutions with s///

If you think of the m// pattern match as being like your word processor’s “search” feature, the “search and replace” feature would be Perl’s s/// substitution operator. This simply replaces whatever part of a variable matches a pattern with a replacement string:

$_ = "He's out bowling with Barney tonight."; 
s/Barney/Fred/; # Replace Barney with Fred 
print "$_\n";

If the match fails, nothing happens, and the variable is untouched:

# Continuing from above; $_ has "He's out bowling with Fred tonight." 
s/Wilma/Betty/; # Replace Wilma with Betty (fails)

Of course, both the pattern and the replacement string could be more complex. Here, the replacement string uses the first memory variable, $1, which is set by the pattern match:

s/with (\w+)/against $1's team/; 
print "$_\n"; # says "He's out bowling against Fred's team tonight."

Here are some other possible substitutions:

$_ = "green scaly dinosaur";
s/(\w+) (\w+)/$2, $1/; # Now it's "scaly, green dinosaur"
s/^/huge, /; # Now it's "huge, scaly, green dinosaur"
s/,.*een//; # Empty replacement: Now it's "huge dinosaur"
s/green/red/; # Failed match: still "huge dinosaur"
s/\w+$/($`!)$&/; # Now it's "huge (huge !)dinosaur"
s/\s+(!\W+)/$1 /; # Now it's "huge (huge!) dinosaur"
s/huge/gigantic/; # Now it's "gigantic (huge!) dinosaur"

There’s a useful Boolean value from s///; it’s true if a substitution was successful; otherwise, it’s false:

$_ = "fred flintstone"; 
if (s/fred/wilma/) { 
    print "Successfully replaced fred with wilma!\n"; 
}

Note: unlike m//, which can match against any string expression, s/// is modifying data that must therefore be contained in what’s known as an lvalue. This is nearly always a variable, although it could actually be anything that could be used on the left side of an assignment operator.

Global Replacements with /g

As you may have noticed in a previous example, s/// will make just one replacement, even if others are possible. Of course, that’s just the default. The /g modifier tells s/// to make all possible nonoverlapping replacements:

$_ = "home, sweet home!"; 
s/home/cave/g; 
print "$_\n"; # "cave, sweet cave!"

A fairly common use of a global replacement is to collapse whitespace; that is, to turn any arbitrary whitespace into a single space:

$_ = "Input data\t may have extra whitespace."; 
s/\s+/ /g; # Now it says "Input data may have extra whitespace."

Once we show collapsing whitespace, everyone wants to know about stripping leading and trailing whitespace. That’s easy enough, in two steps:

s/^\s+//; # Replace leading whitespace with nothing 
s/\s+$//; # Replace trailing whitespace with nothing

We could do that in one step with an alternation and the /g flag, but that turns out to be a bit slower, at least when we wrote this. The regular expression engine is always being tuned, but to learn more about that, you can get Jeffrey Friedl’s Mastering Regular Expressions (O’Reilly) and find out what makes regular expressions fast (or slow).

s/^\s+|\s+$//g; # Strip leading, trailing whitespace
Different Delimiters

Just as we did with m// and qw//, we can change the delimiters for s///. But the sub- stitution uses three delimiter characters, so things are a little different.

With ordinary (nonpaired) characters, which don’t have a left and right variety, just use three of them, as we did with the forward slash. Here, we’ve chosen the pound sign as the delimiter:

s#^https://#http://#;

But if you use paired characters, which have a left and right variety, you have to use two pairs: one to hold the pattern and one to hold the replacement string. In this case, the delimiters don’t have to be the same kind around the string as they are around the pattern. In fact, the delimiters of the string could even be nonpaired. These are all the same:

s{fred}{barney}; 
s[fred](barney); 
s<fred>#barney#;

Note: Although the pound sign is generally the start of a comment in Perl, it won’t start a comment when the parser knows to expect a delimiter—in this case, immediately after the s that starts the substitution.

Option Modifiers

In addition to the /g modifier, substitutions may use the /i, /x, and /s modifiers that you saw in ordinary pattern matching (the order of modifiers isn’t significant):

s#wilma#Wilma#gi; # replace every WiLmA or WILMA with Wilma 
s{__END__.*}{}s; # chop off the end marker and all following lines
The Binding Operator

Just as you saw with m//, we can choose a different target for s/// by using the binding operator:

$file_name =~ s#^.*/##s; # In $file_name, remove any Unix-style path
Case Shifting

It often happens in a substitution that you’ll want to make sure that a replacement word is properly capitalized (or not, as the case may be). That’s easy to accomplish with Perl, by using some backslash escapes. The \U escape forces what follows to all uppercase:

$_ = "I saw Barney with Fred."; 
s/(fred|barney)/\U$1/gi; # $_ is now "I saw BARNEY with FRED."

Similarly, the \L escape forces lowercase. Continuing from the previous code:

s/(fred|barney)/\L$1/gi; # $_ is now "I saw barney with fred."

By default, these affect the rest of the (replacement) string, or you can turn off case shifting with \E:

s/(\w+) with (\w+)/\U$2\E with $1/i; # $_ is now "I saw FRED with barney."

When written in lowercase (\l and \u ), they affect only the next character:

s/(fred|barney)/\u$1/ig; # $_ is now "I saw FRED with Barney."

You can even stack them up. Using \u with \L means “all lowercase, but capitalize the first letter”:

s/(fred|barney)/\u\L$1/ig; # $_ is now "I saw Fred with Barney."

As it happens, although we’re covering case shifting in relation to substitutions, these escape sequences are available in any double-quotish string:

print "Hello, \L\u$name\E, would you like to play a game?\n";

Note: The \L and \u may appear together in either order. Larry realized that people would sometimes get those two backward, so he made Perl figure out that you want just the first letter capitalized and the rest lowercase.

The split Operator

Another operator that uses regular expressions is split, which breaks up a string according to a pattern. This is useful for tab-separated data, or colon-separated, whitespace-separated, or anything-separated data, really.(1) So long as you can specify the separator with a regular expression (and generally, it’s a simple regular expression), you can use split. It looks like this:

@fields = split /separator/, $string;

The split operator drags the pattern through a string and returns a list of fields (substrings) that were separated by the separators. Whenever the pattern matches, that’s the end of one field and the start of the next. So, anything that matches the pattern will never show up in the returned fields. Here’s a typical split pattern, splitting on colons:

@fields = split /:/, "abc:def:g:h"; # gives ("abc", "def", "g", "h")

You could even have an empty field, if there were two delimiters together:

@fields = split /:/, "abc:def::g:h"; # gives ("abc", "def", "", "g", "h")

Here’s a rule that seems odd at first, but it rarely causes problems: leading empty fields are always returned, but trailing empty fields are discarded. For example:

@fields = split /:/, ":::a:b:c:::"; # gives ("", "", "", "a", "b", "c")

It’s also common to split on whitespace, using /\s+/ as the pattern. Under that pattern, all whitespace runs are equivalent to a single space:

my $some_input = "This is a \t test.\n"; 
my @args = split /\s+/, $some_input; # ("This", "is", "a", "test.")

The default for split is to break up $_ on whitespace:

my @fields = split; # like split /\s+/, $_;

This is almost the same as using /\s+/ as the pattern, except that in this special case a leading empty field is suppressed—so, if the line starts with whitespace, you won’t see an empty field at the start of the list. (If you’d like to get the same behavior when splitting another string on whitespace, just use a single space in place of the pattern: split ' ', $other_string. Using a space instead of the pattern is a special kind of split.)

Note1: Except “comma-separated values,” normally called CSV files. Those are a pain to do with split; you’re better off getting the Text::CSV module from CPAN.

The join Function

The join function doesn’t use patterns, but performs the opposite function of split: split breaks up a string into a number of pieces, and join glues together a bunch of pieces to make a single string. The join function looks like this:

my $result = join $glue, @pieces;

The first argument to join is the glue, which may be any string. The remaining arguments are a list of pieces. join puts the glue string between the pieces and returns the resulting string:

my $x = join ":", 4, 6, 8, 10, 12; # $x is "4:6:8:10:12"

There will be one fewer piece of glue than the number of items in the list. This means that there may be no glue at all if the list doesn’t have at least two elements:

my $y = join "foo", "bar"; # gives just "bar", since no fooglue is needed
my @empty; # empty array
my $empty = join "baz", @empty; # no items, so it's an empty string

Using $x from above, we can break up a string and put it back together with a different delimiter:

my @values = split /:/, $x; # @values is (4, 6, 8, 10, 12) 
my $z = join "-", @values; # $z is "4-6-8-10-12"

m// in List Context

When a pattern match (m//) is used in a list context, the return value is a list of the memory variables created in the match, or an empty list if the match failed:

$_ = "Hello there, neighbor!"; 
my($first, $second, $third) = /(\S+) (\S+), (\S+)/; 
print "$second is my $third\n";

The /g modifier that you first saw on s/// also works with m//, which lets it match at more than one place in a string. In this case, a pattern with a pair of parentheses will return a memory from each time it matches:

my $text = "Fred dropped a 5 ton granite block on Mr. Slate"; 
my @words = ($text =~ /([a-z]+)/ig); 
print "Result: @words\n"; # Result: Fred dropped a ton granite block on Mr Slate

This is like using split “inside out”: instead of specifying what we want to remove, we specify what we want to keep.
In fact, if there is more than one pair of parentheses, each match may return more than one string. Let’s say that we have a string that we want to read into a hash, something like this:

my $data = "Barney Rubble Fred Flintstone Wilma Flintstone"; 
my %last_name = ($data =~ /(\w+)\s+(\w+)/g);

Each time the pattern matches, it returns a pair of memories. Those pairs of values then become the key-value pairs in the newly created hash.

More Powerful Regular Expressions

Nongreedy Quantifiers

The four quantifiers(*, +, ?, {m,n}) you’ve already seen are all greedy. That means that they match as much as they can, only to reluctantly give some back if that’s necessary to allow the overall pattern to succeed. For each of the greedy quantifiers, though, there’s also a nongreedy quantifier available. Instead of the plus (+), we can use the nongreedy quantifier +?, which matches one or more times (just as the plus does), except that it prefers to match as few times as pos- sible, rather than as many as possible.

Since the nongreedy form of the plus was +? and the nongreedy form of the star was *?, you’ve probably realized that the other two quantifiers look similar. The nongreedy form of any curly-brace quantifier looks the same, but with a question mark after the closing brace, like {5,10}? or {8,}?. And even the question-mark quantifier has a nongreedy form: ??. That matches either once or not at all, but it prefers not to match anything.

Here is an example. Suppose you had some HTML-like text, and you want to remove all of the tags <BOLD> and </BOLD>, leaving their contents intact. Here’s the text: I’m talking about the cartoon with Fred and <BOLD>Wilma</BOLD>!
And here’s a substitution to remove those tags. But what’s wrong with it?

s#<BOLD>(.*)</BOLD>#$1#g;

The problem is that the star is greedy. What if the text had said this instead?
I thought you said Fred and <BOLD>Velma</BOLD>, not <BOLD>Wilma</BOLD>

In that case, the pattern would match from the first <BOLD> to the last </BOLD>, leaving intact the ones in the middle of the line. Oops! Instead, we want a nongreedy quantifier. The nongreedy form of star is *?, so the substitution now looks like this:

s#<BOLD>(.*?)</BOLD>#$1#g;
Matching Multiple-Line Text

Classic regular expressions were used to match just single lines of text. But since Perl can work with strings of any length, Perl’s patterns can match multiple lines of text as easily as single lines. Of course, you have to include an expression that holds more than one line of text. Here’s a string that’s four lines long:

$_ = "I'm much better\nthan Barney is\nat bowling,\nWilma.\n";
Now, the anchors ^ and $ are normally anchors for the start and end of the whole string. But the /m regular expression option lets them match at internal new-lines as well (think m for multiple lines). This makes them anchors for the start and end of each line, rather than the whole string. So this pattern can match:

print "Found 'wilma' at start of line\n" if /^wilma\b/im;
Similarly, you could do a substitution on each line in a multiline string. Here, we read an entire file into one variable, then add the file’s name as a prefix at the start of each line:

open FILE, $filename or die "Can't open '$filename': $!"; 
my $lines = join '', <FILE>; 
$lines =~ s/^/$filename: /gm;
Updating Many Files

The most common way of programmatically updating a text file is by writing an entirely new file that looks similar to the old one, but making whatever changes we need as we go along. As you’ll see, this technique gives nearly the same result as updating the file itself, but it has some beneficial side effects as well.
In this example, we’ve got hundreds of files with a similar format. One of them is fred03.dat, and it’s full of lines like these:

Program name: granite 
Author: Gilbert Bates 
Company: RockSoft 
Department: R&D 
Phone: +1 503 555-0095 
Date: Tues March 9, 2004 
Version: 2.1 
Size: 21k 
Status: Final beta

We need to fix this file so that it has some different information. Here’s roughly what this one should look like when we’re done:

Program name: granite 
Author: Randal L. Schwartz
Company: RockSoft 
Department: R&D 
Date: June 12, 2008 6:38 pm 
Version: 2.1 
Size: 21k 
Status: Final beta

In short, we need to make three changes. The name of the Author should be changed; the Date should be updated to today’s date, and the Phone should be removed completely. And we have to make these changes in hundreds of similar files as well.

Perl supports a way of in-place editing of files with a little extra help from the diamond operator (<>). Here’s a program to do what we want, although it may not be obvious how it works at first. This program’s only new feature is the special variable $^I; ignore that for now, and we’ll come back to it:

#!/usr/bin/perl -w 
use strict; 

chomp(my $date = `date`); 
$^I = ".bak"; 
while (<>) { 
    s/^Author:.*/Author: Randal L. Schwartz/; 
    s/^Phone:.*\n//; 
    s/^Date:.*/Date: $date/; 
    print; 
}

Since we need today’s date, the program starts by using the system date command. A better way to get the date (in a slightly different format) would almost surely be to use Perl’s own localtime function in a scalar context:
my $date = localtime;

The next line sets $^I, but keep ignoring that for the moment.
The list of files for the diamond operator here are coming from the command line. The main loop reads, updates, and prints one line at a time. Note that the second substitution can replace the entire line containing the phone number with an empty string—leaving not even a newline—so when that’s printed, nothing comes out, and it’s as if the Phone never existed. Most input lines won’t match any of the three patterns, and those will be unchanged in the output.

So this result is close to what we want, except that we haven’t shown you how the updated information gets back out on to the disk. The answer is in the variable $^I. By default it’s undef, and everything is normal. But when it’s set to some string, it makes the diamond operator (<>) even more magical than usual.

Let’s say it’s time for the diamond to open our file fred03.dat. It opens it like before, but now it renames it, calling it fred03.dat.bak. We’ve still got the same file open, but now it has a different name on the disk. Next, the diamond creates a new file and gives it the name fred03.dat. That’s okay; we weren’t using that name any more. And now the diamond selects the new file as the default for output, so that anything that we print will go into that file. So now the while loop will read a line from the old file, update that, and print it out to the new file. This program can update thousands of files in a few seconds on a typical machine.

Some folks use a tilde (~) as the value for $^I since that resembles what emacs does for backup files. Another possible value for $^I is the empty string. This enables in-place editing, but doesn’t save the original data in a backup file. But since a small typo in your pattern could wipe out all of the old data, using the empty string is recommended only if you want to find out how good your backup tapes are. It’s easy enough to delete the backup files when you’re done. And when something goes wrong and you need to rename the backup files to their original names, you’ll be glad that you know how to use Perl to do that (see the multiple-file rename example in Chapter Strings and Sorting).

In-Place Editing from the Command Line

A program like the example from the previous section is fairly easy to write. But Larry decided it wasn’t easy enough.

Imagine that you need to update hundreds of files that have the misspelling Randall instead of the one-l name Randal. You could write a program like the one in the previous section. Or you could do it all with a one-line program, right on the command line:
$ perl -p -i.bak -w -e 's/Randall/Randal/g' fred*.dat

The -p option tells Perl to write a program for you. It’s not much of a program, though; it looks something like this:

while (<>) { 
    print; 
}

If you want even less, you could use -n instead; that leaves out the automatic print statement, so you can print only what you wish. (Fans of awk will recognize -p and -n.) Again, it’s not much of a program, but it’s pretty good for the price of a few keystrokes.

The next option is -i.bak, which you might have guessed sets $^I to ".bak" before the program starts. If you don’t want a backup file, you can use -i alone, with no extension. If you don’t want a spare parachute, you can leave the airplane with just one.

We’ve seen -w before—it turns on warnings.

The -e option says “executable code follows.” That means the s/Randall/Randal/g string is treated as Perl code. Since we’ve already got a while loop (from the -p option), this code is put inside the loop, before the print. For technical reasons, the last semicolon in the -e code is optional. But if you have more than one -e, and thus more than one chunk of code, only the semicolon at the end of the last one may safely be omitted.

The last command-line parameter is fred*.dat, which says that @ARGV should hold the list of filenames that match that filename pattern. Put the pieces all together, and it’s as if we had written a program like this, and put it to work on all of those fred*.dat files:

#!/usr/bin/perl -w 

$^I = ".bak"; 

while (<>) { 
    s/Randall/Randal/g; 
    print; 
}

发表评论

邮箱地址不会被公开。 必填项已用*标注