Scripting Clinic: Slice and Dice Text with Perl

By Carla Schroder | Jun 30, 2004 | Print this Page
http://www.enterprisenetworkingplanet.com/netsysm/article.php/3375561/Scripting-Clinic-Slice-and-Dice-Text-with-Perl.htm

No no, don't run away just because I said "Perl"! Perl is a really nice scripting language, don't be afraid. Pretend you have never been abused by mean people on the Perl mailing lists. Pretend that you have never looked at a long, complex Perl program and wondered if it was really a program, or a random collection of characters generated as a practical joke. The key to using Perl effectively is to focus on two basic principles:

  • Learn a few basic Perl tools well. You can do a lot with a little in Perl. Don't worry about the show-offs who flaunt their strange, arcane Perl knowledge, just stay focused on the tasks you need to accomplish.
  • Write your Perl code for clarity and human understanding. Some Perl geeks love to compete to write the most obfuscated code. You're welcome to join in; but for writing easy-to-maintain scripts, clarity is the way to go.

Today we'll take a Real People look at Perl, and learn some Perl tricks for doing useful text searches and replacements.

Quiz
First, a pop quiz: Which line is a Perl expression, and which one is a random collection of characters?

@caps= m/(\b[^\Wa-z0-9_]+\b)/g;

$a//\*/\||+=#//\n

Stay tuned for the answer.

Simple String Searches
Nothing beats Perl regular expressions for searching text in any way you can imagine. Here are some simple string comparisons.

This is a simple search for a specific string:

/search-term/;

That's a nice easy search, but we can nail it down more precisely. Adding the caret looks for our search string at the beginning of lines:

 /^search-term/;

Lose the caret and add a dollar sign to find your search string at the end of the line:

 /search-term$/;

To do a case-insensitive search, add i:

/search-term$/i;

Combine the caret and dollar sign to find lines that consist only of the search string:

 /^search-term$/;

This conducts a whole-word search; the previous examples will return any matching string, even if it is inside another word:

/\bsearch-term\b/ ;

Search And Replace
Well that was easy and fun. So what are you going to do with those search strings when you find them? You could have Perl replace them:

 s/\bGeorge Bush\b/Anyone At All/;

This only replaces the first instance. To replace all occurrences in a document, add the /g switch:

s/\btofu\b/chocolate/g;

This is a huge time-saver for anyone who needs to do a global search-and-replace in a batch of files, like Web pages. Run it from the command line, substituting your own text to search and replace:

$ perl  -e 's/string/stringier/gi'  -p  -i.bak  *.html

perl -e means "run this command." The command is enclosed in single quotes. -i.bak creates backup copies of the originals, and *.html = all .html files in the current directory.

Continued on Page 2: Surgical Repairs with PerlContinued From Page 1

Mass Surgical Repairs
Computers, being overly literal, must be fed precise input. For example, suppose you have a file that needs to have comma-delimited fields. But the poor thing is littered with semi-colons, dashes, and who knows what-all. Here's where Perl really earns its keep. This expression hunts down all non-comma delimiters, and replaces them with a comma:

s/[^\w| ]/,/g ;

This uses a nice feature of Perl called character classes. Character classes are enclosed by square brackets. There's a lot going on inside these brackets. The caret inside the square brackets means "not." \w means "any alpha-numeric character or underscore." The pipe and space mean "and exclude spaces, too." This keeps it from inserting commas between phrases. For example, the line "Schroder, Carla Jean; home address- 1234 Main st: Nowheresville" becomes "Schroder, Carla Jean, home address, 1234 Main st, Nowheresville,"

That's ever so much more fun than making manual corrections.

Some configuration files are sensitive to leading spaces. The leading space acts like a comment. This expression gets rid of all leading spaces:

 s/^\s+//g ;

Character classes give you a powerful tool for doing fine-grained search and replacing. See the table below for a listing of Perl's built-in character classes. You can also define your own, stick anything you want between the square brackets. Let's say you hate the numbers 3 and 7:

/[37]/

Or you want to find and delete words containing certain vowels:

/[eoyEOY]/

Remember, Perl is case-sensitive.

Using Expressions In Scripts
While some simple Perl commands can be run from a shell prompt, Perl is used primarily in scripts and programs. For example, to run a simple search-and-replace on a file, first create a script. Let's call it replace-tofu:

1  #!/usr/bin/perl
2  my $file = $ARGV[0];
3  open (FILE,$file) || die "Sorry, I cannot read from $file";
4  open (TMP, ">$file.$$") || die "Sorry, I cannot write to $file.$$\n";
5  my $count = 0;
6
7  while () {
8  $count++ if s/\btofu\b/chocolate/g;
9  print TMP;
10 }
11
12 close FILE,TMP;
13 print "I found $count instances of tofu, and changed it to chocolate.\n";
14 rename "$file.$$",$file || die "Cannot update $file\n";

To use this script, run it like this:

$ ./replace-tofu  dessert.txt

Remember to not copy the line numbers, and to chmod +x the script.

A whole lot of things happen in this little script. Line 1 grabs the file name from the command line; it is the first (and only) argument used.

Line 2 opens and reads the file. die is a Perl function that gives you a quick and easy way to generate an error message on a failure. Line 3 creates a temporary file; that is where changes are initially written.

Line 4 initiates a counter, which will count how many instances your search term is found. You don't have to call it "count", it can be anything you like.

Line 7 is good ole "while", which is the same everywhere. It will make the search and replace command loop over each line in the file. Line 8 does the search-and-replace, and counts each occurence of the search term.

In Line 9, the results are copied to the temporary file. Line 12 closes the file. Line 13 prints a summary report to the screen, and Line 14 copies the contents of the temp file into the original file.

This is a nice, simple script with a bit of built-in error-checking that you can use for real search-and-replaces, or you can modify it to count things without changing them, or use it to test search expressions.

Glossary
/ =  expression delimiter
^ =  beginning of line
$ =  end of line, when placed before the delimiter
i = case insensitive, placed after the delimiter 
s = substitute
g = global, placed after the delimiter 
\b = word boundary anchor

Table Of Built-in Character Classes
\d    = [0-9]
\w   = [a-zA-Z0-9_]
\s    =  [ \r\t\n\f] (\r = carriage return, \t = tab \n = newline, \f = formfeed)
\D   =  [^0-9]
\W  =  [^a-zA-Z0-9_]
\S    =  [^ \r\t\n\f]

Did you figure out the answer to the pop quiz? The fake Perl expression is the second one. The first one finds words that are in all caps.

Next Time
For next month's Scripting Clinic, we'll dig into real-life useful Perl examples.

Resources
Start at Perl.org for all kinds of links to howtos, mail lists and archives, and other useful stuff.