Scripting Clinic: Slice and Dice Text with Perl - Page 2

By Carla Schroder | Posted Jun 30, 2004
Page 2 of 2   |  Back to Page 1
Print ArticleEmail Article
  • Share on Facebook
  • Share on Twitter
  • Share on LinkedIn
Continued From Page 1

Mass Surgical Repairs
Computers, being overly literal, must be fed precise input. For example, suppose you have a file that needs to have comma-delimited fields. But the poor thing is littered with semi-colons, dashes, and who knows what-all. Here's where Perl really earns its keep. This expression hunts down all non-comma delimiters, and replaces them with a comma:

s/[^\w| ]/,/g ;

This uses a nice feature of Perl called character classes. Character classes are enclosed by square brackets. There's a lot going on inside these brackets. The caret inside the square brackets means "not." \w means "any alpha-numeric character or underscore." The pipe and space mean "and exclude spaces, too." This keeps it from inserting commas between phrases. For example, the line "Schroder, Carla Jean; home address- 1234 Main st: Nowheresville" becomes "Schroder, Carla Jean, home address, 1234 Main st, Nowheresville,"

That's ever so much more fun than making manual corrections.

Some configuration files are sensitive to leading spaces. The leading space acts like a comment. This expression gets rid of all leading spaces:

 s/^\s+//g ;

Character classes give you a powerful tool for doing fine-grained search and replacing. See the table below for a listing of Perl's built-in character classes. You can also define your own, stick anything you want between the square brackets. Let's say you hate the numbers 3 and 7:

/[37]/

Or you want to find and delete words containing certain vowels:

/[eoyEOY]/

Remember, Perl is case-sensitive.

Using Expressions In Scripts
While some simple Perl commands can be run from a shell prompt, Perl is used primarily in scripts and programs. For example, to run a simple search-and-replace on a file, first create a script. Let's call it replace-tofu:

1  #!/usr/bin/perl
2  my $file = $ARGV[0];
3  open (FILE,$file) || die "Sorry, I cannot read from $file";
4  open (TMP, ">$file.$$") || die "Sorry, I cannot write to $file.$$\n";
5  my $count = 0;
6
7  while () {
8  $count++ if s/\btofu\b/chocolate/g;
9  print TMP;
10 }
11
12 close FILE,TMP;
13 print "I found $count instances of tofu, and changed it to chocolate.\n";
14 rename "$file.$$",$file || die "Cannot update $file\n";

To use this script, run it like this:

$ ./replace-tofu  dessert.txt

Remember to not copy the line numbers, and to chmod +x the script.

A whole lot of things happen in this little script. Line 1 grabs the file name from the command line; it is the first (and only) argument used.

Line 2 opens and reads the file. die is a Perl function that gives you a quick and easy way to generate an error message on a failure. Line 3 creates a temporary file; that is where changes are initially written.

Line 4 initiates a counter, which will count how many instances your search term is found. You don't have to call it "count", it can be anything you like.

Line 7 is good ole "while", which is the same everywhere. It will make the search and replace command loop over each line in the file. Line 8 does the search-and-replace, and counts each occurence of the search term.

In Line 9, the results are copied to the temporary file. Line 12 closes the file. Line 13 prints a summary report to the screen, and Line 14 copies the contents of the temp file into the original file.

This is a nice, simple script with a bit of built-in error-checking that you can use for real search-and-replaces, or you can modify it to count things without changing them, or use it to test search expressions.

Glossary
/ =  expression delimiter
^ =  beginning of line
$ =  end of line, when placed before the delimiter
i = case insensitive, placed after the delimiter 
s = substitute
g = global, placed after the delimiter 
\b = word boundary anchor

Table Of Built-in Character Classes
\d    = [0-9]
\w   = [a-zA-Z0-9_]
\s    =  [ \r\t\n\f] (\r = carriage return, \t = tab \n = newline, \f = formfeed)
\D   =  [^0-9]
\W  =  [^a-zA-Z0-9_]
\S    =  [^ \r\t\n\f]

Did you figure out the answer to the pop quiz? The fake Perl expression is the second one. The first one finds words that are in all caps.

Next Time
For next month's Scripting Clinic, we'll dig into real-life useful Perl examples.

Resources
Start at Perl.org for all kinds of links to howtos, mail lists and archives, and other useful stuff.

Comment and Contribute
(Maximum characters: 1200). You have
characters left.
Get the Latest Scoop with Enterprise Networking Planet Newsletter