next up previous contents
Next: Associative arrays Up: Programming in Perl Previous: Input/Output   Contents

Matching and regular expressions

We mentioned in the beginning that Perl is well suited to text manipulation. A particularly useful feature in this regard is a powerful matching operator, that extends naturally to a search-and-replace operator.

The match operator is =~ which allows us to match a string variable against a pattern, usually delimited by /. For instance, to print all lines from a file that contain the string CMI, we would write:

   while ($line = <INFILE>) {
     if ($line =~ /CMI/){
        print $line;
     }
   }

More generally, the pattern could be a regular expression. The syntax for regular expressions is similar to that in text editors like vi or emacs or in the grep command of Unix. In a regular expression, a character stands for itself. A sequence of characters in square brackets stands for a choice of characters. Thus, the pattern /[Cc][Mm][Ii]/ would match any combination of lower and upper case letters that make up the string cmi--for instance, CMI, CmI, .... The character . is special and matches any character, so, for instance, the pattern /[Cc].i/ would match any three letter pattern beginning with C or c and ending with i. We can specify a case-insensitive search by appending the modifier i at the end of the pattern.

  if ($line =~ /CMI/i){      # Same as ($line =~ /[Cc][Mm][Ii]/)

Perl provides some special abbreviations for commonly used choices of alternatives. The expression \w (for word) represents any of the characters _,a,...,z,A,...,Z,0,...,9. The expression \d (for digit) represents 0,...,9, while \s represents a whitespace character (space, tab or newline).

Repetition is described using * (zero or more repetitions), + (one or more repetitions) and ? (zero or one repetitions). For instance the expression \d+ matches a nonempty sequence of digits, while \s*a\s* matches a single a along with all its surrounding white space, if any. More controlled repetition is given by the syntax {m,n}, which specifies between m and n repetitions. Thus \d{6,8} matches a sequence of 6 to 8 digits.

A close relative of the match operator is the search and replace operator, which is given by =~ s/pattern/replacement/. For instance, we can replace each tab (\t) in $line by a single space by writing

   $line =~ s/\t/ /;

More precisely, this replaces the first tab in $line by a space. To replace all tabs we have to add the modifier g at the end, as follows.

   $line =~ s/\t/ /g;

Often, we need to reuse the portion that was matched in the search pattern in the replacement string. Suppose that we have a file with lines of the form

  phone-number name

which we wish to read and print out in the form

  name phone-number

If we match each line against the pattern /\d+\s*\w.*/, then the portion \d+ would match the phone number, the portion \s* would match all spaces between the phone number and the first part of the name (which could have many parts) and the portion \w.* would match the rest of the line, containing all parts of the name. We are interested in reproducing the phone number and the name, corresponding to the first and third groups of the pattern, in the output. To do this, we group the portions that we want to ``capture'' within parentheses and then use \1, \2, ...to recover each of the captured portions. In particular, if $line contains a line of the form phone-number name, to modify it to the new form name phone-number we could write

  $line =~ s/(\d+)\s*(\w.*)/\2 \1/;  # \1 is what \d+ matches,
                                     # \2 is what \w.* matches

One thing to remember is that if we assigned a value to $line using the <> operator, then it would initially have a trailing newline character. In the search and replace that we wrote above, this newline character would get included in the pattern \2, so the output would have a new line between the name and the phone number. The function chomp $line removes the trailing newline from $line, if it exists, and should always be used to strip off unwanted newlines when reading data from a file.


next up previous contents
Next: Associative arrays Up: Programming in Perl Previous: Input/Output   Contents
Madhavan Mukund 2004-04-29