Regular Expressions in Perl.

In this section we will talk about how to check the correctness of the information entered by the user. By correctness we mean that if user was prompted for a zip code, then we need to check that he/she actually entered a 5-digit number and not some text or if we are expecting an e-mail address, then it should be in the form: name@server.network.

A simple way to validate an entry is to compare it with a given constant, but this is obviously not a case here. We do knot know up front what user enters. Another way is applicable to numbers. If we expect to get a number, then we need to convert the entered string into a number and check the range. This methods is simple, but unfortunately it doesn't always work. Besides, this method doesn't fork for, for example, phone numbers. Phone number can be entered as:

  • 123-4567
  • 123 4567
  • 123 45 67
  • 123.45.67
  • and other ways (we don't even mention it may include area code).
    Thus, we would like to have a tool that allows us to check if a string matches a specific pattern.

    Regular expression

    Perl has a very powerful tool. Regular expressions allow us to check if a string has a substring that matches a specific pattern. In this section we are not talking yet how to use regular expressions. For the time being we are talking only about how to write them.

    A simple regular expression uses no special characters for defining the string to be used in a search it contains only a string you want you want to find in you test string. If we need to assign a pattern (rules) the test line should match we have to use special characters. For example, if we want to find out if my test string contains a phone number in the form ***-**** we need to use special character \d, which indicates that any digit matches this character but nothing but a digit, then my regular expression would be:

    /\d\d\d-\d\d\d\d/;
    
    This regular expression requires that the test string has three any digits then dash symbol and then four more digits. The following table contains other special characters we can use in Perl regular expressions:

    matching metacharacters
    Character Matches Example
    \b Word boundary /\bor/ matches "origami" and "or" but not "normal"
    /or\b/ matches "traitor" and "or" but not "perform"
    /\bor\b/ matches "or" and nothing else
    \B Word nonboundary /\Bor/ matches "normal" but not "origami"
    /or\B/ matches "normal" and "origami" but not "traitor"
    /\Bor|B/ matches "normal" but not "origami" or "traitor"
    \d Numeral 0 through 9 /\d\d\d/ matches "212" and "415" but not "B17" or "ABC"
    \D Nonnumeral /\D\D\D/ matches "ABC" and "GEF" but not "B17" or "123"
    \s Single white space /over\sbite/ matches "over bite" but not "overbite" or "over  bite"
    \S Single nonwhite space /over\Sbite/ matches "over-bite" but not "overbite" or "over bite"
    \w Letter, numeral, or underscore /A\w/ matches "A1" and "AC" but not "A+"
    \W Non letter, numeral, or underscore /A\W/ matches "A+" but not "AC" and "A2"
    \A At the beginning of the string /\AFread/ matches "Fred is OK" but not "I'm with Fred" or "Is Fred here?"
    \Z At the end of the string or before newline at the end /Fread\Z/ matches "I'm with Fred\n and Bob" but not "Fred is OK" or "Is Fred here?"
    \z At the end of the string /Fread\z/ matches "I'm with Fred" but not "Fred is OK" or "Is Fred here?"
    . Any character except new line /.../ matches "abC", "12f", "1+ ", or ant three characters
    [...] Character set /[AN]BC/ matches "ABC" and "NBC" but not "BBC"
    [^...] Negated character set /[^AN]BC/ matches "BBC" and "CBC" but not "ABC" or "NBC"
    Counting metacharacters
    Character Matches last character Example
    * Zero or more times /Ja*vaScript/ matches "JvaScript", "JavaScript", and "JaaaavaScript" but not "JuvaScript"
    ? Zero or one time /Ja?vaScript/ matches "JvaScript" or "JavaScript" but not "JaavaScript"
    + One or more times /Ja+vaScript/ matches "JavaScript" or "JaaaavaScript" but not "JvaScript"
    {n} Exactly n times /Ja{2}vaScript/ matches "JaavaScript" but not "JvaScript" or "JaaaavaScript"
    {n,} n or more times /Ja{2,}vaScript/ matches "JaavaScript" or "JaaaavaScript" but not "JvaScript"
    {n, m} At least n at most m times /Ja{2,3}vaScript/ matches "JaavaScript" or "JaaavaScript" but not "JvaScript" or "JaaaaavaScript"
    positional metacharacters
    Character Matches located Example
    ^ At the beginning of the string /^Fread/ matches "Fred is OK" but not "I'm with Fred" or "Is Fred here?"
    $ At the end of the string or before newline at the end /Fread$/ matches "I'm with Fred\n and Bob" but not "Fred is OK" or "Is Fred here?"

    For example if you want to make sure that a match for a Roman numeral is found only when it is at the start of a line and has a dot after it you check for the match

       /^[IVXMDC]+\./

    Not to be confused with the metacharacters listed in the table above are escaped string characters for
    Symbol Escape symbol Description
    tab \t Tabulation symbol
    newline \n New line symbol
    return \r carriage return
    formfeed \f Formfeed symbol (printer command)
    vtab \v Vertical tabulation symbol
    . \. Dot
    ^ \^ Caret symbol
    $ \$ Dollar sign
    \ \\ Backslash
    / \/ Slash
    - \- Dash
    ( \( Open parenthesis
    ) \) Close parenthesis

    Modifiers

    When we create a regular expression we can specify several modifiers that change the behavior of the pattern: For example, the following pattern /bob/i matches "Bob", "bob", and "Is BOB here?".

    Using Regular Expressions

    To check if there is a match in a string Perl uses special operator =~. Usual syntax is:
    variable =~ m/regular_expression/
    The result of this operator is either true (if there is a match) or false (no match). The following example checks if a string contains a phone number in it:
    $str = "My phone number is 123-3445. This is my home phone.";
    if( $str =~ /\d{3}-\d{4}/ ){
       print "There is a phone number in the string '$str'\n";
    }
    else{
       print "There is not a phone number in the string '$str'\n";
    }
    
    As you can see the character m may be omitted (m stays for "match").

    We also often need to know not only if there is match or not, but also what is the substring that matches the pattern. Perl provides several special variables for that purpose:

    Let's modify the previous example:
    $str = "My phone number is 123-3445. This is my home phone.";
    if( $str =~ /\d{3}-\d{4}/ ){
       print "There is a phone number in the string '$str'\n";
       print "The phone is: $&\n";
       print "      Before: '$`'\n";
       print "       After: '$''\n";
    }
    else{
       print "There is not a phone number in the string '$str'\n";
    }
    

    If we want to find all matches in a string we can use operator =~ in a loop and setting the value of the variable $str to the substring on the right of the match:

    $str = "My phones: 123-3456 (home), 234-4557 (office), 456-4564 (cell).";
    while( $str =~ /\d{3}-\d{4}/ ){
       print "The phone is: $&\n";
       $str = $';
    }
    
    or we can use the global modifier g and use operator =~ in the list context. If used on the right side of the assignment operator operator =~ returns an array of matches:
    $str = "My phones: 123-3456 (home), 234-4557 (office), 456-4564 (cell).";
    @phones = $str =~ /\d{3}-\d{4}/g;
    foreach $phone (@phones){
       print "$phone\n";
    }

    In addition to operator =~ Perl has operator !~ that returns true if there is no match and false otherwise.

    Getting information about a match

    You can not only verify that a one-field date entry is in desired format, but also extract match components of the entry. To get any piece of information inside a substring matching the pattern we need to embrace the corresponding part of the pattern in parenthesis. Please notice that parenthesis themselves are special symbols inside patterns and do not require any match on the string. If a pattern includes one or more parenthesis sets, then substring of the match corresponding the pattern inside parenthesis will be placed by Perl in special variables $1, $2, $3, etc (Perl does not stop at $9).

    For example, if we are checking that a date was entered in either in "mm/dd/yyyy" or "mm-dd-yyyy" format and also need to know the values of the month, day, and year we can use the following regular expression:

    $today = "1/24/2003";
    if( $today =~ /\b(1[0-2]|0?[1-9])[\-\/](0?[1-9]|[12][0-9]|3[01])[\-\/]((19|20)\d{2})/ ){
       print "Date: $&\n";
       print "Month: $1\n";
       print "Day: $2\n";
       print "Year: $3\n";
       print "Century: $4\n";
    }
    
    Let's take a closer look at this expression: combining these three thing together and adding possibility for different separators we come up with code above.

    String Replacement

    Let's consider a small example about credit card numbers. Credit card number can be entered as
  • 6432-2345-2342-2342
    or
  • 6432 2345 2342 2342
    or
  • 6432234523422342
    our goal is to recognize a valid number and transform it the the first form. Let's use the following regular expression:
    $card = "6432 23452342-2342";
    if( $card =~ /(\d\d\d\d)[\-\s]?(\d\d\d\d)[\-\s]?(\d\d\d\d)[\-\s]?(\d\d\d\d)/ ){
       	$card = "$1-$2-$3-$4";
    }
    else{
      	$card = "Invalid credit card number!";
    }
    

    To replace a part of a string that matches a regular expression we can use s/// regular expression. This expression includes a pattern (goes between the first and second slash) and a string to substitute with (goes between the second and the third slash). Operator =~ performs the substitution if used with s/// regular expression. Thus, the following example substitutes the first 4 digits in a credit card number with stars:

    $card = "6432 23452342-2342";
    if( $card =~ s/\d{4}/****/ ){
       print "$card\n";
    }
    else{
       print "Sorry, there is no match\n";
    }
    
    If used in the modifier g such regular expression substitutes all matches in the string. In the following example we first bring the card number in the normal form and then substitute all digits but the last 4 with stars:
    $card = "6432 23452342-2342";
    $card =~ s/(\d\d\d\d)[\-\s]?/$1-/g; # separate 4-digit groups with dashes
    $card =~ s/-$//;                    # remove the trailing dash
    $card =~ s/\d{4}-/****-/g;          # substitute 4-digit groups with 4 stars
    print "$card\n";
    

    Please consult Perl regular expression documentations for more details.