Science Library - free educational site

PERL Compatible Regular Expressions

Regular expressions is a technique applicable to many programming languages. It is system of matching patterns in strings, useful for identifying elements and making conditional substitutions automatically. One application we use them for is redirecting URLs according to set patterns.

PHP supports POSIX Extended and PCRE (Perl-Compatible Regular Expressions). PCRE are more powerful but more complex than POSIX, and is becoming the most used type.

The preg_match(pattern, subject) function returns 0 or 1, where 1 signals a string match. The function stops once a single match has been found. To find all matches use preg_match_all. Place the pattern to be matched in single quotation marks, to avoid problems with escaped characters, like /n, which take on special meanings within double quote marks.

Delimiters: these are characters which indicate where the pattern starts and ends. Delimiters must be non-alphanumeric or a backslash. Use the same delimiter to mark both ends of the pattern. A common delimiter is the forward slash (/). If / is to be checked for a match within the string, use instead | or !.

Example

$name = "Frau Vladimira Fang";

if(preg_match('/Frau/', $name))

{ print 'Sehr geehrte Frau ...';

} else if(preg_match('/Signora/', $name))

{ print 'Gentile Signora...'; }

else { print 'Dear Madam...'; }

Patterns

Literals

These are characters which are interpreted 'literally' as they are written. Here, case-sensitivity rules will apply.

Meta-Characters

When a character has a meaning beyond its literal match, then meta-characters are needed to specify this extra meaning. These meta-characters have unique meanings within regular expressions:

Meta-CharacterInterpretation
\escapes following character
^(caret) indicates string start
$indicates string end
.any single character except newline
\Q .. \Eto escape all characters in between
|(stovepipe) or
[ .. ]start .. end of class
( .. )start .. end of subpattern
{ .. }start .. end of quantifier
\\\\escapes a single backslash
?0 or 1
*0 or more
+1 or more
{n}exactly n occurrences
{n, m}between n and m occurreances
{n, }at least n occurrences

Meta-characters can be used singularly or in combination to make simple or complex regular expression patterns. to match the meta-character literally, escape it (\).

e.g.: e.g will match both e.g and egg, while e\.g will match only e.g.

The pattern ^

will match a string that begins with a p tag. And

$ will match a string that ends with a p tag.

preg_match('/theat(re|er)/', $string) will allow for English and digressions thereof.

preg_match('/colou?r/', $string) will allow both spellings.

preg_match('/$.+00/', $string) will check for an amount of any length that starts with a dollar sign and ends with'00'.

preg_match('/^(0|1|2|3|4|5|6|7|8|9){8}$/', $string) will check that an 8-digit number has been entered.

preg_match('/^(0|1|2|3|4|5|6|7|8|9){4,}$/', $string) checks that there are at least 4 digits.

preg_match('/^(0|1|2|3|4|5|6|7|8|9){4,6}$/', $string) checks that there are at least 4 digits, and not more than 6.

Character Classes

To save having to write all possible values in a PCRE, character classes may be used.

ClassCode (shortcut)Interpretation
[0-9]\dNumerals (single digits)
[A-Za-z]Alphabet
[^A-Z]Not capital letters
[A-Za-z0-9]\wAny word character
[^A-Za-z0-9]\WNot a word character
[^\f\r\t\n\v]\SNot white space

Within classes, all meta-characters are literal, except for \ (escape), ^ (negation), - (range).

[A-z.] will allow all letters, as upper and lower cases, the apostrophe, and any other character (space, hyphen, full stop...), which would permit all forms of name to be matched.

can\s?not matches both "can not" and "cannot". if(preg_match('/^(can\s?not)?(can\'t)?$/', $word)) would check for all three possibilities.

A boundary (\b) uses a word (i.e. all characters except spaces) to restrict matches: \band\b uses the boundary shortcut (\b) to check that the string and is a word in a string. e.g. \band\b would match 'the long and the short', but not 'bandaid'.

The upper case \B is the opposite: \Band\B would match 'bandaid' but not 'the long and the short'.

Email address validation

preg_match('/^[\w.-]+@[\w.-]+\.[A-Za-z]{2,10}$/')

NB: in the past, the final number of character count was restricted to between two and six {2,6}. With the new TLDs, this may no longer be sufficient.

Matches found

preg_match(pattern, subject, $match) provides information about what was matched.

preg_match_all(pattern, subject, $match) returns an array with all matches found.

Replacing text

preg_replace(pattern, replacement, subject, limit): looks for the pattern, replaces it with replacement string, in the subject string, for an optional limit number of times.

Content © Andrew Bone. All rights reserved. Created : September 10, 2015 Last updated :March 5, 2016

Latest Item on Science Library:

The most recent article is:

Air Resistance and Terminal Velocity

View this item in the topic:

Mechanics

and many more articles in the subject:

Subject of the Week

Universe

'Universe' on ScienceLibrary.info covers astronomy, cosmology, and space exploration. Learn Science with ScienceLibrary.info.

Science

Great Scientists

Luca Pacioli

c. 1445 - 1517

Luca Pacioli was an Italian mathematician and teacher, who published books in the Italian vernacular to popularise mathematics.

Luca Pacioli
Lugano English

Quote of the day...

It is much more interesting living not knowing than having answers that might be wrong

ZumGuy Internet Promotions

Website content services