Science Library - free educational site

PERL Compatible Regular Expressions

Regular expressions is a technique applicable to many programming languages. It is system of matching patterns in strings, useful for identifying elements and making conditional substitutions automatically. One application we use them for is redirecting URLs according to set patterns.

PHP supports POSIX Extended and PCRE (Perl-Compatible Regular Expressions). PCRE are more powerful but more complex than POSIX, and is becoming the most used type.

The preg_match(pattern, subject) function returns 0 or 1, where 1 signals a string match. The function stops once a single match has been found. To find all matches use preg_match_all. Place the pattern to be matched in single quotation marks, to avoid problems with escaped characters, like /n, which take on special meanings within double quote marks.

Delimiters: these are characters which indicate where the pattern starts and ends. Delimiters must be non-alphanumeric or a backslash. Use the same delimiter to mark both ends of the pattern. A common delimiter is the forward slash (/). If / is to be checked for a match within the string, use instead | or !.

Example

$name = "Frau Vladimira Fang";

if(preg_match('/Frau/', $name))

{ print 'Sehr geehrte Frau ...';

} else if(preg_match('/Signora/', $name))

{ print 'Gentile Signora...'; }

else { print 'Dear Madam...'; }

Patterns

Literals

These are characters which are interpreted 'literally' as they are written. Here, case-sensitivity rules will apply.

Meta-Characters

When a character has a meaning beyond its literal match, then meta-characters are needed to specify this extra meaning. These meta-characters have unique meanings within regular expressions:

Meta-CharacterInterpretation
\escapes following character
^(caret) indicates string start
$indicates string end
.any single character except newline
\Q .. \Eto escape all characters in between
|(stovepipe) or
[ .. ]start .. end of class
( .. )start .. end of subpattern
{ .. }start .. end of quantifier
\\\\escapes a single backslash
?0 or 1
*0 or more
+1 or more
{n}exactly n occurrences
{n, m}between n and m occurreances
{n, }at least n occurrences

Meta-characters can be used singularly or in combination to make simple or complex regular expression patterns. to match the meta-character literally, escape it (\).

e.g.: e.g will match both e.g and egg, while e\.g will match only e.g.

The pattern ^

will match a string that begins with a p tag. And

$ will match a string that ends with a p tag.

preg_match('/theat(re|er)/', $string) will allow for English and digressions thereof.

preg_match('/colou?r/', $string) will allow both spellings.

preg_match('/$.+00/', $string) will check for an amount of any length that starts with a dollar sign and ends with'00'.

preg_match('/^(0|1|2|3|4|5|6|7|8|9){8}$/', $string) will check that an 8-digit number has been entered.

preg_match('/^(0|1|2|3|4|5|6|7|8|9){4,}$/', $string) checks that there are at least 4 digits.

preg_match('/^(0|1|2|3|4|5|6|7|8|9){4,6}$/', $string) checks that there are at least 4 digits, and not more than 6.

Character Classes

To save having to write all possible values in a PCRE, character classes may be used.

ClassCode (shortcut)Interpretation
[0-9]\dNumerals (single digits)
[A-Za-z]Alphabet
[^A-Z]Not capital letters
[A-Za-z0-9]\wAny word character
[^A-Za-z0-9]\WNot a word character
[^\f\r\t\n\v]\SNot white space

Within classes, all meta-characters are literal, except for \ (escape), ^ (negation), - (range).

[A-z.] will allow all letters, as upper and lower cases, the apostrophe, and any other character (space, hyphen, full stop...), which would permit all forms of name to be matched.

can\s?not matches both "can not" and "cannot". if(preg_match('/^(can\s?not)?(can\'t)?$/', $word)) would check for all three possibilities.

A boundary (\b) uses a word (i.e. all characters except spaces) to restrict matches: \band\b uses the boundary shortcut (\b) to check that the string and is a word in a string. e.g. \band\b would match 'the long and the short', but not 'bandaid'.

The upper case \B is the opposite: \Band\B would match 'bandaid' but not 'the long and the short'.

Email address validation

preg_match('/^[\w.-]+@[\w.-]+\.[A-Za-z]{2,10}$/')

NB: in the past, the final number of character count was restricted to between two and six {2,6}. With the new TLDs, this may no longer be sufficient.

Matches found

preg_match(pattern, subject, $match) provides information about what was matched.

preg_match_all(pattern, subject, $match) returns an array with all matches found.

Replacing text

preg_replace(pattern, replacement, subject, limit): looks for the pattern, replaces it with replacement string, in the subject string, for an optional limit number of times.

Content © Renewable-Media.com. All rights reserved. Created : September 10, 2015 Last updated :March 5, 2016

Latest Item on Science Library:

The most recent article is:

Trigonometry

View this item in the topic:

Vectors and Trigonometry

and many more articles in the subject:

Subject of the Week

Mathematics

Mathematics is the most important tool of science. The quest to understand the world and the universe using mathematics is as old as civilisation, and has led to the science and technology of today. Learn about the techniques and history of mathematics on ScienceLibrary.info.

Mathematics

Great Scientists

Richard Feynman

1918 - 1988

Richard Feynman was an American physicist and folk hero, known for his humour as much as his scientific work.

Richard Feynman, 1918 - 1988
ScienceLibrary.info

Quote of the day...

Do we have a plan? It doesn't have to be Wellington at Waterloo, but some sort of plan would be nice.

ZumGuy Internet Promotions

IT information forum by Sean Bone