PHP Lesson 17 - Regular Expressions

Started by Parham, December 06, 2003, 08:07:09 PM

Previous topic - Next topic

Parham

Regular Expressions are one of the trickiest things to learn.  There are a lot of components to it, but it can at the same time be very strong.  A regular expression is an expression which lets you match an arbitrary strong, dissect it, and check it for validity.  A regular expression uses as set of characters to match strings someone inputs to them.  For those of you that have used DOS or the Unix shell, "dir *.txt" (for DOS) or "ls *.txt" are both regular expressions which ask that the dir/ls commands only return strings that end with ".txt" and have "any other character" before them.

Why would you want to use regular expressions in you're scripts?  The biggest reason would be to validate what a user inputs into fields in a HTML form and submits to your PHP script.  I won't go into the negatives, but for example, if you had the HTML field "age", you would only expect the user to input a number.  If the user inputs anything other than numbers, you don't want that information to go into your database.  You can use regular expressions to validate what the user inputs in the "age" field, and if they type in something bad, you can warn them.

The six basic simple characters used in regular expression are:

Pattern: a*
Matches: '', 'a', 'aa', ...
Explanation: match "a" zero or more times

Pattern: b+
Matches: 'b', 'bb', ...
Explanation: match "b" one or more times

Pattern: ab?c
Matches: 'ac', 'abc'
Explanation: match "a" followed by "b" optionally and then "c"

Pattern: [abc]
Matches: 'a' or 'b' or 'c'
Explanation: match "a" or "b" or "c" once

Pattern: [a-c]
Matches: 'a' or 'b' or 'c'
Explanation: Abbreviation for the above

Pattern: [abc]*
Matches: '', 'accb', ...
Explanation: Combination of "one from a set" and "zero or more"; match "a" or "b" or "c" zero or more times from the set

The "^" character is used to check to see whether something "starts at the beginning of the string".  The "$" character is used to check whether something "finishes at the end of the string".  The "|" character is used as the "or" separator.  The "|" character is not like the square bracket characters, because the | character separates regular expressions, NOT characters.  Brackets are used to group regular expressions.  Curly brackets are used to match regular expressions a certain amount of times (or a minimum/maximum amount of times).  I know this is a little too much to take it, but soon there will be a massive amount of examples to explain all of these regular expressions characters.

There are also a few special characters which are used to set common characters.  Those are:

\t -> Tab
\n -> Newline
\r -> Carriage Return
\* -> Asterisk
\\ -> Backslash
\d -> Digits [0-9]
\w -> Word [a-zA-Z0-9_] (letters, numbers, and the underscore)
\s -> Space [\t\r\n] (a tab, a carriage return, a newline)
. -> Anything except end-of-line [^\n] (literally any character that isn't a newline)

The function used in PHP to match a string using regular expressions is the preg_match() function.  This function uses Perl's regular expression feature to match a string.  The function takes the following [simple] parameters:

int preg_match (string pattern, string subject)

The "pattern" must start and end with the "/" character.  The main reason for this is that this function uses the Perl regular expressions library, and Perl uses "/"'s in its functions (if you used Perl, regular expressions don't use functions, instead they use m///).  The function returns a 1 if the "pattern" matched something in the "subject" and 0 otherwise.

Here are a few simple examples:


echo preg_match("/a/", "a"); //matches "a"
echo preg_match("/b/", "a"); //doesn't match, needs a "b"
echo preg_match("/a+/",""); //doesn't match, needs to have at least 1 "a"
echo preg_match("/a+/","a"); //matches, at least one "a"
echo preg_match("/a+/","aaaaaa"); //matches, at least one "a"
echo preg_match("/a*/",""); //matches, 0 or more "a"'s
echo preg_match("/a*/","aaaaaaaaaa"); //matches, 0 or more "a"'s
echo preg_match("/[xyz]/","x"); //matches, there is an "x"
echo preg_match("/[xyz]/","y"); //matches, there is an "y"
echo preg_match("/[xyz]/","z"); //matches, there is an "z"
echo preg_match("/[xyz]/","a"); //doesn't match, there is neither "x", "y", or "z"
echo preg_match("/[a-z]/","q"); //matches, "q" is in the range from "a" to "z"
echo preg_match("/[0-9]/","5"); //matches, "5" is in the range from "0" to "9"
echo preg_match("/[0-9]/","s"); //doesn't match, "s" is not in the range from "0" to "9"


examples of the "|" character:


//note that the | does not match only the chararacter before or after,
//the | character matches everything either before or after unless you group
//them

//not grouped
echo preg_match("/ab|cd/","ab"); //matches
echo preg_match("/ab|cd/","cd"); //matches

//grouped
echo preg_match("/a(b|c)d/","abd"); //matches
echo preg_match("/a(b|c)d/","acd"); //matches
echo preg_match("/a(b|c)d/","ad"); //doesn't match


examples of the "*" character:


echo preg_match("/ab*/","abbbb"); //matches
echo preg_match("/ab*/","bbbbb"); //fails


examples of the "+" character:


echo preg_match("/a+b/","aaaab"); //matches
echo preg_match("/a+b/","b"); //fails


examples with "\w" character:


echo preg_match("/\w+/","abc"); //matches
echo preg_match("/\w+/","a_b_c"); //matches
echo preg_match("/\w+/","0123456789"); //matches
echo preg_match("/\w+/","-"); //fails, "-" is not a part of \w
echo preg_match("/\w+/"," "); //fails, space is not a part of \w
echo preg_match("/\w+/",""); //fails, have to have a least one \w


examples with "?" character:


echo preg_match("/a?b?c?/","a"); //matches
echo preg_match("/a?b?c?/","b"); //matches
echo preg_match("/a?b?c?/","c"); //matches
echo preg_match("/a?b?c?/","abc"); //matches
echo preg_match("/a?b?c?/","ab"); //matches
echo preg_match("/a?b?c?/","bc"); //matches


examples with "^" and "$" characters:


echo preg_match("/^im/","image"); //matches
echo preg_match("/^im/","imagine"); //matches
echo preg_match("/^im/","embrace"); //doesn't match
echo preg_match("/er$/","programmer"); //matches
echo preg_match("/er$/","designer"); //matches
echo preg_match("/er$/","designing"); //doesn't match
echo preg_match("/^(ab|cd)$/","ab"); //matches
echo preg_match("/^(ab|cd)$/","cd"); //matches
echo preg_match("/^(ab|cd)$/","abcd"); //doesn't match
echo preg_match("/^(ab|cd)$/","xy"); //doesn't match


examples with curly brackets character:


echo preg_match("/a{2}/","aaa"); //matches, found "aa" somewhere
echo preg_match("/^a{2}$/","aa"); //matches, entire string is "aa"
echo preg_match("/^a{2}$/","aaa"); //doesn't match, entire string isn't "aa"
echo preg_match("/a{2,4}/","aaa"); //matches, minimum "aa", maximum "aaaa"
echo preg_match("/a{2,4}/","aaaa"); //matches
echo preg_match("/a{2,4}/","a"); //doesn't match
echo preg_match("/^a{2,4}$/","aabaa"); //doesn't match


a few common regular expressions (these are by no means secure... JUST simple):


echo preg_match("/^[-.\w]+\@[-.\w]+$/","[email protected]"); //email addresses
echo preg_match("/^\d{2}$/","24"); //ages
echo preg_match("/^(19|20)\d\d$/","1983"); //years
echo preg_match("/^([\w\s]+)$/","hello there"); //a simple string
echo preg_match("/^(http:\/\/www\.|http:\/\/|www\.)([\w\.\/\=\?\&\-]+)$/","http://www.google.com"); //urls

Tyris

hey, thanks a lot for that!!!! :)
just wish you'd written it a bit earlier ;)
It really helped me understand Regular Expressios a lot better than I did...
as well as cleared up quite a few problems I was pondering.

one question...
with
echo preg_match("/^(ab|cd)$/","abcd"); //doesn't match
that kinda thing...
from what it seems like you've stated in the ^ and $ description...
it should match either an 'ab or cd' at the start, and either an 'ab or cd' at the end...
so wouldnt that pass...?
I understand that you've said that ^blah$ means the WHOLE string is and ONLY is "blah"... but I dont quite understand why because of your explanation of ^ and $...
how would you make it start AND end with "ab"... coz preg_match("/^ab$/","ab_blah_ab"") would apparently fail...

thanx.

Parham

I should probably clear that up... i'm 99.9999% sure the ^ and $ characters can only exist at the very beginning and very end of the regex, nowhere inbetween (they mean nothing in those places).  Now if you only have the ^, that'll mean that your regex must match the very beginning of the string.  If you only have $, that means your regex must match the very end.  If you have both, then your regex must match the entire string.

Look carefully at the regex and the example you're inquiring about Tyris.  If you read the regex back, it says something like this: "From beginning to end, the string must contain 'ab' OR 'cd'".  The "|" character splits up your regex, and it'll match either one or the other, NOT both.  If you wanted to alter that regex to match both, then you'd have to do something like this:


echo preg_match("/^(ab|cd)(ab|cd)$/","abcd"); //doesn't match


now if you wanted to make it start end end with "ab", then this would probably be the regex you want:


echo preg_match("/^ab.*ab$/","abcdab"); //note, .* means match any character that isn't a newline zero or more times


Reading this regex literally: "The string must start with 'ab', then anything can exist in the middle, and it must end with 'ab'"

You might just be thinking of them a little to generally or abstract.  Say out loud what you want to match, and how you want to match it, and then try to build a regex from that.  When you know something MUST exist for it to match, put it in there literally, and then just try to fill the gaps with more general regex characters.

Hope that helps :)

Parham

oh, for you VERY curious people, there is this tool that'll check to see whether regular expressions match strings you input:

http://www.weitz.de/regex-coach/

There is also A LOT of literature about the subject.  O'Reilly's "Mastering Regular Expressions" over at http://www.amazon.ca/exec/obidos/ASIN/0596002890/qid=1070772784/sr=1-1/ref=sr_1_0_1/701-8818493-0043539

I know post-secondary institutions are teaching theory on regular expressions also.  I at least was taught this year about them, both the theory and practically.

Tyris

ok, sweet thanks... just wanted the info the both means it must match the entire string as such.. :)

am using the coach on my laptop too (tho laptop is dead temporarily)...
Thanx again :)

[Unknown]

The ^ and $ can be anywhere, here's an example:

/abc(d|$)/ -> matches anything + "abc" or anything + "abcd" + anything.

/^(ab|cd)$/ -> matches: 'ab' OR 'cd'.  The problem I think you are having is the word OR - in general programming, or usually means "this, that, or both" but in regular expressions it means "this, that, but not both."

-[Unknown]

Parham

Quote from: [Unknown] on December 07, 2003, 12:30:21 AM
/abc(d|$)/ -> matches anything + "abc" or anything + "abcd" + anything.

it's still _technically_ at the end of the regex though, no?  You can either choose to end with a d or "abc" must show up at the end of the string.  What I meant was that you can't do something like "/beginning(end)$more/".  Meaning "end" should match at the end of the string.  What I meant was that regular expressions are very linear: match this, then this, then this or this, then this...

[Unknown]

Yeah... that's true, they are incredibly linear.

-[Unknown]

Tyris

ok, thanks to both of you... its pretty clear to me now :)

pulpitfire

whew...i'm starting to understand this...good job :)

Parham


Søren Bjerg

Quote from: Parham on December 06, 2003, 11:52:42 PM
[...] http://www.weitz.de/regex-coach/ [...]

Gotta say "The Regex Coach" is pretty helpful for regex newbies like myself, but I seem to have a problem with getting something, which works wonders in "The Regex Coach", to work in PHP;

Would like to remove the MD5 hash from the string below:

$page = file.php?s=843157a0b0c79d147e7bb94c3a3e9f76&action=view

The following highlights flawlessly in "The Regex Coach"...

/?s=\w+

...and I'd then use it like this

preg_replace('/?s=\w+', 's=', $page);

But the problem is the starting and ending with '/'s... can't seem to get it to work then :(.

Any straightforward solution I've totally gone blind from?
RUNE HORDES dot INFO - SMF 1.1.10 w/ Custom Profile Mod... and various permissions hooks and template changes (new topic form).

Tyris

#12
preg_replace('~/?s=\w+~', 's=', $page);
I'm still pretty useless with preg_replace stuff,... but that should work...
basically you need an opening and closing character (in this case "~") in regex statements.
the coach doesnt need them tho (as it deals with the statements themselves).

also... you're probably want
$page = preg_replace('~/?s=\w+~', 's=', $page);
(maybe...)

Søren Bjerg

Ah... yes, that works as planned!Much thanks... I tried opening and ending with '/' as Parham explained in his first post, but I think I sort of get it now... appreciate your help :).
RUNE HORDES dot INFO - SMF 1.1.10 w/ Custom Profile Mod... and various permissions hooks and template changes (new topic form).

Tyris

opening and ending with '/'  is fine... howver then you HAVE to escape any '/' characters within the regex... otherwise it thinks its the end... so using ~ is less of a problem coz they're far less used...
:)
glad I could help :)

Parham


Saleh

QuotePattern: ab?c
Matches: 'ac', 'abc'
Explanation: match "a" followed by "b" optionally and then "c"
I think you mean it matches 'ab', 'abc' not 'ac'!! right?

We don't need a reason to help people

Parham

Quote from: NeverMind on May 27, 2004, 05:59:18 AM
QuotePattern: ab?c
Matches: 'ac', 'abc'
Explanation: match "a" followed by "b" optionally and then "c"
I think you mean it matches 'ab', 'abc' not 'ac'!! right?

nope, the "?" is on the "b"... the "b" is therefore optional.  try it in a script, you'll see that "ab" will not match.

[Unknown]

To clarify:

[a]?[c]

Means:

[a] and then maybe and then [c].

-[Unknown]

Advertisement: