Friday 30 October 2009

Regular Expression (Validation) Examples in PHP

The Regular Expressions quickly covers the basics of regular-expression syntax, then delves into the mechanics of expression-processing, common pitfalls, performance issues, and implementation-specific differences. Written in an engaging style and sprinkled with solutions to complex real-world problems, MRE offers a wealth information that you use. I will start with some simple usage examples of the regular expressions and continue with a huge list of cases for various situations where we would normally need a regex to operate. We will use simple functions which return TRUE or FALSE. $regex will serve as our regular expression to match against and $text will be our text (pretty obvious):
function do_reg($text, $regex)
{
 if (preg_match($regex, $text)) {
  return TRUE;
 }
 else {
  return FALSE;
 }
}
The next function will get the part of a given string ($text) matched by the regex ($regex) using a group srorage ($regs). By changing the $regs[0] to $regs[1] we can use a capturing group (in this case griup 1) to match against. The capturing group can also have a name ($regs['groupname']):
function do_reg($text, $regex, $regs)
{
 if (preg_match($regex, $text, $regs)) {
  $result = $regs[0];
 }
 else {
  $result = "";
 }
 return $result;
}  

The following function will return an array of all regex matches in a given string ($text):
function do_reg($text, $regex)
{
 preg_match_all($regex, $text, $result, PREG_PATTERN_ORDER);
 return $result = $result[0];
}

Next we can iterate (loop) over all matches in a string ($text) and output the results:
function do_reg($text, $regex, $regs)
{
 preg_match_all($regex, $text, $result, PREG_PATTERN_ORDER);
  for ($i = 0; $i < count($result[0]); $i++) {
  $result[0][$i];
 }
}  

Extending the above one we can iterate over all matches ($text) and capture groups in a string ($text):
function do_reg($text, $regex)
{
 preg_match_all($regex, $text, $result, PREG_SET_ORDER);
 for ($matchi = 0; $matchi < count($result); $matchi++) {
  for ($backrefi = 0; $backrefi < count($result[$matchi]); $backrefi++) {
   $result[$matchi][$backrefi];
  }
 }
} 

EXAMPLES:
For Date

$date_regex = "/^(0[1-9]|[12][0-9]|3[01])[\-\/\.](0[1-9]|1[012])[\-\/\.](19|20)[\d][\d]$/";
$date_regex = "/^(0[1-9]|[12][0-9]|3[01])[\-\/.](0[1-9]|1[012])[\-\/.](19|20)[\d][\d]$/";


//Date d/m/yy and dd/mm/yyyy
//1/1/00 through 31/12/99 and 01/01/1900 through 31/12/2099
//Matches invalid dates such as February 31st
'\b(0?[1-9]|[12][0-9]|3[01])[- /.](0?[1-9]|1[012])[- /.](19|20)?[0-9]{2}\b'

//Date dd/mm/yyyy
//01/01/1900 through 31/12/2099
//Matches invalid dates such as February 31st
'(0[1-9]|[12][0-9]|3[01])[- /.](0[1-9]|1[012])[- /.](19|20)[0-9]{2}'

//Date m/d/y and mm/dd/yyyy
//1/1/99 through 12/31/99 and 01/01/1900 through 12/31/2099
//Matches invalid dates such as February 31st
//Accepts dashes, spaces, forward slashes and dots as date separators
'\b(0?[1-9]|1[012])[- /.](0?[1-9]|[12][0-9]|3[01])[- /.](19|20)?[0-9]{2}\b'

//Date mm/dd/yyyy
//01/01/1900 through 12/31/2099
//Matches invalid dates such as February 31st
'(0[1-9]|1[012])[- /.](0[1-9]|[12][0-9]|3[01])[- /.](19|20)[0-9]{2}'

//Date yy-m-d or yyyy-mm-dd
//00-1-1 through 99-12-31 and 1900-01-01 through 2099-12-31
//Matches invalid dates such as February 31st
'\b(19|20)?[0-9]{2}[- /.](0?[1-9]|1[012])[- /.](0?[1-9]|[12][0-9]|3[01])\b'

//Date yyyy-mm-dd
//1900-01-01 through 2099-12-31
//Matches invalid dates such as February 31st
'(19|20)[0-9]{2}[- /.](0[1-9]|1[012])[- /.](0[1-9]|[12][0-9]|3[01])' 
/-------------------------------------------------------------------------------/ For Address /-------------------------------------------------------------------------------/
//Address: State code (US)
'/\\b(?:A[KLRZ]|C[AOT]|D[CE]|FL|GA|HI|I[ADLN]|K[SY]|LA|M[ADEINOST]|N[CDEHJMVY]|O[HKR]|PA|RI|S[CD]|T[NX]|UT|V[AT]|W[AIVY])\\b/'

//Address: ZIP code (US)
'\b[0-9]{5}(?:-[0-9]{4})?\b' 
/-------------------------------------------------------------------------------/ For Email /-------------------------------------------------------------------------------/
//Email address
//Use this version to seek out email addresses in random documents and texts.
//Does not match email addresses using an IP address instead of a domain name.
//Does not match email addresses on new-fangled top-level domains with more than 4 letters such as .museum.
//Including these increases the risk of false positives when applying the regex to random documents.
'\b[A-Z0-9._%-]+@[A-Z0-9.-]+\.[A-Z]{2,4}\b'

//Email address (anchored)
//Use this anchored version to check if a valid email address was entered.
//Does not match email addresses using an IP address instead of a domain name.
//Does not match email addresses on new-fangled top-level domains with more than 4 letters such as .museum.
//Requires the "case insensitive" option to be ON.
'^[A-Z0-9._%-]+@[A-Z0-9.-]+\.[A-Z]{2,4}$'

//Email address (anchored; no consecutive dots)
//Use this anchored version to check if a valid email address was entered.
//Improves on the original email address regex by excluding addresses with consecutive dots such as john@aol...com
//Does not match email addresses using an IP address instead of a domain name.
//Does not match email addresses on new-fangled top-level domains with more than 4 letters such as .museum.
//Including these increases the risk of false positives when applying the regex to random documents.
'^[A-Z0-9._%-]+@(?:[A-Z0-9-]+\.)+[A-Z]{2,4}$'

//Email address (no consecutive dots)
//Use this version to seek out email addresses in random documents and texts.
//Improves on the original email address regex by excluding addresses with consecutive dots such as john@aol...com
//Does not match email addresses using an IP address instead of a domain name.
//Does not match email addresses on new-fangled top-level domains with more than 4 letters such as .museum.
//Including these increases the risk of false positives when applying the regex to random documents.
'\b[A-Z0-9._%-]+@(?:[A-Z0-9-]+\.)+[A-Z]{2,4}\b'

//Email address (specific TLDs)
//Does not match email addresses using an IP address instead of a domain name.
//Matches all country code top level domains, and specific common top level domains.
'^[A-Z0-9._%-]+@[A-Z0-9.-]+\.(?:[A-Z]{2}|com|org|net|biz|info|name|aero|biz|info|jobs|museum|name)$'

//Email address: Replace with HTML link
'\b(?:mailto:)?([A-Z0-9._%-]+@[A-Z0-9.-]+\.[A-Z]{2,4})\b' 
/-------------------------------------------------------------------------------/ For Credit Cards /-------------------------------------------------------------------------------/
//Credit card: All major cards
'^(?:4[0-9]{12}(?:[0-9]{3})?|5[1-5][0-9]{14}|6011[0-9]{12}|3(?:0[0-5]|[68][0-9])[0-9]{11}|3[47][0-9]{13})$'

//Credit card: American Express
'^3[47][0-9]{13}$'

//Credit card: Diners Club
'^3(?:0[0-5]|[68][0-9])[0-9]{11}$'

//Credit card: Discover
'^6011[0-9]{12}$'

//Credit card: MasterCard
'^5[1-5][0-9]{14}$'

//Credit card: Visa
'^4[0-9]{12}(?:[0-9]{3})?$'

//Credit card: remove non-digits
'/[^0-9]+/'

//Delimiters: Replace commas with tabs
//Replaces commas with tabs, except for commas inside double-quoted strings
'((?:"[^",]*+")|[^,]++)*+,'

/-------------------------------------------------------------------------------/ For HTML /-------------------------------------------------------------------------------/
//HTML comment
''

//HTML file
//Matches a complete HTML file. Place round brackets around the .*? parts you want to extract from the file.
//Performance will be terrible on HTML files that miss some of the tags
//(and thus won't be matched by this regular expression). Use the atomic version instead when your search
//includes such files (the atomic version will also fail invalid files, but much faster).
'.*?.*?.*?.*?]*>.*?.*?'

//HTML file (atomic)
//Matches a complete HTML file. Place round brackets around the .*? parts you want to extract from the file.
//Atomic grouping maintains the regular expression's performance on invalid HTML files.
'(?>.*?)(?>.*?)(?>.*?)(?>.*?]*>)(?>.*?).*?'

//HTML tag
//Matches the opening and closing pair of whichever HTML tag comes next.
//The name of the tag is stored into the first capturing group.
//The text between the tags is stored into the second capturing group.
'<([A-Z][A-Z0-9]*)[^>]*>(.*?)'

//HTML tag
//Matches the opening and closing pair of a specific HTML tag.
//Anything between the tags is stored into the first capturing group.
//Does NOT properly match tags nested inside themselves.
'<%TAG%[^>]*>(.*?)'

//HTML tag
//Matches any opening or closing HTML tag, without its contents.
']*>'