RU beehive logo ITEC dept promo banner
ITEC 325
2015spring
ibarland

homelectsexamshws
D2Lbreeze (snow day)

lect16-regexps
regular expressions

regular expression intro

regexps.pdf

Representing regular expressions: They're not the same type as strings2. Various languages have slightly different ways of representing regexps: Some …

PHP regular expressions

PHP regular-expression matching: In php, regular expressions are strings delimited by a special character (usually /). Some very quick examples: lect16-regexps-examples.php

      echo preg_match( '/abcd/', 'abcd' );
      echo preg_match( '/abcd/', 'azcd' );      // false

      echo preg_match( '/a..d/', 'azcd' );      // . matches any single character (besides newline, null)

      echo preg_match( '/ab*cd/', 'abbbbcd' );  // b*  matches 0-or-more-b's.
      echo preg_match( '/ab*cd/', 'abcd' );  
      echo preg_match( '/ab*cd/', 'acd'  ); 

      echo preg_match( '/ab+cd/', 'abbbbcd' );  // b+  matches 1-or-more-b's.
      echo preg_match( '/ab+cd/', 'abcd' );  
      echo preg_match( '/ab+cd/', 'acd'  );     // false

      echo preg_match( '/ab.*cd/', 'abcd' );
      echo preg_match( '/ab.*cd/', 'abXd' );               // .* and .+ are a common patterns.
      echo preg_match( '/ab.*cd/', 'abBlahBlahBlahcd' );
      echo preg_match( '/ab.+cd/', 'abcd' );               // false
      echo preg_match( '/ab.+cd/', 'abXcd' );
      echo preg_match( '/ab.+cd/', 'abBlahBlahBlahcd' );   // false

      
      echo preg_match( '/ab?cd/', 'acd' );  // b?  matches 0-or-1 b
      echo preg_match( '/ab?cd/', 'abcd' );  
      echo preg_match( '/ab?cd/', 'abbbbcd'  );     // false

      echo preg_match( '/(ab)*cd/', 'ababababcd' );  // parens do grouping
      echo preg_match( '/(ab)*cd/', 'abbbbcd' );  // false


      // WARNING: preg_match looks to see if the string *contains* a match!
      echo preg_match( '/row/', 'How now, brown cow?' );  // true
      echo preg_match( '/.ow/', 'How now, brown cow?' );  // true

      // Use "^" to specify the start-of-string, and "$" to specify end-of-string.
      echo preg_match( '/^.ow$/', 'How now, brown cow?' );  // false
      echo preg_match( '/^.ow$/', 'Zow' );
      echo preg_match( '/^.ow/', 'Zowee' );
      echo preg_match( '/w.e$/', 'Wowee Zowee' );

      
      echo preg_match( '/[WZ]ow/', 'Wowee Zowee' );  // square-brackets match any one character from the set
      echo preg_match( '/[WZ]ow/', 'Yow' );          // false
      echo preg_match( '/[W-Z]ow/', 'Yowee' );       // square-brackets can contain a *range*
      echo preg_match( '/ab[0-9]+de/', 'ab789de' );


      echo preg_match( '/[0-9]*/', '00047' );        // Beware: matching just a * expression (w/o ^,$)!
    
See the manual.
Note: the name “preg” comes from "Perl compatible"; earlier PHP used the POSIX regexp's but PHP decided to deprecate that.

Your task: What is a regular expression to match...

False positive, and false negative:

test result
negativepositive
is test accurate? false false negative false positive
true true negative true positive
In our setting, the “test” is preg_match, and “accurate” means whether the value returned is what we want it to be.

Warning: Beware matching a top-level * expression: the empty-string matches it, and any string contains the empty-string! Thus preg_match_all( "/(xyz)*/", "uh-oh") === 6!!, since "uh-oh" has zero "xyz"'s at the start, followed by 'u', followed by zero more "xyz"'s, followed by 'h', followed by ….

Atomic regexps

Compound regexps

There are also ways of building bigger regexps out of smaller ones:

NOTE: In php, no trailing “g” allowed, as in javascript!
(There are other trailing modifiers however — e.g. i for case-insensitive, and more.)

Three helpful functions:

regexps vs. unicode


2C, ML, and Haskell let you introduce new type-synonyms, but not Java -- you have to introduce a new class that wraps strings.      

1 Similarly, I'd love to have a language that lets me rename types2, so that I could have string (for raw-data), html-data (safe to concatentate to HTML), and sql-data (safe to splice into a SQL query as data). htmlspecialchars and mysql_real_escape_string can be thought of as consturctors for these new types, but the type-system doesn't help protect me from giving a raw (unsanitized) string when it was expecting some already-sanitized data.      

3 Okay, that's actually vague: is “de Soto” or “da Vinci” all last name? Names are notoriously hard to characterize, especially across multiple cultures. My advice is to liberally accept what characters people say their name is; trimming and collapsing whitespace is about all I'd do.      

homelectsexamshws
D2Lbreeze (snow day)


©2015, Ian Barland, Radford University
Last modified 2015.Feb.23 (Mon)
Please mail any suggestions
(incl. typos, broken links)
to ibarlandradford.edu
Rendered by Racket.