RU beehive logo ITEC dept promo banner
ITEC 325
2021spring
flo

regular expressions

intro

Review git.html; who ran into conflicts? Auto-resolved? Glance quickly at the git submits; obeserve: tabs can ruin indentation for others; do indent meaningfully! Poor indentation would be cause an interviewer to dismiss somebody immediately.

prof: Use colors effectively: Say, patterns in green, strings in black, and explaining-meaning-of-regexp-chars in blue?

Unwieldy video from distance lecture, 2017-feb-21 (1h35m)
but you can watch just [0:24:40,1:33:00) (1h08m) about regular-expressions.
(You can skip over the first 25min (opening pause [0:00:00,0:00:30), some ERD review [0:00:30,0:04:30), git-review [0:04:30,0:22:00), and a few comments about projects-with-external-clients [0:22:00,0:24:40)), as well as the final few “shutting down” minutes at [1:33:00,1:35:00).)

regular expression intro

regexps.pdf

Representing regular expressions

Regular expressions are not the same type as strings2. Various languages have slightly different ways to represent regexps. video (14m23s)

For example:

As a final annoyance to polyglots, languages differ on what they do with “illegal” character-escapes. For example, given "a\zb" as a string-literal in code: (And that leaves me further wondering: what do the regexp library implementations inside each of these languages choose to do with illegal escape-characters at the regexp level, after having converted strings into regexps. I would hope they throw an exception, rather than take either of the latter two approaches above.)


PHP regular expressions

PHP regular-expression matching: In php, regular expressions are strings delimited by a special character (usually /). Some examples; see playing with regexps in PHP for details.

      echo preg_match( '/abcd/', 'abcd' );
      echo preg_match( '/abcd/', 'azcd' );      // false

      echo preg_match( '/a..d/', 'azcd' );      // . matches any single character (besides newline, null)

      echo preg_match( '/ab*cd/', 'abbbbcd' );  // b*  matches 0-or-more-b’s.
      echo preg_match( '/ab*cd/', 'abcd' );  
      echo preg_match( '/ab*cd/', 'acd'  ); 

      echo preg_match( '/ab+cd/', 'abbbbcd' );  // b+  matches 1-or-more-b’s.
      echo preg_match( '/ab+cd/', 'abcd' );  
      echo preg_match( '/ab+cd/', 'acd'  );     // false

      echo preg_match( '/ab.*cd/', 'abcd' );
      echo preg_match( '/ab.*cd/', 'abXd' );               // .* and .+ are a common patterns.
      echo preg_match( '/ab.*cd/', 'abBlahBlahBlahcd' );
      echo preg_match( '/ab.+cd/', 'abcd' );               // false
      echo preg_match( '/ab.+cd/', 'abXcd' );

      
      echo preg_match( '/ab?cd/', 'acd' );  // b?  matches 0-or-1 b
      echo preg_match( '/ab?cd/', 'abcd' );  
      echo preg_match( '/ab?cd/', 'abbbbcd'  );     // false

      echo preg_match( '/(ab)*cd/', 'ababababcd' );  // parens do grouping
      echo preg_match( '/(ab)*cd/', 'abbbbcd' );  // false


      // WARNING: preg_match looks to see if the string *contains* a match!
      echo preg_match( '/row/', 'How now, brown cow?' );  // true
      echo preg_match( '/.ow/', 'How now, brown cow?' );  // true

      // Use "^" to specify the start-of-string, and "$" to specify end-of-string.
      echo preg_match( '/^.ow$/', 'How now, brown cow?' );  // false
      echo preg_match( '/^.ow$/', 'Zow' );
      echo preg_match( '/^.ow/', 'Zowee' );
      echo preg_match( '/w.e$/', 'Wowee Zowee' );

      
      echo preg_match( '/[WZ]ow/', 'Wowee Zowee' );  // square-brackets match any one character from the set
      echo preg_match( '/[WZ]ow/', 'Yow' );          // false
      echo preg_match( '/[W-Z]ow/', 'Yowee' );       // square-brackets can contain a *range*
      echo preg_match( '/ab[0-9]+de/', 'ab789de' );


      echo preg_match( '/[0-9]*/', '00047' );        // Beware: matching just a * expression (w/o ^,$)!
    
See the manual.
Note: the name “preg” comes from “Perl compatible"; earlier PHP used the POSIX regexp’s but PHP decided to deprecate that.

Your task: What is a regular expression to match…

False positive, and false negative:

test result
negativepositive
is test accurate? false false negative false positive
true true negative true positive
In our setting, the “test” is preg_match, and “accurate” means whether the value returned is what we want it to be.

Warning: Beware matching a top-level * expression: the empty-string matches it, and any string contains the empty-string! Thus preg_match_all( "/(xyz)*/", “uh-oh") === 6!!, since "uh-oh" has zero "xyz"'s at the start, followed by 'u', followed by zero more "xyz"'s, followed by 'h', followed by ….

Atomic regexps

Compound regexps

There are also ways of building bigger regexps out of smaller ones:

NOTE: In php, no trailing “g” allowed, as in javascript!
(There are other trailing modifiers however — e.g. i for case-insensitive, and more.)

Three helpful functions:

regexps vs. unicode

xkcd on regex golf

a warning

regular expressions aren't meant(able) to do everything!

2 C, ML, and Haskell let you introduce new type-synonyms, but not Java -- you have to introduce a new class that wraps strings.      
1 Btw, I’d love to have a language that lets me rename types2, so that I could have string (for raw-data), html-data (safe to concatentate to HTML), and sql-data (safe to splice into a SQL query as data). htmlspecialchars and mysql_real_escape_string can be thought of as constructors for these new types, but the type-system doesn’t help protect me from giving a raw (unsanitized) string when it was expecting some already-sanitized data.      
3 Actually, they can begin/end with pretty much any punctuation character, or even matching parenthesis-like characters, so that’s nice. But use / unless you have reason not to, since that’s the standard/expected practice.      
4 Though we can avoid the backslashing by ingeniously using racket's at-expressions.      
5 Okay, that’s actually vague: is “de Soto” or “da Vinci” all last name? Names are notoriously hard to characterize, especially across multiple cultures. My advice is to liberally accept what characters people say their name is; trimming and collapsing whitespace is about all I’d do.      
6 Although it's well understood that regex's can't match arbitrary html, this question only wants to match open-tag-expressions. That seems much more plausible. However, it will be fairly difficult for several reasons: It might be possible to work around each of these still with mere regular expressions, but it won't be fun, it will be error-prone, and the correct solution is to do proper HTML parsing (for which many packages already exist).      
7 . Well, extended-regexps, which include back-references, can get around this limitation somewhat (a finite number of teims per expression).      

logo for creative commons by-attribution license
This page licensed CC-BY 4.0 Ian Barland
Page last generated
Please mail any suggestions
(incl. typos, broken links)
to ibarlandradford.edu
Rendered by Racket.