ITEC325 regular expressions

Review git usage; who ran into conflicts? Auto-resolved? Glance quickly at the git submits; obeserve: tabs can ruin indentation for others; do indent meaningfully! Poor indentation would be cause an interviewer to dismiss somebody immediately.

Unwieldy video from distance lecture, 2017-feb-21 (1h35m)
but you can watch just [0:24:40,1:33:00) (1h08m) about regular-expressions.
(You can skip over the first 25min (opening pause [0:00:00,0:00:30), some ERD review [0:00:30,0:04:30), git-review [0:04:30,0:22:00), and a few comments about projects-with-external-clients [0:22:00,0:24:40)), as well as the final few "shutting down" minutes at [1:33:00,1:35:00).)

regular expression intro

Representing regular expressions

Regular expressions are not the same type as strings ². Various languages have slightly different ways to represent regexps. video (14m23s)

For example:

PHP regular expressions

PHP regular-expression matching: In php, regular expressions are strings delimited by a special character (usually /). Some examples; see playing with regexps in PHP for details.

Atomic regexps

Compound regexps

		test result
		negative	positive
is test accurate?	false	false negative	false positive
true	true negative	true positive

NOTE: In php, no trailing “g” allowed, as in javascript!
(There are other trailing modifiers however — e.g. i for case-insensitive, and more.)

regexps vs. unicode

a warning

² C, ML, and Haskell let you introduce new type-synonyms, but not Java -- you have to introduce a new class that wraps strings. ↩

¹ Btw, I’d love to have a language that lets me rename types ², so that I could have string (for raw-data), html-data (safe to concatentate to HTML), and sql-data (safe to splice into a SQL query as data). htmlspecialchars and mysql_real_escape_string can be thought of as constructors for these new types, but the type-system doesn’t help protect me from giving a raw (unsanitized) string when it was expecting some already-sanitized data. ↩

³ Actually, they can begin/end with pretty much any punctuation character, or even matching parenthesis-like characters, so that’s nice. But use / unless you have reason not to, since that’s the standard/expected practice. ↩

⁴ Okay, that’s actually vague: is “de Soto” or “da Vinci” all last name? Names are notoriously hard to characterize, especially across multiple cultures. My advice is to liberally accept what characters people say their name is; trimming and collapsing whitespace is about all I’d do. ↩

⁵ Although it's well understood that regex's can't match arbitrary html, this question only wants to match open-tag-expressions. That seems much more plausible. However, it will be fairly difficult for several reasons:

Once you match “<”, it's not true that the next “>” will be closing it: attribute-values can contain “>” characters.
Oh wait… you also have to be sure that the initial “<” you match is not actually sitting inside some other attribute-value.
“naked” less-than symbols might be encountered inside CDATA sections, html comments, script elements, and style elements (h/t to several in that thread, including itsadok's).

It might be possible to work around each of these still with mere regular expressions, but it won't be fun, it will be error-prone, and the correct solution is to do proper HTML parsing (for which many packages already exist). ↩

⁶ . Well, extended-regexps, which include back-references, can get around this limitation somewhat (a finite number of teims per expression). ↩