|
Unwieldy video from distance lecture, 2017-feb-21 (1h35m)
but you can watch just [0:24:40,1:33:00) (1h08m) about regular-expressions.
(You can skip over the first 25min (opening pause [0:00:00,0:00:30),
some ERD review [0:00:30,0:04:30),
git-review [0:04:30,0:22:00),
and a few comments about projects-with-external-clients [0:22:00,0:24:40)),
as well as the final few "shutting down" minutes at [1:33:00,1:35:00).)
Regular expressions are not the same type as strings2.
Various languages have slightly different ways to represent regexps.
video (14m23s)
For example:
The thing to be careful about is regexps that contain a backslash:
To get the pattern
This sometimes leads to backslash hell:
If a regexp wants to match a backslash,
the regexp will contain
Even where we don't have this situation,
regexps that use a lot of backslashes end up looking even less readable as Java strings.
This solution nice, because it emphasizes the difference between strings and regexps at the type level (which we like), without needing convenience functions which immediately obscure the difference and erode the helpfulness of the type-checker.
Note that internally, there is still a
python: There aren't regexp-literals built in, but even better:
python has “raw-string literals, which can be used in many situations,
including being terribly convenient for writing regular expressions:
the raw-string-literal
PHP: Since PHP does have single-quotes for raw-string-literals, so you'd think it'd be cool like python.
But not quite (of course):
For PHP, the regular-expression libraries expect strings which begin and end with
“
They're the fine idea of raw-strings from python, but (for regexps) they then tack on a superfluous javascript-but-not requirement that detracts from its usefulness. phpsadness.com.
PHP regular-expression matching:
In php, regular expressions are strings delimited by a special character
(usually
echo preg_match( '/abcd/', 'abcd' ); echo preg_match( '/abcd/', 'azcd' ); // false echo preg_match( '/a..d/', 'azcd' ); // . matches any single character (besides newline, null) echo preg_match( '/ab*cd/', 'abbbbcd' ); // b* matches 0-or-more-b's. echo preg_match( '/ab*cd/', 'abcd' ); echo preg_match( '/ab*cd/', 'acd' ); echo preg_match( '/ab+cd/', 'abbbbcd' ); // b+ matches 1-or-more-b's. echo preg_match( '/ab+cd/', 'abcd' ); echo preg_match( '/ab+cd/', 'acd' ); // false echo preg_match( '/ab.*cd/', 'abcd' ); echo preg_match( '/ab.*cd/', 'abXd' ); // .* and .+ are a common patterns. echo preg_match( '/ab.*cd/', 'abBlahBlahBlahcd' ); echo preg_match( '/ab.+cd/', 'abcd' ); // false echo preg_match( '/ab.+cd/', 'abXcd' ); echo preg_match( '/ab?cd/', 'acd' ); // b? matches 0-or-1 b echo preg_match( '/ab?cd/', 'abcd' ); echo preg_match( '/ab?cd/', 'abbbbcd' ); // false echo preg_match( '/(ab)*cd/', 'ababababcd' ); // parens do grouping echo preg_match( '/(ab)*cd/', 'abbbbcd' ); // false // WARNING: preg_match looks to see if the string *contains* a match! echo preg_match( '/row/', 'How now, brown cow?' ); // true echo preg_match( '/.ow/', 'How now, brown cow?' ); // true // Use "^" to specify the start-of-string, and "$" to specify end-of-string. echo preg_match( '/^.ow$/', 'How now, brown cow?' ); // false echo preg_match( '/^.ow$/', 'Zow' ); echo preg_match( '/^.ow/', 'Zowee' ); echo preg_match( '/w.e$/', 'Wowee Zowee' ); echo preg_match( '/[WZ]ow/', 'Wowee Zowee' ); // square-brackets match any one character from the set echo preg_match( '/[WZ]ow/', 'Yow' ); // false echo preg_match( '/[W-Z]ow/', 'Yowee' ); // square-brackets can contain a *range* echo preg_match( '/ab[0-9]+de/', 'ab789de' ); echo preg_match( '/[0-9]*/', '00047' ); // Beware: matching just a * expression (w/o ^,$)! |
Your task: What is a regular expression to match...
False positive, and false negative:
test result | |||
---|---|---|---|
negative | positive | ||
is test accurate? | false | false negative | false positive |
true | true negative | true positive |
Warning: Beware matching a top-level* expression: the empty-string matches it, and any string contains the empty-string! Thuspreg_match_all( "/(xyz)*/", "uh-oh") === 6 !!, since"uh-oh" has zero"xyz" 's at the start, followed by'u' , followed by zero more"xyz" 's, followed by'h' , followed by ….
bug?:preg_match_all( '/\p{N}/', "⁴٤𝟜4" ) is returning 1 for me (not 4), in php 5.4.3.
Checking for space (one of the most common situations) is the one
that is hardest:
We have
Also, we are probably least able to ignore this problem, because in web forms people might paste in text from web pages or Word documents, and those exper-typesetting programs do tend to use the various special-spaces.
Solution: Either
$str2 = preg_replace( '/\p{Z}+/', ' ', $str ); … if( preg_match(/\s+/',$str2)) … |
NOTE:
In php, no trailing “
(There are other trailing modifiers however —
e.g. i for case-insensitive,
and more.)
Warning:
This page licensed CC-BY 4.0 Ian Barland Page last generated 2018.Mar.12 (Mon) | Please mail any suggestions (incl. typos, broken links) to ibarlandradford.edu |