|
Review git usage; who ran into conflicts? Auto-resolved? Glance quickly at the git submits; obeserve: tabs can ruin indentation for others; do indent meaningfully! Poor indentation would be cause an interviewer to dismiss somebody immediately.
prof: Use colors effectively: Say, patterns in green, strings in black, and explaining-meaning-of-regexp-chars in blue?
A
So if
My favorite words include
"bot",
"bat",
"blt",
and even
"b4t"
and
"b%t"
and
any letter between a b and a t,
then
Well, actually
Hey Barland — Why regular expressions? I mean, when do you use these things?
In programs: verify that a string has a certain structure. (e.g. certain URLs might need to start with "http://" or "https://", then at least one "." in the , and then at least two later "/"s separated by letters.)
In fact, we'll use regular expressions to validate data. E.g. a text-field for credit-card-numbers might need to be in a certain format.
A “
So
Say we want to have a regexp matching strings with hi
occurring mid-word, in between any two other (lower-case) vowels.
We can do this with what we've already seen:
Conceptually the same, what about "hi" embedded between any to
other lower-case letters?
It turns out, it's pretty common that you want to
have an "
From here, it's a small jump to allow entire ranges,
as a shorthand:
essentials: The two ascii codes every CS person should have on the tip of their tongue is for'A' and' ' (space). Knowing'0' is nice too; I don't find I need much beyond that. (Well, the hex ascii-code for space and/ come up a lot in encoded URLs, so they're good to know.)
Unwieldy video from distance lecture, 2017-feb-21 (1h35m)
but you can watch just [0:24:40,1:33:00) (1h08m) about regular-expressions.
(You can skip over the first 25min (opening pause [0:00:00,0:00:30),
some ERD review [0:00:30,0:04:30),
git-review [0:04:30,0:22:00),
and a few comments about projects-with-external-clients [0:22:00,0:24:40)),
as well as the final few "shutting down" minutes at [1:33:00,1:35:00).)
Regular expressions are not the same type as strings2.
Various languages have slightly different ways to represent regexps.
video (14m23s)
For example:
The thing to be careful about is regexps that contain a backslash:
To get the pattern
This sometimes leads to backslash hell:
If a regexp wants to match a backslash,
the regexp will contain
Even where we don’t have this situation,
regexps that use a lot of backslashes end up looking even less readable as Java strings.
This solution nice, because it emphasizes the difference between strings and regexps at the type level (which we like), without needing convenience functions which immediately obscure the difference and erode the helpfulness of the type-checker.
Note that internally, there is still a
python: There aren’t regexp-literals built in, but even better:
python has “raw-string literals, which can be used in many situations,
including being terribly convenient for writing regular expressions:
the raw-string-literal
PHP: Since PHP does have single-quotes for raw-string-literals, so you’d think it’d be cool like python.
But not quite (of course):
For PHP, the regular-expression libraries expect strings which begin and end with
“
They’re the fine idea of raw-strings from python, but (for regexps) they then tack on a superfluous javascript-but-not requirement that detracts from its usefulness. phpsadness.com.
PHP regular-expression matching:
In php, regular expressions are strings delimited by a special character
(usually
echo preg_match( '/abcd/', 'abcd' ); echo preg_match( '/abcd/', 'azcd' ); // false echo preg_match( '/a..d/', 'azcd' ); // . matches any single character (besides newline, null) echo preg_match( '/ab*cd/', 'abbbbcd' ); // b* matches 0-or-more-b’s. echo preg_match( '/ab*cd/', 'abcd' ); echo preg_match( '/ab*cd/', 'acd' ); echo preg_match( '/ab+cd/', 'abbbbcd' ); // b+ matches 1-or-more-b’s. echo preg_match( '/ab+cd/', 'abcd' ); echo preg_match( '/ab+cd/', 'acd' ); // false echo preg_match( '/ab.*cd/', 'abcd' ); echo preg_match( '/ab.*cd/', 'abXd' ); // .* and .+ are a common patterns. echo preg_match( '/ab.*cd/', 'abBlahBlahBlahcd' ); echo preg_match( '/ab.+cd/', 'abcd' ); // false echo preg_match( '/ab.+cd/', 'abXcd' ); echo preg_match( '/ab?cd/', 'acd' ); // b? matches 0-or-1 b echo preg_match( '/ab?cd/', 'abcd' ); echo preg_match( '/ab?cd/', 'abbbbcd' ); // false echo preg_match( '/(ab)*cd/', 'ababababcd' ); // parens do grouping echo preg_match( '/(ab)*cd/', 'abbbbcd' ); // false // WARNING: preg_match looks to see if the string *contains* a match! echo preg_match( '/row/', 'How now, brown cow?' ); // true echo preg_match( '/.ow/', 'How now, brown cow?' ); // true // Use "^" to specify the start-of-string, and "$" to specify end-of-string. echo preg_match( '/^.ow$/', 'How now, brown cow?' ); // false echo preg_match( '/^.ow$/', 'Zow' ); echo preg_match( '/^.ow/', 'Zowee' ); echo preg_match( '/w.e$/', 'Wowee Zowee' ); echo preg_match( '/[WZ]ow/', 'Wowee Zowee' ); // square-brackets match any one character from the set echo preg_match( '/[WZ]ow/', 'Yow' ); // false echo preg_match( '/[W-Z]ow/', 'Yowee' ); // square-brackets can contain a *range* echo preg_match( '/ab[0-9]+de/', 'ab789de' ); echo preg_match( '/[0-9]*/', '00047' ); // Beware: matching just a * expression (w/o ^,$)! |
Your task: What is a regular expression to match…
False positive, and false negative:
test result | |||
---|---|---|---|
negative | positive | ||
is test accurate? | false | false negative | false positive |
true | true negative | true positive |
Warning: Beware matching a top-level* expression: the empty-string matches it, and any string contains the empty-string! Thuspreg_match_all( "/(xyz)*/", "uh-oh") === 6 !!, since"uh-oh" has zero"xyz" 's at the start, followed by'u' , followed by zero more"xyz" 's, followed by'h' , followed by ….
bug?:preg_match_all( '/\p{N}/', "⁴٤𝟜4" ) is returning 1 for me (not 4), in php 5.4.3.
Checking for space (one of the most common situations) is the one
that is hardest:
We have
Also, we are probably least able to ignore this problem, because in web forms people might paste in text from web pages or Word documents, and those exper-typesetting programs do tend to use the various special-spaces.
Solution: Either
$str2 = preg_replace( '/\p{Z}+/', ' ', $str ); … if( preg_match(/\s+/',$str2)) … |
NOTE:
In php, no trailing “
(There are other trailing modifiers however —
e.g. i for case-insensitive,
and more.)
Warning:
This page licensed CC-BY 4.0 Ian Barland Page last generated | Please mail any suggestions (incl. typos, broken links) to ibarlandradford.edu |