|
Review git.html; who ran into conflicts? Auto-resolved? Glance quickly at the git submits; obeserve: tabs can ruin indentation for others; do indent meaningfully! Poor indentation would be cause an interviewer to dismiss somebody immediately.
prof: Use colors effectively: Say, patterns in green, strings in black, and explaining-meaning-of-regexp-chars in blue?
A . ("dot") means any one character.
So if My favorite words include "bot", "bat", "blt", and even "b4t" and "b%t" and any letter between a b and a t, then b.t matches the things I like.
Well, actually .
matches any character except for a newline.
So if a file contains a line ending in “b” and the next line starting in “t”,
it won't match. Fair enough.
Hey Barland — I can kinda see what you're doing, but why regular expressions? I mean, when might you use them in real life, and not just Shatner and Bill the Cat specs?
In programs: verify that a string has a certain structure. (e.g. certain URLs might need to start with “http://” or “https://”, then contain at least one ".” in the next group-of-letters, and then at least two later "/"s separated by letters.)
In fact, we'll use regular expressions to validate data. E.g. a text-field for credit-card-numbers might need to be in a certain format.
It's performed by the shell, not by each program.
That is, by the time a program's main is given its String... args,
the program has no idea if those strings had been typed in by hand, or had been expanded/globbed by the shell.
Which is good, because
(a) the program doesn't care about the detail,
and (b) every single program doesn't need to repeat the code to expand wildcards.
Note that the two commands ls *.pdf and echo *.pdf give pretty much identical results!
A “|” means “or".
So Eric|Erik|Eriq can be useful, if I'm search for any of those variants in a file.
You can use ? to mean “optional" (even though it's just shorthand for using |, similar to how + can be viewed as syntactic sugar involving *).
For example: Erick?. Note that this has very-high precedence, and you often use parentheses along with ?.
Say we want to have a regexp matching strings with hi occurring mid-word, in between any two other (lower-case) vowels. We can do this with what we've already seen: (a|e|i|o|u)hi(a|e|i|o|u)
Conceptually the same, what about “hi” embedded between any to other lower-case letters? (a|b|c|…|z)hi(a|b|c|…|z) Oh, brother!
It turns out, it's pretty common that you want to
have an “|” of many single-letters.
So, a special feature was added to regexps:
character sets,
between
[
and
]:
[aeiou]hi[aeiou].
From here, it's a small jump to allow entire ranges, as a shorthand: [a-z]hi[a-z]. Here's trying to match “hi” between a possibly-uppercase letter in front, and a trailing letter-or-exclamation-point-or-prime-digit after: [A-Za-z]hi[235a-z!7] (which matches This. and chi5 and ahi!).
essentials: The two ascii codes every CS person should have on the tip of their tongue is for 'A' and ' ' (space). Knowing '0' is nice too; I don't find I need much beyond that. (Well, the hex ascii-code for space and / come up a lot in encoded URLs, so they're good to know.)
Unwieldy video from distance lecture, 2017-feb-21 (1h35m)
but you can watch just [0:24:40,1:33:00) (1h08m) about regular-expressions.
(You can skip over the first 25min (opening pause [0:00:00,0:00:30),
some ERD review [0:00:30,0:04:30),
git-review [0:04:30,0:22:00),
and a few comments about projects-with-external-clients [0:22:00,0:24:40)),
as well as the final few “shutting down” minutes at [1:33:00,1:35:00).)
Regular expressions are not the same type as strings2.
Various languages have slightly different ways to represent regexps.
video (14m23s)
For example:
The thing to be careful about is regexps that contain a backslash:
To get the pattern a\sb,
your Java would use have the string-literal "a
This sometimes leads to backslash hell:
If a regexp wants to match a backslash,
the regexp will contain \\,
and the java-string-to-represent-that-regexp would be written with \\\\.
Even where we don’t have this situation,
regexps that use a lot of backslashes end up looking even less readable as Java strings.
This solution nice, because it emphasizes the difference between strings and regexps at the type level (which we like), without needing convenience functions which immediately obscure the difference and erode the helpfulness of the type-checker.
Note that internally, there is still a class Regexp, which these literals are instances of. In addition, Javascript also allows for modifying the meaning of a regexp by adding one of a few possible suffixes; e.g. /a\sb/i is a case-independent regexp.
python: There aren’t regexp-literals built in, but even better:
python has “raw-string” literals, which can be used in many situations,
including being terribly convenient for writing regular expressions:
the raw-string-literal
PHP: Since PHP does have single-quotes for raw-string-literals, so you’d think it’d be cool like python. But not quite (of course): For PHP, the regular-expression libraries expect strings which begin and end with “/”3! So we’d use '/a\sb/' (or equivalently, "/a\\\\sb/").
They’re the fine idea of raw-strings from python, but (for regexps) they then tack on a superfluous javascript-but-not requirement that detracts from its usefulness. phpsadness.com.
PHP regular-expression matching: In php, regular expressions are strings delimited by a special character (usually /). Some examples; see playing with regexps in PHP for details.
echo preg_match( '/abcd/', 'abcd' ); echo preg_match( '/abcd/', 'azcd' ); // false echo preg_match( '/a..d/', 'azcd' ); // . matches any single character (besides newline, null) echo preg_match( '/ab*cd/', 'abbbbcd' ); // b* matches 0-or-more-b’s. echo preg_match( '/ab*cd/', 'abcd' ); echo preg_match( '/ab*cd/', 'acd' ); echo preg_match( '/ab+cd/', 'abbbbcd' ); // b+ matches 1-or-more-b’s. echo preg_match( '/ab+cd/', 'abcd' ); echo preg_match( '/ab+cd/', 'acd' ); // false echo preg_match( '/ab.*cd/', 'abcd' ); echo preg_match( '/ab.*cd/', 'abXd' ); // .* and .+ are a common patterns. echo preg_match( '/ab.*cd/', 'abBlahBlahBlahcd' ); echo preg_match( '/ab.+cd/', 'abcd' ); // false echo preg_match( '/ab.+cd/', 'abXcd' ); echo preg_match( '/ab?cd/', 'acd' ); // b? matches 0-or-1 b echo preg_match( '/ab?cd/', 'abcd' ); echo preg_match( '/ab?cd/', 'abbbbcd' ); // false echo preg_match( '/(ab)*cd/', 'ababababcd' ); // parens do grouping echo preg_match( '/(ab)*cd/', 'abbbbcd' ); // false // WARNING: preg_match looks to see if the string *contains* a match! echo preg_match( '/row/', 'How now, brown cow?' ); // true echo preg_match( '/.ow/', 'How now, brown cow?' ); // true // Use "^" to specify the start-of-string, and "$" to specify end-of-string. echo preg_match( '/^.ow$/', 'How now, brown cow?' ); // false echo preg_match( '/^.ow$/', 'Zow' ); echo preg_match( '/^.ow/', 'Zowee' ); echo preg_match( '/w.e$/', 'Wowee Zowee' ); echo preg_match( '/[WZ]ow/', 'Wowee Zowee' ); // square-brackets match any one character from the set echo preg_match( '/[WZ]ow/', 'Yow' ); // false echo preg_match( '/[W-Z]ow/', 'Yowee' ); // square-brackets can contain a *range* echo preg_match( '/ab[0-9]+de/', 'ab789de' ); echo preg_match( '/[0-9]*/', '00047' ); // Beware: matching just a * expression (w/o ^,$)! |
Your task: What is a regular expression to match…
False positive, and false negative:
test result | |||
---|---|---|---|
negative | positive | ||
is test accurate? | false | false negative | false positive |
true | true negative | true positive |
Warning: Beware matching a top-level * expression: the empty-string matches it, and any string contains the empty-string! Thus preg_match_all( "/(xyz)*/", “uh-oh") === 6!!, since "uh-oh" has zero "xyz"'s at the start, followed by 'u', followed by zero more "xyz"'s, followed by 'h', followed by ….
bug?: preg_match_all( '/\p{N}/', "⁴٤𝟜4" ) is returning 1 for me (not 4), in php 5.4.3.
Checking for space (one of the most common situations) is the one that is hardest: We have \p{Z}, as well as [[:space:]] and \s. The first, unicode-property version checks for separating space characters (thin-space, etc), but doesn’t catch tabs (or newlines). On the other hand, \s does catch tabs, newlines etc., but doesn’t catch thin-space or nonbreaking-space or other space-looking characters.
Also, we are probably least able to ignore this problem, because in web forms people might paste in text from web pages or Word documents, and those exper-typesetting programs do tend to use the various special-spaces.
Solution: Either
$str2 = preg_replace( '/\p{Z}+/', ' ', $str ); … if( preg_match(/\s+/',$str2)) … |
NOTE:
In php, no trailing “g” allowed, as in javascript!
(There are other trailing modifiers however —
e.g. i for case-insensitive,
and more.)
Warning: preg_match_all("/x/","abc",$results) will always have $results be non-empty — it will be an array containing an empty array. (In particular, the condition of if ($results) … will always be true.) You can instead use preg_match_all("/x/","abc",$results,PREG_SET_ORDER) which will have zero elements for zero matches (and if ($results) … will behave as you want). Each element of $results will still be an another entire array though -- the full match plus any sub-patterns.
This page licensed CC-BY 4.0 Ian Barland Page last generated | Please mail any suggestions (incl. typos, broken links) to ibarlandradford.edu |