Presentation is loading. Please wait.

Presentation is loading. Please wait.

LING 388: Language and Computers Sandiway Fong 9/20 Lecture 8.

Similar presentations


Presentation on theme: "LING 388: Language and Computers Sandiway Fong 9/20 Lecture 8."— Presentation transcript:

1 LING 388: Language and Computers Sandiway Fong 9/20 Lecture 8

2 Administrivia Homework 3 –due tonight at midnight

3 Today’s Topic Regular Expressions (RE) –used for searching text (information extraction applications and text processing) –an (industry) standard notation for specifying a search pattern

4 Exercise (Ungraded homework exercise) Write a Prolog program to enumerate the integer line Where would you start? i.e. a program that would print out all and only the numbers on the integer line (given enough time…)

5 Exercise Program nn(1). nn(N) :- nn(M), N is M+1. int(0). int(N) :- nn(M), (N = M ; N is – M). Output: ?- int(X). X = 0 ; X = 1 ; X = -1 ; X = 2 ; X = -2 ; X = 3 ; X = -3 ; X = 4 ; X = -4 ; X = 5 ; X = -5 ; X = 6 Used predicate name int/1 since integer/1 is taken in SWI Prolog

6 Today’s Topic Regular Expressions (RE) –used for searching text (information extraction applications and text processing) –an (industry) standard notation for specifying a search pattern

7 Regular Expressions (formally) equivalent to –finite state automata (FSA), and –regular grammars used in –string pattern matching typically for a single word form search text: unix (e)grep, Perl, Microsoft Word caution: –differences in notation and implementation Regular Grammars FSA Regular Expressions

8 Regular Expressions shorthand for describing sets of strings String –sequence of zero or more characters –(typically, unbroken by spaces) Examples –aaa –john –mary45 –NT$ –  (empty string)

9 Regular Expressions –shorthand string n –exactly n occurrences of string –n = 0,1,2,3,... examples –a 4 b 3 = aaaabbb –(uv) 2 = uvuv –((ab) 2 (ba) 2 ) 2 = ababbabaababbaba Note: –parentheses are used to group sequences of characters (strings)

10 Regular Expressions shorthand for describing sets of strings string + –set of one or more occurrences of string –i.e. the set {string 1, string 2, string 3,... } –Note: set is infinite examples –a+–a+ = {a, aa, aaa, aaaa, aaaaa, …} –(abc) + = {abc, abcabc, abcabcabc, …}

11 Regular Expressions shorthand for describing sets of strings string * –set of zero or more occurrences of string –i.e. the set {string 0, string 1, string 2, string 3,... } –string 0 =  (the empty string) Language = a set of strings examples –a * = {, a, aa, aaa, aaaa, …} –(abc) * = {, abc, abcabc, …} Note: –a a * = a + –a {, a, aa, aaa, aaaa, …} = {a, aa, aaa, aaaa, aaaaa, …}

12 Regular Expressions Wildcard Characters matches a range of characters. (period) matches any single character examples –. + ed = set of all strings of length 3 or greater containing ed and having at least one character preceding it worked bed pre-education ed education –. * fix = set of all strings of length 3 or greater containing fix prefix infix infixed suffix fix

13 Regular Expressions Wildcard Characters matches a range of characters [characters] (list of matching characters) matches any single character in the list examples –[s,z]ation organization organisation –[a-z] any character in the range lowercase a to z Note: not uppercase –[0-9] any digit ASCII chart: computers only understand numbers American Standard Code for Information Interchange.

14 Regular Expressions One of the most popular programs for searching files and returning lines that match a regular expression pattern is called grep –name comes from Unix ed command g/re/p –“ search globally for lines matching the regular expression, and print them ” –[Source: http://en.wikipedia.org/wiki/Grep]http://en.wikipedia.org/wiki/Grep –Most programming languages, e.g. C, C++, Java (initially) etc., don’t come with regular expression search standard… –However (later) programming languages, e.g. Perl, have standardized on grep’s syntax and expanded on its functionality. –(Java has java.util.regex. Python has a re module.)

15 Regular Expressions: grep excerpts from the grep manpage –The caret ^ and the dollar sign $ are metacharacters that respectively match the empty string at the beginning and end of a line. –The symbol \b matches the empty string at the edge of a word –The symbols \ respectively match the empty string at the beginning and end of a word. terminology –word unbroken sequence of digits, underscores and letters

16 Regular Expressions: grep Excerpts from the manpage –A regular expression may be followed by one of several repetition operators: ? The preceding item is optional and matched at most once. * The preceding item will be matched zero or more times. + The preceding item will be matched one or more times. {n} The preceding item is matched exactly n times {n,} The preceding item is matched n or more times. {n,m} The preceding item is matched at least n times, but not more than m times.

17 Regular Expressions: GNU grep concatenation –Two regular expressions may be concatenated; the resulting regular expression matches any string formed by concatenating two substrings that respectively match the concatenated subexpressions. disjunction – Two regular expressions may be joined by the infix operator |; the resulting regular expression matches any string matching either subexpression. Excerpts from the manpage

18 Regular Expressions: Examples Regular Expression –gupp(y|ies) examples –guppy –guppies Regular Expression –beds? examples –bed –beds

19 Regular Expressions: Examples Example –\b99 matches 99 in “there are 99 bottles …” –but not in 99 in “there are 299 bottles …” –Note: $99 contains two words, so \b99 will match 99 here –word unbroken sequence of digits, underscores and letters

20 Regular Expressions: Examples Example (sheeptalk) –baa! –baaa! –baaaa! –… regular expression –baaa*! –baa+!

21 Regular Expressions: Microsoft Word terminology: –wildcard search

22 Regular Expressions: Microsoft Word

23 Sample Article From American National Corpus (ANC), Slate Magazine 8/12/1999 Will There Be Life After Greenspan? If you blinked you missed it, but for a short while yesterday morning the stock and bond markets dived after a rumor that Alan Greenspan was resigning hit the Street. The story was quickly... well, it wasn't exactly refuted, since Greenspan didn't say actually say "I'm not resigning," but it was rejected as unlikely, and both markets rebounded nicely. Fleeting as it was, the momentary episode of selling panic was interesting for a couple of reasons. In the first place, the rumor had all the makings of a story that was being floated by someone who had taken a large short position (in other words, who was wagering that the market was going down) and was trying to knock the market down after it opened strongly. There's something weirdly old-fashioned about the idea that a big Wall Street insider could say "Pssst! Hey, buddy, I hear Al's on his way out!" and send stocks tumbling. It fits our ideas of the 1920s, when the market was incredibly manipulable, or even of the 1980s more than it does the late 1990s. But the truth is that in the short run, markets can occasionally be pushed, especially when so many decisions to buy or sell are keyed off what everyone else in the market is doing. Chain reactions are not much harder to start (in fact, given how quickly price moves get noticed, they may be easier) than they were 70 years ago. All that notwithstanding, the interesting thing about the Greenspan resignation rumor was that it raised an obvious question: Would it really matter? As Jacob Weisberg just pointed out in " Ballot Box," Steve Forbes is apparently the only American who doesn't think Greenspan has done a terrific job as Fed chairman. And most of us would be happy to have Greenspan stay in office even after his current term expires in the middle of next year. But it's interesting to note that in the past couple of months there have been more than a few voices--including those of economists Greg Mankiw and Robert Barr--suggesting that Greenspan has been more the beneficiary of good economic fundamentals than the creator of them. By James Surowiecki That position may be a bit overstated, particularly since Greenspan has shown an unusual ability to let his thinking on inflation, productivity, and the economy's possible growth rate evolve in response to changing data. But the essential point, that the soundness of this economy does not depend on Greenspan's presence at the head of the Fed, is right. That might not be the case if Greenspan's successor were either an inflation dove like William Greider or a perma-bear like Jim Grant. But whoever would succeed Greenspan would be nothing of the sort. He or she would be, in a word, Greenspanian, still concerned about the possibility of an overheating economy but also convinced that important technological changes have allowed this economy to grow faster than in the past without sparking inflation. If anything, in fact, the bond market should have rallied on news that Greenspan might be stepping down, since he has long since stopped being paranoid enough for bondholders, who seem perpetually convinced that the United States is about to become Brazil. There are certainly Fed governors out there who would be far more likely to raise interest rates aggressively at the first hint of price pressures than Greenspan. The momentary sell-off, though, was not driven by any rational consideration of what Greenspan's departure might mean. Instead, everyone assumes that Greenspan's resignation will knock down the market, so Greenspan's resignation--or rumors of it--knocks down the market. But this is not the summer of 1998 or the fall of 1997. We don't need Greenspan to reassure us that the world isn't going to fall apart anymore. When he leaves, the market will hiccup. But it would be surprising if it did more than that. Data file avilable on class homepage as Article247_300.txt Data file avilable on class homepage as Article247_300.txt

24 Class Exercise Let’s create a regular expression in Microsoft Word to look for decades in the text (and highlight them) –Example: … the late 1990s.


Download ppt "LING 388: Language and Computers Sandiway Fong 9/20 Lecture 8."

Similar presentations


Ads by Google