Download presentation
Presentation is loading. Please wait.
1
LING 438/538 Computational Linguistics Sandiway Fong Lecture 21: 11/7
2
2 Administrivia Short Lecture Today Homework 5 –out today –due next Tuesday –usual rules
3
3 Homework 5
4
4 Britney Spears Webpage –http://news.bbc.co.u k/cbbcnews/hi/music /newsid_1953000/19 53614.stm
5
5 from the BBC website -typos: #2 is wrong, -#3 and #5 shouldn’t be the same 1.Brittany 2.Brittney 3.Britany 4.Britny 5.Briteny 6.Britteny 7.Briney 8.Brittny =9.Brintey =9. Britanny use this list for the homework http://www.google.com/jobs/britney.html
6
6 Question1: Britney Spears Question 1 –Part 1 (3pts) Compute the edit distances for the misspellings of Britney (Spears) use insert=delete=1, substitute=2 –Part 2 (3pts) Compute the edit distances for the misspellings of Britney (Spears) use insert=delete=1, substitute=1 –Part 3 (4pts) Come up with a metric that correctly ranks the top 7 misspellings for either Part 1 or Part 2
7
7 Making Money from Misspellings Webpage –http://news.bbc. co.uk/1/hi/sci/te ch/1575060.stm
8
8 Making Money from Misspellings Excerpts: –US legal authorities are appealing for help in tracking down John Zuccarini, who they say is making more than a million dollars a year from a collection of misspelled domain names. –The Federal Trade Commission is now looking for ways to recover the cash Mr Zuccarini has made from the domain names. Excerpts: –Mr Zuccarini has been practising a novel variation of cybersquatting which usually involves gaining control of a website that you have no real claim to, and then offering it for sale to the rightful owner at a premium. –The domains registered by Mr Zuccarini were typically misspellings of well-known names. Mr Zuccarini has reportedly registered 15 variations of the spelling of Cartoon Network TV channel, and 41 of pop star Britney Spears.
9
9 Question 2
10
10 Corpus homework corpus –WSJ9_041.txt –from the course homepage Wall Street Journal articles (July 26–28 1989) this is the text file you will use contains almost 22,000 lines and 150,000 words use only the text between the SGML markers example...
11
11 Question 2 Sun Microsystems Inc. said it will post a larger-than-expected fourth-quarter loss of as much as $26 million and may show a loss in the current first quarter, raising further troubling questions about the once high-flying computer workstation maker. Sun reported last month that management errors, rather than a weakness in the market for computer workstations, would result in lower earnings or a "slight loss" in the quarter ended June 30. But the amount cited in yesterday's disclosure was far greater than analysts had suspected, and suggested deepening troubles. "It is extremely disconcerting," said Peter Rogers, computer analyst at Robertson Stephens & Co. in San Francisco. "Many of us had been led to believe that most (of the management-systems problems) had been put behind them. It looks like there is another layer." "On the surface, it would lead one to conclude that Sun has at least temporarily completely lost control of its operations," Mr. Rogers added. The maker of high-performance desktop computers now says the loss was probably between $20 million and $26 million, compared with year-ago net income of $25.3 million, or 66 cents a share. The huge fourth-quarter loss will bring year-end earnings to between $55 million and $61 million, or between 72 cents and 78 cents a share, compared with year-ago net of $66.4 million, or 89 cents a share. Sun said it expects to report fourth-quarter revenue of $425 million to $435 million, up 16% to 19% from a year earlier. That would contrast sharply with Sun's third quarter, when revenue surged 92%, and would put full-year revenue in the $1.75 billion to $1.77 billion range, up from $1.05 billion a year earlier.
12
12 Question 2 Sun said the problems that led to the loss have "largely been resolved," and that it received record bookings in its fourth quarter. Still, Sun said profitability in the current quarter, ending Sept. 30, can't be assured. The company added that a return to profitability will depend on the effectiveness of cost-cutting measures and its ability to obtain parts. A spokeswoman said the company still faces a shortage of certain parts. In June, Sun said operations were disrupted by a change to a new system for getting information to management. The company also cited faulty forecasting of demand, problems in manufacturing new machines and a shortage of certain parts. The spokeswoman reiterated that the company sees strong demand for its products, and believes the market for computer workstations remains healthy. Sun said it has imposed a hiring freeze in all areas except sales and customer service, postponed moving into new facilities and curtailed other expenses. The spokeswoman wouldn't say how much the cost-cutting measures will save. The announcement was made after the market closed. Sun's stock closed at $16.25, up 62.5 cents, in national over-the-counter trading.
13
13 Question 2 edit out other stuff... WSJ890728-0079 = 890728 890728-0079. Major Deficit @ Signaled by Sun @ Microsystems @ --- @ Firm to Post Quarterly Loss @ As Much as $26 Million; @ Deepening Trouble Seen @ ---- @ By Carrie Dolan @ Staff Reporter of The Wall Street Journal 07/28/89 WALL STREET JOURNAL (J) SUNW COMPUTERS AND INFORMATION TECHNOLOGY (CPR) MOUNTAIN VIEW, Calif.
14
14 (Ngram Statistics Package) NSP for homework question 2 –suggest you use the NSP software package, –brew your own, or –any other package you want to use... (Ngram Statistics Package) NSP –Ted Petersen’s Perl- based Ngram Statistics Package (NSP) –http://www.d.umn.edu/~tp ederse/nsp.htmlhttp://www.d.umn.edu/~tp ederse/nsp.html –you need to install a free Perl on your system if not already available e.g. Active State Perl
15
15 (Ngram Statistics Package) NSP –you only need to use the Perl program file –count.pl NSP on Windows –command line options perl count.pl --help Usage: count.pl [OPTIONS] DESTINATION SOURCE [[, SOURCE]...] Counts up the frequency of all n-grams occurring in SOURCE. Sends to DESTINATION the list of n- grams found, along with the frequencies of combinations of the n tokens that the n-gram is composed of. If SOURCE is a directory, all text files in it are counted. OPTIONS: --ngram N Creates n-grams of N tokens each. N = 2 by default. --newLine Prevents n-grams from spanning across the new-line character.
16
16 Question 2 (4pts) –List the most frequent closed-class word (for each class) in the corpus use the definition of closed-classes (and your judgment) listed in section 8.1 of the textbook (2pts) –What is the most frequent proper noun? –What is the most frequent (non-auxiliary) verb?
17
17 Question 2 (8pts) compute the probability of the (similar) sentences –Bristol-Myers agreed to merge with Sun Microsystems –Bristol-Myers and Sun Microsystems agreed to merge using both the bigram and trigram approximations use add-one smoothing where relevant
18
18 Question 2 Note –given the chain rule p(w 1 w 2 w 3...w n ) = p(w 1 ) p(w 2 |w 1 ) p(w 3 |w 1 w 2 )... p(w n |w 1...w n-2 w n-1 ) –what is w 1 ? –if we’re talking about a sentence, w 1 = START –Example: sentence begin with Sun.. –(see opposite column) p(Sun|START) assume p(START) = 1 Note –Petersen’s program does not take into account START –you’ll have to calculate this separately or modify the corpus before running NSP... –sentence start symbol = START –file: START Sun Microsystems Inc. said it will post a larger-than- expected fourth-quarter loss of as much as $26 million and may show a loss in the current first quarter, raising further troubling questions about the once high-flying computer workstation maker.
19
19 Summary for both 438/538 –Question 1: 10pts –Question 2: 14pts –Total: 24 pts
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.