1 Information Extraction (Several slides based on those by Ray Mooney, Cohen/McCallum (via Dan Weld’s class) Make-up Class: Tomorrow (Wed) 10:30—11:45AM.

1 Information Extraction (Several slides based on those by Ray Mooney, Cohen/McCallum (via Dan Weld’s class) Make-up Class: Tomorrow (Wed) 10:30—11:45AM  BY 210 (next to the advising office)

2 Intended Use of Semantic Web? Pages should be annotated with RDF triples, with links to RDF-S (our OWL) background ontology. E.g. See Jim Hendler’s page…

3 Database vs. Semantic Web Inference (and the Magellan Story) Also templated extraction as undoing XML  HTML conversion. Templated extraction is by DOM-patterns; unstructured extraction is (sort of) by grammar parse tree patterns. Grammar learning is mostly from +ve examples. To be addedRinku Patel

4 Who will annotate the data? Semantic web works if the users annotate their pages using some existing ontology (or their own ontology, but with mapping to other ontologies) –But users typically do not conform to standards.. and are not patient enough for delayed gratification… Two Solutions –1. Intercede in the way pages are created (act as if you are helping them write web-pages) What if we change the MS Frontpage/Claris Homepage so that they (slyly) add annotations? E.g. The Mangrove project at U. Wash.Mangrove –Help user in tagging their data (allow graphical editing) –Provide instant gratification by running services that use the tags. –2. Collaborative tagging! “Folksonomies” (look at Wikipedia article) –FLICKR, Technorati, deli.cio.us etc CBIOC, ESP game etc. –Need to incentivize users to do the annotations.. –3. Automated information extraction (next topic)

5 Folksonomies—The good Bottom-up approach to taxonomies/ontologies –[In systems like] Furl, Flickr and Del.icio.us... people classify their pictures/bookmarks/web pages with tags (e.g. wedding), and then the most popular tags float to the top (e.g. Flickr's tags or Del.icio.us on the right).... –[F]olksonomies can work well for certain kinds of information because they offer a small reward for using one of the popular categories (such as your photo appearing on a popular page). People who enjoy the social aspects of the system will gravitate to popular categories while still having the freedom to keep their own lists of tags.

6 Works best when Many people Tag the same Info…

7 Folksonomies… the bad On the other hand, not hard to see a few reasons why a folksonomy would be less than ideal in a lot of cases: –None of the current implementations have synonym control (e.g. "selfportrait" and "me" are distinct Flickr tags, as are "mac" and "macintosh" on Del.icio.us). –Also, there's a certain lack of precision involved in using simple one-word tags--like which Lance are we talking about? –And, of course, there's no heirarchy and the content types (bookmarks, photos) are fairly simple. For indexing and library people, folksonomies are about as appealing as Wikipedia is to encyclopedia editors.encyclopediaeditors –But.. there's some interesting stuff happening around them.

8 Mass Collaboration (& Mice running the Earth) The quality of the tags generated through folksonomies is notoriously hard to control –So, design mechanisms that ensure correctness of tags.. ESP game makes it fun to CBIOC and Google Co-op restrict annotation previleges to trusted users.. It is hard to get people to tag things in which they don’t have personal interest.. –Find incentive structures.. ESP makes it a “game” with points CBIOC and Google Co-op try to promise delayed gratification in terms of improved search later..

9 Who will annotate the data? Semantic web works if the users annotate their pages using some existing ontology (or their own ontology, but with mapping to other ontologies) –But users typically do not conform to standards.. and are not patient enough for delayed gratification… Two Solutions –1. Intercede in the way pages are created (act as if you are helping them write web-pages) What if we change the MS Frontpage/Claris Homepage so that they (slyly) add annotations? E.g. The Mangrove project at U. Wash.Mangrove –Help user in tagging their data (allow graphical editing) –Provide instant gratification by running services that use the tags. –2. Collaborative tagging! “Folksonomies” (look at Wikipedia article) –FLICKR, Technorati, deli.cio.us etc CBIOC, ESP game etc. –Need to incentivize users to do the annotations.. –3. Automated information extraction Next Topic

10 Information Extraction (IE) Identify specific pieces of information (data) in a unstructured or semi-structured textual document. Transform unstructured information in a corpus of documents or web pages into a structured database. Applied to different types of text: –Newspaper articles –Web pages –Scientific articles –Newsgroup messages –Classified ads –Medical notes –Wikipedia (info boxes)..

11 Information Extraction vs. NLP? Information extraction is attempting to find some of the structure and meaning in the hopefully template driven web pages. As IE becomes more ambitious and text becomes more free form, then ultimately we have IE becoming equal to NLP. Web does give one particular boost to NLP –Massive corpora..

12 MUC DARPA funded significant efforts in IE in the early to mid 1990’s. Message Understanding Conference (MUC) was an annual event/competition where results were presented. Focused on extracting information from news articles: –Terrorist events –Industrial joint ventures –Company management changes Information extraction of particular interest to the intelligence community (CIA, NSA).

13 Other Applications Job postings: –Newsgroups: Rapier from austin.jobsRapier –Web pages: FlipdogFlipdog Job resumes: –BurningGlassBurningGlass –MohomineMohomine Seminar announcements Company information from the web Continuing education course info from the web University information from the web Apartment rental ads Molecular biology information from MEDLINE

14 Wikipedia Infoboxes.. Wikipedia has both unstructured text and structured info boxes.. Infobox

15 Subject: US-TN-SOFTWARE PROGRAMMER Date: 17 Nov 1996 17:37:29 GMT Organization: Reference.Com Posting Service Message-ID: SOFTWARE PROGRAMMER Position available for Software Programmer experienced in generating software for PC- Based Voice Mail systems. Experienced in C Programming. Must be familiar with communicating with and controlling voice cards; preferable Dialogic, however, experience with others such as Rhetorix and Natural Microsystems is okay. Prefer 5 years or more experience with PC Based Voice Mail, but will consider as little as 2 years. Need to find a Senior level person who can come on board and pick up code with very little training. Present Operating System is DOS. May go to OS-2 or UNIX in future. Please reply to: Kim Anderson AdNET (901) 458-2888 fax kimander@memphisonline.com Subject: US-TN-SOFTWARE PROGRAMMER Date: 17 Nov 1996 17:37:29 GMT Organization: Reference.Com Posting Service Message-ID: SOFTWARE PROGRAMMER Position available for Software Programmer experienced in generating software for PC- Based Voice Mail systems. Experienced in C Programming. Must be familiar with communicating with and controlling voice cards; preferable Dialogic, however, experience with others such as Rhetorix and Natural Microsystems is okay. Prefer 5 years or more experience with PC Based Voice Mail, but will consider as little as 2 years. Need to find a Senior level person who can come on board and pick up code with very little training. Present Operating System is DOS. May go to OS-2 or UNIX in future. Please reply to: Kim Anderson AdNET (901) 458-2888 fax kimander@memphisonline.com Sample Job Posting

16 Extracted Job Template computer_science_job id: 56nigp$mrs@bilbo.reference.com title: SOFTWARE PROGRAMMER salary: company: recruiter: state: TN city: country: US language: C platform: PC \ DOS \ OS-2 \ UNIX application: area: Voice Mail req_years_experience: 2 desired_years_experience: 5 req_degree: desired_degree: post_date: 17 Nov 1996

17 Amazon Book Description …. The Age of Spiritual Machines : When Computers Exceed Human Intelligence by <a href="/exec/obidos/search-handle-url/index=books&field-author= Kurzweil%2C%20Ray/002-6235079-4593641"> Ray Kurzweil <img src="http://images.amazon.com/images/P/0140282025.01.MZZZZZZZ.gif" width=90 height=140 align=left border=0> List Price: $14.95 Our Price: $11.96 You Save: $2.99 (20%) …. The Age of Spiritual Machines : When Computers Exceed Human Intelligence by <a href="/exec/obidos/search-handle-url/index=books&field-author= Kurzweil%2C%20Ray/002-6235079-4593641"> Ray Kurzweil <img src="http://images.amazon.com/images/P/0140282025.01.MZZZZZZZ.gif" width=90 height=140 align=left border=0> List Price: $14.95 Our Price: $11.96 You Save: $2.99 (20%) …

18 Extracted Book Template Title: The Age of Spiritual Machines : When Computers Exceed Human Intelligence Author: Ray Kurzweil List-Price: $14.95 Price: $11.96 :

19 Extraction from Templated Text Many web pages are generated automatically from an underlying database. Therefore, the HTML structure of pages is fairly specific and regular (semi-structured). However, output is intended for human consumption, not machine interpretation. An IE system for such generated pages allows the web site to be viewed as a structured database. An extractor for a semi-structured web site is sometimes referred to as a wrapper. Process of extracting from such pages is sometimes referred to as screen scraping.

20 Templated Extraction using DOM Trees Web extraction may be aided by first parsing web pages into DOM trees. Extraction patterns can then be specified as paths from the root of the DOM tree to the node containing the text to extract. May still need regex patterns to identify proper portion of the final CharacterData node.

21 Sample DOM Tree Extraction HTML BODY FONTB Age of Spiritual Machines Ray Kurzweil Element Character-Data HEADER by A Title: HTML  BODY  B  CharacterData Author: HTML  BODY  FONT  A  CharacterData

22 Template Types Slots in template typically filled by a substring from the document. Some slots may have a fixed set of pre-specified possible fillers that may not occur in the text itself. –Terrorist act: threatened, attempted, accomplished. –Job type: clerical, service, custodial, etc. –Company type: SEC code Some slots may allow multiple fillers. –Programming language Some domains may allow multiple extracted templates per document. –Multiple apartment listings in one ad

23 Simple Extraction Patterns Specify an item to extract for a slot using a regular expression pattern. –Price pattern: “\b\$\d+(\.\d{2})?\b” May require preceding (pre-filler) pattern to identify proper context. –Amazon list price: Pre-filler pattern: “ List Price: ” Filler pattern: “\ $\d+(\.\d{2})?\b ” May require succeeding (post-filler) pattern to identify the end of the filler. –Amazon list price: Pre-filler pattern: “ List Price: ” Filler pattern: “.+” Post-filler pattern: “ ”

24 Simple Template Extraction Extract slots in order, starting the search for the filler of the n+1 slot where the filler for the nth slot ended. Assumes slots always in a fixed order. –Title –Author –List price –… Make patterns specific enough to identify each filler always starting from the beginning of the document.

25 Pre-Specified Filler Extraction If a slot has a fixed set of pre-specified possible fillers, text categorization can be used to fill the slot. –Job category –Company type Treat each of the possible values of the slot as a category, and classify the entire document to determine the correct filler.

26 Learning for IE Writing accurate patterns for each slot for each domain (e.g. each web site) requires laborious software engineering. Alternative is to use machine learning: –Build a training set of documents paired with human-produced filled extraction templates. –Learn extraction patterns for each slot using an appropriate machine learning algorithm.

Finding“Sweet Spots” in computer-mediated cooperative work It is possible to get by with techniques blythely ignorant of semantics, when you have humans in the loop –All you need is to find the right sweet spot, where the computer plays a pre-processing role and presents “potential solutions” –…and the human very gratefully does the in-depth analysis on those few potential solutions Examples: –The incredible success of “Bag of Words” model! Bag of letters would be a disaster ;-) Bag of sentences and/or NLP would be good –..but only to your discriminating and irascible searchers ;-)

Collaborative Computing AKA Brain Cycle Stealing AKA Computizing Eyeballs A lot of exciting research related to web currently involves “co-opting” the masses to help with large-scale tasks –It is like “cycle stealing”—except we are stealing “human brain cycles” (the most idle of the computers if there is ever one ;-) Remember the mice in the Hitch Hikers Guide to the Galaxy? (..who were running a mass-scale experiment on the humans to figure out the question..) –Collaborative knowledge compilation (wikipedia!) –Collaborative Curation –Collaborative tagging –Paid collacoration/contracting Many big open issues –How do you pose the problem such that it can be solved using collaborative computing? –How do you “incentivize” people into letting you steal their brain cycles? Pay them! (Amazon mturk.com )

Tapping into the Collective Unconscious Another thread of exciting research is driven by the realization that WEB is not random at all! –It is written by humans –…so analyzing its structure and content allows us to tap into the collective unconscious.. Meaning can emerge from syntactic notions such as “co-occurrences” and “connectedness” Examples: –Analyzing term co-occurrences in the web-scale corpora to capture semantic information (today’s paper) –Analyzing the link-structure of the web graph to discover communities DoD and NSA are very much into this as a way of breaking terrorist cells –Analyzing the transaction patterns of customers (collaborative filtering)

37 Information Extraction from unstructured text

39 Information Extraction from Unstructured Text: Semantic web needs: –Tagged data –Background knowledge (blue sky approaches to) automate both –Knowledge Extraction Extract base level knowledge (“facts”) directly from the web –Automated tagging Start with a background ontology and tag other web pages –Semtag/Seeker

40 Fielded IE Systems: Citeseer, Google Scholar; Libra How do they do it? Why do they fail? 

What is “Information Extraction” Filling slots in a database from sub-segments of text. As a task: October 14, 2002, 4:00 a.m. PT For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the open- source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation, countered saying… NAME TITLE ORGANIZATION Slides from Cohen & McCallum

What is “Information Extraction” Filling slots in a database from sub-segments of text. As a task: October 14, 2002, 4:00 a.m. PT For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the open- source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation, countered saying… NAME TITLE ORGANIZATION Bill Gates CEO Microsoft Bill Veghte VP Microsoft Richard Stallman founder Free Soft.. IE Slides from Cohen & McCallum

What is “Information Extraction” Information Extraction = segmentation + classification + clustering + association As a family of techniques: October 14, 2002, 4:00 a.m. PT For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the open- source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation, countered saying… Microsoft Corporation CEO Bill Gates Microsoft Gates Microsoft Bill Veghte Microsoft VP Richard Stallman founder Free Software Foundation Slides from Cohen & McCallum

What is “Information Extraction” Information Extraction = segmentation + classification + association + clustering As a family of techniques: October 14, 2002, 4:00 a.m. PT For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the open- source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation, countered saying… Microsoft Corporation CEO Bill Gates Microsoft Gates Microsoft Bill Veghte Microsoft VP Richard Stallman founder Free Software Foundation Slides from Cohen & McCallum

What is “Information Extraction” Information Extraction = segmentation + classification + association + clustering As a family of techniques: October 14, 2002, 4:00 a.m. PT For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the open- source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation, countered saying… Microsoft Corporation CEO Bill Gates Microsoft Gates Microsoft Bill Veghte Microsoft VP Richard Stallman founder Free Software Foundation NAME TITLE ORGANIZATION Bill Gates CEOMicrosoft Bill Veghte VP Microsoft RichardStallman founder Free Soft.. * * * * Slides from Cohen & McCallum

IE in Context Create ontology Segment Classify Associate Cluster Load DB Spider Query, Search Data mine IE Document collection Database Filter by relevance Label training data Train extraction models Slides from Cohen & McCallum

IE History Pre-Web Mostly news articles –De Jong’s FRUMP [1982] Hand-built system to fill Schank-style “scripts” from news wire –Message Understanding Conference (MUC) DARPA [’87-’95], TIPSTER [’92- ’96] Most early work dominated by hand-built models –E.g. SRI’s FASTUS, hand-built FSMs. –But by 1990’s, some machine learning: Lehnert, Cardie, Grishman and then HMMs: Elkan [Leek ’97], BBN [Bikel et al ’98] Web AAAI ’94 Spring Symposium on “Software Agents” –Much discussion of ML applied to Web. Maes, Mitchell, Etzioni. Tom Mitchell’s WebKB, ‘96 –Build KB’s from the Web. Wrapper Induction –First by hand, then ML: [Doorenbos ‘96], [Soderland ’96], [Kushmerick ’97],… Slides from Cohen & McCallum

www.apple.com/retail What makes IE from the Web Different? Less grammar, but more formatting & linking The directory structure, link structure, formatting & layout of the Web is its own new grammar. Apple to Open Its First Retail Store in New York City MACWORLD EXPO, NEW YORK--July 17, 2002-- Apple's first retail store in New York City will open in Manhattan's SoHo district on Thursday, July 18 at 8:00 a.m. EDT. The SoHo store will be Apple's largest retail store to date and is a stunning example of Apple's commitment to offering customers the world's best computer shopping experience. "Fourteen months after opening our first retail store, our 31 stores are attracting over 100,000 visitors each week," said Steve Jobs, Apple's CEO. "We hope our SoHo store will surprise and delight both Mac and PC users who want to see everything the Mac can do to enhance their digital lifestyles." www.apple.com/retail/soho www.apple.com/retail/soho/theatre.html NewswireWeb Slides from Cohen & McCallum

Landscape of IE Tasks (1/4): Pattern Feature Domain Text paragraphs without formatting Grammatical sentences and some formatting & links Non-grammatical snippets, rich formatting & links Tables Astro Teller is the CEO and co-founder of BodyMedia. Astro holds a Ph.D. in Artificial Intelligence from Carnegie Mellon University, where he was inducted as a national Hertz fellow. His M.S. in symbolic and heuristic computation and B.S. in computer science are from Stanford University. His work in science, literature and business has appeared in international media from the New York Times to CNN to NPR. Slides from Cohen & McCallum

Landscape of IE Tasks (2/4): Pattern Scope Web site specificGenre specificWide, non-specific Amazon.com Book PagesResumesUniversity Names FormattingLayoutLanguage Slides from Cohen & McCallum

Landscape of IE Tasks (3/4): Pattern Complexity Closed set He was born in Alabama… Regular set Phone: (413) 545-1323 Complex pattern University of Arkansas P.O. Box 140 Hope, AR 71802 …was among the six houses sold by Hope Feldman that year. Ambiguous patterns, needing context + many sources of evidence The CALD main office can be reached at 412-268-1299 The big Wyoming sky… U.S. states U.S. phone numbers U.S. postal addresses Person names Headquarters: 1128 Main Street, 4th Floor Cincinnati, Ohio 45210 Pawel Opalinski, Software Engineer at WhizBang Labs. E.g. word patterns: Slides from Cohen & McCallum

Landscape of IE Tasks (4/4): Pattern Combinations Single entity Person: Jack Welch Binary relationship Relation: Person-Title Person: Jack Welch Title: CEO N-ary record “Named entity” extraction Jack Welch will retire as CEO of General Electric tomorrow. The top role at the Connecticut company will be filled by Jeffrey Immelt. Relation: Company-Location Company: General Electric Location: Connecticut Relation: Succession Company: General Electric Title: CEO Out: Jack Welsh In: Jeffrey Immelt Person: Jeffrey Immelt Location: Connecticut Slides from Cohen & McCallum

Evaluation of Single Entity Extraction Michael Kearns and Sebastian Seung will start Monday’s tutorial, followed by Richard M. Karpe and Martin Cooke. TRUTH: PRED: Precision = = # correctly predicted segments 2 # predicted segments 6 Michael Kearns and Sebastian Seung will start Monday’s tutorial, followed by Richard M. Karpe and Martin Cooke. Recall = = # correctly predicted segments 2 # true segments 4 F1 = Harmonic mean of Precision & Recall = ((1/P) + (1/R)) / 2 1 Slides from Cohen & McCallum

State of the Art Performance Named entity recognition –Person, Location, Organization, … –F1 in high 80’s or low- to mid-90’s Binary relation extraction –Contained-in (Location1, Location2) Member-of (Person1, Organization1) –F1 in 60’s or 70’s or 80’s Wrapper induction –Extremely accurate performance obtainable –Human effort (~30min) required on each site Slides from Cohen & McCallum

Landscape of IE Techniques (1/1): Models Any of these models can be used to capture words, formatting or both. Lexicons Alabama Alaska … Wisconsin Wyoming Abraham Lincoln was born in Kentucky. member? Classify Pre-segmented Candidates Abraham Lincoln was born in Kentucky. Classifier which class? …and beyond Sliding Window Abraham Lincoln was born in Kentucky. Classifier which class? Try alternate window sizes: Boundary Models Abraham Lincoln was born in Kentucky. Classifier which class? BEGINENDBEGINEND BEGIN Context Free Grammars Abraham Lincoln was born in Kentucky. NNPVPNPVNNP NP PP VP S Most likely parse? Finite State Machines Abraham Lincoln was born in Kentucky. Most likely state sequence? Slides from Cohen & McCallum

Landscape: Focus of this Tutorial Pattern complexity Pattern feature domain Pattern scope Pattern combinations Models closed setregularcomplexambiguous wordswords + formattingformatting site-specificgenre-specificgeneral entitybinaryn-ary lexiconregexwindowboundaryFSMCFG Slides from Cohen & McCallum

Sliding Windows Slides from Cohen & McCallum

Extraction by Sliding Window GRAND CHALLENGES FOR MACHINE LEARNING Jaime Carbonell School of Computer Science Carnegie Mellon University 3:30 pm 7500 Wean Hall Machine learning has evolved from obscurity in the 1970s into a vibrant and popular discipline in artificial intelligence during the 1980s and 1990s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e.g. analogy, explanation-based learning), learning theory (e.g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on. CMU UseNet Seminar Announcement Slides from Cohen & McCallum

A “Naïve Bayes” Sliding Window Model [Freitag 1997] 00 : pm Place : Wean Hall Rm 5409 Speaker : Sebastian Thrun w t-m w t-1 w t w t+n w t+n+1 w t+n+m prefixcontentssuffix Other examples of sliding window: [Baluja et al 2000] (decision tree over individual words & their context) If P(“Wean Hall Rm 5409” = LOCATION) is above some threshold, extract it. … … Estimate Pr(LOCATION|window) using Bayes rule Try all “reasonable” windows (vary length, position) Assume independence for length, prefix words, suffix words, content words Estimate from data quantities like: Pr(“Place” in prefix|LOCATION) Slides from Cohen & McCallum

“Naïve Bayes” Sliding Window Results GRAND CHALLENGES FOR MACHINE LEARNING Jaime Carbonell School of Computer Science Carnegie Mellon University 3:30 pm 7500 Wean Hall Machine learning has evolved from obscurity in the 1970s into a vibrant and popular discipline in artificial intelligence during the 1980s and 1990s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e.g. analogy, explanation-based learning), learning theory (e.g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on. Domain: CMU UseNet Seminar Announcements FieldF1 Person Name:30% Location:61% Start Time:98% Slides from Cohen & McCallum

Realistic sliding-window-classifier IE What windows to consider? –all windows containing as many tokens as the shortest example, but no more tokens than the longest example How to represent a classifier? It might: –Restrict the length of window; –Restrict the vocabulary or formatting used before/after/inside window; –Restrict the relative order of tokens, etc. Learning Method –SRV: Top-Down Rule Learning [Frietag AAAI ‘98] –Rapier: Bottom-Up [Califf & Mooney, AAAI ‘99] Slides from Cohen & McCallum

Rapier: results – precision/recall Slides from Cohen & McCallum

Rapier – results vs. SRV Slides from Cohen & McCallum

Rule-learning approaches to sliding- window classification: Summary SRV, Rapier, and WHISK [Soderland KDD ‘97] –Representations for classifiers allow restriction of the relationships between tokens, etc –Representations are carefully chosen subsets of even more powerful representations based on logic programming (ILP and Prolog) –Use of these “heavyweight” representations is complicated, but seems to pay off in results Can simpler representations for classifiers work? Slides from Cohen & McCallum

BWI: Learning to detect boundaries Another formulation: learn three probabilistic classifiers: –START(i) = Prob( position i starts a field) –END(j) = Prob( position j ends a field) –LEN(k) = Prob( an extracted field has length k) Then score a possible extraction (i,j) by START(i) * END(j) * LEN(j-i) LEN(k) is estimated from a histogram [Freitag & Kushmerick, AAAI 2000] Slides from Cohen & McCallum

BWI: Learning to detect boundaries BWI uses boosting to find “detectors” for START and END Each weak detector has a BEFORE and AFTER pattern (on tokens before/after position i). Each “pattern” is a sequence of –tokens and/or –wildcards like: anyAlphabeticToken, anyNumber, … Weak learner for “patterns” uses greedy search (+ lookahead) to repeatedly extend a pair of empty BEFORE,AFTER patterns Slides from Cohen & McCallum

BWI: Learning to detect boundaries FieldF1 Person Name:30% Location:61% Start Time:98% Slides from Cohen & McCallum

Problems with Sliding Windows and Boundary Finders Decisions in neighboring parts of the input are made independently from each other. –Naïve Bayes Sliding Window may predict a “seminar end time” before the “seminar start time”. –It is possible for two overlapping windows to both be above threshold. –In a Boundary-Finding system, left boundaries are laid down independently from right boundaries, and their pairing happens as a separate step. Slides from Cohen & McCallum Solution? Joint inference…

Extraction: Named Entity  Binary Relations How Extend a Sliding Window Approach?

Snowball

Pattern Representation Brittle candidate generation? –Can’t extract if location mentioned before organization? –Tag_ is a named entity tag –Pat_ is vector (in term space) Degree of Match Dependence on Alembic Tagger

Generating & Evaluating Patterns Generation of Candidate Patterns Evaluation of Candidate Patterns –Selectivity vs Coverage vs Confidence (Precision) –Rilloff’s Conf * log |Postive| 2/2 ~ 4/12

Evaluating Tuples Conf(T) = 1 –  (1 – Conf(P_i) * Match(T, P_i))) i=0 |P| Conf(P) = Conf_n(P) * W + Conf_o(P) * (1-W) Comments? Simulated Annealing? Discard poor tuples? (vs not count as seeds) Lower confidence of old tuples?

Overall Algorithm Relation to EM? Relation to KnowItAll Will it work for the long tail? Tagging vs Full NLP Synonyms Negative Examples General Relations vs Functions (Keys)

Evaluation Effect of Seed Quality Effect of Seed Quantity Other Domains –Shouldn’t this expt be easy? Ease of Use –Training Examples vs Parameter Tweaking

Contributions Techniques for Pattern Generation Strategies for Evaluating Patterns & Tuples Evaluation Methodology & Metrics

References [Bikel et al 1997] Bikel, D.; Miller, S.; Schwartz, R.; and Weischedel, R. Nymble: a high-performance learning name-finder. In Proceedings of ANLP’97, p194-201. [Califf & Mooney 1999], Califf, M.E.; Mooney, R.: Relational Learning of Pattern-Match Rules for Information Extraction, in Proceedings of the Sixteenth National Conference on Artificial Intelligence (AAAI-99). [Cohen, Hurst, Jensen, 2002] Cohen, W.; Hurst, M.; Jensen, L.: A flexible learning system for wrapping tables and lists in HTML documents. Proceedings of The Eleventh International World Wide Web Conference (WWW-2002) [Cohen, Kautz, McAllester 2000] Cohen, W; Kautz, H.; McAllester, D.: Hardening soft information sources. Proceedings of the Sixth International Conference on Knowledge Discovery and Data Mining (KDD-2000). [Cohen, 1998] Cohen, W.: Integration of Heterogeneous Databases Without Common Domains Using Queries Based on Textual Similarity, in Proceedings of ACM SIGMOD-98. [Cohen, 2000a] Cohen, W.: Data Integration using Similarity Joins and a Word-based Information Representation Language, ACM Transactions on Information Systems, 18(3). [Cohen, 2000b] Cohen, W. Automatically Extracting Features for Concept Learning from the Web, Machine Learning: Proceedings of the Seventeeth International Conference (ML-2000). [Collins & Singer 1999] Collins, M.; and Singer, Y. Unsupervised models for named entity classification. In Proceedings of the Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora, 1999. [De Jong 1982] De Jong, G. An Overview of the FRUMP System. In: Lehnert, W. & Ringle, M. H. (eds), Strategies for Natural Language Processing. Larence Erlbaum, 1982, 149-176. [Freitag 98] Freitag, D: Information extraction from HTML: application of a general machine learning approach, Proceedings of the Fifteenth National Conference on Artificial Intelligence (AAAI-98). [Freitag, 1999], Freitag, D. Machine Learning for Information Extraction in Informal Domains. Ph.D. dissertation, Carnegie Mellon University. [Freitag 2000], Freitag, D: Machine Learning for Information Extraction in Informal Domains, Machine Learning 39(2/3): 99-101 (2000). Freitag & Kushmerick, 1999] Freitag, D; Kushmerick, D.: Boosted Wrapper Induction. Proceedings of the Sixteenth National Conference on Artificial Intelligence (AAAI-99) [Freitag & McCallum 1999] Freitag, D. and McCallum, A. Information extraction using HMMs and shrinakge. In Proceedings AAAI-99 Workshop on Machine Learning for Information Extraction. AAAI Technical Report WS-99-11. [Kushmerick, 2000] Kushmerick, N: Wrapper Induction: efficiency and expressiveness, Artificial Intelligence, 118(pp 15-68). [Lafferty, McCallum & Pereira 2001] Lafferty, J.; McCallum, A.; and Pereira, F., Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data, In Proceedings of ICML-2001. [Leek 1997] Leek, T. R. Information extraction using hidden Markov models. Master’s thesis. UC San Diego. [McCallum, Freitag & Pereira 2000] McCallum, A.; Freitag, D.; and Pereira. F., Maximum entropy Markov models for information extraction and segmentation, In Proceedings of ICML-2000 [Miller et al 2000] Miller, S.; Fox, H.; Ramshaw, L.; Weischedel, R. A Novel Use of Statistical Parsing to Extract Information from Text. Proceedings of the 1st Annual Meeting of the North American Chapter of the ACL (NAACL), p. 226 - 233. Slides from Cohen & McCallum

82 More Ambitious (Blue Sky) Approaches Semantic web needs: –Tagged data –Background knowledge (blue sky approaches to) automate both –Knowledge Extraction Extract base level knowledge (“facts”) directly from the web –Automated tagging Start with a background ontology and tag other web pages –Semtag/Seeker The information extraction tasks in fielded applications like Citeseer/Libra are narrowly focused –We assume that we are learning specific relations (e.g. author/title etc) –We assume that the extracted relations will be put in a database for db- style look-up Let’s look at state of the feasible art before going to blue-sky..

83 Extraction from Free Text involves Natural Language Processing If extracting from automatically generated web pages, simple regex patterns usually work. If extracting from more natural, unstructured, human-written text, some NLP may help. –Part-of-speech (POS) tagging Mark each word as a noun, verb, preposition, etc. –Syntactic parsing Identify phrases: NP, VP, PP –Semantic word categories (e.g. from WordNet) KILL: kill, murder, assassinate, strangle, suffocate Off-the-shelf software available to do this! –The “Brill” tagger Extraction patterns can use POS or phrase tags. Analogy to regex patterns on DOM trees for structured tex

84 I. Generate-n-Test Architecture Generic extraction patterns (Hearst ’92): “…Cities such as Boston, Los Angeles, and Seattle…” (“C such as NP1, NP2, and NP3”) => IS-A(each(head(NP)), C), … Detailed information for several countries such as maps, …” ProperNoun(head(NP)) “I listen to pretty much all music but prefer country such as Garth Brooks” Template Driven Extraction (where template In in terms of Syntax Tree)

85 Test Assess candidate extractions using Mutual Information (PMI-IR) (Turney ’01). Many variations are possible…

86..but many things indicate “city”ness PMI = frequency of I & D co-occurrence 5-50 discriminators D i Each PMI for D i is a feature f i Naïve Bayes evidence combination: PMI is used for feature selection. NBC is used for learning. Hits used for assessing PMI as well as conditional probabilities Discriminator phrases f i : “x is a city” “x has a population of” “x is the capital of y” “x’s baseball team…” Keep the probablities with the extracted facts

87 Assessment In Action 1.I = “Yakima” ( 1,340,000) 2.D = 3.I+D = “Yakima city” (2760) 4.PMI = (2760 / 1.34M)= 0.02 I = “Avocado” (1,000,000) I+D =“Avocado city” (10) PMI = 0.00001 << 0.02

88 Some Sources of ambiguity Time: “Clinton is the president” (in 1996). Context: “common misconceptions..” Opinion: Elvis… Multiple word senses: Amazon, Chicago, Chevy Chase, etc. –Dominant senses can mask recessive ones! –Approach: unmasking. ‘Chicago –City’

89 Chicago CityMovie

90 Chicago Unmasked City senseMovie sense

91 Impact of Unmasking on PMI Name Recessive Original Unmask Boost Washington city 0.50 0.99 96% Casablanca city 0.41 0.93 127% Chevy Chase actor 0.09 0.58 512% Chicago movie 0.02 0.21 972%

92 CBioC: Collaborative Bio- Curation Motivation  To help get information nuggets of articles and abstracts and store in a database.  The challenge is that the number of articles are huge and they keep growing, and need to process natural language.  The two existing approaches human curation and use of automatic information extraction systems They are not able to meet the challenge, as the first is expensive, while the second is error-prone.

93 CBioC (cont’d) Approach: We propose a solution that is inexpensive, and that scales up.  Our approach takes advantage of automatic information extraction methods as a starting point, Based on the premise that if there are a lot of articles, then there must be a lot of readers and authors of these articles.  We provide a mechanism by which the readers of the articles can participate and collaborate in the curation of information.  We refer to our approach as “Collaborative Curation''.

94 Using the C-BioCurator System (cont’d)

What is the main difference between Knowitall and CBIOC? Assessment– Knowitall does it by HITS. CBioC by voting

96 Annotation “The Chicago Bulls announced yesterday that Michael Jordan will... ” The <resource ref="http://tap.stanford.edu/ BasketballTeam_Bulls">Chicago Bulls announced yesterday that <resource ref= "http://tap.stanford.edu/AthleteJordan,_Michael"> Michael Jordan will...’’

97 Semantic Annotation Picture from http://lsdis.cs.uga.edu/courses/SemWebFall2005/courseMaterials/CSCI8350-Metadata.ppt This simplest task of meta-data extraction on NL is to establish “type” relation between entities in the NL resources and concepts in ontologies. Name Entity Identification

98 Semantics Semantic Annotation - The content of annotation consists of some rich semantic information - Targeted not only at human reader of resources but also software agents - formal : metadata following structural standards informal : personal notes written in the margin while reading an article - explicit : carry sufficient information for interpretation tacit : many personal annotations (telegraphic and incomplete) http://www-scf.usc.edu/~csci586/slides/6

99 Uses of Annotation http://www-scf.usc.edu/~csci586/slides/8

100 Objectives of Annotation Generate Metadata for existing information –e.g., author-tag in HTML –RDF descriptions to HTML –Content description to Multimedia files Employ metadata for –Improved search –Navigation –Presentation –Summarization of contents http://www.aifb.uni- karlsruhe.de/WBS/sst/Teaching/Intelligente%20System%20im%20WWW%20SS%202000/10-Annotation.pdf

101 Annotation Current practice of annotation for knowledge identification and extraction is time consuming needs annotation by experts is complex Reduce burden of text annotation for Knowledge Management www.racai.ro/EUROLAN-2003/html/presentations/SheffieldWilksBrewsterDingli/Eurolan2003AlexieiDingli.ppt

SemTag & Seeker WWW-03 Best Paper Prize Seeded with TAP ontology (72k concepts)  And ~700 human judgments Crawled 264 million web pages Extracted 434 million semantic tags  Automatically disambiguated

103 SemTag Research project IBM Very large scale – largest to date 264 million web pages Goal: to provide early set of widespread semantic tags through automated generation

104 SemTag Uses broad, shallow knowledge base TAP – lexical and taxonomic information about popular objects –Music –Movies –Sports –Etc.

105 SemTag Problem: –No write access to original document, so how do you annotate? Solution: –Store annotations in a web-available database

106 SemTag Semantic Label Bureau –Separate store of semantic annotation information –HTTP server that can be queried for annotation information –Example Find all semantic tags for a given document Find all semantic tags for a particular object

107 SemTag Methodology

108 SemTag Three phases 1. Spotting Pass: –Tokenize the document –All instances plus 20 word window 2. Learning Pass: –Find corpus-wide distribution of terms at each internal node of taxonomy –Based on a representative sample 3. Tagging Pass: –Scan windows to disambiguate each reference –Finally determined to be a TAP object

109 SemTag Another problem magnified by the scale: –Ambiguity Resolution Two fundamental categories of ambiguities: 1.Some labels appear at multiple locations 2.Some entities have labels that occur in contexts that have no representative in the taxonomy

110 SemTag Solution: – Taxonomy Based Disambiguation (TBD) TBD expectation: –Human tuned parameters used in small, critical sections –Automated approaches deal with bulk of information

111 SemTag TBD methodology: –Each node in the taxonomy is associated with a set of labels Cats, Football, Cars all contain “jaguar” –Each label in the text is stored with a window of 20 words – the context –Each node has an associated similarity function mapping a context to a similarity Higher similarity  more likely to contain a reference

112 SemTag Similarity: –Built a 200,000 word lexicon (200,100 most common – 100 most common) –200,000 dimensional vector space –Training: spots (label, context) and correct node –Estimated the distribution of terms for nodes –Standard cosine similarity – TFIDF vectors (context vs. node)

113 SemTag References inside the taxonomy vs. References outside the taxonomy Multiple nodes: b = r  b != p(v) Is a context c appropriate for a node v

114 SemTag Some internal nodes very popular: –Associate a measurement of how accurate Sim is likely to be at a node –Also, how ambiguous the node is overall (consistency of human judgment) TBD Algorithm: returns 1 or 0 to indicate whether a particular context c is on topic for a node v 82% accuracy on 434 million spots

115 SemTag

116 Summary Information extraction can be motivated either as explicating more structure from the data or as an automated way to Semantic Web Extraction complexity depends on whether the text you have is “templated” or “free-form” –Extraction from templated text can be done by regular expressions –Extraction from free form text requires NLP Can be done in terms of parts-of-speech-tagging “Annotation” involves connecting terms in a free form text to items in the background knowledge –It too can be automated

1 Information Extraction (Several slides based on those by Ray Mooney, Cohen/McCallum (via Dan Weld’s class) Make-up Class: Tomorrow (Wed) 10:30—11:45AM.

Similar presentations

Presentation on theme: "1 Information Extraction (Several slides based on those by Ray Mooney, Cohen/McCallum (via Dan Weld’s class) Make-up Class: Tomorrow (Wed) 10:30—11:45AM."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 Information Extraction (Several slides based on those by Ray Mooney, Cohen/McCallum (via Dan Weld’s class) Make-up Class: Tomorrow (Wed) 10:30—11:45AM.

Similar presentations

Presentation on theme: "1 Information Extraction (Several slides based on those by Ray Mooney, Cohen/McCallum (via Dan Weld’s class) Make-up Class: Tomorrow (Wed) 10:30—11:45AM."— Presentation transcript:

Similar presentations

About project

Feedback