Presentation is loading. Please wait.

Presentation is loading. Please wait.

Wrapper Induction & Other Use of “Structure” 2-1-2007.

Similar presentations


Presentation on theme: "Wrapper Induction & Other Use of “Structure” 2-1-2007."— Presentation transcript:

1 Wrapper Induction & Other Use of “Structure” 2-1-2007

2 Outline Deadlines and projects and such What wrapper learning is about BWI and some related work MUC7: MENE & others

3 Administrative issues Don’t forget – critiques due on Tues! –“1/22: Paper critiques for the week should be submitted on Tuesday before class. E.g., summaries for the four papers discussed the week of Jan 30th (Janche and Abney, Cohen et al, Freitag & Kushmerick, Borthwick et al) should be submitted before class Jan 30th. Submit your critiques by email (to vitor@cs, cc wcohen@cs). Each paper critique should be about 200-500 words (think half page or a page) discussing one thing you liked and/or one thing you didn’t like about the paper. ” [course web page] –Please use “Subject: critique ….” –Next week: Borkar “Datamold” paper, Frietag et al MEMM paper, Ratnaparkhi Maxent tagger paper Grading: –Grades based on project content, project write-up, class participation and critiques –You’ll learn more if you read and think in advance of the lectures

4 Administrative issues Some scary math: –23 sessions left (after this week) –28 people enrolled  28 student “optional” paper presentations My plan: next 14-15 sessions will each have –two 20min student presentations –discussion/questions –50-60min of lecture Vitor and I will construct a signup mechanism (Google spreadsheet?) and email you the details and post on the web page –Procrastinate intelligently –If you don’t get email contact Vitor (we have Andrew emails for roster) What you can present –Any “optional” paper from the syllabus –Anything else you want that’s related (check with William) When you can present –Any time after the topic has been covered in lecture –Preferably soon after

5 Projects Next Tuesday 2/6: everyone submit an abstract (email to William cc Vitor, and hardcopy) –One page, covering some subset of: What you plan to do Why you think it’s interesting Any relevant superpowers you might have How you plan to evaluate What techniques you plan to use What question you want to answer Who you might work with –These will be posted on the class web site Following Tuesday 2/13: –Similar abstract from each team Team is (preferably) 2-3 people, but I’m flexible Main new information: who’s on what team

6 Previous projects “Extracting person names from email, augmented by name co-reference information” – extension in EMNLP 2005 with 2/3 of team members Learning to Exract Signature and Reply Lines from Email – 1 person, extension appeared in CEAS 2004 “Evaluating CRF and MaxEnt Methods with Language-Model Derived Features for Protein Name Extraction” – 2 students, one followed up with published work in this area “Filtering Web Searching Results on People” – 2 students –paper with same idea & comparable results from different people has since appeared in ?? “Semantic Role Labeling with Conditional Random Fields” – 2 students, –one left CMU and went to Univ of Rochester BiDirectional MEMM-like model (evaluated on secondary structure prediction, compared to MaxEnt, CRFs) – 2 people “Comparing SVM, HMM, MaxEnt and CRF” in Video Segmentation” – 2 students – negative results but a good grade

7 Some other ideas… Exploiting burstiness in named entities on the web.

8

9 Notice that we get something useful from just identifying the person names and then doing some counting and trending

10 Some other ideas… Exploiting burstiness in named entities on the web – I have some resources to help

11 Some other ideas Semi-supervised relation extraction for biology –Large subdomains where entity extraction is easy (e.g., extraction of yeast proteins) but relation extraction is hard (e.g., determining if “A inhibits B” or if “A binds to B”). –This gives you unlabeled examples of interactions –There are also small amounts of labeled interaction data –There are also large amounts of “weakly labeled” data For many proteins we know (e.g.) how A and B actually do interact.

12 Other project ideas Project: Semi-supervised learning of named entity extractors from the web. The goal of this project is to develop a cotraining-style approach to learning named entity extractors using the web plus a handful of seed examples (e.g., to train an extractor for the class “university” we might provide names of half a dozen universities). The key to cotraining is to identify multiple redundant views of the data, and corresponding classifiers for each view, that can be used to cotrain each other. For example, three possible views for classifying named entities are (1) HTML web page structure such as “if X appears as an item in an HTML list, and the other items in the list a person names, then X is probably a person name,” (2) features of the entity tokens such as “Capitalized tokens that end with ‘ski’ are likely to be last names of people”, and (3) information in URLS such as “URLS of the form ‘www..edu’ suggest that is a university name or an abbreviation of one. The first step of this project will be to identify an interesting set of views such as the above, which complement those we are already developing. The second step will be to implement a cotraining method to learn the different types of classifiers jointly from mostly unlabeled data. Tom Mitchell has volunteered to work with the student team on this project. The project would fit into the larger “Read the Web” research project, for which Google has provided special permission to make a large number of queries to their web search engine. Please contact tom.mitchell@cmu.edu if you would like to discuss this.tom.mitchell@cmu.edu

13 Other project ideas Project: Joint learning of named-entity extractors and relation extractors. The goal of this project is to develop more accurate learning methods for training named-entity extractors and relation extractors, by coupling these two learning problems into a single joint problem. Consider, for example, learning named entity extractors for ‘person(x)’ and for ‘university(x)’ and also learning relation extractors for the relations advisor_of(x,y) and affiliated_with(a,b). One could treat these as independent learning tasks. However, they can be coupled because the relations are typed: for advisor_of(x,y) to be true, it must also be true that person(x) and person(y). Let’s design a learning method that learns these jointly, resulting in better learning from less total labeled data. A possible extension to this is to consider methods that automatically find useful subclasses of the named entities, where “useful” means that learning this subclass helps with relation extraction. For example, a potentially useful subclass of ‘person’ is ‘student,’ because this is a more useful class for y in extracting instances of the advisor_of(x,y) relation. Tom Mitchell has volunteered to work with the team on this project. The project would fit into the larger “Read the Web” research project, for which Google has provided special permission to make a large number of queries to their web search engine. Please contact tom.mitchell@cmu.edu if you would like to discuss this.tom.mitchell@cmu.edu

14 Outline Deadlines and projects and such What wrapper learning is about BWI and some related work MUC7 and MENE

15 Managing information from many places QueryAnswer advisor(wc,vc) advisor(yh,tm) affil(wc,mld) affil(vc,lti) fn(wc,``William”) fn(vc,``Vitor”) inference X:advisor(wc,Y)&affil(X,lti) ?{X=em; X=vc} AND Data Integration Problems: 1.Heterogeneous databases have different schemas (which the user or integrator must understand) 2.Heterogeneous databases have different object identifiers. first(vc,``Vitor”) from(vc,lti)

16 Wrappers

17 Goal: learn from a human teacher how to extract certain database records from a particular web site.

18 Learner

19

20

21 Why learning from few examples is important At training time, only four examples are available—but one would like to generalize to future pages as well… Must generalize across time as well as across a single site

22 Some history

23 Kushmerick’s WIEN system Earliest wrapper-learning system (published IJCAI ’97) Special things about WIEN: –Treats document as a string of characters –Learns to extract a relation directly, rather than extracting fields, then associating them together in some way –Example is a completely labeled page

24 WIEN system: a sample wrapper

25 Left delimiters L1=“ ”, L2=“ ”; Right R1=“ ”, R2=“ ”

26 WIEN system: a sample wrapper Learning means finding L1,…,Lk and R1,…,Rk Li must precede every instance of field i Ri must follow every instance of field I Li, Ri can’t contain data items Limited number of possible candidates for Li,Ri

27 WIEN system: a more complex class of wrappers (HLRT) Extension: use Li,Ri delimiters only: after a “head” (after first occurence of H) and before a “tail” (occurrence of T) H = “ ”, T = “ ”

28 Kushmeric: overview of various extensions to LR

29 Wrapper Induction History Kushmerick ’97 STALKER ’98-’01 –  Commercialization as FETCH.com ~= 2000 –Rivals: Connotate, etc Semisupervised tree-structure learning (2002,2004,2005)

30 Kushmeric and Frietag: Boosted wrapper induction

31 Review of boosting Generalized version of AdaBoost (Singer&Schapire, 99) Allows “real-valued” predictions for each “base hypothesis”—including value of zero.

32 Learning methods: boosting rules Weak learner: to find weak hypothesis t: 1.Split Data into Growing and Pruning sets 2.Let R t be an empty conjunction 3.Greedily add conditions to R t guided by Growing set: 4.Greedily remove conditions from R t guided by Pruning set: 5.Convert to weak hypothesis: where Constraint: W + > W - and caret is smoothing

33 Learning methods: boosting rules SLIPPER also produces fairly compact rule sets.

34 Learning methods: BWI Boosted wrapper induction (BWI) learns to extract substrings from a document. –Learns three concepts: firstToken(x), lastToken(x), substringLength(k) –Conditions are tests on tokens before/after x E.g., tok i-2 =‘from’, isNumber(tok i+1 ) –S LIPPER weak learner, no pruning. –Greedy search extends “window size” by at most L in each iteration, uses lookahead L, no fixed limit on window size. Good results in ( Kushmeric and Frietag, 2000)

35 BWI algorithm

36 Lookahead search here

37 BWI example rules

38

39 BWI and SWI Kautchak, Smarr & Elkan, 2004: “Sources of Success for Information Extraction Methods”, Journal of Machine Learning Research, 5, 499 - 527. –SWI is a “hard” version of BWI, using set- covering instead of boosting

40

41 Websites from DBs Seminars, Job Ads BioEntities in Medline Abstracts No negative examples covered BWI rule wts Same #rounds as SWI Secondary regularities?

42 = (#rules)/(#posExamples)

43 Cohen et al

44 Improving A Page Classifier with Anchor Extraction and Link Analysis William W. Cohen NIPS 2002

45 Previous work in page classification using links: Exploit hyperlinks (Slattery&Mitchell 2000; Cohn&Hofmann, 2001; Joachims 2001): Documents pointed to by the same “hub” should have the same class. What’s new in this paper: Use structure of hub pages (as well as structure of site graph) to find better “hubs” Adapt an existing “wrapper learning” system to find structure, on the task of classifying “executive bio pages”.

46 Intuition: links from this “hub page” are informative… …especially these links

47 Idea: use the wrapper-learner to learn to extract links to execBio pages, smoothing the “noisy” data produced by the initial page classifier. Task: train a page classifier, then use it to classify pages on a new, previously-unseen web site as executiveBio or other Question: can index pages for executive biographies be used to improve classification?

48 Background: “co-training” (Mitchell&Blum, ‘98) Suppose examples are of the form (x 1,x 2,y) where x 1,x 2 are independent (given y), and where each x i is sufficient for classification, and unlabeled examples are cheap. –(E.g., x 1 = bag of words, x 2 = bag of links). Co-training algorithm: 1. Use x 1 ’s (on labeled data D) to train f 1 (x)=y 2. Use f 1 to label additional unlabeled examples U. 3. Use x 2 ’s (on labeled part of U+D to train f 1 (x)=y 4. Repeat...

49 Simple 1-step co-training for web pages f 1 is a bag-of-words page classifier, and S is web site containing unlabeled pages. Feature construction. Represent a page x in S as a bag of pages that link to x (“bag of hubs”). Learning. Learn f 2 from the bag-of-hubs examples, labeled with f 1 Labeling. Use f 2 (x) to label pages from S. Idea: use one round of co-training to bootstrap the bag-of words classifier to one that uses site-specific features x 2 /f 2

50 Improved 1-step co-training for web pages Feature construction. - Label an anchor a in S as positive iff it points to a positive page x (according to f 1 ). Let D = {(x’,a): a is a positive anchor on x’}. - Generate many small training sets D i from D, by sliding small windows over D. - Let P be the set of all “structures” found by any builder from any subset D i - Say that p links to x if p extracts an anchor that points to x. Represent a page x as the bag of structures in P that link to x. Learning and Labeling. As before.

51 builder extractor List1

52 builder extractor List2

53 builder extractor List3

54 BOH representation: { List1, List3,…}, PR { List1, List2, List3,…}, PR { List2, List 3,…}, Other { List2, List3,…}, PR … Learner

55 Experimental results Co-training hurts No improvement

56 Experimental results

57 Summary - “Builders” (from a wrapper learning system) let one discover and use structure of web sites and index pages to smooth page classification results. - Discovering good “hub structures” makes it possible to use 1-step co-training on small (50- 200 example) unlabeled datasets. – Average error rate was reduced from 8.4% to 3.6%. – Difference is statistically significant with a 2- tailed paired sign test or t-test. – EM with probabilistic learners also works—see (Blei et al, UAI 2002)

58 MUC-7 Last Message Understanding Conference, (fore-runner to ACE), about 1998 –200 articles in development set (aircraft accidents) –200 articles in final test (launch events) –Names of persons, organizations, locations, dates, times, currency & percentage

59

60 LTG Identifinder (HMMs) MENE+Proteus Manitoba (NB filtered names) NetOwl Commercial RBS

61 Borthwick et al: MENE system Much like MXPost, with some tricks for NER: –4 tags/field: x_start, x_continue, x_end, x_unique –Features: Section features Tokens in window Lexical features of tokens in window Dictionary features of tokens (is token a firstName?) External system of tokens (is this a NetOwl_company_start? proteus_person_unique?) Smooth by discarding low-count features –No history: viterbi search used to find best consistent tag sequence (e.g. no continue w/o start)

62 Dictionaries in MENE

63 MENE results (dry run)

64 MENE learning curves 96.3 93.3 92.2

65 Largest U.S. Cable Operator Makes Bid for Walt Disney By ANDREW ROSS SORKIN The Comcast Corporation, the largest cable television operator in the United States, made a $54.1 billion unsolicited takeover bid today for The Walt Disney Company, the storied family entertainment colossus. If successful, Comcast's audacious bid would once again reshape the entertainment landscape, creating a new media behemoth that would combine the power of Comcast's powerful distribution channels to some 21 million subscribers in the nation with Disney's vast library of content and production assets. Those include its ABC television network, ESPN and other cable networks, and the Disney and Miramax movie studios. Short names Longer names

66 LTG system Another MUC-7 competitor Handcoded rules for “easy” cases (amounts, etc) Process of repeated tagging and “matching” for hard cases –Sure-fire (high precision) rules for names where type is clear (“Phillip Morris, Inc – The Walt Disney Company”) –Partial matches to sure-fire rule are filtered with a maxent classifier (candidate filtering) using contextual information, etc –Higher-recall rules, avoiding conflicts with partial-match output “Phillip Morris announced today…. - “Disney’s ….” –Final partial-match & filter step on titles with different learned filter. Exploits discourse/context information

67 LTG Results

68 LTG Identifinder (HMMs) MENE+Proteus Manitoba (NB filtered names) NetOwl Commercial RBS


Download ppt "Wrapper Induction & Other Use of “Structure” 2-1-2007."

Similar presentations


Ads by Google