Presentation is loading. Please wait.

Presentation is loading. Please wait.

IE by Candidate Classification: Jansche & Abney, Cohen et al William Cohen 1/19/03.

Similar presentations


Presentation on theme: "IE by Candidate Classification: Jansche & Abney, Cohen et al William Cohen 1/19/03."— Presentation transcript:

1 IE by Candidate Classification: Jansche & Abney, Cohen et al William Cohen 1/19/03

2 SCAN: Search & Summarization for Audio Collections (AT&T Labs)

3

4 Why IE from personal voicemail Unified interface for email, voicemail, fax, … requires uniform headers: –Sender, Time, Subject, … –Headers are key for uniform interface Independently, voicemail access is slow: –useful to have fast access to important parts of message (contact number, caller)

5 Why else to read this paper Robust information extraction –Generalizing from manual transcripts (i.e., human-produced written version of voicemail) to automatic (ASR) transcripts Place of hand-coding vs learning in information extraction –How to break up task –Where and how to use engineering Candidate Generator Learned filter Candidate phrase Extracted phrase

6 Voicemail corpus About 10,000 manually transcribed and annotated voice messages. 1869 used for evaluation

7 Observation: caller phrases are short and near the beginning of the message.

8 Caller-phrase extraction Propose start positions i1,…,iN Use a learned decision tree to pick the best i Propose end positions i+j1,i+j2,…,i+jM Use a learned decision tree to pick the best j

9 Baseline (HZP, Col log-linear) IE as tagging: Pr(tag i|word i,word i-1,…,word i+1,…,tag i-1,…) estimated via MAXENT model Beam search to find best tag sequence given word sequence Features of model are words, word pairs, word pair+tag trigrams, …. Hithereit’sBilland… OUT IN OUT…

10 Performance

11 Observation: caller names are really short and near the beginning of the message.

12 What about ASR transcripts?

13 Extracting phone numbers Phase 1: hand-coded grammer proposes candidate phone numbers –Not too hard, due to limited vocabulary –Optimize recall (96%) not precision (30%) Phase 2: a learned decision tree filters candidates –Use length, position, context, …

14 Results

15 Their Conclusions

16 Cohen, Wang, Murphy Another paper with a similar flavor: –IE for a particular task –IE using similar propose-and-filter approach –When and how to you engineer, and when and how to you use learning?

17 Background – subcellular localization The most important tool for studying protein localizations is fluorescence microscopy. New image processing techniques can automatically produce a quantitative description of subcellular localization.

18 Background – subcellular localization Two golgi proteins that cannot be distinguished by eye

19 Background – subcellular localization Entrez: “a new 376kD Golgi complex outher membrane protein” SWISSProt: “INTEGRAL MEMBRANE PROTEIN. GOLGI MEMBRANE” Entrez: “GPP130; type II Golgi membrane protein” SWISSProt: nothing

20 Background – subcellular localization Some other interesting facts: –Primary structure is poor indicator of localization –Many possible localizations with image analysis –Tens of thousands of images in open literature

21 Overview of SLIF: image analysis of existing images from online publications Image Panel Splitter Panel Classifier Scale FinderFl. Micr. Panel Micr. Scale On-line paper Figure Figure finder

22 Overview of SLIF: image analysis of existing images from online publications End result: collection of on-line fluorescence microscope images, with quantitative description of localization. E.g.: we know this figure section shows a tubulin-like protein… …but not which one!

23 Background – overview of SLIF1 Segment into “panels” Detect & remove annotations Classify panels FMI+

24 Background – overview of SLIF1 Segment FMI panels into individual cells. Find scalebar and scale measurement Figure 1. (A) Single confocal 0-GFP fusion. ………Bars, 5 m m.Movement of Coiled Bodies Vol. 10, July 1999… Rescale image of each cell, adjust contrast, and compute subcellular localization features as if it were an ordinary microscope image. Of course, you still don’t know what it’s an image of…

25 Background – overview of SLIF2.0 Caption Image Pointer Finder Scope Finder Name Finder Panel Label Matcher Image Panel Splitter Panel Classifier Scale FinderFl. Micr. Panel Micr. ScaleCell Type Protein Name

26 Background – overview of SLIF2.0 Figure 1. (A) Single confocal optical section of BY-2 cells expressing U2B 0-GFP, double labeled with GFP (left panel) and autoantibody against p80 coilin (right panel). Three nuclei are shown, and the bright GFP spots colocalize with bright foci of anti-coilin labeling. There is some labeling of the cytoplasm by anti-p80 coilin. (B) Single confocal optical section of BY-2 cells expressing U2B 0 -GFP, double labeled with GFP (left panel) and 4G3 antibody (right panel). Three nuclei are shown. Most coiled bodies are in the nucleoplasm, but occasionally are seen in the nucleolus (arrows). All coiled bodies that contain U2B 0 also express the U2B 0-GFP fusion. Bars, 5 m m. Movement of Coiled Bodies Vol. 10, July 1999 2299 An old issue: entity recognition BY-2 U2B 0-GFP p80-coilin anti-p80 coilin A new issue: “caption understanding” - where are the entities in the image?

27 Figure 1. (A) Single confocal optical section of BY-2 cells expressing U2B 0-GFP, double labeled with GFP (left panel) and autoantibody against p80 coilin (right panel). Three nuclei are shown, and the bright GFP spots colocalize with bright foci of anti- coilin labeling. There is some labeling of the cytoplasm by anti-p80 coilin. (B) Single confocal optical section of BY-2 cells expressing U2B 0 -GFP, double labeled with GFP (left panel) and 4G3 antibody (right panel). Three nuclei are shown. Most coiled bodies are in the nucleoplasm, but occasionally are seen in the nucleolus (arrows). All coiled bodies that contain U2B 0 also express the U2B 0-GFP fusion. Bars, 5 m m. Movement of Coiled Bodies Vol. 10, July 1999 2299 Why caption understanding? - Location proteomics. - Remove extraneous junk from caption text for “ordinary” IE, NLP, indexing, … - Better text- or content-based image retrieval for scientific images.

28 Figure 1. (A) Single confocal optical section of BY-2 cells expressing U2B 0-GFP, double labeled with GFP (left panel) and autoantibody against p80 coilin (right panel). Three nuclei are shown, and the bright GFP spots colocalize with bright foci of anti- coilin labeling. There is some labeling of the cytoplasm by anti-p80 coilin. (B) Single confocal optical section of BY-2 cells expressing U2B 0 -GFP, double labeled with GFP (left panel) and 4G3 antibody (right panel). Three nuclei are shown. Most coiled bodies are in the nucleoplasm, but occasionally are seen in the nucleolus (arrows). All coiled bodies that contain U2B 0 also express the U2B 0-GFP fusion. Bars, 5 m m. Movement of Coiled Bodies Vol. 10, July 1999 2299 Identify image pointers: Substrings that refer to parts of the image Will focus on text issues, not matching

29 Figure 1. (A) Single confocal optical section of BY-2 cells expressing U2B 0-GFP, double labeled with GFP (left panel) and autoantibody against p80 coilin (right panel). Three nuclei are shown, and the bright GFP spots colocalize with bright foci of anti- coilin labeling. There is some labeling of the cytoplasm by anti-p80 coilin. (B) Single confocal optical section of BY-2 cells expressing U2B 0 -GFP, double labeled with GFP (left panel) and 4G3 antibody (right panel). Three nuclei are shown. Most coiled bodies are in the nucleoplasm, but occasionally are seen in the nucleolus (arrows). All coiled bodies that contain U2B 0 also express the U2B 0-GFP fusion. Bars, 5 m m. Movement of Coiled Bodies Vol. 10, July 1999 2299 Identify image pointers: Substrings that refer to parts of the image Classify image pointers as citation-style or bullet-style.

30 Figure 1. (A) Single confocal optical section of BY-2 cells expressing U2B 0-GFP, double labeled with GFP (left panel) and autoantibody against p80 coilin (right panel). Three nuclei are shown, and the bright GFP spots colocalize with bright foci of anti- coilin labeling. There is some labeling of the cytoplasm by anti-p80 coilin. (B) Single confocal optical section of BY-2 cells expressing U2B 0 -GFP, double labeled with GFP (left panel) and 4G3 antibody (right panel). Three nuclei are shown. Most coiled bodies are in the nucleoplasm, but occasionally are seen in the nucleolus (arrows). All coiled bodies that contain U2B 0 also express the U2B 0-GFP fusion. Bars, 5 m m. Movement of Coiled Bodies Vol. 10, July 1999 2299 Classify image pointers as citation-style or bullet-style. Compute scopes: - The scope of a bullet-style image pointer is all words between it and the next “bullet” scope of (A) scope of (B)

31 Figure 1. (A) Single confocal optical section of BY-2 cells expressing U2B 0-GFP, double labeled with GFP (left panel) and autoantibody against p80 coilin (right panel). Three nuclei are shown, and the bright GFP spots colocalize with bright foci of anti- coilin labeling. There is some labeling of the cytoplasm by anti-p80 coilin. (B) Single confocal optical section of BY-2 cells expressing U2B 0 -GFP, double labeled with GFP (left panel) and 4G3 antibody (right panel). Three nuclei are shown. Most coiled bodies are in the nucleoplasm, but occasionally are seen in the nucleolus (arrows). All coiled bodies that contain U2B 0 also express the U2B 0-GFP fusion. Bars, 5 m m. Movement of Coiled Bodies Vol. 10, July 1999 2299 Compute scopes: - The scope of a bullet-style image pointer is all words after it, but before next “bullet” - The scope of a citation-style image pointer is some set of words nearby it (heuristically determined by separating words and punctuation)

32 Figure 1. (A) Single confocal optical section of BY-2 cells expressing U2B 0-GFP, double labeled with GFP (left panel) and autoantibody against p80 coilin (right panel). Three nuclei are shown, and the bright GFP spots colocalize with bright foci of anti- coilin labeling. There is some labeling of the cytoplasm by anti-p80 coilin. (B) Single confocal optical section of BY-2 cells expressing U2B 0 -GFP, double labeled with GFP (left panel) and 4G3 antibody (right panel). Three nuclei are shown. Most coiled bodies are in the nucleoplasm, but occasionally are seen in the nucleolus (arrows). All coiled bodies that contain U2B 0 also express the U2B 0-GFP fusion. Bars, 5 m m. Movement of Coiled Bodies Vol. 10, July 1999 2299 Image pointers share all entities in their “scope”. Entities are assigned to panels based on matches of image- pointers to annotations in panels.

33 Outline Details on caption understanding –Baseline hand-coded methods –Learning methods –Experimental results

34 Task Identify image pointers in captions. Classify image pointers: –bullet-style, citation-style, or NP-style E.g., “Panels A and C show the …” Won’t talk about scoping Will focus first on extracting image pointers—i.e., binary classification of substrings “is this an image pointer” Data: 100 captions from 100 papers—about 600 positive examples.

35 Baseline methods Labeled 100 sample figure captions. H ANDCODE-1: patterns like (A), (B-E), (c and d), etc. H ANDCODE-2: all short parenthesized expressions & patterns like “panel A” or “in B-C” HC-1HC-2 Precisio n 98.574.5 Recall 45.698.0 F1 62.384.6 Some plausible tricks (like filtering HC-2) don’t help much… HC-1HC- 2f HC-2 Precis. 98.589.074.5 Recall 45.654.898.0 F1 62.367.884.6

36 How hard is the problem? Some citation-style image pointers

37 How hard is the problem? NP-style non-image pointers The difficulty of the task suggests using a learning approach

38 Another use of propose-and-filter Candidate Generator Learned filter Candidate phrase Extracted phrase Note that Hand-Code2 (recall 98%) is a natural candidate generator. We’ll start with “off the shelf” features…

39 Learning methods: features Start with: named sets of labeled substrings –Image pointers and tokens (not marked) Fig. 1. Kinase inactive Plk inhibits Golgi fragmentation by mitotic cytosol. (A) NRK cells were grown on coverslips and treated with 2mMthymidine for 8 to 14 h. Cells were subsequently permeabilized with digitonin, washed with 1M KCl-containing bu®er, and incubated with either 7 mgyml interphase cytosol (IE), 7mgyml mitotic extract (ME), or mitotic extract to which 20 mgyml kinase inactive Plk (ME + Plk-KD) was added. After a 60-min incubation at 32C, cells were fixed and stained with anti-mannosidase II antibody to visualize the Golgi apparatus by fluorescence microscopy. (B) Percentage of cells with fragmented Golgi after incubation with mitotic extract (ME) in the absence or the presence of kinase inactive Plk (ME + Plk-KD). The histogram represents the average of four independent experiments.

40 Learning methods: features Start with: named sets of labeled substrings –Image pointers (label=y/n) and tokens (label=token) –Substrings act as examples and features To create features: use a “little language”: emit( token, before, -1, label ), emit( token, before, -2, label ), … Fig. 1. Kinase inactive Plk inhibits Golgi fragmentation by mitotic cytosol. (A) NRK cells were grown on coverslips and treated with … either 7 mgyml interphase cytosol (IE), 7mgyml mitotic extract (ME), or mitotic extract to which 20 mgyml kinase inactive Plk (ME + Plk-KD) was added.

41 Learning methods: features emit( token, before, -1, label ), emit( token, before, -2, label ), … Fig. 1. Kinase inactive Plk inhibits Golgi fragmentation by mitotic cytosol. (A) NRK cells were grown on coverslips and treated with … either 7 mgyml interphase cytosol (IE), 7mgyml mitotic extract (ME), or mitotic extract to which 20 mgyml kinase inactive Plk (ME + Plk-KD) was added. kind of substring to look for direction to go distance to go what to emit (substring label, distance in chars to substring, …) start go 2 tokens back emit “inactive”

42 Learning methods: boosting Generalized version of AdaBoost (Singer&Schapire, 99) Allows “real-valued” predictions for each “base hypothesis”—including value of zero.

43 Learning methods: boosting rules Weak learner: to find weak hypothesis t: 1.Split Data into Growing and Pruning sets 2.Let R t be an empty conjunction 3.Greedily add conditions to R t guided by Growing set: 4.Greedily remove conditions from R t guided by Pruning set: 5.Convert to weak hypothesis: where Constraint: W + > W - and caret is smoothing

44 Learning methods: boosting rules SLIPPER also produces fairly compact rule sets.

45 Learning methods: BWI Boosted wrapper induction (BWI) learns to extract substrings from a document. –Learns three concepts: firstToken(x), lastToken(x), substringLength(k) –Conditions are tests on tokens before/after x E.g., tok i-2 =‘from’, isNumber(tok i+1 ) –S LIPPER weak learner, no pruning. –Greedy search extends window size by at most L in each iteration, uses lookahead L, no fixed limit on window size. Good results in ( Kushmeric and Frietag, 2000)

46 Learning methods: ABWI “ Almost boosted wrapper induction” (ABWI) learns to extract substrings: –Learns to filter candidate substrings (HandCode2) –Conditions are the same tests on tokens near x: E.g., tok i-2 =‘from’, isNumber(tok i+1 ) –S LIPPER weak learner, no pruning. –Greedy search extends window size any amount, uses no lookahead, has fixed limit on window size. Optimal window sizes for this problem seem to be small…

47 Learning methods Features: W tokens before/after, all tokens inside. Learner: 100 rounds boosting conjunctions of feature tests –Inspired by BWI (Frietag & Kushmeric) –Implemented with S LIPPER learner HC-1HC- 2f HC- 2 ABWI (W=2) Precis. 98.589.074.5 89.7 Recall45.654.898.0 91.0 F162.367.884.6 90.3

48 Other learning methods HC-1HC-2fHC-2ABWI (W=2) ABWI Slipper ABWI Ripper ABWI SVM1 ABWI SVM2 Precis. 98.589.074.589.796.188.169.0100.0 Recall45.654.898.091.085.287.178.075.2 F162.367.884.690.3 87.673.285.6 All learning methods are competitive with hand-coded methods

49 Additional features Check if candidate contains certain “special” substrings: –Matches color name: labeled color –Matches H ANDCODE-1 pattern: handcode1 –Matches “mm”, “mg”, etc: measure –Matches 1980,…,2003, “et al”: citation –Matches “top”, “left”, etc: place Added “sentence boundary” substrings: –Feature is “distance to boundary”.

50 Learning with expanded feature set HC-1HC-2fHC-2ABWI (W=2) ABWI + NA Precis.98.589.074.589.7 85.9 Recall45.654.898.091.0 92.2 F162.367.884.690.3 89.0 Many new features are inversely correlated with class (e.g. citation), but ABWI looks only for positively-correlated patterns.

51 Learning with expanded feature set HC-1HC-2fHC-2ABWI (W=2) ABWI + NA SABWI + NA Precis.98.589.074.589.7 85.988.6 Recall45.654.898.091.0 92.293.8 F162.367.884.690.3 89.091.1 SABWI is a symmetric version of ABWI: can use rules and/or conditions negatively or positively correlated with the class

52

53 Task Identify image pointers in captions. Classify image pointers: –bullet-style, citation-style, or NP-style Combine these to get a four-class problem: –bullet-style, citation-style, or NP-style, other –no hand-coded baseline methods

54 Four-class extraction results MethodError rate W=2W=3W=5 ABWI24.627.526.7 ABWI+NA26.722.226.7 SABWI+NA24.218.222.6

55 Further improvement is probable with additional labeled data


Download ppt "IE by Candidate Classification: Jansche & Abney, Cohen et al William Cohen 1/19/03."

Similar presentations


Ads by Google