1 Information Extraction (Slides based on those by Ray Mooney, Craig Knoblock, Dan Weld and Perry)

Slides:

Advertisements

Similar presentations

Arnd Christian König Venkatesh Ganti Rares Vernica Microsoft Research Entity Categorization Over Large Document Collections.

Advertisements

Web Intelligence Text Mining, and web-related Applications

SemTag and Seeker: Bootstrapping the Semantic Web via Automated Semantic Annotation Presented by: Hussain Sattuwala Stephen Dill, Nadav Eiron, David Gibson,

1 Relational Learning of Pattern-Match Rules for Information Extraction Presentation by Tim Chartrand of A paper bypaper Mary Elaine Califf and Raymond.

1 Information Extraction. 2 Information Extraction (IE) Identify specific pieces of information (data) in a unstructured or semi-structured textual document.

Web Mining Research: A Survey Authors: Raymond Kosala & Hendrik Blockeel Presenter: Ryan Patterson April 23rd 2014 CS332 Data Mining pg 01.

NYU ANLP-00 1 Automatic Discovery of Scenario-Level Patterns for Information Extraction Roman Yangarber Ralph Grishman Pasi Tapanainen Silja Huttunen.

A (corny) ending. 2 Course Outcomes After this course, you should be able to answer: –How search engines work and why are some better than others –Can.

Information Integration + a (corny) ending 5/4 An unexamined life is not worth living.. --Socrates  Mandatory blog qns  Final on next Tuesday 9:50—11:40.

Information Retrieval in Practice

WebMiningResearch ASurvey Web Mining Research: A Survey Raymond Kosala and Hendrik Blockeel ACM SIGKDD, July 2000 Presented by Shan Huang, 4/24/2007.

April 22, Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Doerre, Peter Gerstl, Roland Seiffert IBM Germany, August 1999 Presenter:

Aki Hecht Seminar in Databases (236826) January 2009

Interactive Review + a (corny) ending 12/05  Project due today (with extension)  Homework 4 due Friday  Demos (to the TA) as scheduled.

1 CBioC: Collaborative Bio- Curation Chitta Baral Department of Computer Science and Engineering Arizona State University.

Measuring Semantic Similarity between Words Using Web Search Engines Danushka Bollegala, Yutaka Matsuo, Mitsuru Ishizuka Topic  Semantic similarity measures.

KnowItNow: Fast, Scalable Information Extraction from the Web Michael J. Cafarella, Doug Downey, Stephen Soderland, Oren Etzioni.

Open Information Extraction From The Web Rani Qumsiyeh.

Information Extraction and Ontology Learning Guided by Web Directory Authors:Martin Kavalec Vojtěch Svátek Presenter: Mark Vickers.

Query Processing in Data Integration + a (corny) ending

Empirical Methods in Information Extraction - Claire Cardie 자연어처리연구실 한 경 수

WebMiningResearchASurvey Web Mining Research: A Survey Raymond Kosala and Hendrik Blockeel ACM SIGKDD, July 2000 Presented by Shan Huang, 4/24/2007 Revised.

What is a document? Information need: From where did the metaphor, doing X is like “herding cats”, arise? quotation? “Managing senior programmers is like.

Machine Learning in Natural Language Processing Noriko Tomuro November 16, 2006.

Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dörre, Peter Gerstl, and Roland Seiffert Presented By: Jake Happs,

Information Retrieval

CONTENT-BASED BOOK RECOMMENDING USING LEARNING FOR TEXT CATEGORIZATION TRIVIKRAM BHAT UNIVERSITY OF TEXAS AT ARLINGTON DATA MINING CSE6362 BASED ON PAPER.

Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Huimin Ye.

Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Drew DeHaas.

Overview of Search Engines

Information Extraction with Unlabeled Data Rayid Ghani Joint work with: Rosie Jones (CMU) Tom Mitchell (CMU & WhizBang! Labs) Ellen Riloff (University.

Artificial Intelligence Research Centre Program Systems Institute Russian Academy of Science Pereslavl-Zalessky Russia.

Information Retrieval and Web Search Introduction to Information Extraction Instructor: Rada Mihalcea Class web page:

Copyright R. Weber Machine Learning, Data Mining ISYS370 Dr. R. Weber.

1 Introduction to Modeling Languages Striving for Engineering Precision in Information Systems Jim Carpenter Bureau of Labor Statistics, and President,

Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.

C OLLECTIVE ANNOTATION OF WIKIPEDIA ENTITIES IN WEB TEXT - Presented by Avinash S Bharadwaj ( )

Survey of Semantic Annotation Platforms

SWETO: Large-Scale Semantic Web Test-bed Ontology In Action Workshop (Banff Alberta, Canada June 21 st 2004) Boanerges Aleman-MezaBoanerges Aleman-Meza,

PAUL ALEXANDRU CHIRITA STEFANIA COSTACHE SIEGFRIED HANDSCHUH WOLFGANG NEJDL 1* L3S RESEARCH CENTER 2* NATIONAL UNIVERSITY OF IRELAND PROCEEDINGS OF THE.

Thanks to Bill Arms, Marti Hearst Documents. Last time Size of information –Continues to grow IR an old field, goes back to the ‘40s IR iterative process.

Ontology-Driven Automatic Entity Disambiguation in Unstructured Text Jed Hassell.

1 Technologies for (semi-) automatic metadata creation Diana Maynard.

WebMining Web Mining By- Pawan Singh Piyush Arora Pooja Mansharamani Pramod Singh Praveen Kumar 1.

RCDL Conference, Petrozavodsk, Russia Context-Based Retrieval in Digital Libraries: Approach and Technological Framework Kurt Sandkuhl, Alexander Smirnov,

©2003 Paula Matuszek CSC 9010: Text Mining Applications Document Summarization Dr. Paula Matuszek (610)

1 A Hierarchical Approach to Wrapper Induction Presentation by Tim Chartrand of A paper bypaper Ion Muslea, Steve Minton and Craig Knoblock.

Presenter: Shanshan Lu 03/04/2010

Task: Information Extraction Goal: being able to answer semantic queries (a.k.a. “database queries”) using “unstructured” natural language sources Identify.

A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.

Algorithmic Detection of Semantic Similarity WWW 2005.

Towards the Semantic Web 6 Generating Ontologies for the Semantic Web: OntoBuilder R.H.P. Engles and T.Ch.Lech 이 은 정

OWL Representing Information Using the Web Ontology Language.

Presented By- Shahina Ferdous, Student ID – , Spring 2010.

Semantic web Bootstrapping & Annotation Hassan Sayyadi Semantic web research laboratory Computer department Sharif university of.

A Classification-based Approach to Question Answering in Discussion Boards Liangjie Hong, Brian D. Davison Lehigh University (SIGIR ’ 09) Speaker: Cho,

UWMS Data Mining Workshop Content Analysis: Automated Summarizing Prof. Marti Hearst SIMS 202, Lecture 16.

CS 4705 Lecture 17 Semantic Analysis: Robust Semantics.

Acquisition of Categorized Named Entities for Web Search Marius Pasca Google Inc. from Conference on Information and Knowledge Management (CIKM) ’04.

CIS750 – Seminar in Advanced Topics in Computer Science Advanced topics in databases – Multimedia Databases V. Megalooikonomou Link mining ( based on slides.

Information Extraction from Web Resources CENG 770.

Multi-Class Sentiment Analysis with Clustering and Score Representation Yan Zhu.

111 Natural Language Processing: Information Extraction.

Trends in NL Analysis Jim Critz University of New York in Prague EurOpen.CZ 12 December 2008.

Information Extraction

Social Knowledge Mining

Information Extraction

1/21/10 Viewing the Coure in terms of IR, DB, Soc Net, ML adapted to web Start of IR.

CS246: Information Retrieval

Information Retrieval

Presentation transcript:

1 Information Extraction (Slides based on those by Ray Mooney, Craig Knoblock, Dan Weld and Perry)

2 Information Extraction (IE) Identify specific pieces of information (data) in a unstructured or semi-structured textual document. Transform unstructured information in a corpus of documents or web pages into a structured database. Applied to different types of text: –Newspaper articles –Web pages –Scientific articles –Newsgroup messages –Classified ads –Medical notes

3 MUC DARPA funded significant efforts in IE in the early to mid 1990’s. Message Understanding Conference (MUC) was an annual event/competition where results were presented. Focused on extracting information from news articles: –Terrorist events –Industrial joint ventures –Company management changes Information extraction of particular interest to the intelligence community (CIA, NSA).

4 Other Applications Job postings: –Newsgroups: Rapier from austin.jobsRapier –Web pages: FlipdogFlipdog Job resumes: –BurningGlassBurningGlass –MohomineMohomine Seminar announcements Company information from the web Continuing education course info from the web University information from the web Apartment rental ads Molecular biology information from MEDLINE

5 Subject: US-TN-SOFTWARE PROGRAMMER Date: 17 Nov :37:29 GMT Organization: Reference.Com Posting Service Message-ID: SOFTWARE PROGRAMMER Position available for Software Programmer experienced in generating software for PC- Based Voice Mail systems. Experienced in C Programming. Must be familiar with communicating with and controlling voice cards; preferable Dialogic, however, experience with others such as Rhetorix and Natural Microsystems is okay. Prefer 5 years or more experience with PC Based Voice Mail, but will consider as little as 2 years. Need to find a Senior level person who can come on board and pick up code with very little training. Present Operating System is DOS. May go to OS-2 or UNIX in future. Please reply to: Kim Anderson AdNET (901) fax Subject: US-TN-SOFTWARE PROGRAMMER Date: 17 Nov :37:29 GMT Organization: Reference.Com Posting Service Message-ID: SOFTWARE PROGRAMMER Position available for Software Programmer experienced in generating software for PC- Based Voice Mail systems. Experienced in C Programming. Must be familiar with communicating with and controlling voice cards; preferable Dialogic, however, experience with others such as Rhetorix and Natural Microsystems is okay. Prefer 5 years or more experience with PC Based Voice Mail, but will consider as little as 2 years. Need to find a Senior level person who can come on board and pick up code with very little training. Present Operating System is DOS. May go to OS-2 or UNIX in future. Please reply to: Kim Anderson AdNET (901) fax Sample Job Posting

6 Extracted Job Template computer_science_job id: title: SOFTWARE PROGRAMMER salary: company: recruiter: state: TN city: country: US language: C platform: PC \ DOS \ OS-2 \ UNIX application: area: Voice Mail req_years_experience: 2 desired_years_experience: 5 req_degree: desired_degree: post_date: 17 Nov 1996

7 Amazon Book Description …. The Age of Spiritual Machines : When Computers Exceed Human Intelligence by <a href="/exec/obidos/search-handle-url/index=books&field-author= Kurzweil%2C%20Ray/ "> Ray Kurzweil <img src=" width=90 height=140 align=left border=0> List Price: $14.95 Our Price: $11.96 You Save: $2.99 (20%) …. The Age of Spiritual Machines : When Computers Exceed Human Intelligence by <a href="/exec/obidos/search-handle-url/index=books&field-author= Kurzweil%2C%20Ray/ "> Ray Kurzweil <img src=" width=90 height=140 align=left border=0> List Price: $14.95 Our Price: $11.96 You Save: $2.99 (20%) …

8 Extracted Book Template Title: The Age of Spiritual Machines : When Computers Exceed Human Intelligence Author: Ray Kurzweil List-Price: $14.95 Price: $11.96 :

9 Web Extraction Many web pages are generated automatically from an underlying database. Therefore, the HTML structure of pages is fairly specific and regular (semi-structured). However, output is intended for human consumption, not machine interpretation. An IE system for such generated pages allows the web site to be viewed as a structured database. An extractor for a semi-structured web site is sometimes referred to as a wrapper. Process of extracting from such pages is sometimes referred to as screen scraping.

10 Web Extraction using DOM Trees Web extraction may be aided by first parsing web pages into DOM trees. Extraction patterns can then be specified as paths from the root of the DOM tree to the node containing the text to extract. May still need regex patterns to identify proper portion of the final CharacterData node.

11 Sample DOM Tree Extraction HTML BODY FONTB Age of Spiritual Machines Ray Kurzweil Element Character-Data HEADER by A Title: HTML  BODY  B  CharacterData Author: HTML  BODY  FONT  A  CharacterData

12 Template Types Slots in template typically filled by a substring from the document. Some slots may have a fixed set of pre-specified possible fillers that may not occur in the text itself. –Terrorist act: threatened, attempted, accomplished. –Job type: clerical, service, custodial, etc. –Company type: SEC code Some slots may allow multiple fillers. –Programming language Some domains may allow multiple extracted templates per document. –Multiple apartment listings in one ad

13 Simple Extraction Patterns Specify an item to extract for a slot using a regular expression pattern. –Price pattern: “\b\$\d+(\.\d{2})?\b” May require preceding (pre-filler) pattern to identify proper context. –Amazon list price: Pre-filler pattern: “ List Price: ” Filler pattern: “\ $\d+(\.\d{2})?\b ” May require succeeding (post-filler) pattern to identify the end of the filler. –Amazon list price: Pre-filler pattern: “ List Price: ” Filler pattern: “.+” Post-filler pattern: “ ”

14 Simple Template Extraction Extract slots in order, starting the search for the filler of the n+1 slot where the filler for the nth slot ended. Assumes slots always in a fixed order. –Title –Author –List price –… Make patterns specific enough to identify each filler always starting from the beginning of the document.

15 Pre-Specified Filler Extraction If a slot has a fixed set of pre-specified possible fillers, text categorization can be used to fill the slot. –Job category –Company type Treat each of the possible values of the slot as a category, and classify the entire document to determine the correct filler.

16 Learning for IE Writing accurate patterns for each slot for each domain (e.g. each web site) requires laborious software engineering. Alternative is to use machine learning: –Build a training set of documents paired with human-produced filled extraction templates. –Learn extraction patterns for each slot using an appropriate machine learning algorithm.

17

18

19

20

21

22

23

24

25 4 th Nov, Happy Deepawali! & Haloween 10/31

26 October 31 st

Finding“Sweet Spots” in computer-mediated cooperative work It is possible to get by with techniques blythely ignorant of semantics, when you have humans in the loop –All you need is to find the right sweet spot, where the computer plays a pre-processing role and presents “potential solutions” –…and the human very gratefully does the in-depth analysis on those few potential solutions Examples: –The incredible success of “Bag of Words” model! Bag of letters would be a disaster ;-) Bag of sentences and/or NLP would be good –..but only to your discriminating and irascible searchers ;-)

Collaborative Computing AKA Brain Cycle Stealing AKA Computizing Eyeballs A lot of exciting research related to web currently involves “co-opting” the masses to help with large-scale tasks –It is like “cycle stealing”—except we are stealing “human brain cycles” (the most idle of the computers if there is ever one ;-) Remember the mice in the Hitch Hikers Guide to the Galaxy? (..who were running a mass-scale experiment on the humans to figure out the question..) –Collaborative knowledge compilation (wikipedia!) –Collaborative Curation –Collaborative tagging –Paid collacoration/contracting Many big open issues –How do you pose the problem such that it can be solved using collaborative computing? –How do you “incentivize” people into letting you steal their brain cycles? Pay them! (Amazon mturk.com )

Tapping into the Collective Unconscious Another thread of exciting research is driven by the realization that WEB is not random at all! –It is written by humans –…so analyzing its structure and content allows us to tap into the collective unconscious.. Meaning can emerge from syntactic notions such as “co-occurrences” and “connectedness” Examples: –Analyzing term co-occurrences in the web-scale corpora to capture semantic information (today’s paper) –Analyzing the link-structure of the web graph to discover communities DoD and NSA are very much into this as a way of breaking terrorist cells –Analyzing the transaction patterns of customers (collaborative filtering)

30 Automated Support for “Semantic Web” Semantic web needs: –Tagged data –Background knowledge (blue sky approaches to) automate both –Automated tagging Start with a background ontology and tag other web pages –Semtag/Seeker –Knowledge Extraction Extract base level knowledge (“facts”) directly from the web

31 Extraction from Free Text involves Natural Language Processing If extracting from automatically generated web pages, simple regex patterns usually work. If extracting from more natural, unstructured, human-written text, some NLP may help. –Part-of-speech (POS) tagging Mark each word as a noun, verb, preposition, etc. –Syntactic parsing Identify phrases: NP, VP, PP –Semantic word categories (e.g. from WordNet) KILL: kill, murder, assassinate, strangle, suffocate Off-the-shelf software available to do this! –The “Brill” tagger Extraction patterns can use POS or phrase tags.

32 I. Generate-n-Test Architecture Generic extraction patterns (Hearst ’92): “…Cities such as Boston, Los Angeles, and Seattle…” (“C such as NP1, NP2, and NP3”) => IS-A(each(head(NP)), C), … Detailed information for several countries such as maps, …” ProperNoun(head(NP)) “I listen to pretty much all music but prefer country such as Garth Brooks”

33 Test Assess candidate extractions using Mutual Information (PMI-IR) (Turney ’01). Many variations are possible…

34 Assessment PMI = frequency of I & D co-occurrence 5-50 discriminators D i Each PMI for D i is a feature f i Naïve Bayes evidence combination: PMI is used for feature selection. NBC is used for learning. Hits used for assessing PMI as well as conditional probabilities

35 Assessment In Action 1.I = “Yakima” ( 1,340,000) 2.D = 3.I+D = “Yakima city” (2760) 4.PMI = (2760 / 1.34M)= 0.02 I = “Avocado” (1,000,000) I+D =“Avocado city” (10) PMI = << 0.02

36 Some Sources of ambiguity Time: “Clinton is the president” (in 1996). Context: “common misconceptions..” Opinion: Elvis… Multiple word senses: Amazon, Chicago, Chevy Chase, etc. –Dominant senses can mask recessive ones! –Approach: unmasking. ‘Chicago –City’

37 Chicago CityMovie

38 Chicago Unmasked City senseMovie sense

39 Impact of Unmasking on PMI Name Recessive Original Unmask Boost Washington city % Casablanca city % Chevy Chase actor % Chicago movie %

40 CBioC: Collaborative Bio- Curation Motivation  To help get information nuggets of articles and abstracts and store in a database.  The challenge is that the number of articles are huge and they keep growing, and need to process natural language.  The two existing approaches human curation and use of automatic information extraction systems They are not able to meet the challenge, as the first is expensive, while the second is error-prone.

41 CBioC (cont’d) Approach: We propose a solution that is inexpensive, and that scales up.  Our approach takes advantage of automatic information extraction methods as a starting point, Based on the premise that if there are a lot of articles, then there must be a lot of readers and authors of these articles.  We provide a mechanism by which the readers of the articles can participate and collaborate in the curation of information.  We refer to our approach as “Collaborative Curation''.

42 Using the C-BioCurator System (cont’d)

What is the main difference between Knowitall and CBIOC? Assessment– Knowitall does it by HITS. CBioC by voting

44 Annotation “The Chicago Bulls announced yesterday that Michael Jordan will... ” The <resource ref=" BasketballTeam_Bulls">Chicago Bulls announced yesterday that <resource ref= " Michael Jordan will...’’

45 Semantic Annotation Picture from This simplest task of meta-data extraction on NL is to establish “type” relation between entities in the NL resources and concepts in ontologies. Name Entity Identification

46 Semantics Semantic Annotation - The content of annotation consists of some rich semantic information - Targeted not only at human reader of resources but also software agents - formal : metadata following structural standards informal : personal notes written in the margin while reading an article - explicit : carry sufficient information for interpretation tacit : many personal annotations (telegraphic and incomplete)

47 Uses of Annotation

48 Objectives of Annotation Generate Metadata for existing information –e.g., author-tag in HTML –RDF descriptions to HTML –Content description to Multimedia files Employ metadata for –Improved search –Navigation –Presentation –Summarization of contents karlsruhe.de/WBS/sst/Teaching/Intelligente%20System%20im%20WWW%20SS%202000/10-Annotation.pdf

49 Annotation Current practice of annotation for knowledge identification and extraction is time consuming needs annotation by experts is complex Reduce burden of text annotation for Knowledge Management

SemTag & Seeker WWW-03 Best Paper Prize Seeded with TAP ontology (72k concepts)  And ~700 human judgments Crawled 264 million web pages Extracted 434 million semantic tags  Automatically disambiguated

51 SemTag Research project IBM Very large scale – largest to date 264 million web pages Goal: to provide early set of widespread semantic tags through automated generation

52 SemTag Uses broad, shallow knowledge base TAP – lexical and taxonomic information about popular objects –Music –Movies –Sports –Etc.

53 SemTag Problem: –No write access to original document, so how do you annotate? Solution: –Store annotations in a web-available database

54 SemTag Semantic Label Bureau –Separate store of semantic annotation information –HTTP server that can be queried for annotation information –Example Find all semantic tags for a given document Find all semantic tags for a particular object

55 SemTag Methodology

56 SemTag Three phases 1. Spotting Pass: –Tokenize the document –All instances plus 20 word window 2. Learning Pass: –Find corpus-wide distribution of terms at each internal node of taxonomy –Based on a representative sample 3. Tagging Pass: –Scan windows to disambiguate each reference –Finally determined to be a TAP object

57 SemTag Another problem magnified by the scale: –Ambiguity Resolution Two fundamental categories of ambiguities: 1.Some labels appear at multiple locations 2.Some entities have labels that occur in contexts that have no representative in the taxonomy

58 SemTag Solution: – Taxonomy Based Disambiguation (TBD) TBD expectation: –Human tuned parameters used in small, critical sections –Automated approaches deal with bulk of information

59 SemTag TBD methodology: –Each node in the taxonomy is associated with a set of labels Cats, Football, Cars all contain “jaguar” –Each label in the text is stored with a window of 20 words – the context –Each node has an associated similarity function mapping a context to a similarity Higher similarity  more likely to contain a reference

60 SemTag Similarity: –Built a 200,000 word lexicon (200,100 most common – 100 most common) –200,000 dimensional vector space –Training: spots (label, context) and correct node –Estimated the distribution of terms for nodes –Standard cosine similarity – TFIDF vectors (context vs. node)

61 SemTag References inside the taxonomy vs. References outside the taxonomy Multiple nodes: b = r  b != p(v) Is a context c appropriate for a node v

62 SemTag Some internal nodes very popular: –Associate a measurement of how accurate Sim is likely to be at a node –Also, how ambiguous the node is overall (consistency of human judgment) TBD Algorithm: returns 1 or 0 to indicate whether a particular context c is on topic for a node v 82% accuracy on 434 million spots

63 SemTag

64 Summary Information extraction can be motivated either as explicating more structure from the data or as an automated way to Semantic Web Extraction complexity depends on whether the text you have is “templated” or “free-form” –Extraction from templated text can be done by regular expressions –Extraction from free form text requires NLP Can be done in terms of parts-of-speech-tagging “Annotation” involves connecting terms in a free form text to items in the background knowledge –It too can be automated