Information Extraction from Web Resources CENG 770.

Slides:



Advertisements
Similar presentations
JRC 2005/05/10 Automatic event extraction from text on the base of linguistic and semantic annotation Thierry Declerck DFKI – Language Technology Lab.
Advertisements

School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Chunking: Shallow Parsing Eric Atwell, Language Research Group.
1 Relational Learning of Pattern-Match Rules for Information Extraction Presentation by Tim Chartrand of A paper bypaper Mary Elaine Califf and Raymond.
1 Information Extraction. 2 Information Extraction (IE) Identify specific pieces of information (data) in a unstructured or semi-structured textual document.
Statistical NLP: Lecture 3
1 Words and the Lexicon September 10th 2009 Lecture #3.
April 22, Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Doerre, Peter Gerstl, Roland Seiffert IBM Germany, August 1999 Presenter:
Introduction to Computational Linguistics Lecture 2.
Relational Learning of Pattern-Match Rules for Information Extraction Mary Elaine Califf Raymond J. Mooney.
1 Information Extraction (Slides based on those by Ray Mooney, Craig Knoblock, Dan Weld and Perry)
Information Extraction and Ontology Learning Guided by Web Directory Authors:Martin Kavalec Vojtěch Svátek Presenter: Mark Vickers.
Reference Collections: Task Characteristics. TREC Collection Text REtrieval Conference (TREC) –sponsored by NIST and DARPA (1992-?) Comparing approaches.
Properties of Text CS336 Lecture 3:. 2 Information Retrieval Searching unstructured documents Typically text –Newspaper articles –Web pages Other documents.
What is a document? Information need: From where did the metaphor, doing X is like “herding cats”, arise? quotation? “Managing senior programmers is like.
Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang National Central University
Article by: Feiyu Xu, Daniela Kurz, Jakub Piskorski, Sven Schmeier Article Summary by Mark Vickers.
Machine Learning in Natural Language Processing Noriko Tomuro November 16, 2006.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Huimin Ye.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Drew DeHaas.
Ontology Learning and Population from Text: Algorithms, Evaluation and Applications Chapters Presented by Sole.
Artificial Intelligence Research Centre Program Systems Institute Russian Academy of Science Pereslavl-Zalessky Russia.
Semantic Web Technologies Lecture # 2 Faculty of Computer Science, IBA.
Lecture 1, 7/21/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2005 Lecture 1 21 July 2005.
Mining and Summarizing Customer Reviews
Information Retrieval and Web Search Introduction to Information Extraction Instructor: Rada Mihalcea Class web page:
Evaluating the Contribution of EuroWordNet and Word Sense Disambiguation to Cross-Language Information Retrieval Paul Clough 1 and Mark Stevenson 2 Department.
Lemmatization Tagging LELA /20 Lemmatization Basic form of annotation involving identification of underlying lemmas (lexemes) of the words in.
Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.
Processing of large document collections Part 10 (Information extraction: multilingual IE, IE from web, IE from semi-structured data) Helena Ahonen-Myka.
Survey of Semantic Annotation Platforms
Using Text Mining and Natural Language Processing for Health Care Claims Processing Cihan ÜNAL
Scott Duvall, Brett South, Stéphane Meystre A Hands-on Introduction to Natural Language Processing in Healthcare Annotation as a Central Task for Development.
Jennie Ning Zheng Linda Melchor Ferhat Omur. Contents Introduction WordNet Application – WordNet Data Structure - WordNet FrameNet Application – FrameNet.
Data Mining Chapter 1 Introduction -- Basic Data Mining Tasks -- Related Concepts -- Data Mining Techniques.
Ontology-Based Information Extraction: Current Approaches.
WebMining Web Mining By- Pawan Singh Piyush Arora Pooja Mansharamani Pramod Singh Praveen Kumar 1.
Flexible Text Mining using Interactive Information Extraction David Milward
Péter Schönhofen – Ad Hoc Hungarian → English – CLEF Workshop 20 Sep 2007 Performing Cross-Language Retrieval with Wikipedia Participation report for Ad.
1 A Hierarchical Approach to Wrapper Induction Presentation by Tim Chartrand of A paper bypaper Ion Muslea, Steve Minton and Craig Knoblock.
Relevance Detection Approach to Gene Annotation Aid to automatic annotation of databases Annotation flow –Extraction of molecular function of a gene from.
Presenter: Shanshan Lu 03/04/2010
Mining Topic-Specific Concepts and Definitions on the Web Bing Liu, etc KDD03 CS591CXZ CS591CXZ Web mining: Lexical relationship mining.
Task: Information Extraction Goal: being able to answer semantic queries (a.k.a. “database queries”) using “unstructured” natural language sources Identify.
Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.
Bootstrapping for Text Learning Tasks Ramya Nagarajan AIML Seminar March 6, 2001.
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
Computational linguistics A brief overview. Computational Linguistics might be considered as a synonym of automatic processing of natural language, since.
Using a Named Entity Tagger to Generalise Surface Matching Text Patterns for Question Answering Mark A. Greenwood and Robert Gaizauskas Natural Language.
Information Retrieval
UWMS Data Mining Workshop Content Analysis: Automated Summarizing Prof. Marti Hearst SIMS 202, Lecture 16.
SYNTAX.
Acquisition of Categorized Named Entities for Web Search Marius Pasca Google Inc. from Conference on Information and Knowledge Management (CIKM) ’04.
Overview of Statistical NLP IR Group Meeting March 7, 2006.
Jean-Yves Le Meur - CERN Geneva Switzerland - GL'99 Conference 1.
Rendering XML Documents ©NIITeXtensible Markup Language/Lesson 5/Slide 1 of 46 Objectives In this session, you will learn to: * Define rendering * Identify.
111 Natural Language Processing: Information Extraction.
Trends in NL Analysis Jim Critz University of New York in Prague EurOpen.CZ 12 December 2008.
Information Extraction
UNIFIED MEDICAL LANGUAGE SYSTEMS (UMLS)
Statistical NLP: Lecture 3
Natural Language Processing (NLP)
Social Knowledge Mining
Machine Learning in Natural Language Processing
Information Extraction
Introduction to Information Retrieval
Linguistic Essentials
CS246: Information Retrieval
Natural Language Processing (NLP)
System Model Acquisition from Requirements Text
Natural Language Processing (NLP)
Presentation transcript:

Information Extraction from Web Resources CENG 770

Information Extraction (IE) ● Identify specific pieces of information (data) in a unstructured or semi-structured textual document. ● Transform unstructured information in a corpus of documents or web pages into a structured database. ● Applied to different types of text: ● Newspaper articles ● Web pages ● Scientific articles ● Newsgroup messages ● Classified ads ● Medical notes 222

Other Applications ● Job postings ● Job resumes ● Seminar announcements ● Company information from the web ● Continuing education course info from the web ● University information from the web ● Apartment rental ads ● Molecular biology information from MEDLINE

Subject: US-TN-SOFTWARE PROGRAMMER Date: 17 Nov :37:29 GMT Organization: Reference.Com Posting Service Message-ID: SOFTWARE PROGRAMMER Position available for Software Programmer experienced in generating software for PC-Based Voice Mail systems. Experienced in C Programming. Must be familiar with communicating with and controlling voice cards; preferable Dialogic, however, experience with others such as Rhetorix and Natural Microsystems is okay. Prefer 5 years or more experience with PC Based Voice Mail, but will consider as little as 2 years. Need to find a Senior level person who can come on board and pick up code with very little training. Present Operating System is DOS. May go to OS-2 or UNIX in future. Please reply to: Kim Anderson AdNET (901) fax Subject: US-TN-SOFTWARE PROGRAMMER Date: 17 Nov :37:29 GMT Organization: Reference.Com Posting Service Message-ID: SOFTWARE PROGRAMMER Position available for Software Programmer experienced in generating software for PC-Based Voice Mail systems. Experienced in C Programming. Must be familiar with communicating with and controlling voice cards; preferable Dialogic, however, experience with others such as Rhetorix and Natural Microsystems is okay. Prefer 5 years or more experience with PC Based Voice Mail, but will consider as little as 2 years. Need to find a Senior level person who can come on board and pick up code with very little training. Present Operating System is DOS. May go to OS-2 or UNIX in future. Please reply to: Kim Anderson AdNET (901) fax Sample Job Posting

Extracted Job Template computer_science_job id: title: SOFTWARE PROGRAMMER salary: company: recruiter: state: TN city: country: US language: C platform: PC \ DOS \ OS-2 \ UNIX application: area: Voice Mail req_years_experience: 2 desired_years_experience: 5 req_degree: desired_degree: post_date: 17 Nov 1996

Amazon Book Description …. The Age of Spiritual Machines : When Computers Exceed Human Intelligence by <a href="/exec/obidos/search-handle-url/index=books&field-author= Kurzweil%2C%20Ray/ "> Ray Kurzweil <img src=" width=90 height=140 align=left border=0> List Price: $14.95 Our Price: $11.96 You Save: $2.99 (20%) …. The Age of Spiritual Machines : When Computers Exceed Human Intelligence by <a href="/exec/obidos/search-handle-url/index=books&field-author= Kurzweil%2C%20Ray/ "> Ray Kurzweil <img src=" width=90 height=140 align=left border=0> List Price: $14.95 Our Price: $11.96 You Save: $2.99 (20%) …

Extracted Book Template Title: The Age of Spiritual Machines : When Computers Exceed Human Intelligence Author: Ray Kurzweil List-Price: $14.95 Price: $11.96 :

Web Extraction Many web pages are generated automatically from an underlying database. Therefore, the HTML structure of pages is fairly specific and regular (semi-structured). However, output is intended for human consumption, not machine interpretation. An IE system for such generated pages allows the web site to be viewed as a structured database. An extractor for a semi-structured web site is sometimes referred to as a wrapper. Process of extracting from such pages is sometimes referred to as screen scraping.

Template Types Slots in template typically filled by a substring from the document. Some slots may have a fixed set of pre- specified possible fillers that may not occur in the text itself. Terrorist act: threatened, attempted, accomplished. Job type: clerical, service, custodial, etc. Company type: SEC code Some slots may allow multiple fillers. Programming language Some domains may allow multiple extracted templates per document. Multiple apartment listings in one ad

Simple Extraction Patterns Specify an item to extract for a slot using a regular expression pattern. Price pattern: “\b\$\d+(\.\d{2})?\b” May require preceding (pre-filler) pattern to identify proper context. Amazon list price: Pre-filler pattern: “ List Price: ” Filler pattern: “\ $\d+(\.\d{2})?\b ” May require succeeding (post-filler) pattern to identify the end of the filler. Amazon list price: Pre-filler pattern: “ List Price: ” Filler pattern: “.+” Post-filler pattern: “ ”

Simple Template Extraction Extract slots in order, starting the search for the filler of the n+1 slot where the filler for the nth slot ended. Assumes slots always in a fixed order. Title Author List price … Make patterns specific enough to identify each filler always starting from the beginning of the document.

Natural Language Processing If extracting from automatically generated web pages, simple regex patterns usually work. If extracting from more natural, unstructured, human-written text, some NLP may help. Part-of-speech (POS) tagging Mark each word as a noun, verb, preposition, etc. Syntactic parsing Identify phrases: NP, VP, PP Semantic word categories (e.g. from WordNet) KILL: kill, murder, assassinate, strangle, suffocate Extraction patterns can use POS or phrase tags. Crime victim: Prefiller: [POS: V, Hypernym: KILL] Filler: [Phrase: NP]

Learning for IE Writing accurate patterns for each slot for each domain (e.g. each web site) requires laborious software engineering. Alternative is to use machine learning: Build a training set of documents paired with human- produced filled extraction templates. Learn extraction patterns for each slot using an appropriate machine learning algorithm.

Evaluating IE Accuracy Always evaluate performance on independent, manually- annotated test data not used during system development. Measure for each test document: Total number of correct extractions in the solution template: N Total number of slot/value pairs extracted by the system: E Number of extracted slot/value pairs that are correct (i.e. in the solution template): C Compute average value of metrics adapted from IR: Recall = C/N Precision = C/E F-Measure = Harmonic mean of recall and precision

Information Integration ● Answering certain questions using the web requires integrating information from multiple web sites. ● Information integration concerns methods for automating this integration. ● Requires wrappers to accurately extract specific information from web pages from specific sites. ● Treat each wrapped site as a database table and answer complex queries using a database query language (e.g. SQL).

Information Integration Example Question: What is the closest theater to my home where I can see both Monsters Inc. and Harry Potter? ● From austin360.com, extract theaters and their addresses where Harry Potter and Monster’s Inc. are playing. ● Intersect the two to find the theaters playing both. ● Query mapquest.com for driving directions from your home address to the address of each of these theaters. ● Extract distance and driving instructions for each. ● Sort results by driving distance. ● Present driving instructions for closest theater.

Automatic event extraction from web resources

Event ● An event is an activity that happens at a specific time and location and attracts people's attention ● Involves entities and relations between then ● Implies a change of states Example: The striker of Barcelona shot a wonderful goal in the 89. Minute. 1 event (goal-shot) 2 entities (person and team) 1 change of state (the scoring)

Events in textual documents Various types of text Structured: For processing, pattern matching techniques required. Very few linguistic knowledge needed Semi-structured: Requires a mixture of pattern matching and more linguistic knowledge Unstructured: Requires a mixture of layout analysis and linguistic knowledge Need for a domain specific knowledge base (ontology) for event extraction

Domain Knowledge Domain knowledge can be organised in terminologies, thesauri, taxonomies or ontologies.

Automatic Event Extraction from Text ● A combination of human language technology (HLT) and semantic web technologies (ontologies) ● statistical means (with minimal linguistic knowledge)

Linguistic Analysis Free text documents undergoing linguistic analysis become available as semi-structured documents, – from which meaningful units can be extracted automatically (information extraction) and – organized through clustering or classification (text mining). Basic linguistic analysis steps that underlie the extraction tasks: – tokenization, – morphological analysis, – part-of-speech tagging, – chunking, – dependency structure analysis, – semantic tagging.

Tokenisation Tokenisation deals with the detection of the word units in a text and with the detection of sentence boundaries. The markets acknowledge the measures taken on the 24 th of September by the CEO of XYZ Corp.

Morphological Analysis Morphological analysis is concerned with the inflectional, derivational, and compounding processes in word formation in order to determine properties such as stem and inflectional information. Together with part-of-speech (PoS) information this process delivers the morpho-syntactic properties of a word. Example: Evlerinde (in Turkish) ev Häusern (houses) (in German) [PoS=N NUM=PL CASE=DAT GEN=NEUT STEM=HAUS]

Part-of-Speech Tagging Part-of-Speech (PoS) tagging is the process of determining the correct syntactic class (a part-of-speech, e.g. noun, verb, etc.) for a particular word given its current context. The word “works” in the following sentences will be either a verb or a noun: He works [N,V] the whole day for nothing. His works [N,V] have all been sold abroad. PoS tagging involves disambiguation between multiple part-of- speech tags, next to guessing of the correct part-of-speech tag for unknown words on the basis of context information.

Chunking Chunks are sequences of words which are grouped on the base of linguistic properties, such as nominal, prepositional, adjectival and adverbial phrases and verb groups. [ NP His works] [ VG have] [ NP all] [ VG been sold] [ AdvP abroad].

Named Entities Detection ● Related to chunking is the recognition of so-called named entities (names of institutions and companies, date expressions, etc.). ● The extraction of named entities is mostly based on a strategy that combines look up in gazetteers (lists of companies, cities, etc.) with the definition of regular expression patterns. Example: “…the secretary-general of the United Nations, Kofi Annan,…” will be annotated as a nominal phrase, including two named entities: United Nations with named entity class: organization, and Kofi Annan with named entity class: person

Dependency Structure Analysis A dependency structure consists of two or more linguistic units that immediately dominate each other in a syntax tree. The detection of such structures is generally not provided by chunking but is building on the top of it. There are two main types of dependencies that are relevant for our purposes: On the one hand, the internal dependency structure of phrasal units or chunks and on the other hand the so-called grammatical functions (like subject and direct object).

Internal Dependency Structure. In linguistic analysis, for this we use the terms head, complements and modifiers, where the head is the dominating node in the syntax tree of a phrase (chunk), complements are necessary qualifiers thereof, and modifiers are optional qualifiers. Consider the following example: “The shot by Christian Ziege goes over the goal.” The prepositional phrase “by Christian Ziege” (containing the named entity Christian Ziege) depends on (and modifies) the head noun “shot”.

Grammatical Functions Determine the role (function) of each of the linguistic chunks in the sentence and allow to identify the actors involved in certain events. So for example in the following sentence, the syntactic (and also the semantic) subject is the NP constituent “The shot by Christian Ziege”: “The shot by Christian Ziege goes over the goal.” This nominal phrase depends on (and complements) the verb “goes”, whereas the Noun “shot” is the head of the NP (it this the shot going over the goal, and not Christian Ziege!)

Semantic Tagging Automatic semantic annotation has developed within language technology in recent years in connection with more integrated tasks like information extraction, which require a certain level of semantic analysis. Semantic tagging consists in the annotation of each content word in a document with a semantic category. Semantic categories are assigned on the basis of a semantic resources like WordNet for English or EuroWordNet, which links words between many European languages through a common inter-lingua of concepts.

Semantic Resources Semantic resources are captured in dictionaries, thesauri, and semantic networks, all of which express, either implicitly or explicitly, an ontology of the world in general or of more specific domains, such as medicine. They can be roughly distinguished into the following three groups: Thesauri: Semantic resources that group together similar words or terms according to a standard set of relations, including broader term, narrower term, sibling, etc. Semantic Lexicons: Semantic resources that group together words (or more complex lexical items) according to lexical semantic relations like synonymy, hyponymy, meronymy, and antonymy (like WordNet) Semantic Networks: Semantic resources that group together objects denoted by natural language expressions (terms) according to a set of relations that originate in the nature of the domain of application (like UMLS in the medical domain)

The WordNet Semantic Lexicon WordNet has primarily been designed as a computational account of the human capacity of linguistic categorization and covers an extensive set of semantic classes (called synsets). Synsets are collections of synonyms, grouping together lexical items according to meaning similarity. Synsets are actually not made up of lexical items, but rather of lexical meanings (i.e. senses)

WordNet: An Example The word 'tree' has two meanings that roughly correspond to the classes of plants and that of diagrams, each with their own hierarchy of classes that are included in more general super-classes: tree woody_plant 0 ligneous_plant vascular_plant 0 tracheophyte plant 0 flora 0 plant_life life_form 0 organism 0 being 0 living_thing entity 0 something tree 0 tree_diagram plane_figure 0 two-dimensional_figure figure shape 0 form attribute abstraction 0