Information Extraction from Web Resources CENG 770.

Information Extraction from Web Resources CENG 770

Information Extraction (IE) ● Identify specific pieces of information (data) in a unstructured or semi-structured textual document. ● Transform unstructured information in a corpus of documents or web pages into a structured database. ● Applied to different types of text: ● Newspaper articles ● Web pages ● Scientific articles ● Newsgroup messages ● Classified ads ● Medical notes 222

Other Applications ● Job postings ● Job resumes ● Seminar announcements ● Company information from the web ● Continuing education course info from the web ● University information from the web ● Apartment rental ads ● Molecular biology information from MEDLINE

Subject: US-TN-SOFTWARE PROGRAMMER Date: 17 Nov 1996 17:37:29 GMT Organization: Reference.Com Posting Service Message-ID: SOFTWARE PROGRAMMER Position available for Software Programmer experienced in generating software for PC-Based Voice Mail systems. Experienced in C Programming. Must be familiar with communicating with and controlling voice cards; preferable Dialogic, however, experience with others such as Rhetorix and Natural Microsystems is okay. Prefer 5 years or more experience with PC Based Voice Mail, but will consider as little as 2 years. Need to find a Senior level person who can come on board and pick up code with very little training. Present Operating System is DOS. May go to OS-2 or UNIX in future. Please reply to: Kim Anderson AdNET (901) 458-2888 fax kimander@memphisonline.com Subject: US-TN-SOFTWARE PROGRAMMER Date: 17 Nov 1996 17:37:29 GMT Organization: Reference.Com Posting Service Message-ID: SOFTWARE PROGRAMMER Position available for Software Programmer experienced in generating software for PC-Based Voice Mail systems. Experienced in C Programming. Must be familiar with communicating with and controlling voice cards; preferable Dialogic, however, experience with others such as Rhetorix and Natural Microsystems is okay. Prefer 5 years or more experience with PC Based Voice Mail, but will consider as little as 2 years. Need to find a Senior level person who can come on board and pick up code with very little training. Present Operating System is DOS. May go to OS-2 or UNIX in future. Please reply to: Kim Anderson AdNET (901) 458-2888 fax kimander@memphisonline.com Sample Job Posting

Extracted Job Template computer_science_job id: 56nigp$mrs@bilbo.reference.com title: SOFTWARE PROGRAMMER salary: company: recruiter: state: TN city: country: US language: C platform: PC \ DOS \ OS-2 \ UNIX application: area: Voice Mail req_years_experience: 2 desired_years_experience: 5 req_degree: desired_degree: post_date: 17 Nov 1996

Amazon Book Description …. The Age of Spiritual Machines : When Computers Exceed Human Intelligence by <a href="/exec/obidos/search-handle-url/index=books&field-author= Kurzweil%2C%20Ray/002-6235079-4593641"> Ray Kurzweil <img src="http://images.amazon.com/images/P/0140282025.01.MZZZZZZZ.gif" width=90 height=140 align=left border=0> List Price: $14.95 Our Price: $11.96 You Save: $2.99 (20%) …. The Age of Spiritual Machines : When Computers Exceed Human Intelligence by <a href="/exec/obidos/search-handle-url/index=books&field-author= Kurzweil%2C%20Ray/002-6235079-4593641"> Ray Kurzweil <img src="http://images.amazon.com/images/P/0140282025.01.MZZZZZZZ.gif" width=90 height=140 align=left border=0> List Price: $14.95 Our Price: $11.96 You Save: $2.99 (20%) …

Extracted Book Template Title: The Age of Spiritual Machines : When Computers Exceed Human Intelligence Author: Ray Kurzweil List-Price: $14.95 Price: $11.96 :

Web Extraction Many web pages are generated automatically from an underlying database. Therefore, the HTML structure of pages is fairly specific and regular (semi-structured). However, output is intended for human consumption, not machine interpretation. An IE system for such generated pages allows the web site to be viewed as a structured database. An extractor for a semi-structured web site is sometimes referred to as a wrapper. Process of extracting from such pages is sometimes referred to as screen scraping.

Template Types Slots in template typically filled by a substring from the document. Some slots may have a fixed set of pre- specified possible fillers that may not occur in the text itself. Terrorist act: threatened, attempted, accomplished. Job type: clerical, service, custodial, etc. Company type: SEC code Some slots may allow multiple fillers. Programming language Some domains may allow multiple extracted templates per document. Multiple apartment listings in one ad

Simple Extraction Patterns Specify an item to extract for a slot using a regular expression pattern. Price pattern: “\b\$\d+(\.\d{2})?\b” May require preceding (pre-filler) pattern to identify proper context. Amazon list price: Pre-filler pattern: “ List Price: ” Filler pattern: “\ $\d+(\.\d{2})?\b ” May require succeeding (post-filler) pattern to identify the end of the filler. Amazon list price: Pre-filler pattern: “ List Price: ” Filler pattern: “.+” Post-filler pattern: “ ”

Simple Template Extraction Extract slots in order, starting the search for the filler of the n+1 slot where the filler for the nth slot ended. Assumes slots always in a fixed order. Title Author List price … Make patterns specific enough to identify each filler always starting from the beginning of the document.

Natural Language Processing If extracting from automatically generated web pages, simple regex patterns usually work. If extracting from more natural, unstructured, human-written text, some NLP may help. Part-of-speech (POS) tagging Mark each word as a noun, verb, preposition, etc. Syntactic parsing Identify phrases: NP, VP, PP Semantic word categories (e.g. from WordNet) KILL: kill, murder, assassinate, strangle, suffocate Extraction patterns can use POS or phrase tags. Crime victim: Prefiller: [POS: V, Hypernym: KILL] Filler: [Phrase: NP]

Learning for IE Writing accurate patterns for each slot for each domain (e.g. each web site) requires laborious software engineering. Alternative is to use machine learning: Build a training set of documents paired with human- produced filled extraction templates. Learn extraction patterns for each slot using an appropriate machine learning algorithm.

Evaluating IE Accuracy Always evaluate performance on independent, manually- annotated test data not used during system development. Measure for each test document: Total number of correct extractions in the solution template: N Total number of slot/value pairs extracted by the system: E Number of extracted slot/value pairs that are correct (i.e. in the solution template): C Compute average value of metrics adapted from IR: Recall = C/N Precision = C/E F-Measure = Harmonic mean of recall and precision

Information Integration ● Answering certain questions using the web requires integrating information from multiple web sites. ● Information integration concerns methods for automating this integration. ● Requires wrappers to accurately extract specific information from web pages from specific sites. ● Treat each wrapped site as a database table and answer complex queries using a database query language (e.g. SQL).

Information Integration Example Question: What is the closest theater to my home where I can see both Monsters Inc. and Harry Potter? ● From austin360.com, extract theaters and their addresses where Harry Potter and Monster’s Inc. are playing. ● Intersect the two to find the theaters playing both. ● Query mapquest.com for driving directions from your home address to the address of each of these theaters. ● Extract distance and driving instructions for each. ● Sort results by driving distance. ● Present driving instructions for closest theater.

Automatic event extraction from web resources

Event ● An event is an activity that happens at a specific time and location and attracts people's attention ● Involves entities and relations between then ● Implies a change of states Example: The striker of Barcelona shot a wonderful goal in the 89. Minute. 1 event (goal-shot) 2 entities (person and team) 1 change of state (the scoring)

Events in textual documents Various types of text Structured: For processing, pattern matching techniques required. Very few linguistic knowledge needed Semi-structured: Requires a mixture of pattern matching and more linguistic knowledge Unstructured: Requires a mixture of layout analysis and linguistic knowledge Need for a domain specific knowledge base (ontology) for event extraction

Domain Knowledge Domain knowledge can be organised in terminologies, thesauri, taxonomies or ontologies.

Automatic Event Extraction from Text ● A combination of human language technology (HLT) and semantic web technologies (ontologies) ● statistical means (with minimal linguistic knowledge)

Linguistic Analysis Free text documents undergoing linguistic analysis become available as semi-structured documents, – from which meaningful units can be extracted automatically (information extraction) and – organized through clustering or classification (text mining). Basic linguistic analysis steps that underlie the extraction tasks: – tokenization, – morphological analysis, – part-of-speech tagging, – chunking, – dependency structure analysis, – semantic tagging.

Tokenisation Tokenisation deals with the detection of the word units in a text and with the detection of sentence boundaries. The markets acknowledge the measures taken on the 24 th of September by the CEO of XYZ Corp.

Morphological Analysis Morphological analysis is concerned with the inflectional, derivational, and compounding processes in word formation in order to determine properties such as stem and inflectional information. Together with part-of-speech (PoS) information this process delivers the morpho-syntactic properties of a word. Example: Evlerinde (in Turkish) ev Häusern (houses) (in German) [PoS=N NUM=PL CASE=DAT GEN=NEUT STEM=HAUS]

Part-of-Speech Tagging Part-of-Speech (PoS) tagging is the process of determining the correct syntactic class (a part-of-speech, e.g. noun, verb, etc.) for a particular word given its current context. The word “works” in the following sentences will be either a verb or a noun: He works [N,V] the whole day for nothing. His works [N,V] have all been sold abroad. PoS tagging involves disambiguation between multiple part-of- speech tags, next to guessing of the correct part-of-speech tag for unknown words on the basis of context information.

Chunking Chunks are sequences of words which are grouped on the base of linguistic properties, such as nominal, prepositional, adjectival and adverbial phrases and verb groups. [ NP His works] [ VG have] [ NP all] [ VG been sold] [ AdvP abroad].

Named Entities Detection ● Related to chunking is the recognition of so-called named entities (names of institutions and companies, date expressions, etc.). ● The extraction of named entities is mostly based on a strategy that combines look up in gazetteers (lists of companies, cities, etc.) with the definition of regular expression patterns. Example: “…the secretary-general of the United Nations, Kofi Annan,…” will be annotated as a nominal phrase, including two named entities: United Nations with named entity class: organization, and Kofi Annan with named entity class: person

Dependency Structure Analysis A dependency structure consists of two or more linguistic units that immediately dominate each other in a syntax tree. The detection of such structures is generally not provided by chunking but is building on the top of it. There are two main types of dependencies that are relevant for our purposes: On the one hand, the internal dependency structure of phrasal units or chunks and on the other hand the so-called grammatical functions (like subject and direct object).

Internal Dependency Structure. In linguistic analysis, for this we use the terms head, complements and modifiers, where the head is the dominating node in the syntax tree of a phrase (chunk), complements are necessary qualifiers thereof, and modifiers are optional qualifiers. Consider the following example: “The shot by Christian Ziege goes over the goal.” The prepositional phrase “by Christian Ziege” (containing the named entity Christian Ziege) depends on (and modifies) the head noun “shot”.

Grammatical Functions Determine the role (function) of each of the linguistic chunks in the sentence and allow to identify the actors involved in certain events. So for example in the following sentence, the syntactic (and also the semantic) subject is the NP constituent “The shot by Christian Ziege”: “The shot by Christian Ziege goes over the goal.” This nominal phrase depends on (and complements) the verb “goes”, whereas the Noun “shot” is the head of the NP (it this the shot going over the goal, and not Christian Ziege!)

Semantic Tagging Automatic semantic annotation has developed within language technology in recent years in connection with more integrated tasks like information extraction, which require a certain level of semantic analysis. Semantic tagging consists in the annotation of each content word in a document with a semantic category. Semantic categories are assigned on the basis of a semantic resources like WordNet for English or EuroWordNet, which links words between many European languages through a common inter-lingua of concepts.

Semantic Resources Semantic resources are captured in dictionaries, thesauri, and semantic networks, all of which express, either implicitly or explicitly, an ontology of the world in general or of more specific domains, such as medicine. They can be roughly distinguished into the following three groups: Thesauri: Semantic resources that group together similar words or terms according to a standard set of relations, including broader term, narrower term, sibling, etc. Semantic Lexicons: Semantic resources that group together words (or more complex lexical items) according to lexical semantic relations like synonymy, hyponymy, meronymy, and antonymy (like WordNet) Semantic Networks: Semantic resources that group together objects denoted by natural language expressions (terms) according to a set of relations that originate in the nature of the domain of application (like UMLS in the medical domain)

The WordNet Semantic Lexicon WordNet has primarily been designed as a computational account of the human capacity of linguistic categorization and covers an extensive set of semantic classes (called synsets). Synsets are collections of synonyms, grouping together lexical items according to meaning similarity. Synsets are actually not made up of lexical items, but rather of lexical meanings (i.e. senses)

WordNet: An Example The word 'tree' has two meanings that roughly correspond to the classes of plants and that of diagrams, each with their own hierarchy of classes that are included in more general super-classes: 09396070 tree 0 09395329 woody_plant 0 ligneous_plant 0 09378438 vascular_plant 0 tracheophyte 0 00008864 plant 0 flora 0 plant_life 0 00002086 life_form 0 organism 0 being 0 living_thing 0 00001740 entity 0 something 0 10025462 tree 0 tree_diagram 0 09987563 plane_figure 0 two-dimensional_figure 0 09987377 figure 0 00015185 shape 0 form 0 00018604 attribute 0 00013018 abstraction 0

Information Extraction from Web Resources CENG 770.

Similar presentations

Presentation on theme: "Information Extraction from Web Resources CENG 770."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Information Extraction from Web Resources CENG 770.

Similar presentations

Presentation on theme: "Information Extraction from Web Resources CENG 770."— Presentation transcript:

Similar presentations

About project

Feedback