Information Retrieval course ::: Information Management Technologies Kalliopi Zervanou
Overview The need for information processing Structured vs. unstructured data (text) The challenges of text Textual information processing technologies
The need for info processing Large amounts of data in electronic form Need for large scale & fast info processing text Most information to be found in text
Types of Data Structured data Semi-structured data Unstructured, free-text data
Structured Data: e.g. Databases Title: Introduction to Information Retrieval Author: C.D.Manning, P.Raghavan, H.Schütze Doc type: Book Publisher: Cambridge University Press Pub date: 2008 Id: CM20B Location: Computer Science section Keywords: Information Retrieval, Indexing, …
Semi-Structured Data (e.g. XML) bios/bymholt.html Berend Bymholt socialistisch en anarchistisch publicist en auteur van de Geschiedenis der Arbeidersbeweging in Nederland 77...
Free-Text/ Unstructured data Bertelsmann 9-mth profit slips on start-up losses FRANKFURT, Nov 10 (Reuters) - Media conglomerate Bertelsmann posted a slight decline in nine-month operating profit due to start-up losses related to new businesses. Europe's largest media group on Thursday said it still expects its 2011 operating profit to decline slightly year-on-year. It had cut its outlook in August due to costs for new projects and rising energy prices. Bertelsmann owns publishers Gruner + Jahr and Random House as well as European TV broadcaster RTL Group and Arvato, an outsourcing service provider. Operating earnings before interest and tax (EBIT) eased by 1.1 percent to 1.03 billion euros ($1.4 billion) in the first nine months of 2011, Bertelsmann said.
Data Mining analysis of structured data detection of unknown interesting patterns: groups of data records (cluster analysis) unusual records (anomaly detection) data dependencies (association rule mining)
Text Mining / Text Analytics analysis of text (semi-/unstructured data) detection of unknown, interesting information: group documents (classification/clustering) extract information (content descriptors, concepts of interest) associate/link information (e.g. concept relations) discover previously unknown facts
The challenges of text Full text understanding beyond current technology context Human understanding based on context Context: text, but also world knowledge Text: ambiguity (syntactic, semantic, lexical, pragmatic)
Doc Collection IR Important Info IE Relevant Docs Summarisation (or Abstracting) ( Indexing ) Index Terms Terminology ATR Data Bases - Thesauri - Lexicons - Ontologies - Gazetteers Data Mining Reasoning, etc… Derived Info Process Resource Structured Info Relevant Info NE … EVENT … UNSTRUCTUREDDATAUNSTRUCTUREDDATA STRUCTUREDDATASTRUCTUREDDATA
IR: Select relevant documents Query: “query term” Relevant: Documents containing the “term” Methods: Indexing or Automatic Term Recognition
Automatic Term Recognition supervised/ unsupervised task Methods: rule based, statistics-based, machine learning, hybrid Objective: detect words or phrases denoting specialised terms concepts, i.e. terms Objective: detect words or phrases denoting specialised terms concepts, i.e. terms
ATR: example C-valueCandidate term trade union[trade union, Trades Union,…] ernst papanek[Ernst Papanek] new york[New York] press clipping[Press clippings, press -clippings,…] world war[world war, world wars, World Wars,…] print material[printed materials, Printed material,…] executive committee[executive committee, …] communist party[Communist party,…] second world war[Second World War, …] spanish civil war[Spanish Civil War, …] great britain[Great Britain, Great -Britain]
Document clustering unsupervised task “clusters”, group categories unknown machine learning and statistics-based approaches Objective: group documents based on their content / semantic similarities Objective: group documents based on their content / semantic similarities
Objective: classify documents based on their content / semantics Objective: classify documents based on their content / semantics Document classification supervised task we know the classes/categories use of machine learning, or statistics-based methods
Doc Collection IR Important Info IE Relevant Docs Summarisation (or Abstracting) ( Indexing ) Index Terms Terminology ATR Data Bases - Thesauri - Lexicons - Ontologies - Gazetteers Data Mining Reasoning, etc… Derived Info Process Resource Structured Info Relevant Info NE … EVENT …
Summarisation or Abstracting Bertelsmann 9-mth profit slips on start-up losses FRANKFURT, Nov 10 (Reuters) - Media conglomerate Bertelsmann posted a slight decline in nine-month operating profit due to start-up losses related to new businesses. Europe's largest media group on Thursday said it still expects its 2011 operating profit to decline slightly year-on-year. It had cut its outlook in August due to costs for new projects and rising energy prices. Bertelsmann owns publishers Gruner + Jahr and Random House as well as European TV broadcaster RTL Group and Arvato, an outsourcing service provider. Operating earnings before interest and tax (EBIT) eased by 1.1 percent to 1.03 billion euros ($1.4 billion) in the first nine months of 2011, Bertelsmann said.
Information Extraction supervised, or unsupervised/generic task Methods: rule-based, machine learning Objective: detect specific types of info in documents, e.g. names, events, relations Objective: detect specific types of info in documents, e.g. names, events, relations
IE tasks Named Entity (NE) recognise entities/concepts of interest, e.g. persons, organisations, dates & times Co-reference (CO) recognise mentions to the same entity Template Relation (TR) & Scenario Template (ST) recognise relations among concepts, e.g. concept properties & entities involved in facts & events of interest
IE Tasks Bertelsmann said operating earnings before interest and tax (EBIT) rose 35 percent to 215 million euros ($272.1 million) compared with 2005, and sales were up 17.3 percent at 4.5 billion euros. Europe's largest media group on Thursday said it still expects its 2011 operating profit to decline slightly year-on-year. ORGANISATION PERCENT DATE AMOUNT ORGANISATION=“Bertelsmann”DATE=“ ”
IE Tasks Bertelsmann said operating earnings before interest and tax (EBIT) rose 35 percent to 215 million euros ($272.1 million) compared with 2005, and sales were up 17.3 percent at 4.5 billion euros. Europe's largest media group on Thursday said it still expects its 2011 operating profit to decline slightly year-on-year. SALES_of Event_type: sales Organisation_type: Company Organisation_name: Bertelsmann Sector: media Sales_mode: increase Sales_amount: Currency: euros Period: ?? Date: ??
Sentiment analysis/Opinion mining Polarity classification (positive/negative) Objectivity/Subjectivity detection
Doc Collection IR Important Info IE Relevant Docs Summarisation (or Abstracting) ( Indexing ) Index Terms Terminology ATR Data Bases - Thesauri - Lexicons - Ontologies - Gazetteers Data Mining Reasoning, etc… Derived Info Process Resource Structured Info Relevant Info NE … EVENT …
Structured Data: e.g. Databases Title: Introduction to Information Retrieval Author: C.D.Manning, P.Raghavan, H.Schütze Doc type: Book Publisher: Cambridge University Press Pub date: 2008 Id: CM20B Location: Computer Science section Keywords: Information Retrieval, Indexing, …
Structured Data: Ontologies Structure of concepts: Entities (concepts, objects) Properties (concept properties) Relations (links between concepts) Domain specific relations, e.g., “has_capital” Objective: describe domain knowledge and reason about concepts & relations
Einstein's riddle we have five houses in a row, each house is painted with a different colour, each house has a single inhabitant each inhabitant is of different nationality drinks different beverage, owns a different pet, smokes different brands of cigarettes Source:
Einstein's riddle 1. There are five houses. Englishmanred house 2. The Englishman lives in the red house. Spaniard 3. The Spaniard owns the dog. 4. Coffeegreen house 4. Coffee is drunk in the green house. Ukrainian 5. The Ukrainian drinks tea. Source:
Einstein's riddle Source: green house ivory house 6. The green house is immediately to the right of the ivory house. snails 7. The Old Gold smoker owns snails. yellow house 8. Kools are smoked in the yellow house. 9. Milk 9. Milk is drunk in the middle house. Norwegian 10. The Norwegian lives in the first house.
Einstein's riddle 11. The man who smokes Chesterfields lives in the house next to the man with the fox. 12. Kools 12. Kools are smoked in a house next to the house where the horse is kept. orange juice 13. The Lucky Strike smoker drinks orange juice. Japanese 14. The Japanese smokes Parliaments. Norwegianblue house 15. The Norwegian lives next to the blue house. Source:
Einstein's riddle Who drinks water? Who owns a zebra? Source:
Ontology: hierarchical structure Thing/Root Inhabitant Colour Pet Beverage House-1 House-2 House-3 House-4 House-5 House House... Englishman Spaniard Japanese Norwegean Ukranian Spaniard... Red Green Blue Ivory Yellow Green... Dog Horse Snails Fox Zebra
Ontology “is-a” or taxonomic relationships Denote the “kind” of a concept But ontologies: more than taxonomic relationships! Thing/Root Inhabitant Colour Pet Brand House-1 House-2House House... Englishman SpaniardSpaniard... Red GreenGreen... Dog Horse... Beverage
Ontology: properties Thing/Root Inhabitant Colour Pet House Has_colour: (Colour> Is_ColourOf: [House] ) [Colour] Has_inhabitant: (Inhabitant> LivesIn:[ House] ) [Inhabitant] Is_rightTo:[House] House-1 Brand Beverage
Ontology: properties Thing/Root Inhabitant Colour Pet House LivesIn: (House> Has_inhabitant: [Inhabitant] ) [House] Has_pet: (Pet> Has_owner: [Inhabitant] ) [Pet] Drinks: (Beverage> Drunk_by: [Inhabitant] ) [Beverage] Uses_brand: (Brand> Used_by: [Inhabitant] ) [Brand] Spaniard Brand Beverage