Presentation is loading. Please wait.

Presentation is loading. Please wait.

Information Retrieval course ::: Information Management Technologies Kalliopi Zervanou

Similar presentations


Presentation on theme: "Information Retrieval course ::: Information Management Technologies Kalliopi Zervanou"— Presentation transcript:

1 Information Retrieval course ::: Information Management Technologies Kalliopi Zervanou k.zervanou@uvt.nl

2 Overview The need for information processing Structured vs. unstructured data (text) The challenges of text Textual information processing technologies

3 The need for info processing Large amounts of data in electronic form Need for large scale & fast info processing text Most information to be found in text

4 Types of Data Structured data Semi-structured data Unstructured, free-text data

5 Structured Data: e.g. Databases Title: Introduction to Information Retrieval Author: C.D.Manning, P.Raghavan, H.Schütze Doc type: Book Publisher: Cambridge University Press Pub date: 2008 Id: CM20B Location: Computer Science section Keywords: Information Retrieval, Indexing, …

6 Semi-Structured Data (e.g. XML) bios/bymholt.html Berend Bymholt 1864 07-09 1947 05-27 socialistisch en anarchistisch publicist en auteur van de Geschiedenis der Arbeidersbeweging in Nederland 77...

7 Free-Text/ Unstructured data Bertelsmann 9-mth profit slips on start-up losses FRANKFURT, Nov 10 (Reuters) - Media conglomerate Bertelsmann posted a slight decline in nine-month operating profit due to start-up losses related to new businesses. Europe's largest media group on Thursday said it still expects its 2011 operating profit to decline slightly year-on-year. It had cut its outlook in August due to costs for new projects and rising energy prices. Bertelsmann owns publishers Gruner + Jahr and Random House as well as European TV broadcaster RTL Group and Arvato, an outsourcing service provider. Operating earnings before interest and tax (EBIT) eased by 1.1 percent to 1.03 billion euros ($1.4 billion) in the first nine months of 2011, Bertelsmann said.

8 Data Mining analysis of structured data detection of unknown interesting patterns:  groups of data records (cluster analysis)  unusual records (anomaly detection)  data dependencies (association rule mining)

9 Text Mining / Text Analytics analysis of text (semi-/unstructured data) detection of unknown, interesting information:  group documents (classification/clustering)  extract information (content descriptors, concepts of interest)  associate/link information (e.g. concept relations)  discover previously unknown facts

10 The challenges of text Full text understanding beyond current technology context Human understanding based on context Context: text, but also world knowledge Text: ambiguity (syntactic, semantic, lexical, pragmatic)

11 Doc Collection IR Important Info IE Relevant Docs Summarisation (or Abstracting) ( Indexing ) Index Terms Terminology ATR Data Bases - Thesauri - Lexicons - Ontologies - Gazetteers Data Mining Reasoning, etc… Derived Info Process Resource Structured Info Relevant Info NE … EVENT … UNSTRUCTUREDDATAUNSTRUCTUREDDATA STRUCTUREDDATASTRUCTUREDDATA

12 IR: Select relevant documents Query: “query term” Relevant: Documents containing the “term” Methods:  Indexing or Automatic Term Recognition

13 Automatic Term Recognition supervised/ unsupervised task Methods: rule based, statistics-based, machine learning, hybrid Objective: detect words or phrases denoting specialised terms concepts, i.e. terms Objective: detect words or phrases denoting specialised terms concepts, i.e. terms

14 ATR: example C-valueCandidate term 338.13958trade union[trade union, Trades Union,…] 213.127ernst papanek[Ernst Papanek] 200.55471new york[New York] 143.48147press clipping[Press clippings, press -clippings,…] 139.07053world war[world war, world wars, World Wars,…] 134.47055print material[printed materials, Printed material,…] 131.19386executive committee[executive committee, …] 124.91502communist party[Communist party,…] 94.48066second world war[Second World War, …] 91.18482spanish civil war[Spanish Civil War, …] 90.80228great britain[Great Britain, Great -Britain]

15 Document clustering unsupervised task  “clusters”, group categories unknown machine learning and statistics-based approaches Objective: group documents based on their content / semantic similarities Objective: group documents based on their content / semantic similarities

16 Objective: classify documents based on their content / semantics Objective: classify documents based on their content / semantics Document classification supervised task  we know the classes/categories use of machine learning, or statistics-based methods

17 Doc Collection IR Important Info IE Relevant Docs Summarisation (or Abstracting) ( Indexing ) Index Terms Terminology ATR Data Bases - Thesauri - Lexicons - Ontologies - Gazetteers Data Mining Reasoning, etc… Derived Info Process Resource Structured Info Relevant Info NE … EVENT …

18 Summarisation or Abstracting Bertelsmann 9-mth profit slips on start-up losses FRANKFURT, Nov 10 (Reuters) - Media conglomerate Bertelsmann posted a slight decline in nine-month operating profit due to start-up losses related to new businesses. Europe's largest media group on Thursday said it still expects its 2011 operating profit to decline slightly year-on-year. It had cut its outlook in August due to costs for new projects and rising energy prices. Bertelsmann owns publishers Gruner + Jahr and Random House as well as European TV broadcaster RTL Group and Arvato, an outsourcing service provider. Operating earnings before interest and tax (EBIT) eased by 1.1 percent to 1.03 billion euros ($1.4 billion) in the first nine months of 2011, Bertelsmann said.

19 Information Extraction supervised, or unsupervised/generic task Methods: rule-based, machine learning Objective: detect specific types of info in documents, e.g. names, events, relations Objective: detect specific types of info in documents, e.g. names, events, relations

20 IE tasks Named Entity (NE) recognise entities/concepts of interest, e.g. persons, organisations, dates & times Co-reference (CO) recognise mentions to the same entity Template Relation (TR) & Scenario Template (ST) recognise relations among concepts, e.g. concept properties & entities involved in facts & events of interest

21 IE Tasks Bertelsmann said operating earnings before interest and tax (EBIT) rose 35 percent to 215 million euros ($272.1 million) compared with 2005, and sales were up 17.3 percent at 4.5 billion euros. Europe's largest media group on Thursday said it still expects its 2011 operating profit to decline slightly year-on-year. ORGANISATION PERCENT DATE AMOUNT ORGANISATION=“Bertelsmann”DATE=“2011-11-10”

22 IE Tasks Bertelsmann said operating earnings before interest and tax (EBIT) rose 35 percent to 215 million euros ($272.1 million) compared with 2005, and sales were up 17.3 percent at 4.5 billion euros. Europe's largest media group on Thursday said it still expects its 2011 operating profit to decline slightly year-on-year. SALES_of Event_type: sales Organisation_type: Company Organisation_name: Bertelsmann Sector: media Sales_mode: increase Sales_amount: 4.500.000.000 Currency: euros Period: ?? Date: ??

23 Sentiment analysis/Opinion mining Polarity classification (positive/negative) Objectivity/Subjectivity detection

24 Doc Collection IR Important Info IE Relevant Docs Summarisation (or Abstracting) ( Indexing ) Index Terms Terminology ATR Data Bases - Thesauri - Lexicons - Ontologies - Gazetteers Data Mining Reasoning, etc… Derived Info Process Resource Structured Info Relevant Info NE … EVENT …

25 Structured Data: e.g. Databases Title: Introduction to Information Retrieval Author: C.D.Manning, P.Raghavan, H.Schütze Doc type: Book Publisher: Cambridge University Press Pub date: 2008 Id: CM20B Location: Computer Science section Keywords: Information Retrieval, Indexing, …

26 Structured Data: Ontologies Structure of concepts:  Entities (concepts, objects)  Properties (concept properties)  Relations (links between concepts)  Domain specific relations, e.g., “has_capital” Objective:  describe domain knowledge and reason about concepts & relations

27 Einstein's riddle we have five houses in a row,  each house is painted with a different colour,  each house has a single inhabitant each inhabitant  is of different nationality  drinks different beverage,  owns a different pet,  smokes different brands of cigarettes Source: http://en.wikipedia.org/wiki/Zebra_puzzlehttp://en.wikipedia.org/wiki/Zebra_puzzle

28 Einstein's riddle 1. There are five houses. Englishmanred house 2. The Englishman lives in the red house. Spaniard 3. The Spaniard owns the dog. 4. Coffeegreen house 4. Coffee is drunk in the green house. Ukrainian 5. The Ukrainian drinks tea. Source: http://en.wikipedia.org/wiki/Zebra_puzzlehttp://en.wikipedia.org/wiki/Zebra_puzzle

29 Einstein's riddle Source: http://en.wikipedia.org/wiki/Zebra_puzzlehttp://en.wikipedia.org/wiki/Zebra_puzzle green house ivory house 6. The green house is immediately to the right of the ivory house. snails 7. The Old Gold smoker owns snails. yellow house 8. Kools are smoked in the yellow house. 9. Milk 9. Milk is drunk in the middle house. Norwegian 10. The Norwegian lives in the first house.

30 Einstein's riddle 11. The man who smokes Chesterfields lives in the house next to the man with the fox. 12. Kools 12. Kools are smoked in a house next to the house where the horse is kept. orange juice 13. The Lucky Strike smoker drinks orange juice. Japanese 14. The Japanese smokes Parliaments. Norwegianblue house 15. The Norwegian lives next to the blue house. Source: http://en.wikipedia.org/wiki/Zebra_puzzlehttp://en.wikipedia.org/wiki/Zebra_puzzle

31 Einstein's riddle Who drinks water? Who owns a zebra? Source: http://en.wikipedia.org/wiki/Zebra_puzzlehttp://en.wikipedia.org/wiki/Zebra_puzzle

32 Ontology: hierarchical structure Thing/Root Inhabitant Colour Pet Beverage House-1 House-2 House-3 House-4 House-5 House House... Englishman Spaniard Japanese Norwegean Ukranian Spaniard... Red Green Blue Ivory Yellow Green... Dog Horse Snails Fox Zebra

33 Ontology “is-a” or taxonomic relationships Denote the “kind” of a concept But ontologies: more than taxonomic relationships! Thing/Root Inhabitant Colour Pet Brand House-1 House-2House House... Englishman SpaniardSpaniard... Red GreenGreen... Dog Horse... Beverage

34 Ontology: properties Thing/Root Inhabitant Colour Pet House Has_colour: (Colour> Is_ColourOf: [House] ) [Colour] Has_inhabitant: (Inhabitant> LivesIn:[ House] ) [Inhabitant] Is_rightTo:[House] House-1 Brand Beverage

35 Ontology: properties Thing/Root Inhabitant Colour Pet House LivesIn: (House> Has_inhabitant: [Inhabitant] ) [House] Has_pet: (Pet> Has_owner: [Inhabitant] ) [Pet] Drinks: (Beverage> Drunk_by: [Inhabitant] ) [Beverage] Uses_brand: (Brand> Used_by: [Inhabitant] ) [Brand] Spaniard Brand Beverage


Download ppt "Information Retrieval course ::: Information Management Technologies Kalliopi Zervanou"

Similar presentations


Ads by Google