Information Retrieval course ::: Information Management Technologies Kalliopi Zervanou

Slides:



Advertisements
Similar presentations
UCLA : GSE&IS : Department of Information StudiesJF : 276lec1.ppt : 5/2/2015 : 1 I N F S I N F O R M A T I O N R E T R I E V A L S Y S T E M S Week.
Advertisements

Lab 1 Chapter 1, Sections 1.1, 1.2, 1.3, and 1.4 Book: Discrete Mathematics and Its Applications By Kenneth H. Rosen.
Logic Puzzle Smith. The Setup There are five houses Each of a different color Inhabited by men of different nationalities Each has a different pet Each.
DCP 1172, Homework 2 1 Homework 2 for DCP-1172 ( ) This time, we have 3 different homework assignments.  Homework assignment 2-1 (50%, Ch3 &
April 22, Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Doerre, Peter Gerstl, Roland Seiffert IBM Germany, August 1999 Presenter:
Semantic Web and Web Mining: Networking with Industry and Academia İsmail Hakkı Toroslu IST EVENT 2006.
CIS392Semester Projects1 CIS392 Text Processing, Retrieval, and Mining Overview of Semester Projects.
An Overview of Text Mining Rebecca Hwa 4/25/2002 References M. Hearst, “Untangling Text Data Mining,” in the Proceedings of the 37 th Annual Meeting of.
1 Information Retrieval and Web Search Introduction.
Detecting Economic Events Using a Semantics-Based Pipeline 22nd International Conference on Database and Expert Systems Applications (DEXA 2011) September.
Information retrieval Finding relevant data using irrelevant keys Example: database of photographic images sorted by number, date. DBMS: Well structured.
Data Mining Adrian Tuhtan CS157A Section1.
Automatic Acquisition of Lexical Classes and Extraction Patterns for Information Extraction Kiyoshi Sudo Ph.D. Research Proposal New York University Committee:
1 CSC 594 Topics in AI – Applied Natural Language Processing Fall 2009/2010 Overview of NLP tasks (text pre-processing)
Enhance legal retrieval applications with an automatically induced knowledge base Ka Kan Lo.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Huimin Ye.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Drew DeHaas.
WHT/ HPCC Systems Flavio Villanustre VP, Products and Infrastructure HPCC Systems Risk Solutions.
Ontology Learning and Population from Text: Algorithms, Evaluation and Applications Chapters Presented by Sole.
Artificial Intelligence Research Centre Program Systems Institute Russian Academy of Science Pereslavl-Zalessky Russia.
Redefining Perspectives A thought leadership forum for technologists interested in defining a new future June COPYRIGHT ©2015 SAPIENT CORPORATION.
Semantic Web Technologies Lecture # 2 Faculty of Computer Science, IBA.
CONTI’2008, 5-6 June 2008, TIMISOARA 1 Towards a digital content management system Gheorghe Sebestyen-Pal, Tünde Bálint, Bogdan Moscaliuc, Agnes Sebestyen-Pal.
Logic Programming for Natural Language Processing Menyoung Lee TJHSST Computer Systems Lab Mentor: Matt Parker Analytic Services, Inc.
Semantic Web outlook and trends May The Past 24 Odd Years 1984 Lenat’s Cyc vision 1989 TBL’s Web vision 1991 DARPA Knowledge Sharing Effort 1996.
WP5.4 - Introduction  Knowledge Extraction from Complementary Sources  This activity is concerned with augmenting the semantic multimedia metadata basis.
1 The BT Digital Library A case study in intelligent content management Paul Warren
MEMORY: - LEVELS AND TYPES OF MEMORY, CASE STUDIES Memory is the capacity to acquire, retain and recall knowledge and skills.
Text Analytics Prof Sunil Wattal.
Introduction to Web Mining Spring What is data mining? Data mining is extraction of useful patterns from data sources, e.g., databases, texts, web,
Data Mining Chapter 1 Introduction -- Basic Data Mining Tasks -- Related Concepts -- Data Mining Techniques.
WebMining Web Mining By- Pawan Singh Piyush Arora Pooja Mansharamani Pramod Singh Praveen Kumar 1.
Flexible Text Mining using Interactive Information Extraction David Milward
Puzzle A Puzzle B. When asked about his birthday, a man said: "The day before yesterday I was only 25 and next year I will turn 28." This is true only.
Bootstrapping for Text Learning Tasks Ramya Nagarajan AIML Seminar March 6, 2001.
Artificial Intelligence Research Center Pereslavl-Zalessky, Russia Program Systems Institute, RAS.
Information in the Digital Environment Information Seeking Models Dr. Dania Bilal IS 530 Spring 2005.
Gene Clustering by Latent Semantic Indexing of MEDLINE Abstracts Ramin Homayouni, Kevin Heinrich, Lai Wei, and Michael W. Berry University of Tennessee.
Mining real world data Web data. World Wide Web Hypertext documents –Text –Links Web –billions of documents –authored by millions of diverse people –edited.
Introduction to Information Retrieval Example of information need in the context of the world wide web: “Find all documents containing information on computer.
Who owns the zebra?. Here’s a small logical puzzle for you. Copy down the information you see. Through elimination, can you discover who owns the zebra.
Data Mining: Text Mining
Text Analytics A Tool for Taxonomy Development Tom Reamy Chief Knowledge Architect KAPS Group Program Chair – Text Analytics World Knowledge Architecture.
TWC Illuminate Knowledge Elements in Geoscience Literature Xiaogang (Marshall) Ma, Jin Guang Zheng, Han Wang, Peter Fox Tetherless World Constellation.
Semantic Web Course - Semantic Annotation
GoRelations: an Intuitive Query System for DBPedia Lushan Han and Tim Finin 15 November 2011
Semantic Web Technologies Readings discussion Research presentations Projects & Papers discussions.
Trends in NL Analysis Jim Critz University of New York in Prague EurOpen.CZ 12 December 2008.
Best pTree organization? level-1 gives te, tf (term level)
Unsupervised text analysis
Information Organization: Overview
Logic Puzzle.
Visual Information Retrieval
Taking a Tour of Text Analytics
Information Retrieval and Web Search
Clues 1 The Brit lives in the red house 2 The Swede keeps dogs as pets 3 The Dane drinks tea 4 As you look at the houses from across the street, the.
Information Retrieval and Web Search
Adrian Tuhtan CS157A Section1
Social Knowledge Mining
Using Alloy to Solve the Einstein Puzzle
FIBO-aligned Semantic Triples
TDM=Text Mining “automated processing of large amounts of structured digital textual content for purposes of information retrieval, extraction, interpretation.
CSE 635 Multimedia Information Retrieval
Text Mining & Natural Language Processing
Introduction to Information Retrieval
Text Mining & Natural Language Processing
CS246: Information Retrieval
Information Organization: Overview
Information Retrieval and Web Design
CSE591: Data Mining by H. Liu
Presentation transcript:

Information Retrieval course ::: Information Management Technologies Kalliopi Zervanou

Overview The need for information processing Structured vs. unstructured data (text) The challenges of text Textual information processing technologies

The need for info processing Large amounts of data in electronic form Need for large scale & fast info processing text Most information to be found in text

Types of Data Structured data Semi-structured data Unstructured, free-text data

Structured Data: e.g. Databases Title: Introduction to Information Retrieval Author: C.D.Manning, P.Raghavan, H.Schütze Doc type: Book Publisher: Cambridge University Press Pub date: 2008 Id: CM20B Location: Computer Science section Keywords: Information Retrieval, Indexing, …

Semi-Structured Data (e.g. XML) bios/bymholt.html Berend Bymholt socialistisch en anarchistisch publicist en auteur van de Geschiedenis der Arbeidersbeweging in Nederland 77...

Free-Text/ Unstructured data Bertelsmann 9-mth profit slips on start-up losses FRANKFURT, Nov 10 (Reuters) - Media conglomerate Bertelsmann posted a slight decline in nine-month operating profit due to start-up losses related to new businesses. Europe's largest media group on Thursday said it still expects its 2011 operating profit to decline slightly year-on-year. It had cut its outlook in August due to costs for new projects and rising energy prices. Bertelsmann owns publishers Gruner + Jahr and Random House as well as European TV broadcaster RTL Group and Arvato, an outsourcing service provider. Operating earnings before interest and tax (EBIT) eased by 1.1 percent to 1.03 billion euros ($1.4 billion) in the first nine months of 2011, Bertelsmann said.

Data Mining analysis of structured data detection of unknown interesting patterns:  groups of data records (cluster analysis)  unusual records (anomaly detection)  data dependencies (association rule mining)

Text Mining / Text Analytics analysis of text (semi-/unstructured data) detection of unknown, interesting information:  group documents (classification/clustering)  extract information (content descriptors, concepts of interest)  associate/link information (e.g. concept relations)  discover previously unknown facts

The challenges of text Full text understanding beyond current technology context Human understanding based on context Context: text, but also world knowledge Text: ambiguity (syntactic, semantic, lexical, pragmatic)

Doc Collection IR Important Info IE Relevant Docs Summarisation (or Abstracting) ( Indexing ) Index Terms Terminology ATR Data Bases - Thesauri - Lexicons - Ontologies - Gazetteers Data Mining Reasoning, etc… Derived Info Process Resource Structured Info Relevant Info NE … EVENT … UNSTRUCTUREDDATAUNSTRUCTUREDDATA STRUCTUREDDATASTRUCTUREDDATA

IR: Select relevant documents Query: “query term” Relevant: Documents containing the “term” Methods:  Indexing or Automatic Term Recognition

Automatic Term Recognition supervised/ unsupervised task Methods: rule based, statistics-based, machine learning, hybrid Objective: detect words or phrases denoting specialised terms concepts, i.e. terms Objective: detect words or phrases denoting specialised terms concepts, i.e. terms

ATR: example C-valueCandidate term trade union[trade union, Trades Union,…] ernst papanek[Ernst Papanek] new york[New York] press clipping[Press clippings, press -clippings,…] world war[world war, world wars, World Wars,…] print material[printed materials, Printed material,…] executive committee[executive committee, …] communist party[Communist party,…] second world war[Second World War, …] spanish civil war[Spanish Civil War, …] great britain[Great Britain, Great -Britain]

Document clustering unsupervised task  “clusters”, group categories unknown machine learning and statistics-based approaches Objective: group documents based on their content / semantic similarities Objective: group documents based on their content / semantic similarities

Objective: classify documents based on their content / semantics Objective: classify documents based on their content / semantics Document classification supervised task  we know the classes/categories use of machine learning, or statistics-based methods

Doc Collection IR Important Info IE Relevant Docs Summarisation (or Abstracting) ( Indexing ) Index Terms Terminology ATR Data Bases - Thesauri - Lexicons - Ontologies - Gazetteers Data Mining Reasoning, etc… Derived Info Process Resource Structured Info Relevant Info NE … EVENT …

Summarisation or Abstracting Bertelsmann 9-mth profit slips on start-up losses FRANKFURT, Nov 10 (Reuters) - Media conglomerate Bertelsmann posted a slight decline in nine-month operating profit due to start-up losses related to new businesses. Europe's largest media group on Thursday said it still expects its 2011 operating profit to decline slightly year-on-year. It had cut its outlook in August due to costs for new projects and rising energy prices. Bertelsmann owns publishers Gruner + Jahr and Random House as well as European TV broadcaster RTL Group and Arvato, an outsourcing service provider. Operating earnings before interest and tax (EBIT) eased by 1.1 percent to 1.03 billion euros ($1.4 billion) in the first nine months of 2011, Bertelsmann said.

Information Extraction supervised, or unsupervised/generic task Methods: rule-based, machine learning Objective: detect specific types of info in documents, e.g. names, events, relations Objective: detect specific types of info in documents, e.g. names, events, relations

IE tasks Named Entity (NE) recognise entities/concepts of interest, e.g. persons, organisations, dates & times Co-reference (CO) recognise mentions to the same entity Template Relation (TR) & Scenario Template (ST) recognise relations among concepts, e.g. concept properties & entities involved in facts & events of interest

IE Tasks Bertelsmann said operating earnings before interest and tax (EBIT) rose 35 percent to 215 million euros ($272.1 million) compared with 2005, and sales were up 17.3 percent at 4.5 billion euros. Europe's largest media group on Thursday said it still expects its 2011 operating profit to decline slightly year-on-year. ORGANISATION PERCENT DATE AMOUNT ORGANISATION=“Bertelsmann”DATE=“ ”

IE Tasks Bertelsmann said operating earnings before interest and tax (EBIT) rose 35 percent to 215 million euros ($272.1 million) compared with 2005, and sales were up 17.3 percent at 4.5 billion euros. Europe's largest media group on Thursday said it still expects its 2011 operating profit to decline slightly year-on-year. SALES_of Event_type: sales Organisation_type: Company Organisation_name: Bertelsmann Sector: media Sales_mode: increase Sales_amount: Currency: euros Period: ?? Date: ??

Sentiment analysis/Opinion mining Polarity classification (positive/negative) Objectivity/Subjectivity detection

Doc Collection IR Important Info IE Relevant Docs Summarisation (or Abstracting) ( Indexing ) Index Terms Terminology ATR Data Bases - Thesauri - Lexicons - Ontologies - Gazetteers Data Mining Reasoning, etc… Derived Info Process Resource Structured Info Relevant Info NE … EVENT …

Structured Data: e.g. Databases Title: Introduction to Information Retrieval Author: C.D.Manning, P.Raghavan, H.Schütze Doc type: Book Publisher: Cambridge University Press Pub date: 2008 Id: CM20B Location: Computer Science section Keywords: Information Retrieval, Indexing, …

Structured Data: Ontologies Structure of concepts:  Entities (concepts, objects)  Properties (concept properties)  Relations (links between concepts)  Domain specific relations, e.g., “has_capital” Objective:  describe domain knowledge and reason about concepts & relations

Einstein's riddle we have five houses in a row,  each house is painted with a different colour,  each house has a single inhabitant each inhabitant  is of different nationality  drinks different beverage,  owns a different pet,  smokes different brands of cigarettes Source:

Einstein's riddle 1. There are five houses. Englishmanred house 2. The Englishman lives in the red house. Spaniard 3. The Spaniard owns the dog. 4. Coffeegreen house 4. Coffee is drunk in the green house. Ukrainian 5. The Ukrainian drinks tea. Source:

Einstein's riddle Source: green house ivory house 6. The green house is immediately to the right of the ivory house. snails 7. The Old Gold smoker owns snails. yellow house 8. Kools are smoked in the yellow house. 9. Milk 9. Milk is drunk in the middle house. Norwegian 10. The Norwegian lives in the first house.

Einstein's riddle 11. The man who smokes Chesterfields lives in the house next to the man with the fox. 12. Kools 12. Kools are smoked in a house next to the house where the horse is kept. orange juice 13. The Lucky Strike smoker drinks orange juice. Japanese 14. The Japanese smokes Parliaments. Norwegianblue house 15. The Norwegian lives next to the blue house. Source:

Einstein's riddle Who drinks water? Who owns a zebra? Source:

Ontology: hierarchical structure Thing/Root Inhabitant Colour Pet Beverage House-1 House-2 House-3 House-4 House-5 House House... Englishman Spaniard Japanese Norwegean Ukranian Spaniard... Red Green Blue Ivory Yellow Green... Dog Horse Snails Fox Zebra

Ontology “is-a” or taxonomic relationships Denote the “kind” of a concept But ontologies: more than taxonomic relationships! Thing/Root Inhabitant Colour Pet Brand House-1 House-2House House... Englishman SpaniardSpaniard... Red GreenGreen... Dog Horse... Beverage

Ontology: properties Thing/Root Inhabitant Colour Pet House Has_colour: (Colour> Is_ColourOf: [House] ) [Colour] Has_inhabitant: (Inhabitant> LivesIn:[ House] ) [Inhabitant] Is_rightTo:[House] House-1 Brand Beverage

Ontology: properties Thing/Root Inhabitant Colour Pet House LivesIn: (House> Has_inhabitant: [Inhabitant] ) [House] Has_pet: (Pet> Has_owner: [Inhabitant] ) [Pet] Drinks: (Beverage> Drunk_by: [Inhabitant] ) [Beverage] Uses_brand: (Brand> Used_by: [Inhabitant] ) [Brand] Spaniard Brand Beverage