1 Alexander Gelbukh Moscow, Russia. 2 Mexico 3 Computing Research Center (CIC), Mexico.

Slides:



Advertisements
Similar presentations
Alexander Gelbukh Special Topics in Computer Science The Art of Information Retrieval Chapter 7: Text Operations Alexander Gelbukh
Advertisements

Repaso: Unidad 1 Lección 2
Feichter_DPG-SYKL03_Bild-01. Feichter_DPG-SYKL03_Bild-02.
Copyright © 2003 Pearson Education, Inc. Slide 1 Computer Systems Organization & Architecture Chapters 8-12 John D. Carpinelli.
Chapter 1 The Study of Body Function Image PowerPoint
Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 9: Natural Language Processing and IR. Tagging, WSD, and Anaphora Resolution.
Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 10: Natural Language Processing and IR. Syntax and structural disambiguation.
Special Topics in Computer Science Advanced Topics in Information Retrieval Chapter 1: Introduction Alexander Gelbukh
Special Topics in Computer Science The Art of Information Retrieval Chapter 1: Introduction Alexander Gelbukh
1 Alexander Gelbukh Moscow, Russia. 2 Mexico 3 Computing Research Center (CIC), Mexico.
Copyright © 2011, Elsevier Inc. All rights reserved. Chapter 6 Author: Julia Richards and R. Scott Hawley.
Author: Julia Richards and R. Scott Hawley
1 Copyright © 2013 Elsevier Inc. All rights reserved. Appendix 01.
Properties Use, share, or modify this drill on mathematic properties. There is too much material for a single class, so you’ll have to select for your.
Objectives: Generate and describe sequences. Vocabulary:
1 Use of Electronic Resources in Research Prof. Dr. Khalid Mahmood Department of Library & Information Science University of the Punjab.
UNITED NATIONS Shipment Details Report – January 2006.
Document #07-2I RXQ Customer Enrollment Using a Registration Agent (RA) Process Flow Diagram (Move-In) (mod 7/25 & clean-up 8/20) Customer Supplier.
Library 1 Electronic Resources in the EUI Library Veerle Deckmyn, Library Director Aimee Glassel, Electronic Resources Librarian September 2, 2009.
Electronic Resources in the EUI Library
1 RA I Sub-Regional Training Seminar on CLIMAT&CLIMAT TEMP Reporting Casablanca, Morocco, 20 – 22 December 2005 Status of observing programmes in RA I.
WIPO Patent Information Services
Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13
Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13
Exit a Customer Chapter 8. Exit a Customer 8-2 Objectives Perform exit summary process consisting of the following steps: Review service records Close.
FACTORING ax2 + bx + c Think “unfoil” Work down, Show all steps.
Year 6 mental test 5 second questions
Year 6 mental test 10 second questions
2010 fotografiert von Jürgen Roßberg © Fr 1 Sa 2 So 3 Mo 4 Di 5 Mi 6 Do 7 Fr 8 Sa 9 So 10 Mo 11 Di 12 Mi 13 Do 14 Fr 15 Sa 16 So 17 Mo 18 Di 19.
REVIEW: Arthropod ID. 1. Name the subphylum. 2. Name the subphylum. 3. Name the order.
EU market situation for eggs and poultry Management Committee 20 October 2011.
EU Market Situation for Eggs and Poultry Management Committee 21 June 2012.
1 University of Utah – School of Computing Computer Science 1021 "Thinking Like a Computer"
2 |SharePoint Saturday New York City
Green Eggs and Ham.
IP Multicast Information management 2 Groep T Leuven – Information department 2/14 Agenda •Why IP Multicast ? •Multicast fundamentals •Intradomain.
BEEF & VEAL MARKET SITUATION "Single CMO" Management Committee 18 April 2013.
VOORBLAD.
1 RA III - Regional Training Seminar on CLIMAT&CLIMAT TEMP Reporting Buenos Aires, Argentina, 25 – 27 October 2006 Status of observing programmes in RA.
Factor P 16 8(8-5ab) 4(d² + 4) 3rs(2r – s) 15cd(1 + 2cd) 8(4a² + 3b²)
Basel-ICU-Journal Challenge18/20/ Basel-ICU-Journal Challenge8/20/2014.
1..
© 2012 National Heart Foundation of Australia. Slide 2.
Science as a Process Chapter 1 Section 2.
Understanding Generalist Practice, 5e, Kirst-Ashman/Hull
Model and Relationships 6 M 1 M M M M M M M M M M M M M M M M
25 seconds left…...
1 Using one or more of your senses to gather information.
Januar MDMDFSSMDMDFSSS
Analyzing Genes and Genomes
We will resume in: 25 Minutes.
©Brooks/Cole, 2001 Chapter 12 Derived Types-- Enumerated, Structure and Union.
Essential Cell Biology
Intracellular Compartments and Transport
PSSA Preparation.
Essential Cell Biology
Immunobiology: The Immune System in Health & Disease Sixth Edition
1 Chapter 13 Nuclear Magnetic Resonance Spectroscopy.
Energy Generation in Mitochondria and Chlorplasts
RefWorks: The Basics October 12, What is RefWorks? A personal bibliographic software manager –Manages citations –Creates bibliogaphies Accessible.
Benchmark Series Microsoft Excel 2013 Level 2
CS460/IT632 Natural Language Processing/Language Technology for the Web Lecture 1 (03/01/06) Prof. Pushpak Bhattacharyya IIT Bombay Introduction to Natural.
Presentation transcript:

1 Alexander Gelbukh Moscow, Russia

2 Mexico

3 Computing Research Center (CIC), Mexico

4 Chung-Ang University, Korea Electronic Commerce and Internet Application Lab

5 Special Topics in Computer Science The Art of Information Retrieval Alexander Gelbukh

6 Information Retrieval In a huge amount of poorly structured information find the information that you need when you dont know exactly what you need or cant explain it The Web User information need Ranking

7

8

9 Information Retrieval In a huge amount of poorly structured information find the information that you need when you dont know exactly what you need or cant explain it The Web User information need Ranking

10 Importance Knowledge: the main treasure of man Web: Repository? Cemetery of information! Natural language and multimedia information oPoorly structured, badly written Corporate and organizational document bases oSenate speeches: Mexico oMedical data collections oCorporate memory. Microsoft knowledge base Future: data explosion increasing importance

11 Perspectives Corporations: corporate databases Organizations: document bases Government oEuropean Union multilingual problem oThe same in Asia Academy oLots of open research topics oWeb topics oComputational Linguistics topics oIntelligent technologies, AI

12 Textbook

13 Contents 1.Introduction 2.Modeling 3.Retrieval Evaluation 4.Query Languages 5.Query Operations 6.Text and Multimedia Languages and Properties 7.Text Operations 8.Indexing and Searching 9.Parallel and Distributed IR 10.User Interfaces and Visualization 11.Multimedia IR: Models and Languages 12.Multimedia IR: Indexing and Searching 13.Searching the Web 14.Libraries and Bibliographical Systems 15.Digital Libraries

14 Calendar 1.September 18Chapter 1 Introduction 2.September 25Chapter 2 Modeling 3.October 2Chapter 3 Retrieval Evaluation 4.October 9Chapter 4 Query Languages 5.October 16Chapter 5 Query Operations October 23 – midterm exam 6.October 30Chapter 6 Text and Multimedia Languages... 7.November 6Chapter 7 Text Operations 8.November 13Chapter 8 Indexing and Searching 9.November 20Chapter 10 User Interfaces and Visualization 10.November 27Chapter 13 Searching the Web 11.December 4Chapter 14 Libraries and Bibliographical Systems 12.December 11Chapter 15 Digital Libraries December – final exam

15 Class structure Main course: Information Retrieval Discussion of previous chapter. Questions I briefly present a new chapter Research seminar: Natural Language Processing Discussion of previous paper. Questions. oIdentification of possible research topics Presentation of a new paper or current work Discussion and questions Goal: publications!

16 Natural Language Processing Research Seminar

17 What CL is about Computers to process natural language text Understand Generate Search Organize Translate … Useful in IR

18 Methods No: text as a stream of letters oBrute force statistics oSimplified heuristics (ex.: Porter) Yes: attention to language rules oLinguistically motivated approaches oKnowledge-based approaches oCorpus-based approaches

19 What IR is about Classical IR: find words? Concepts! Question answering Summarization Clustering … Take language seriously

20 Text representations for IR Represent the retrieval features oStrings stems (lexemes), synsets, phrases. oWomen woman, lady, female oOld men and women old woman Structured representation of text oNetwork of related events and entities oEnables logical inference

21 CL tasks useful in IR Morphology (stemming) POS / Word dense disambiguation Word relatedness Anaphora resolution Parsing and semantics (phrase search) Synonymic rephrasing Translation etc… Each one a whole science in itself

22 Morphology Q: pig T: piggish Simple: stemming opiggish pig- Lexeme: set of word forms osame stem can give different words opigment not pig; piny pine, not pin Dictionary/corpus-based methods oLearning; dictionary management

23 Part of Speech Disambiguation Q: oil well T: He did very well Q: what is an are? T: They are nice Important for English, Chinese. Less important for other types Perhaps not so helpful directly, but is necessary for most other tasks Usually statistical / heuristic methods

24 Word Sense Disambiguation Q: bank account T: on the beautiful banks of Han river... bill: document, banknote, law, ax, peak, Gates... Very frequent, almost any word in text Statistical & dictionary methods International competitions

25 Word relatedness Q: female T: woman (women) oSynonyms. Subtypes/super-types oDictionaries. WordNet. Similarity. Lesk. Q: Korea T: Seoul oOther linguistic relationships (e.g., part) oReal-world relationships (facts) Q: Clinton T: Lewinsky oStatistical co-occurrence (MI)

26 Anaphora resolution Q: Awards of Prof. Han T: Prof. Han said... He did... IBM awarded him... oFrequency oPhrases, co-occurrence, summarization, inference, translation Heuristic (Mitkov) and knowledge- based methods Other types of co-reference

27 Parsing, semantics Q: Awards of Prof. Han T1: Prof. Han among many other prizes has several IBM awards T2: Mr. Kang has an award Prof. Han does not know of Understanding of text oRich structured representation Better phrase search; question answering, summarization,...

28 Synonymic rephrasing, reasoning Q: experienced computer scientists T: Prof. Han has been programming for many years and awarded an IBM award Requires good syntactic and semantic analysis Knowledge-based methods

29 Multilingual access Q: T: We sell excellent yoghurt. Продаем йогурт. Se vende rico yogur. oSearch multilingual collections Europe: dozens of official languages of EU oIf you dont know how to say it in English Dictionaries, bilingual corpora,...

30 Tasks are entangled Many of CL tasks require other tasks oMorphology syntax semantics Many CL tasks form circles oparsing WSD parsing oI see a wild cat with a telescope (tripod?) Can be done quick-and-dirty (?) oFighting for last %s oZipf law: 20% of men drink 80% of beer

31 Tools and infrastructure Analysis tools oTasks, methods Dictionaries and grammars oTypes, structure oAutomatic acquisition Corpora oCorpora analysis tools and methods

32 Possible tasks WSD to help IR Clustering + summarization in IR results Anaphora and coreference resolution to help IR Multilingual IR Applications to Korean... a lot of others

33 Reading Textbooks oManning & Schütze, Allen, Jurafsky, Hausser,... CICLing proceedings Computational Linguistics Google, ResearchIndex

34 Questions Who expects to publish? Who will make a presentation at the next seminar?

35 Thank you! Till September 18