Overview of Information Retrieval (CS598-CXZ Advanced Topics in IR Presentation) Jan. 18, 2005 ChengXiang Zhai Department of Computer Science University.

Slides:



Advertisements
Similar presentations
GMD German National Research Center for Information Technology Darmstadt University of Technology Perspectives and Priorities for Digital Libraries Research.
Advertisements

1.Accuracy of Agree/Disagree relation classification. 2.Accuracy of user opinion prediction. 1.Task extraction performance on Bing web search log with.
Data Mining: Concepts and Techniques Mining Text Data
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, Introduction to IR Research ChengXiang Zhai Department of Computer.
1 Introduction to Natural Language Processing (Lecture for CS410 Text Information Systems) Jan 28, 2011 ChengXiang Zhai Department of Computer Science.
Introduction to Natural Language Processing Hongning Wang
Unstructured Data and Text Mining
Search Engines and Information Retrieval
IR Challenges and Language Modeling. IR Achievements Search engines  Meta-search  Cross-lingual search  Factoid question answering  Filtering Statistical.
WebMiningResearch ASurvey Web Mining Research: A Survey Raymond Kosala and Hendrik Blockeel ACM SIGKDD, July 2000 Presented by Shan Huang, 4/24/2007.
Information Retrieval in Practice
Introduction to CL Session 1: 7/08/2011. What is computational linguistics? Processing natural language text by computers  for practical applications.
1 Information Retrieval and Web Search Introduction.
WebMiningResearchASurvey Web Mining Research: A Survey Raymond Kosala and Hendrik Blockeel ACM SIGKDD, July 2000 Presented by Shan Huang, 4/24/2007 Revised.
Basic IR Concepts & Techniques ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Drew DeHaas.
March 1, 2009 Dr. Muhammed Al-Mulhem 1 ICS 482 Natural Language Processing INTRODUCTION Muhammed Al-Mulhem March 1, 2009.
Statistical Natural Language Processing. What is NLP?  Natural Language Processing (NLP), or Computational Linguistics, is concerned with theoretical.
Artificial Intelligence Research Centre Program Systems Institute Russian Academy of Science Pereslavl-Zalessky Russia.
Lecture 1, 7/21/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2005 Lecture 1 21 July 2005.
CS598CXZ Course Summary ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign.
Challenges in Information Retrieval and Language Modeling Michael Shepherd Dalhousie University Halifax, NS Canada.
9/8/20151 Natural Language Processing Lecture Notes 1.
Overview of IR Research ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign.
Introduction to Information Retrieval Hongning Wang
Lecture 12: 22/6/1435 Natural language processing Lecturer/ Kawther Abas 363CS – Artificial Intelligence.
1 Information Filtering & Recommender Systems (Lecture for CS410 Text Info Systems) ChengXiang Zhai Department of Computer Science University of Illinois,
CS523 INFORMATION RETRIEVAL COURSE INTRODUCTION YÜCEL SAYGIN SABANCI UNIVERSITY.
Introduction to Web Mining Spring What is data mining? Data mining is extraction of useful patterns from data sources, e.g., databases, texts, web,
 Text Representation & Text Classification for Intelligent Information Retrieval Ning Yu School of Library and Information Science Indiana University.
Real World IR Challenges (CS598-CXZ Advanced Topics in IR Presentation) Jan. 20, 2005 ChengXiang Zhai Department of Computer Science University of Illinois,
1 CSI 5180: Topics in AI: Natural Language Processing, A Statistical Approach Instructor: Nathalie Japkowicz Objectives of.
October 2005CSA3180 NLP1 CSA3180 Natural Language Processing Introduction and Course Overview.
How Do We Find Information?. Key Questions  What are we looking for?  How do we find it?  Why is it difficult? “A prudent question is one-half of wisdom”
Digital Libraries1 David Rashty. Digital Libraries2 “A library is an arsenal of liberty” Anonymous.
National Technical University of Ukraine “Kiev Polytechnic Institute” Heat and energy design faculty Department of automation design of energy processes.
Data Mining: Text Mining
Information Retrieval
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 龙星计划课程 : 信息检索 Course Summary ChengXiang Zhai ( 翟成祥 ) Department of.
IR. SI 650/EECS 549 Information Retrieval People search the Web daily Search engines –Google –Bing –Baidu –Yandex Information Retrieval is about search.
Relevance Models and Answer Granularity for Question Answering W. Bruce Croft and James Allan CIIR University of Massachusetts, Amherst.
Text Information Management ChengXiang Zhai, Tao Tao, Xuehua Shen, Hui Fang, Azadeh Shakery, Jing Jiang.
Introduction to Natural Language Processing Hongning Wang
Overview of Statistical NLP IR Group Meeting March 7, 2006.
1 CIS 467 :Data Mining Department of Computer Information Systems Faculty of Information Technology Yarmouk University – Jordan Instructors: Dr. Qasem.
Lecture-6 Bscshelp.com. Todays Lecture  Which Kinds of Applications Are Targeted?  Business intelligence  Search engines.
Information Retrieval in Practice
CS510 Advanced Topics in Information Retrieval (Fall 2017)
Taking a Tour of Text Analytics
Eick: Introduction Machine Learning
School of Computer Science & Engineering
Introduction to IR Research
Information Retrieval and Web Search
Natural Language Processing (NLP)
中国计算机学会学科前沿讲习班:信息检索 Course Overview
Course Summary (Lecture for CS410 Intro Text Info Systems)
Information Retrieval and Web Search
Information Retrieval and Web Search
Overview of IR Research
CS510 (Fall 2018) Advanced Topics in Information Retrieval
CSE 635 Multimedia Information Retrieval
Course Summary ChengXiang “Cheng” Zhai Department of Computer Science
Introduction to Information Retrieval
CS246: Information Retrieval
Natural Language Processing (NLP)
Christoph F. Eick: A Gentle Introduction to Machine Learning
Web Mining Research: A Survey
Information Retrieval and Web Search
Natural Language Processing (NLP)
Presentation transcript:

Overview of Information Retrieval (CS598-CXZ Advanced Topics in IR Presentation) Jan. 18, 2005 ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign

What is Information Retrieval (IR)? Narrow-sense: –IR= Search Engine Technologies (IR=Google, library info system) –IR= Text matching/classification Broad-sense: IR = Text Information Management: –Gneral problem: how to manage text information? –How to find useful information? (info. retrieval) (e.g., google) –How to organize information? (text classification) (e.g., automatically assign to different folders) –How to discover knowledge from text? (text mining) (e.g., discover correlation of events)

Why is IR Important? More and more online information in general (Information Overload) Many tasks rely on effective management and exploitation of information Textual information plays an important role in our lives Effective text management directly improves productivity

Elements of Text Info Management Technologies Search Text Filtering Categorization Summarization Clustering Natural Language Content Analysis Extraction Mining Visualization Retrieval Applications Mining Applications Information Access Knowledge Acquisition Information Organization

A Quick Tour of the State of the Art….

Component Technology 1: Natural Language Processing

What is NLP? … يَجِبُ عَلَى الإنْسَانِ أن يَكُونَ أمِيْنَاً وَصَادِقَاً مَعَ نَفْسِهِ وَمَعَ أَهْلِهِ وَجِيْرَانِهِ وَأَنْ يَبْذُلَ كُلَّ جُهْدٍ فِي إِعْلاءِ شَأْنِ الوَطَنِ وَأَنْ يَعْمَلَ عَلَى مَا … How can a computer make sense out of this string ? Arabic text - What are the basic units of meaning (words)? - What is the meaning of each word? - How are words related with each other? - What is the “combined meaning” of words? - What is the “meta-meaning”? (speech act) - Handling a large chunk of text - Making sense of everything Syntax Semantics Pragmatics Morphology Discourse Inference

An Example of NLP A dog is chasing a boy on the playground DetNounAuxVerbDetNounPrepDetNoun Noun Phrase Complex Verb Noun Phrase Prep Phrase Verb Phrase Sentence Dog(d1). Boy(b1). Playground(p1). Chasing(d1,b1,p1). Semantic analysis Lexical analysis (part-of-speech tagging) Syntactic analysis (Parsing) A person saying this may be reminding another person to get the dog back… Pragmatic analysis (speech act) Scared(x) if Chasing(_,x,_). + Scared(b1) Inference

What we can do in NLP A dog is chasing a boy on the playground DetNounAuxVerbDetNounPrepDetNoun Noun Phrase Complex Verb Noun Phrase Prep Phrase Verb Phrase Sentence Semantics: some aspects - Entity/relation extraction - Word sense disambiguation - Anaphora resolution POS Tagging: 97% Parsing: partial >90%(?) Speech act analysis: ??? Inference: ???

What We Can’t Do in NLP 100% POS tagging –“He turned off the highway.” vs “He turned off the fan.” General complete parsing –“A man saw a boy with a telescope.” Deep semantic analysis –Will we ever be able to precisely define the meaning of “own” in “John owns a restaurant.”? Robust & general NLP tends to be “shallow” … “Deep” understanding doesn’t scale up …

Component Technology 2: Search (ad hoc retrieval)

What is Search (Ad hoc IR)? Retrieval System User “robotics applications” query Robotics others relevant docs non-relevant docs database/collection text docs

What we can do in Search Search in a pure text collection is well studied –Many different methods –Equally effective when optimized Basic search techniques (e.g., vector space, prob. models) are good enough for commercialization –All implementing TF-IDF style heuristics –Some new models have more potential for further optimization

What we can’t do in Search Basic retrieval models –No single model is the best on all test collections –Automatic parameter optimization Lack of interactive search support Lack of personalization Search context modeling Retrieval with more than pure text –With structures –Multi-media

Component Technology 3: Information Filtering

What is Information Filtering? Stable & long term interest, dynamic info source System must make a delivery decision immediately as a document “ arrives ” Filtering System … my interest:

State of the Art: Filtering Content-based adaptive filtering –Basic techniques, though not perfect, are there –We haven’t seen many (any?) filtering applications Collaborative filtering (recommender systems) –Simple methods can be (are being) commercialized –Real applications exist –More applications are possible

Component Technology 4: Text Categorization

What is Text Categorization? Pre-given categories and labeled document examples (Categories may form hierarchy) Classify new documents A standard supervised learning problem Categorization System … Sports Business Education Science … Sports Business Education

State of the Art: Categorization Many supervised learning methods have been developed –SVM is often the best in performance –Other methods are also competitive –Commercial applications exist, but not at a large-scale –More applications can be developed Feature selection/extraction is often more important than the choice of the learning algorithm Applications have been developed Relatively well explored

Component Technology 5: Clustering

The Clustering Problem Discover “ natural structure ” Group similar objects together Object can be document, term, passages Example

State of the Art: Clustering Many methods have been developed, applicable in different situations Difficult to predict which method is the best When patterns are clear, most methods work well In difficult situations –Special clustering bias must be incorporated –Properties of clustering methods need to be considered

End of State of the Art Tour…

Where is IR Going? IR and related areas Current trends How would this course fit to the picture?

Related Areas Information Retrieval Databases Library & Info Science Machine Learning Pattern Recognition Data Mining Natural Language Processing Applications Web, Bioinformatics… Statistics Optimization Software engineering Computer systems Models Algorithms Applications Systems

Current Trends Information Retrieval Databases Library & Info Science Machine Learning Pattern Recognition Data Mining Natural Language Processing Applications Web, Bioinformatics… Statistics Optimization Software engineering Computer systems Models Algorithms Applications Systems Web/ Bioinformatics/… Literature/Digital Library Structured + Unstructured Data Human-Computer Interactions High-Performance Computing More Powerful Content Analysis More Principled Models/Algorithms

Publications/Societies ACM SIGIR VLDB, PODS, ICDE ASIS Learning/Mining NLP Applications Statistics ?? Software/systems ?? COLING, EMNLP, ANLP HLT ICML, NIPS, UAI RECOMB, PSB JCDL Info. Science Info Retrieval ACM CIKM, TREC Databases ACM SIGMOD ACL ICML AAAI ACM SIGKDD ISMB WWW

Let Users Lead the Way… The underlying driving force has always been real world applications The ultimate impact of research in IR is to benefit people in accessing and using information in the real world Research on many component technologies is reaching a stage of “diminishing return”; the challenge is how to make use of such imperfect techniques Think more about complete solutions (as opposed to component technologies) as well as new applications

How would this Course Fit to the Picture? Identify novel application problems Identify new research topics Examine existing research work in these directions Design and carry out new projects in some of the directions We will broadly look at 3 application domains: Web, , and Literature