Overview of IR Research ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign.

Slides:



Advertisements
Similar presentations
Chapter 5: Introduction to Information Retrieval
Advertisements

Text mining Extract from various presentations: Temis, URI-INIST-CNRS, Aster Data …
Introduction to Text Mining
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, Introduction to IR Research ChengXiang Zhai Department of Computer.
Information Retrieval in Practice
Search Engines and Information Retrieval
IR Challenges and Language Modeling. IR Achievements Search engines  Meta-search  Cross-lingual search  Factoid question answering  Filtering Statistical.
Semantic Web and Web Mining: Networking with Industry and Academia İsmail Hakkı Toroslu IST EVENT 2006.
Intelligent Information Retrieval CS 336 Lisa Ballesteros Spring 2006.
Information Retrieval Concerned with the: Representation of Storage of Organization of, and Access to Information items.
Information Retrieval in Practice
1 Information Retrieval and Web Search Introduction.
Basic IR Concepts & Techniques ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign.
Semantic Web and Web Mining: Networking with Industry and Academia İsmail Hakkı Toroslu IST EVENT 2006.
Overview of Search Engines
CS598CXZ Course Summary ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign.
Challenges in Information Retrieval and Language Modeling Michael Shepherd Dalhousie University Halifax, NS Canada.
Lecture 1: Web Search Overview & Web Crawling
1. Search Engines Architecture Azreen Azman, PhD SMM 5891 All slides ©Addison Wesley, 2008.
Introduction to Information Retrieval Hongning Wang
Search Engines and Information Retrieval Chapter 1.
Multimedia Databases (MMDB)
CS523 INFORMATION RETRIEVAL COURSE INTRODUCTION YÜCEL SAYGIN SABANCI UNIVERSITY.
Thanks to Bill Arms, Marti Hearst Documents. Last time Size of information –Continues to grow IR an old field, goes back to the ‘40s IR iterative process.
Modern Information Retrieval: A Brief Overview By Amit Singhal Ranjan Dash.
 Text Representation & Text Classification for Intelligent Information Retrieval Ning Yu School of Library and Information Science Indiana University.
1 Information Retrieval Acknowledgements: Dr Mounia Lalmas (QMW) Dr Joemon Jose (Glasgow)
Proposal for Term Project J. H. Wang Mar. 2, 2015.
Search Result Interface Hongning Wang Abstraction of search engine architecture User Ranker Indexer Doc Analyzer Index results Crawler Doc Representation.
Introduction to Information Retrieval
Toward A Session-Based Search Engine Smitha Sriram, Xuehua Shen, ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign.
Search Engine Architecture
Overview of Information Retrieval (CS598-CXZ Advanced Topics in IR Presentation) Jan. 18, 2005 ChengXiang Zhai Department of Computer Science University.
Research Topics/Areas. Adapting search to Users Advertising and ad targeting Aggregation of Results Community and Context Aware Search Community-based.
How Do We Find Information?. Key Questions  What are we looking for?  How do we find it?  Why is it difficult? “A prudent question is one-half of wisdom”
March 31, 1998NSF IDM 98, Group F1 Group F Multi-modal Issues, Systems and Applications.
Data Integration Hanna Zhong Department of Computer Science University of Illinois, Urbana-Champaign 11/12/2009.
Introduction to Information Retrieval Example of information need in the context of the world wide web: “Find all documents containing information on computer.
Digital Libraries1 David Rashty. Digital Libraries2 “A library is an arsenal of liberty” Anonymous.
Search Result Interface Hongning Wang Abstraction of search engine architecture User Ranker Indexer Doc Analyzer Index results Crawler Doc Representation.
Information Retrieval
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 龙星计划课程 : 信息检索 Course Summary ChengXiang Zhai ( 翟成祥 ) Department of.
IR. SI 650/EECS 549 Information Retrieval People search the Web daily Search engines –Google –Bing –Baidu –Yandex Information Retrieval is about search.
CS798: Information Retrieval Charlie Clarke Information retrieval is concerned with representing, searching, and manipulating.
Text Information Management ChengXiang Zhai, Tao Tao, Xuehua Shen, Hui Fang, Azadeh Shakery, Jing Jiang.
Information Retrieval in Practice
Information Retrieval in Practice
Term Project Proposal By J. H. Wang Apr. 7, 2017.
CS510 Advanced Topics in Information Retrieval (Fall 2017)
Search Engine Architecture
Information Retrieval (in Practice)
Introduction to Information Retrieval
Proposal for Term Project
Introduction to IR Research
Information Retrieval and Web Search
Search Engine Architecture
中国计算机学会学科前沿讲习班:信息检索 Course Overview
Course Summary (Lecture for CS410 Intro Text Info Systems)
Information Retrieval and Web Search
Information Retrieval and Web Search
Overview of IR Research
CS510 (Fall 2018) Advanced Topics in Information Retrieval
CSE 635 Multimedia Information Retrieval
Introduction to Information Retrieval
Course Summary ChengXiang “Cheng” Zhai Department of Computer Science
CS246: Information Retrieval
Search Engine Architecture
Information Retrieval and Extraction
Information Retrieval and Web Search
ADVANCED TOPICS IN INFORMATION RETRIEVAL AND WEB SEARCH
Presentation transcript:

Overview of IR Research ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign

What is Information Retrieval (IR)? Salton’s definition (Salton 68): “information retrieval is a field concerned with the structure, analysis, organization, storage, searching, and retrieval of information” –Information: mostly text, but can be anything (e.g., multimedia) –Retrieval: Narrow sense: search/querying Broad sense: filtering, classification, summarization,... In more general terms –Information access –Information seeking –Help people manage and make use of all kinds of information

Who are working on IR? (IR and Related Areas) Information Retrieval Databases Library & Info Science Machine Learning Pattern Recognition Data Mining Natural Language Processing Applications Web, Bioinformatics… Statistics Optimization Software engineering Computer systems Models Algorithms Applications Systems Human-Computer Interaction Computer Vision

IR and NLP The two fields were closely related from day one, but somewhat disconnected later when NLP focused more on cognitive and symbolic approaches, while IR focused more on pure statistical approaches Most recently the two fields regained close interactions –More complex retrieval tasks (question answering, opinons) –More scalable/robust NLP techniques (parsing, extraction) IR researchers pioneered statistical approaches to NLP in 1950’s (e.g., H. P. Luhn), which only became popular in 1990’s among NLP researchers

IR and Databases “Sibling” fields, but they didn’t get along with each other well IR and DB share many common tasks, but the differences in the form of data and nature of queries are large enough to separate the two fields in most of the history Major differences in data, user, query, what counts as answers: DB  efficiency; IR  effectiveness The two fields are now getting closer and closer now (DB researchers realized the importance of 80% unstructured data, and IR researchers realized the importance of semantic search)

IR and Machine Learning IR as a subfield of AI (IR=intelligent text access)? –AI is too big to have a coherent community (e.g., ML, NLP, Computer Vision all “spin off”) IR researchers did machine learning as early as in 1960’s (Rocchio 1965, relevance feedback), but supervised learning didn’t get popular in IR until in early 1990’s when text categorization started getting a lot of attention –Lack of training data for search (no large-scale online system, users don’t like to make effort on judgments) –Learning-based approach didn’t prevail for ad hoc retrieval Machine learning is now very important for IR

IR and Library & Information Science Inseparable from day one (“Information Science” vs. “Computer Science”) Early IR work was mostly done in the context of library and information science (LIS) I-School initiative/movement: drop “library” and enlarge the scope to “informatics”, leading to merger of CS + LIS Another example where the boundary between fields is disappearing (setting boundaries is generally harmful for research, but is sometimes needed in practice)

IR and Software Engineering Scalability of IR wasn’t a major concern until the Web –Data collection was relatively small and didn’t grow quickly until the Web –The most effective retrieval models remain simple models based on bag-of-words representation However, scalability has always been a core issue in IR, and how to engineer an IR system optimally is extremely important for IR applications Nowadays, data-intensive computing is essential for large-scale IR applications

IR and Applications Early days: library search, literature 1970s: small-scale online search systems 1990s: large-scale systems –TREC (mostly news data, later other kinds of data) –Web search engines 2010s: search is everywhere! More and more applications in the future

Publications/Societies (broad view) ACM SIGIR ICDE, EDBT, TODS JASIS Learning/Mining NLP Applications Statistics Software/systems COLING, EMNLP, NAACL HLT ICML, NIPS, UAI RECOMB, PSB JCDL Info. Science Info Retrieval ECIR, CIKM, TREC TOIS, IRJ, IPM Databases ACM SIGMOD,VLDB ACL ICML AAAI ACM SIGKDD ISMB WWW WSDM ICDM, SDM OSDI

Major IR Publication Venues ACM SIGIR 1990< CIKM 1978 ECIR TREC WWW WSDM ACM TOIS IMP(ISR) IRJ JASIST JDoc

IR Research Topics (Broad View) Search Text Filtering Categorization Summarization Clustering Natural Language Content Analysis Extraction Mining Visualization Retrieval Applications Analytics Applications Information Access Text Mining Information Organization Users Text Acquisition

IR Topics (narrow view) User query judgments docs results Query Rep Doc Rep Ranking Feedback INDEXING SEARCHING QUERY MODIFICATION LEARNING INTERFACE 1. Evaluation 2. Retrieval (Ranking) Models 4. Efficiency & scalability 3. Document representation/structure 6. User interface (browsing) 7. Feedback/Learning 5. Search result summarization/presentation “core” topics: 1-4, 7, especially 1, 2, 7

Major Research Milestones Early days (late 1950s to 1960s): foundation and founding of the field –Luhn’s work on automatic encoding –Cleverdon’s Cranfield evaluation methodology and index experiments –Salton’s early work on SMART system and experiments 1970s-1980s: a large number of retrieval models –Vector space model –Probabilistic models 1990s: further development of retrieval models and new tasks –Language models –TREC evaluation 2000s-present: more applications, especially Web search and interactions with other fields –Web search –Learning to rank –Scalability (e.g., MapReduce) Indexing: auto vs. manual Evaluation System Indexing + Search Theory Large-scale evaluation, beyond ad hoc retrieval Web search Machine learning Scalability

Frontier Topics in IR: Overview Two types of topics –30%: Fundamental challenges: IR models, evaluation, efficiency, user models/studies –70%: Application-driven challenges: Web (1.0, 2.0, 3.0?), Enterprise (text analytics), Scientific Research (bioinformatics, …) Methodology –50%: Machine learning (feature set + supervised) –30%: Language models (unigram + unsupervised) –20%: Others (user studies, empirical experiments) Trends –More interdisciplinary and internationalized –More diversification of topics (new applications, new methods) –Hard fundamental problems regularly revisited 15

Topics in SIGIR 2011/2012 CFP 16 Document Representation and Content Analysis (e.g., text representation, document structure, linguistic analysis, non-English IR, cross-lingual IR, information extraction, sentiment analysis, clustering, classification, topic models, facets) Queries and Query Analysis (e.g., query representation, query intent, query log analysis, question answering, query suggestion, query reformulation) Users and Interactive IR (e.g., user models, user studies, user feedback, search interface, summarization, task models, personalized search) Retrieval Models and Ranking (e.g., IR theory, language models, probabilistic retrieval models, feature-based models, learning to rank, combining searches, diversity) Search Engine Architectures and Scalability ( e.g., indexing, compression, MapReduce, distributed IR, P2P IR, mobile devices) Filtering and Recommending (e.g., content-based filtering, collaborative filtering, recommender systems, profiles) Evaluation (e.g., test collections, effectiveness measures, experimental design) Web IR and Social Media Search (e.g., link analysis, query logs, social tagging, social network analysis, advertising and search, blog search, forum search, CQA, adversarial IR, vertical and local search) IR and Structured Data (e.g., XML search, ranking in databases, desktop search, entity search) Multimedia IR (e.g., Image search, video search, speech/audio search, music IR) Other Applications (e.g., digital libraries, enterprise search, genomics IR, legal IR, patent search, text reuse)

17 My View of the Future of IR Bag of words Search Keyword Queries Access Mining Task Support Entities-Relations Knowledge Representation Search History Complete User Model Current Search Engine Personalization (User Modeling) Large-Scale Semantic Analysis Full-Fledged Text Info. Management

What You Should Know IR is a highly interdisciplinary area interacting with many other areas, especially NLP, ML, DB, HCI, software systems, and Information Science Major publication venues, especially ACM SIGIR, ACM CIKM, ACM TOIS, IRJ, IPM, WSDM