Download presentation
Presentation is loading. Please wait.
Published byDonna Fletcher Modified over 9 years ago
1
IR in a Nutshell: Applications, Research, and Challenges Session 1 Feb 21 st 2013 Tamer Elsayed
2
Roadmap What is Information Retrieval (IR)? ● Overview and applications Overview of my research interests ● Large-scale problems ● MapReduce Extensions ● Twitter Analysis The future of IR research ● SWIRL 2012 IR in a Nutshell: Applications, Research, and Challenges2
3
WHAT IS IR? OVERVIEW & APPLICATIONS/RESEARCH TOPICS IR in a Nutshell: Applications, Research, and Challenges3
4
Information Retrieval (IR) … 4 Unstructured Query Hits IR in a Nutshell: Applications, Research, and Challenges information need
5
Who and Where? *Source: Matt Lease (IR Course at UTexes)
6
IR is not just “Web Page” Ranking 6 or Document or Retrieval
7
Web Search: Google Search suggestions Vertical search Query-biased summarization Sponsored search Search shortcuts Vertical search (news, blog, image)
8
Web Search: Google II Spelling correction Personalized search / social ranking Vertical search (local)
9
Cross-Lingual IR 1/3 of the Web is in non-English About 50% of Web users do not use English as their primary language Many (maybe most) search applications have to deal with multiple languages ● monolingual search: search in one language, but with many possible languages ● cross-language search: search in multiple languages at the same time
10
Routing / Filtering Given standing query, analyze new information as it arrives ● Input: all email, RSS feed or listserv, … ● Typically classification rather than ranking ● Simple example: Ham vs. spam *Source: Matt Lease (IR Course at UTexes)
11
Content-based Music Search *Source: Matt Lease (IR Course at UTexes)
12
Speech Retrieval *Source: Matt Lease (IR Course at UTexes)
13
Entity Search *Source: Matt Lease (IR Course at UTexes)
14
Question Answering & Focused Retrieval *Source: Matt Lease (IR Course at UTexes)
15
Expert Search *Source: Matt Lease (IR Course at U Texes)
16
Blog Search *Source: Matt Lease (IR Course at UTexes)
17
μ-Blog Search (e.g. Twitter) *Source: Matt Lease (IR Course at UTexes)
18
e-Discovery *Source: Matt Lease (IR Course at Utexes)
19
Book Search Find books or more focused results Detect / generate / link table of contents Classification: detect genre (e.g. for browsing) Detect related books, revised editions Challenges: Variable scan quality, OCR accuracy, Copyright, etc.
20
Other Visual Interfaces *Source: Matt Lease (IR Course at Utexes)
21
MY RESEARCH IR in a Nutshell: Applications, Research, and Challenges21
22
22 My Research … Text Large-Scale Processing emails + web pages Enron CLuE Web Identity Resolution Web Search ~500,000 ~1,000,000,000 User Application
23
Back in 2009 … Before 2009, small text collections are available ● Largest: ~ 1M documents ClueWeb09 ● Crawled by CMU in 2009 ● ~ 1B documents ! ● need to move to cluster environments MapReduce/Hadoop seems like promising framework 23
24
MapReduce Framework 24 map map map map reduce input output Shuffling group values by: [keys] (a) Map (b) Shuffle (c) Reduce (k 2, [v 2 ]) (k 1, v 1 ) [(k 3, v 3 )] [k 2, v 2 ] Framework handles “everything else” !
25
E2E Search Toolkit using MapReduce Completely designed for the Hadoop environment Experimental Platform for research Supports common text collections ● + ClueWeb09 Open source release Implements state-of-the-art retrieval models 25 http://ivory.cc Ivory
26
(1) Pairwise Similarity in Large Collections 26 ~~~~~~~~~~~~ ~~~~~~~~~~~~ ~~~~~~~~~~~~ ~~~~ 0.20 0.30 0.54 0.21 0.00 0.34 0.13 0.74 0.20 0.30 0.54 0.21 0.00 0.34 0.13 0.74 0.20 0.30 0.54 0.21 0.00 0.34 0.13 0.74 0.20 0.30 0.54 0.21 0.00 0.34 0.13 0.74 0.20 0.30 0.54 0.21 0.00 0.34 0.13 0.74 Applications: Clustering “more-like-that” queries
27
Decomposition 27 reduce Each term contributes only if appears in map
28
(2) Cross-Lingual Pairwise Similarity Find similar document pairs in different languages Multilingual text mining, Machine Translation Application: automatic generation of potential “interwiki” language links Locality-sensitive Hashing 28 More difficult than monolingual! Vectors close to each other are likely to have similar signatures
29
Solution Overview CLIR projection N f German articles N e English articles Preprocess N e +N f English document vectors N e +N f Signatures N e +N f Signatures Signature generation Sliding window algorithm Sliding window algorithm Similar article pairs 01110000101 11100001010 Random Projection/ Minhash/Simhash
30
(3) Approximate Positional Indexes 30 Learn “Learning to Rank” models Term positions effective ranking functions Proximity features Approximate Large index Slow query evaluation √ X X Smaller index Faster query evaluation √ √ Close Enough is Good Enough?
31
Fixed-Width Buckets Buckets of length W 31 ………...........…. d2d2 123123 d1d1 1234512345
32
(4) Pseudo Training Data for Web Rankers Documents, queries, and relevance judgments Important driving force behind IR innovation In industry, easy to get In academia, hard and really expensive
33
Web Graph web search SIGIR 2012 web search Google web search P1P1 P4P4 P2P2 P5P5 P7P7 P3P3 P6P6
34
Queries and Judgments? SIGIR 2012 P1P1 P4P4 P2P2 P7P7 P3P3 P6P6 web search Bing P5P5 Google anchor text lines ≈ pseudo queries target pages ≈ relevant candidates noise reduction ?
35
(5) Extending MapReduce Framework Iterative Computations (iHadoop) Concurrent Jobs with shared data m maps - r reduces instead of 1 map-1 reduce IR in a Nutshell: Applications, Research, and Challenges35
36
(6) Twitter Analysis Real-time search in Twitter ● TREC 2011 (6 th out of 59 teams) ● TREC 2013? Answering Real-time Questions from Arabic Social Media ● NPRP-submitted IR in a Nutshell: Applications, Research, and Challenges36
37
FUTURE RESEARCH DIRECTIONS IR in a Nutshell: Applications, Research, and Challenges37
38
SWIRL 2012
39
Goal of Report Inspire researchers and graduate students to address the questions raised Provide funding agencies data to focus and coordinate support for information retrieval research. Participants were asked to focus on efforts that could be handled in an academic setting, without the requirement of large-scale commercial data.
40
Key Themes (across Topics) Not just a ranked list ● move beyond the classic “single adhoc query and ranked list” approach Help for users ● support users more broadly, including ways to bring IR to inexperienced, illiterate, and disabled users. Capturing context ● Treats people using search systems, their context, and their information needs as critical aspects needing exploration. Information, not documents ● beyond document retrieval and into more complex types of data and more complicated results New Domains ● data with restricted access, collections of “apps,” and richly connected workplace data Evaluation ● suggest new techniques for evaluation
41
“Most Interesting” Topics IR in a Nutshell: Applications, Research, and Challenges41
42
[1] Conversational Answer Retrieval IR: provides ranked lists of documents in response to a wide range of keyword queries QA: provides more specific answers to a very limited range of natural language questions. Goal: combine the advantages of both to provide effective retrieval of appropriate answers to a wide range of questions expressed in natural language, with rich user-system dialogue
43
Proposed Research Questions: open-domain, natural language text questions Answers: Develop more general approaches to identifying as many constraints as possible on the answers for questions Dialogue would be initiated by the searcher and proactively by the system, for: ● refining the understanding of questions ● improving the quality of answers Answers: short answers, text passages, clustered groups of passages, documents, or even groups of documents may be appropriate answers. Even tables, figures, images, or videos IR in a Nutshell: Applications, Research, and Challenges43
44
Challenges Definitions of question and answer for open domain searching Techniques for representing questions and answers Techniques for reasoning about and ranking answers Techniques for representing a mixed-initiative CAR dialogue Effective dialogue actions for improving question understanding Effective dialogue actions for refining answers IR in a Nutshell: Applications, Research, and Challenges44
45
[2] Finding What You Need with Zero Query Terms (or Less) Function without an explicit query, depending on context and personalization in order to understand user needs Anticipate user needs and respond with information appropriate to the current context without the user having to enter a query (zero query terms) or even initiate an interaction with the system (or less). In a mobile context: take the form of an app that recommends interesting places and activities based on the user’s location, personal preferences, past history, and environmental factors such as weather and time. In a traditional desktop environment: might monitor ongoing activities and suggest related information, or track news, blogs, and social media for interesting updates. Imagine a system that automatically gathers information related to an upcoming task.
46
Proposed Research New representations of information and user needs, along with methods for matching the two Modeling person, task, and context; Methods for finding “objects of interest”, including content, people, objects and actions Methods for determining what, how and when to show material of interest. IR in a Nutshell: Applications, Research, and Challenges46
47
Challenges Time- and geo-sensitivity; trust, transparency, privacy; determining interruptibility; summarization Power management in mobile contexts Evaluation IR in a Nutshell: Applications, Research, and Challenges47
48
[3] Mobile Information Retrieval Analytics (MIRA) No company or researcher has an understanding of mobile information access across a variety of tasks, modes of interaction, or software applications. For example, a search service provider might know that a query was issued, but not know whether the results it provided resulted in consequent action. The identification of common types of web search queries led to query classification and algorithms tuned for different purposes, which improved web search accuracy. A similar understanding for mobile information seeking would focus research on the problems of highest value to mobile users. study what information, what kind of information, and what granularity of information to deliver for different tasks and contexts
49
Proposed Research Methodology and tools for doing large-scale collection of data about mobile information access. Research on incentive mechanisms is required to understand situations in which people are willing to allow their behavior to be monitored. Research on privacy is required to understand what can be protected by dataset licenses alone, what must be anonymized, and tradeoffs between anonymization and data utility. Development of well-defined information seeking tasks Support quantitative evaluation in well-defined evaluation frameworks that lead to repeatable scientific research IR in a Nutshell: Applications, Research, and Challenges49
50
Challenges Developing incentive mechanisms Developing data collections that are sufficiently detailed to be useful while still protecting people’s privacy. Collection of data in a manner that university internal review boards will consider acceptable ethically. Collection of data in a manner that does not violate the Terms of Use restrictions of commercial service providers. IR in a Nutshell: Applications, Research, and Challenges50
51
[4] Empowering Users to Search and Learn Search engines are currently optimized for look-up tasks and not tasks that require more sustained interactions with information People have been conditioned by current search engines to interact in particular ways that prevent them from achieving higher levels of learning. We seek to empower users to be more proactive and critical thinkers during the information search process.
52
[5] The Structure Dimension Better integration of structured and unstructured information to seamlessly meet a user’s information needs is a promising, but underdeveloped area of exploration. Named entities, user profiles, contextual annotations, as well as (typed) links between information objects ranging from web pages to social media messages.
53
[6] Understanding People in Order to Improve Information (Retrieval) Systems Development of a research resource for the IR community: 1.from which hypotheses about how to support people in information interactions can be developed 2.in which IR system designs can be appropriately evaluated. Conducting studies of people ● before, during, and after engagement with information systems, ● at a variety of levels, ● using a variety of methods. ethnography in situ observation controlled observation large-scale logging
54
IR in a Nutshell: Applications, Research, and Challenges54
55
Thank You! IR in a Nutshell: Applications, Research, and Challenges55
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.